Contact

Amália Mendes
E-mail: amalia.mendes@clul.ul.pt

Michel Généreux
E-mail: genereux@clul.ul.pt

latest news

11.09.2012

Website of this course 'Pesquisa de Informação em corpora' is available on-line

Master Course Corpus Linguistics

This master course (Portuguese title: Pesquisa de Informação em corpora) aims to provide the students with both the theory and the practice of corpus linguistics. The course will be given by Amália Mendes and Michel Généreux in the first semester of 2012.

Course Evaluation

The students will be graded for this course on the basis of several exercises given during the course (40%), a short paper (30%) and a final exam (30%). The final exam will assess all the material seen in class, including home work. Students having learned class material and completed all exercices should be well prepared for the exam.

Assignments handed in after the deadline will get a deduction of 10%. Assignments handed in after the final exam will not be evaluated.

Course Schedule

Lecture 1 Introduction

Content: General introduction to this course and to corpus linguistics: what is a corpus, how is it compiled and explored?
Date: 26 Sept 2012, 10.00h - 13.00h
Room: TIC
Lecturer: Amália Mendes
slides lect 1

Lecture 2 Corpus compilation

Content: corpus design and practical aspects of corpus creation
Date: 3 Oct 2012, 10.00h - 13.00h
Room: TIC
Lecturer: Amália Mendes
slides lect 2
slides lect 2 (resources)
coralrom_xml
coralrom_xml_alg
coralrom_xml_dtd
HTML example
cleaned version of HTML example
examples spoken texts
tags of CINTIL corpus
transcription guidelines
authorisation for recordings

Lecture 3 Corpus search practice

Content: introduction to concordancer, excercises
Date: 10 Oct 2012, 10h - 13h
Room: TIC
Lecturer: Amália Mendes
slides concordancers

Lecture 4 Data representation

Content: data cleaning, encoding issues, markup
Date: 17 Oct 2012, 10h - 13h
Room: TIC
Lecturer: Michel Gánáreux
handout lect 4
XML assignment

Lecture 5 Spoken corpora

Content: issues in development of spoken corpora: recordings, transcription, alignment of text and sound.
Date: 24 Oct 2012, 10h - 13h
Room: Complexo Interdisciplinar
Lecturer: Amália Mendes

Lecture 6 Introduction to text Processing with Unix, part 1

Content: Unix, basic corpus cleaning, word counting, sorting, extracting information from texts, ngram statistics, concordancer
Date: 31 Oct 2012, 10h - 13h
Room: TIC
Lecturer: Michel Généreux
Handout Unix
Assignment 1

Lecture 7 Introduction to text Processing with Unix, part 2

Content: continuation of lecture 6
Date: 7 Nov 2012, 10h - 13h
Room: TIC
Lecturer: Michel Généreux
Assignment 2

Lecture 8 Linguistic annotation

Content: description and practical examples of layers of linguistic annotation.
Date: 13 Nov 2012, 9.5h - 12.5h
Room: A2-25
Lecturer: Michel Généreux
handout

Lecture 9 Case study ling. Annotation

Content: Practical exercises with linguistically annotated data using Unix
Date: 21 Nov 2012, 10h - 13h
Room: A2-25
Lecturer: Michel Généreux
practice
assignment
corpus
CQPwebCRPC

Lecture 10 phraseology and collocations

Content: Extraction of collocations, lexical association measures,multiword expressions
Date: 28 Nov 2012, 10h - 13h
Room: TIC
Lecturer: Amália Mendes

Lecture 11 Practice: Collocations

Content: Implementation of expression and collocation extraction methods
Date: 5 Dec 2012, 10h - 13h
Room: A2-25
Lecturer: Michel Généreux
Course notes
Assignment
Material for the assignment

Lecture 12 Short Papers

Content: Discussion of short papers with the students.
Date: 12 Dec, 10h - 13h
Room: Complexo Interdisciplinar
Lecturer: Amália Mendes

Exam

Date: To be confirmed
Room: Complexo Interdisciplinar da Universidade de Lisboa, sala B2-01

Literature

For lectures 1, 2 and 3

  • FILLMORE, Ch. (1992) "Corpus linguistics" or "Computer-aided armchair linguistics" in SVARTVIK, J. (ed.) Directions in Corpus Linguistics, Proceedings of Nobel Symposium 82, Estocolmo, 4-8 Agosto 1991, Berlim, Mouton de Gruyter, pp. 35-60.
  • KENNEDY, G. (1998) An Introduction to Corpus Linguistics, Londres-Nova Iorque, Longman. Chapter 1 (Introduction) + sections 2.5 (Issues in corpus design and compilation) and 2.6 (Compiling a corpus).
  • McENERY, T. & A. WILSON (1996) Corpus Linguistics, Edimburgo, Edinburgh University Press (sections 2.1 and 2.2).

For lecture 4, Data representation

  • A Gentle Introduction to XML, online available: here
  • Chapter 4, Character encoding in corpus construction
    Book: Developing Linguistic Corpora: a Guide to Good Practice, Martin Wynne, 2005
    Online available: here

For lecture 5, Spoken corpora

  • Chapter 5, Spoken language corpora
    Book: Developing Linguistic Corpora: a Guide to Good Practice, Martin Wynne, 2005
    Online available: here

For lectures 6,and 7 Introduction to UNIX,
and lecture 9, Unix case study

For lecture 8, Linguistic annotation:

  • Chapter (Unit) A4 Corpus annotation
    Book: Corpus-based Language studies, Anthony McEnery, Richard Xiao, Yukio Tono, 2005

For lecture 10, Collocations

  • Bartsch, S. (2004) Structural and Functional Properties of Collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag. pp. 57-78.
  • Fellbaum, C. (ed.) (2007) Idioms and collocations. London: Continuum. pp. 1-13
  • Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. pp. 109-121.
  • Sag, I., T. Baldwin, F. Bond, A. Copestake & D. Flickinger (2002) “Multiword Expressions: A Pain in the Neck for NLP”. In Gelbukh A. (ed.), Proceedings of CICLING-2002.
    Online available: here

Design downloaded from Zeroweb.org: Free website templates, layouts, and tools.