Master Course Corpus Linguistics
This master course (Portuguese title: Pesquisa de Informação em corpora) aims to provide the students with both the theory and the practice of corpus linguistics. The course will be given by three teachers: Amália Mendes, Michel Généreux and Iris Hendrickx in the first semester of 2011.
Course Evaluation
The students will be graded for this course on the basis of several exercises given during the course (40%), a short paper (30%) and a final exam (30%). The final exam will assess all the material seen in class, including home work. Students having learned class material and completed all exercices should be well prepared for the exam.
Assignments handed in after the deadline will get a deduction of 10%. Assignments handed in after the final exam will not be evaluated.
Course Schedule
Lecture 1 Introduction
Content: General introduction to this course and to corpus linguistics: what is a corpus, how is it compiled and explored?
Date: 22 Sept 2011, 10.00h - 12.30h
Room: Sala Mattos Ramão
slides lect 1
Lecture 2 Corpus compilation
Content: corpus design and practical aspects of corpus creation
Date: 29 Sept 2010, 9.30h - 12.30h
Room: TIC
slides lect 2
slides lect 2 (resources)
coralrom_xml
coralrom_xml_alg
coralrom_xml_dtd
HTML example
cleaned version of HTML example
examples spoken texts
tags of CINTIL corpus
transcription guidelines
authorisation for recordings
Lecture 3 Corpus search practice
Content: introduction to concordancer, excercises
Date: 6 Oct, 10h - 13h
Room: TIC
slides concordancers
Concordancers assignment
Lecture 4 Data representation
Content: data cleaning, encoding issues, markup
Date: 13 Oct, 10h - 13h
Room: TIC
slides lect 4
XML assignment
Lecture 5 Spoken corpora
Content: issues in development of spoken corpora: recordings, transcription, alignment of text and sound.
Date: 20 Oct, 10h - 13h
Room: TIC
Teacher: Sandra Antunes
slides lect 5
Spoken assignment
Lecture 6 Introduction to text Processing with Unix, part 1
Content: Unix, basic corpus cleaning, word counting, sorting, extracting information from texts, ngram statistics, concordancer
Date: 27 Oct, 10h - 13h
Room: TIC
Teacher: Michel Généreux
Notes Unix
Assignment 1
Using Computers in Linguistics
Unix for Poets
Lecture 7 Introduction to text Processing with Unix, part 2
Content: continuation of lecture 6
Date: 3 Nov, 10h - 13h
Room: TIC
Assignment 2
Lecture 8 Linguistic annotation
Content: description and practical examples of layers of linguistic annotation.
Date: 10 Nov, 10h - 13h
Room: TIC
slides lect 8
practice in class
Lecture 9 Case study ling. Annotation
Content: Practical exercises with linguistically annotated data using Unix
Date: 17 Nov, 10h - 13h
Room: TIC
[parlamentocorpus.100k.vrt] annotated corpus
[introduction slides]
practice questions only
practice and answers
Lecture 10 phraseology and collocations
Content: Extraction of collocations, lexical association measures,multiword expressions
Date: 13 Dec, 14h - 17h
Room: Complexo Interdisciplinar, B1-01
slides collocations
Lecture 11 Practice: Collocations
Content: Implementation of expression and collocation extraction methods
Date: 15 Dec, 10h - 13h
Teacher: Michel Généreux
Room: TIC
Course notes
Material for the assignment
Lecture 12 Short Papers
Content: Discussion of short papers with the students.
Date: 22 Dec, 10h - 13h
Room: Complexo Interdisciplinar
Exam
Date: Jan 12th 2012 10h00-12h00
Room: Complexo Interdisciplinar da Universidade de
Lisboa, sala B2-01
Literature
For lectures 1, 2 and 3
- FILLMORE, Ch. (1992) "Corpus linguistics" or "Computer-aided armchair linguistics" in SVARTVIK, J. (ed.) Directions in Corpus Linguistics, Proceedings of Nobel Symposium 82, Estocolmo, 4-8 Agosto 1991, Berlim, Mouton de Gruyter, pp. 35-60.
- KENNEDY, G. (1998) An Introduction to Corpus Linguistics, Londres-Nova Iorque, Longman. Chapter 1 (Introduction) + sections 2.5 (Issues in corpus design and compilation) and 2.6 (Compiling a corpus).
- McENERY, T. & A. WILSON (1996) Corpus Linguistics, Edimburgo, Edinburgh University Press (sections 2.1 and 2.2).
For lecture 4, Data representation
- A Gentle Introduction to XML, online available: here
- Chapter 4, Character encoding in corpus construction
Book: Developing Linguistic Corpora: a Guide to Good Practice, Martin Wynne, 2005
Online available: here
For lecture 5, Spoken corpora
- Chapter 5, Spoken language corpora
Book: Developing Linguistic Corpora: a Guide to Good Practice, Martin Wynne, 2005
Online available: here
For lectures 6,and 7 Introduction to UNIX,
and lecture 9, Unix case study
- UNIX Tutorial for Beginners, available online: here
- Using Computers in Linguistics
- Unix for Poets
For lecture 8, Linguistic annotation:
- Chapter (Unit) A4 Corpus annotation
Book: Corpus-based Language studies, Anthony McEnery, Richard Xiao, Yukio Tono, 2005
For lecture 10, Collocations
- Bartsch, S. (2004) Structural and Functional Properties of Collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag. pp. 57-78.
- Fellbaum, C. (ed.) (2007) Idioms and collocations. London: Continuum. pp. 1-13
- Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. pp. 109-121.
- Sag, I., T. Baldwin, F. Bond, A. Copestake & D. Flickinger (2002) “Multiword Expressions: A Pain in the Neck for NLP”. In Gelbukh A. (ed.), Proceedings of CICLING-2002.
Online available: here
For lecture 11, Complex predicates of the type «light verb + noun»
- Langer, S. (2005) "A linguistic test battery for support verb constructions", Lingvisticæ Investigationes 27:2. Amsterdam/Philadelphia: John Benjamins, pp. 171-184.
Online available: here - Giry-Schneider, Jacqueline (1987) Les prédicats nominaux en français. Les phrases simples à verbe support, Genève: Droz.
- Butt, M. (2003) The Light Verb Jungle. Harvard Working Papers in Linguistics 9, 1-49.
Online available: here - Ranchhod, E. (1990) Sintaxe dos predicados nominais com estar. Lisboa: INIC.