introduction
This course is called Corpus processing and the aim is provide the students with information about corpus linguistics and all its aspects. What it is a corpus? Why do linguists need corpora? How do you create a corpus? How do you enhance it with linguistic information? What processing tools are there for corpus linguistics?
This course is a practical course and will take place in the computer room, sala TIC in the basement of the faculty. The course will be taught in English.
The students will be graded for this course on the basis of several assignments given during the course (70%) and a final exam (30%).
Preliminary Course Schedule
Lecture 1 Introduction
Content: General introduction to this course and to corpus linguistics.
Date: Monday Feb 13, 2012 at 8.00-10.00
Room: TIC
Lecture 2
Content: Corpus design and corpus compilaton
Date: Wednesday Feb 15, 2012 at 8.00-10.00
Room: TIC
Férias de Carnaval on Feb 20
----Lecture 3
Content: Different types of corpora for Portuguese
Date: Feb 22, 2012 at 8.00-10.00
Room: TIC
Lecture 4
Content: Corpus cleaning and preparation: encoding, meta data and XML
Date: Feb 27, 2012 at 8.00-10.00
Room: TIC
Lecture 5
Content: Practice with XML
Date: Feb 29, 2012 at 8.00-10.00
Room: TIC
Lecture 6
Content: linguistic annotation: part-of-speech tagging
Date: March 5, 2012 at 8.00-10.00
Room: TIC
Lecture 7
Content: linguistic annotation: chunking and parsing
Date: March 7, 2012 at 8.00-10.00
Room: TIC
Lecture 8
Content: Treebanks
Date: March 12, 2012 at 8.00-10.00
Room: TIC
Lecture 9
Content: Semantic annotation overview
Date: March 15, 2012 at 8.00-10.00
Room: TIC
Lecture 10
Content: Lexical semantics: words and their meaning
Date: March 19, 2012 at 8.00-10.00
Room: TIC
Lecture 11
Content: Semantics at the sentence level
Date: March 21, 2012 at 8.00-10.00
Room: TIC
Lecture 12
Content: Semantics at the discourse level
Date: March 26, 2012 at 8.00-10.00
Room: TIC
Lecture 13
Content: Annotation tools and annotation evaluation
Date: March 28, 2012 at 8.00-10.00
Room: TIC
Easter Holidays on April 2 until April 9, 2012
----Lecture 14
Content: Practice with annotation tools
Date: April 11, 2012 at 8.00-10.00
Room: TIC
Lecture 15
Content: Speech corpora
Date: April 16, 2012 at 8.00-10.00
Room: TIC
Lecture 16
Content: Practice with tool Exmaralda
Date: April 18, 2012 at 8.00-10.00
Room: TIC
Lecture 17
Content:knowledge representations, taxonomy, ontology
Date: April 23, 2012 at 8.00-10.00
Room: TIC
Lecture 18
Content: Meta data and TEI
Date: April 25, 2012 at 8.00-10.00
Room: TIC
Lecture 19
Content: Learner corpora
Date: April 30, 2012 at 8.00-10.00
Room: TIC
Lecture 20
Content: multi-word expressions
Date: May 2, 2012 at 8.00-10.00
Room: TIC
Lecture 21
Content: practice multi-word expressions
Date: May 7, 2012 at 8.00-10.00
Room: TIC
Lecture 22
Content: register genre and style, experiments with BNC
Date: May 9, 2012 at 8.00-10.00
Room: TIC
Lecture 23
Content: Case study: modality annotation
Date: May 14, 2012 at 8.00-10.00
Room: TIC
Lecture 24
Content: historical corpora
Date: May 16, 2012 at 8.00-10.00
Room: TIC
Lecture 25
Content: course summary
Date: May 21, 2012 at 8.00-10.00
Room: TIC
exam period 09-07-2012 until 21-07-2012
Literature
- Kennedy, G. (1998) An Introduction to Corpus Linguistics, Londres-Nova Iorque, Longman. Chapter 1 (Introduction) + sections 2.5 (Issues in corpus design and compilation) and 2.6 (Compiling a corpus).
- McEnery, T. & A. Wilson (1996/2001) Corpus Linguistics, Edimburgo, Edinburgh University Press
- A Gentle Introduction to XML, online available: here
- Wynne, M. (2005) Developing Linguistic Corpora: a Guide to Good Practice,
Book is online available: here - Using Computers in Linguistics<
- Chapter (Unit) A4 Corpus annotation
Book: Corpus-based Language studies, Anthony McEnery, Richard Xiao, Yukio Tono, 2005 - Sag, I., T. Baldwin, F. Bond, A. Copestake & D. Flickinger (2002) “Multiword Expressions: A Pain in the Neck for NLP”. In Gelbukh A. (ed.), Proceedings of CICLING-2002.
Online available: here