The Penn annotation system adopts a grammar framework based on single level constituency relations, where constituents are overt or empty categories (in antecedent-gap chains or in situ). The aim of the Penn Treebank-style annotation is the facilitation of automated search, not the implementation of a linguistically-accurate markup. As a consequence, syntactic trees are quite 'flat' compared to those adopted by most canonic linguistic theories and do not conform neither to a binary-branching nor to X-bar theory requirements. The existence of some word level nodes, the omission of undecidable information (as it is the case of VP boundaries or the distinction between argument and adjunct PPs) or the use of some default rules (with respect to location of wh-traces or structural ambiguity) are typically cases where simplicity of annotation and search has been prioritized at the expense of linguistic accuracy. In spite of this, the Penn system is a really rich annotation system which provides the marking up of information of high relevance, such as constituent boundaries, phrase and clause dependencies, categorial information (e.g. NP, PP, ADVP), grammatical relations (e.g. SBJ, ACC, DAT), discourse functions (e.g. left dislocation, pragmatic marking), sentence and clause types (e.g. EXL, CMP, QUE), some null constituents and certain transformational relations. The syntactic annotation is applied to morphologically tagged files, which have the format illustrated in (1): Due to the richness of Portuguese morphology, morphological labels include Part-of-Speech labels and inflectional dash tags. At the level of morphological tagging, the symbol "+" is used to mark word contraction, as in pelo/P+D or viu-o/VB-D-3S+CL, and the symbol "!" is used for mesoclisis, as in dar-te-ei/VB-R-1S!CL. At the level of the syntactic annotation, contracted words, whose associated tag includes "+" or "!", are split into different constituents. The syntactic annotationproduces a hierarchical representation with labeled brackets. Each constituentis bracketed and marked with a label, either a phrase label (NP, ADJP, PP,etc.) or a word label (N, ADJ, P, etc.). Word labels are provided for everyword, as in the morphologically tagged files. Phrase and clause mainlabels are category labels and extended labels provide information concerningsub-category, grammatical relation or discourse-function. In the labeledbracketing representation, level of indenting corresponds to depth ofstructural embedding. (2) ( (IP-MAT (NP-VOC (NPR Senhor))
(, :)
(NP-SBJ *pro*)
(VB-P Ofereço)
(PP (P a)
(NP (PRO$-F Vossa) (NPR Majestade)))
(NP-ACC (D-F-P as)
(NPR-P Reflexões)
(PP (P sobre)
(NP (D-F a)
(N vaidade)
(PP (P de@)
(NP (D-P @os) (N-P homens))))))
(. ;))
(TYCHO BRAHE; ID A_001_PSD,03.1)) The annotation produces arather flat tree, which often includes multiple branching nodes, as in the (2), where NP-VOC, NP-SBJ, VB-P, PP and NP-ACC are the immediate constituents ofIP-MAT. For practical reasons, thestructure is usually underspecified. Just like in the Penn system, there isno explicit representation for intermediate levels (N', ADJ', etc.), for the VPprojection and for some functional projections (such as DP). Additionally, not everyword projects an XP level. Besides verbal forms, some other words are annotatedjust as word-level nodes (such as FP, NEG, etc.), as in (3): (3) ( (IP-MAT (NP-SBJ (PRO Eu)) In general, the head of aphrase is overt and matches the category of the phrase level. However, incertain cases, there may be no matching head. This is the case for noun elisionin NPs (see NP projections). (4) ( (NP (D-P os) (ADJ-G-P seguintes)) (5) ( (NP (NUM dez)) (6) ( (NP (D o) (PRO$ meu)) In other cases, the phrasecategory does not match the head category because the system gives some words amore specific label than N, ADJ, etc.. (7) ( (NP (DEM Isso)) (8) ( (NP (PRO Nós)) (9) ( (ADJP (VB-AN perdido))For the ease of annotation, some sequences of words may be surrounded by an additional pair of parentheses which is given a word level tag. This is the case of: ( (IP-MAT (NP-SBJ (PRO Eu)) (11) ( (IP-MAT (CP-ADV (C se)
|
1. Introduction‎ > ‎