TITLE - AUTHOR

====================================================================== IMS Corpus Workbench -- Demonstration Corpus Size: 3.4 million tokens ====================================================================== This corpus is a collection of novels by Charles Dickens: - A Christmas Carol - David Copperfield - Dombey and Son - Great Expectations - Hard Times - Master Humphrey's Clock - Nicholas Nickleby - Oliver Twist - Our Mutual Friend - Sketches by BOZ - A Tale of Two Cities - The Old Curiosity Shop - The Pickwick Papers - Three Ghost Stories The text is derived from several Etext editions of Project Gutenberg. It was tokenised, part-of-speech tagged and lemmatised with Helmut Schmid's TreeTagger, and chunk-parsed with the Gramotron PCFG for English developed at the IMS. REFERENCES: Project Gutenberg http://www.gutenberg.net/ TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html Gramotron PCFG http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/LoPar-en.html ====================================================================== DISCLAIMER: THIS CORPUS IS PROVIDED TO YOU "AS-IS". NO WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, ARE MADE TO YOU AS TO THE CORPUS OR ANY MEDIUM IT MAY BE ON, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. ====================================================================== INSTALLATION NOTES: This demonstration corpus is intended for use with the CQP Tutorial. You can access the corpus simply by running CQP in the current directory with the following options: cqp -e -r registry If you want to install the corpus permanently, copy the file to the global registry directory, and insert the correct absolute path to the data/ subdirectory in the HOME and INFO entries. ====================================================================== DOCUMENT STRUCTURE: TITLE - AUTHOR TABLE OF CONTENTS / PREFACE / ETC. [ ... ] ...

~~...~~ ~~...~~ ...

... ...

~~...~~ ~~...~~ ...

... ... [ ] SYNTACTIC ANNOTATIONS: In sentences consisting of up to 42 words, noun and prepositional phrases identified with the Gramotron PCFG are annotated as and elements: ~~... ... ... ...~~ Embedded NPs and PPs are represented up to a depth of two nested phrases (more deeply embedded phrases were dropped). For access with CQP, the nested elements are renamed to / or /, depending on the level of embedding. ======================================================================