Text Processing with Linux

By Michel Généreux

(It is recommended to use the browsers Firefox or Chrome. Internet Explorer is not supported.)

This document presents an overview of the text processing facilities offered by the operating system Linux. It is intended for users involved in corpus linguistics not familiar with Linux and hoping to get a quick panoramic view of what Linux can do for them. The Linux commands used in this document can be readily tested by the users on a freely available Linux terminal called Webminal, provided the user is registered, which is explained below. We are going to process a short Portuguese text about science called ciencia.txt:

Tanto as religiões como a ciência tentam descrever a natureza. A diferença está na forma de pensar. O cientista não aceita descrever o natural com o sobrenatural, para ele é necessária a observação de provas que eventualmente destroem as ideias. Para um cientista a ciência é uma só, pois a natureza é apenas uma. Sendo assim, as ideias da física devem complementar as ideias da química, da paleontologia, geografia e assim por diante. Embora a ciência seja dividida em áreas, para facilitar o estudo, ela ainda continua sendo apenas uma. Durante a Idade Média, os filósofos escolásticos criaram uma visão dogmática de ciência que ainda hoje pode ser encontrada em alguns livros e enciclopédias. Estes pensadores não admitiam o uso da matemática, aceitavam somente a dialética e a lógica aristotélica como formas de análise científica. O resultado disso é que nada de científico foi produzido durante a Idade Média.

This is a short text, certainly not representative of the size of documents normally analyzed in corpus linguistics. This is so because we are limited by Webminal with regards to the size of documents. However, a certain number of text processing facilities can be demonstrated, despite their obvious limited appeal in the case of ciencia.txt.

It is recommended to use the browsers Firefox or Chrome. Internet Explorer is not supported.

Registering for a free Linux account on Webminal:

Webminal log in:

Clear the screen

$ clear

This command simply clears the screen.

Print name of current directory

$ pwd

This command simply prints the name of the current directory, which is your home directory. This should be something like /home/your_login.

Copy the ciencia.txt file in your home directory

$ cp /tmp/ciencia.txt /home/your_login

This command copies the file ciencia.txt in your home directory. Please modify your_login to your actual login name.

List directory contents

$ ls

You should see the file ciencia.txt listed.

Display the content of the file ciencia.txt

$ cat ciencia.txt

You should see the Portuguese content of the file.

Display the number of lines, words and characters

$ wc ciencia.txt

You should see that the file has 1 line, 149 words and 950 characters.

Re-organize the text one sentence per line

$ sed -r 's/([\.\?\!])\s/\1\n/g' < ciencia.txt > frases

This command translates all full stops, interrogation and exclamation marks to newline. The result is saved in a file called frases.

$ cat frases

You should see the new arrangements in sentences.

$ head -n 3 frases

You should see the first three sentences.

$ tail -n 3 frases

You should see the lest three sentences.

Searching for patterns

$ grep ' para ' frases

This command looks for lines with the pattern para surrounded by two spaces. You should see two sentences with this pattern.

$ grep -i ' para ' frases

This command looks for lines with the pattern para surrounded by two spaces, without making distinction between lowercase and uppercase letters. You should see three sentences with this pattern.

$ grep -ic ' para ' frases

This command looks for lines with the pattern para surrounded by two spaces and count them. You should see 3 displayed.

$ grep -in ' para ' frases

This command looks for lines with the pattern para surrounded by two spaces. It will also add a line number before it prints out. You should see sentences 3, 4 and 6 displayed.

$ grep -iv ' para ' frases

This command looks for lines WITHOUT the pattern para surrounded by two spaces. You should see six sentences with this pattern.

Create a list of tokens (words)

$ tr -s '.?!,;: ' '\n' < ciencia.txt | tr 'A-Z' 'a-z' > tokens

This command translates all full-stops, interrogation marks, exclamation marks, commas, semicolons, colons and spaces (yes, there IS a space after the colon) to a newline. All characters are then translated to lowercase. The result is saved in a file called tokens.

$ cat tokens

You should see the list of words (tokens).

Create a list of types (a lexicon)

$ sort tokens | uniq > types

This command sorts the tokens, keeps a single copy of each token and saves the result in a file called types.

$ cat types

You should see the lexicon.

Display a sorted list of words

$ sort tokens | uniq -c | sort -k1nr | head

This command sorts the tokens, keeps a count of each and sort them numerically with the most frequent first. Then it shows the ten most frequent words. You should see that ciência occurs four times and ideias occurs three times.

Display according to rhymes

$ sort tokens | uniq | rev | sort | rev

This command reverts each type, sorts them and reverts them again. The result is a list of words sorted by their ending. For example, you should see the words científica, lógica, aristotélica, química, física, dialética, matémática and dogmática grouped together. You need to scroll up the terminal to see those words.

Display palindromes

$ rev types | paste types - | awk '$1 == $2'

This command dputs side by side the types and their inversion. If they are the same (it is a palindrome), it displays them. You should see four palindromes: a, e, ele and o.

Creating bigrams

Bigrams are sequence of two words.

$ tail -n +2 tokens > nextwords

This command creates a list of all tokens, except the first one. The result is saved in file nextwords.

$ paste tokens nextwords | head -n -1 > bigrams

This command displays tokens and nextwords side by side, except the last one. The result is saved in file bigrams.

$ cat bigrams

You should see all bigrams displayed.

$ sort bigrams | uniq -c | sort -k1nr | head

You should see the ten most frequent bigrams, for example the bigram a ciência appears 3 times and idade média appears twice.

Syllables

$ grep -E '^[^aeiou]*[aeiou]+[^aeiou]*$' types

This command uses regular expression (-E) to extract all types that may begin with a sequence of consonnants, followed by at least one vowel and may finally end with a sequence of consonnants. Vowels are simply aeiou, for Portuguese this list should be extended accordingly. You should see that com, pois and ser fit this pattern. Note: the character '^' is produced by pressing it twice.

Concordancer

$ grep -B3 -A3 'que' tokens

This command shows three words (lines) before the pattern que and three words after the pattern in the list of tokens. The sequences of three words before, que and three words after are separated by '--'.

$ grep -B3 -A3 'que' tokens | tr '\n' ' ' | tr -s '-' '\n' ; echo

This command rearranges the same concordances in a more user-friendly way, one concordance per line. You should see the following sequences:

observaçõo de provas que eventualmente destroem as

dogmática de ciência que ainda hoje pode

resultado disso é que nada de científico

Logging Out

$ logout

Then you should also log out from the webminal website by clicking on the "Log Out" button on the far-right frame.

Summary of commands used

awk pattern scanning and processing language
cat concatenate files and print them
clear clear the terminal screen
cp copy files and directories
echo display a line of text
grep print lines matching a pattern
head output the first part of files
ls list directory contents
paste merge lines of files
pwd print name of current directory
rev reverse lines
sed stream editor for filtering and transforming text
sort sort lines of text files
tail output the last part of files
tr translate characters
uniq report or omit repeated lines
wc count lines, words and characters

Creative Commons License

This tutorial is licensed under a Creative Commons License.