By Michel Généreux
Tanto as religiões como a ciência tentam descrever a natureza. A diferença está na forma de pensar. O cientista não aceita descrever o natural com o sobrenatural, para ele é necessária a observação de provas que eventualmente destroem as ideias. Para um cientista a ciência é uma só, pois a natureza é apenas uma. Sendo assim, as ideias da física devem complementar as ideias da química, da paleontologia, geografia e assim por diante. Embora a ciência seja dividida em áreas, para facilitar o estudo, ela ainda continua sendo apenas uma. Durante a Idade Média, os filósofos escolásticos criaram uma visão dogmática de ciência que ainda hoje pode ser encontrada em alguns livros e enciclopédias. Estes pensadores não admitiam o uso da matemática, aceitavam somente a dialética e a lógica aristotélica como formas de análise científica. O resultado disso é que nada de científico foi produzido durante a Idade Média.
This is a short text, certainly not representative of the size of documents normally analyzed in corpus linguistics. This is so because we are limited by Webminal with regards to the size of documents. However, a certain number of text processing facilities can be demonstrated, despite their obvious limited appeal in the case of ciencia.txt.$ clear
This command simply clears the screen.$ pwd
This command simply prints the name of the current directory, which is your home directory. This should be something like /home/your_login.$ cp /tmp/ciencia.txt /home/your_login
This command copies the file ciencia.txt in your home directory. Please modify your_login to your actual login name.$ ls
You should see the file ciencia.txt listed.$ cat ciencia.txt
You should see the Portuguese content of the file.$ wc ciencia.txt
You should see that the file has 1 line, 149 words and 950 characters.$ sed -r 's/([\.\?\!])\s/\1\n/g' < ciencia.txt > frases
This command translates all full stops, interrogation and exclamation marks to newline. The result is saved in a file called frases.$ cat frases
You should see the new arrangements in sentences.$ head -n 3 frases
You should see the first three sentences.$ tail -n 3 frases
You should see the lest three sentences.$ grep ' para ' frases
This command looks for lines with the pattern para surrounded by two spaces. You should see two sentences with this pattern.$ grep -i ' para ' frases
This command looks for lines with the pattern para surrounded by two spaces, without making distinction between lowercase and uppercase letters. You should see three sentences with this pattern.$ grep -ic ' para ' frases
This command looks for lines with the pattern para surrounded by two spaces and count them. You should see 3 displayed.$ grep -in ' para ' frases
This command looks for lines with the pattern para surrounded by two spaces. It will also add a line number before it prints out. You should see sentences 3, 4 and 6 displayed.$ grep -iv ' para ' frases
This command looks for lines WITHOUT the pattern para surrounded by two spaces. You should see six sentences with this pattern.$ tr -s '.?!,;: ' '\n' < ciencia.txt | tr 'A-Z' 'a-z' > tokens
This command translates all full-stops, interrogation marks, exclamation marks, commas, semicolons, colons and spaces (yes, there IS a space after the colon) to a newline. All characters are then translated to lowercase. The result is saved in a file called tokens.$ cat tokens
You should see the list of words (tokens).$ sort tokens | uniq > types
This command sorts the tokens, keeps a single copy of each token and saves the result in a file called types.$ cat types
You should see the lexicon.$ sort tokens | uniq -c | sort -k1nr | head
This command sorts the tokens, keeps a count of each and sort them numerically with the most frequent first. Then it shows the ten most frequent words. You should see that ciência occurs four times and ideias occurs three times.$ sort tokens | uniq | rev | sort | rev
This command reverts each type, sorts them and reverts them again. The result is a list of words sorted by their ending. For example, you should see the words científica, lógica, aristotélica, química, física, dialética, matémática and dogmática grouped together. You need to scroll up the terminal to see those words.$ rev types | paste types - | awk '$1 == $2'
This command dputs side by side the types and their inversion. If they are the same (it is a palindrome), it displays them. You should see four palindromes: a, e, ele and o.$ tail -n +2 tokens > nextwords
This command creates a list of all tokens, except the first one. The result is saved in file nextwords.$ paste tokens nextwords | head -n -1 > bigrams
This command displays tokens and nextwords side by side, except the last one. The result is saved in file bigrams.$ cat bigrams
You should see all bigrams displayed.$ sort bigrams | uniq -c | sort -k1nr | head
You should see the ten most frequent bigrams, for example the bigram a ciência appears 3 times and idade média appears twice.$ grep -E '^[^aeiou]*[aeiou]+[^aeiou]*$' types
This command uses regular expression (-E) to extract all types that may begin with a sequence of consonnants, followed by at least one vowel and may finally end with a sequence of consonnants. Vowels are simply aeiou, for Portuguese this list should be extended accordingly. You should see that com, pois and ser fit this pattern. Note: the character '^' is produced by pressing it twice.$ grep -B3 -A3 'que' tokens
This command shows three words (lines) before the pattern que and three words after the pattern in the list of tokens. The sequences of three words before, que and three words after are separated by '--'.$ grep -B3 -A3 'que' tokens | tr '\n' ' ' | tr -s '-' '\n' ; echo
This command rearranges the same concordances in a more user-friendly way, one concordance per line. You should see the following sequences:observaçõo de provas que eventualmente destroem as
dogmática de ciência que ainda hoje pode
resultado disso é que nada de científico
$ logout
Then you should also log out from the webminal website by clicking on the "Log Out" button on the far-right frame.awk |
pattern scanning and processing language |
cat |
concatenate files and print them |
clear |
clear the terminal screen |
cp |
copy files and directories |
echo |
display a line of text |
grep |
print lines matching a pattern |
head |
output the first part of files |
ls |
list directory contents |
paste |
merge lines of files |
pwd |
print name of current directory |
rev |
reverse lines |
sed |
stream editor for filtering and transforming text |
sort |
sort lines of text files |
tail |
output the last part of files |
tr |
translate characters |
uniq |
report or omit repeated lines |
wc |
count lines, words and characters |