TOPICS (thanks to Megan): Analysis of corpora with Python, frequency and ranking of words, Allometric Scaling Laws, Zipf’s and Mandelbrot’s laws, Hapax Legomena, and the transition from empirical linguistics to rationalism (via Chomsky).
Corpus is any body of language (think corpse). There are many different types of corpora, but we focused mainly on text corpora. Usually, text corpora comes with pre-parsed text – the text has already been split up into paragraphs, sentences, and words in a process called tokenization. Of course, there’s some controversy here. For example, are collocations like New York one word or two words? What about database vs database? What about blue-green?
Some corpora also come with part-of-speech (POS) tagging of each word. Some really nice corpora even come with the syntactic structure of the sentence. Good corpora can be great resources for all kinds of things, from testing hypotheses about language to training data for NLP.
Choosing your corpus matters. Corpora that come from joke books will be significantly different from those that come from editorials. I gave a small demonstration using the Brown corpus and generating 100 words of editorial text vs religious text:
|Assembly session brought much good The General Assembly shall consider and approve the budget of the Organization shall be borne by the Members as apportioned by the General Assembly decided to tackle executive powers . The final decision went to the executive but a way has been opened for strengthening budgeting procedures and to provide legislators information they need . Long-range planning of programs and ways to finance them have become musts if the state in the next few years . The I. A. P. A. found itself driven from journalism into politics as it did its best to bring||As a result , although we still make use of the pax-ordo of the earthly city and acknowledge their share in responsibility for its preservation . Not to repel injury and to preserve any particular civilized attainment of mankind or its provisional justice runs some risk of nuclear war ” . If we are born of God we overcome the world . In fact , during the burning of the vast Ch’in palace some ten years later ; ; yet he did not stop being human as a result of the Civil War , emancipation was achieved . Long before|
To generate the text, I basically counted 4-grams (set of 4 words), and then randomly generated the next word based on the previous three, using the counts to determine respective probabilities. As you can tell, there are some obvious problems with this. But it’s also amazing that we can do this well just be counting sets of 4 words.
from nltk.corpus import brown from nltk.probability import MLEProbDist from nltk.model import NgramModel est = lambda fdist, bins: MLEProbDist(fdist) edit = NgramModel(4, brown.words(categories='editorial'), estimator=est) relig = NgramModel(4, brown.words(categories='religion'), estimator=est) print ' '.join(edit.generate(100)) print print ' '.join(relig.generate(100))
In biology, there are some allometric scaling laws. For example:
There are similar laws with corpora. We talked about Zipf’s law (). There’s also the more complicated (yet more accurate), Mandelbrot’s law: , where P, B, and p are all parameters of the text. We talked about some potential reasons behind these laws, before I showed that these laws are also true of randomly generated text.
One of the consequences of Zipf’s law is that most words you see only occur a couple of times. Words that only occur once are called hapax legomena. For example, in the Brown corpus, 46% of our bins were hapaxes, yet only 2.2% of the words were hapaxes. Practically, this means that if we ever are making a dictionary, we can eliminate almost half of it while only losing a small amount of content.
We also briefly talked about some other Zipf’s laws, such as: and