I'm one of the founders and co-presidents of this club. I also maintain this website.
My main interests are all about cognition and intelligence. The idea that a bunch of atoms can combine and form something self-aware is absolutely fascinating. Linguistically, I'm interesting in integrating theoretical syntax with NLP, grammar inference, figuring out how the brain processes language, and creating a program with true artificial language capacities.
View all posts by Alan Du →
TOPICS (thanks to Megan): Analysis of corpora with Python, frequency and ranking of words, Allometric Scaling Laws, Zipf’s and Mandelbrot’s laws, Hapax Legomena, and the transition from empirical linguistics to rationalism (via Chomsky).
Corpus is any body of language (think corpse). There are many different types of corpora, but we focused mainly on text corpora. Usually, text corpora comes with pre-parsed text – the text has already been split up into paragraphs, sentences, and words in a process called tokenization. Of course, there’s some controversy here. For example, are collocations like New York one word or two words? What about database vs database? What about blue-green?
Some corpora also come with part-of-speech (POS) tagging of each word. Some really nice corpora even come with the syntactic structure of the sentence. Good corpora can be great resources for all kinds of things, from testing hypotheses about language to training data for NLP.
Choosing your corpus matters. Corpora that come from joke books will be significantly different from those that come from editorials. I gave a small demonstration using the Brown corpus and generating 100 words of editorial text vs religious text:
Assembly session brought much good The General Assembly shall consider and approve the budget of the Organization shall be borne by the Members as apportioned by the General Assembly decided to tackle executive powers . The final decision went to the executive but a way has been opened for strengthening budgeting procedures and to provide legislators information they need . Long-range planning of programs and ways to finance them have become musts if the state in the next few years . The I. A. P. A. found itself driven from journalism into politics as it did its best to bring
As a result , although we still make use of the pax-ordo of the earthly city and acknowledge their share in responsibility for its preservation . Not to repel injury and to preserve any particular civilized attainment of mankind or its provisional justice runs some risk of nuclear war ” . If we are born of God we overcome the world . In fact , during the burning of the vast Ch’in palace some ten years later ; ; yet he did not stop being human as a result of the Civil War , emancipation was achieved . Long before
To generate the text, I basically counted 4-grams (set of 4 words), and then randomly generated the next word based on the previous three, using the counts to determine respective probabilities. As you can tell, there are some obvious problems with this. But it’s also amazing that we can do this well just be counting sets of 4 words.
from nltk.corpus import brown
from nltk.probability import MLEProbDist
from nltk.model import NgramModel
est = lambda fdist, bins: MLEProbDist(fdist)
edit = NgramModel(4, brown.words(categories='editorial'), estimator=est)
relig = NgramModel(4, brown.words(categories='religion'), estimator=est)
print ' '.join(edit.generate(100))
print ' '.join(relig.generate(100))
In biology, there are some allometric scaling laws. For example:
There are similar laws with corpora. We talked about Zipf’s law (). There’s also the more complicated (yet more accurate), Mandelbrot’s law: , where P, B, and p are all parameters of the text. We talked about some potential reasons behind these laws, before I showed that these laws are also true of randomly generated text.
One of the consequences of Zipf’s law is that most words you see only occur a couple of times. Words that only occur once are called hapax legomena. For example, in the Brown corpus, 46% of our bins were hapaxes, yet only 2.2% of the words were hapaxes. Practically, this means that if we ever are making a dictionary, we can eliminate almost half of it while only losing a small amount of content.
We also briefly talked about some other Zipf’s laws, such as: and
Ewan Dunbar, a grad student at UMD, spoke to us about fundamental computational properties of our phonology processors. He focused on the way stress patterns work across different languages, and related them to function machines and the Chomsky hierarchy.
Ewan first showed us different stress patterns in various languages and how they were all variations on alternating stress (more detail available here, here, and here). He then showed us a couple Finite State Machines (FSMs), and showed that stress patterns could be identified with FSMs, before generalizing this to all phonological patterns. But FSMs are very limited (see Chomsky Hierarchy lecture) – for example, they can’t recognize the set of palindromes (prove via the pumping lemma). Syntactic structure is an example of pattern class that can’t be recognized by FSMs. This leads us to believe that syntax and phonology are two separate cognitive systems, because of their computationally different complexities.