Frequencies of word frequencies
During Christopher D. Manning's CS224N lectures on Natural Language Processing, he intially presents some interesting tables of the frequencies of word frequencies you can expect to see in a regular corpus. The corpus he uses as an example is Tom Sawyer, freely available from Project Gutenberg. The topic is data sparseness and N-gram language models (lecture 2). These sorts of things are really simple to plot out with NLTK; here I am just going to quickly go through the steps required to get a nice plot of the data.
First, we need to get a list of all the words in our corpus:
## Assuming that nltk is available, and that the text is in the current ## directory, named 'twain-tomsawyer.txt' import nltk reader = nltk.corpus.reader.PlaintextCorpusReader('.', 'twain-tomsawyer.txt') words = reader.words('twain-tomsawyer.txt') # List of all the words in the text
Note that the NLTK corpus reader class does a lot of work behind the scenes, including tokenization. To create a word frequency distribution of the words we just collected, we can use the FreqDist class:
fdist = nltk.FreqDist([w.lower() for w in words])
The values of this frequency distribution now consists of an integer for each word in the corpus, telling us how often they occured (i.e. the word frequencies). We'd like to group these frequencies under 13 different labels (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11-51, 51-100, >100):
def group(i): if i < 11: return str(i) elif i in range(11, 51): return("11-50") elif i in range(51, 101): return("51-100") else: return(">100")
Now we can create another frequency distribution of the previously obtained values:
fdist_freq = nltk.FreqDist([group(freq) for freq in fdist.values()])
Here we basically count how often each of the 13 different "groups" occur,
resulting in the frequencies we were after. To plot out the data graphically we
can now use fdist_freq.plot(), or as a plain text table with
fdist_freq.tabulate().
In summary, I hope this small article shows how easy it is to do corpus experiments with NLTK. The best part is that the toolkit provides an uniform corpus API for dealing with different kinds of corpora (tagged, annotated, coded, etc). The NLTK data collection also includes a lot of corpora to play around with right out of the box.
The complete code is available here.