21 Bits Ought to Be Enough for (Everyday) English

I was working on some additional work on lingo, and preparing it to be moved to go-nlp. One of the things I was working on improving is the corpus package. The goal of package corpus is to provide a map of word to ID and vice versa. Along the way package lingo also exposes a Corpus interface, as there may be other data structures which showcases corpus-like behaviour (things like word embeddings come to mind).

When optimizing and possibly refactoring the package(s), I like to take stock of all the things the corpus package and the Corpus interface is useful for, as this would clearly affect some of the type signatures. This practice usually involves me reaching back to the annals of history and looking at the programs and utilities I had written, and consolidate them.

One of the things that I would eventually have a use for again is n-grams. The current n-gram data structure that I have in my projects is not very nice and I wish to change that.

They look something like this:

type unigram string

type bigram struct {
	a, b string
}

type trigram struct {
	a, b, c string
}

type ngram interface {
	WC() int
	// other methods, elided
}

So I thought to myself, given that I’m trying to consolidate all my NLP projects into lingo, I should actually redesign the n-gram data structure. Being able to use corpus IDs would be a bonus. So I centered my thought around that.

Realistically, a large majority of NLP work doesn’t require anything more than a trigram. Sure, Google had released a 40-gram book corpus, but that’s a completely different type of NLP analysis (really, what do you think that the probability a specific 40-gram would appear in any text would be?)

Exploratory Work

I decide to look at all the corpuses I have access to (and some that I don’t technically have access to but I crawled those sites a long time ago when they were publicly available to crawl). Here’s a quick rundown:

Corpus	Corpus Size
BYU GloWbe	1.9 Billion words
BYU BNC	100 Milion
COCA	540 Million
Common Crawl	1 to 1000 Billion
Gigaword 5	1 Billion
Penn Treebank	1 Million

You get the point. There are many words on the internet. The real question is what is the size of the vocabulary? By my best estimates, depending on how you consume the largest corpus in the list above, and also depending on de-duplication strategies, you come to about 2.1 million unique words. These include words like “lol” and emoticons like “:)” and emoji* which I don't really consider words, but hey it has a unicode codepoint, which is more than I can say for my completely fictional language which doesn't even sit in the BMP . Also, pro-tip, just use Buck et al, as writing code to parse shitty html from the early 2000s is not fun.

In fact, even in Buck et al., it was noted that there were only about 2.6 million unigrams. OED gives a far more conservative worst-case value at 750,000 words in the English language. From my experience, I’m okay with having a maximum vocabulary of 2.1-ish million words.

And so, we have the recipes we need to design a slightly more efficient n-gram data structure.

The Data Structure

This is the definition of the n-gram:

type allgram uint64

Yep. That’s it. This uint64 is able to represent a unigram, a bigram and a trigram. Once Go supports uint128, we can even do pentagrams!

The reason is quite simple: 21 bits is enough to represent most of English. 1<<21 is 2097152. Which is nearly 2.1 million. I’m more than happy to cut anything with a frequency of 1, which brings the value down to about 1.6 million words in my CommonCrawl corpus. Which is more than OED’s estimates anyway.

So the trick to storing a trigram into a uint64 is to do something like this:

type wordID uint64

func makeTrigram(a, b, c, wordID) allgram {
	return allgram(a | (b << 21) | (c << 42))
}

func makeBigram(a, b wordID) allgram {
	return allgram(a | (b << 21))
}

Here you may note that 21 x 3 = 63. This gives us one spare bit. There are many things one can do with a spare bit. For example, in a occurence counting application, the spare bit may be used to lock/finalize the ngram’s occurence.

Either way, I have benchmarked this implementation vs older implementations. Where it really shines is when used as a key to maps. Now instead of passing in a fat pointer (the ngram interface{}, which is two words and complicates hashing strategy), this implementation is fast to hash, fast to read. But of course, this is a micro-optimization on a microbenchmark.

The Decision, Yet Unknown

Now, you may have noted that wordID is uint64. The biggest problem for me is whether to change the corresponding method type signatures for the corpus package and lingo.Corpus inteface to return uint64 or keep the interface as is, and return int as the ID of the word.

On the plus side, having it in uint64 enables a lot more compact storage of word IDs and word pairings (instead of using a struct of two int, we can just stuff them into a single uint64). This would speed up many of my applications I suspect.

On the other hand, much use of corpus.Corpus is to provide slice IDs (slicing a tensor.Tensor for a word vector for example). This would mean a larger code change foot print. It also makes it harder to understand the code if no proper getters and setters are written. I know I’m a rather shitty programmer, so this is something to keep in mind.

So, tell me what you think.

Addendum

My colleague Michael E. raised a few good points regarding the assumptions I should have mentioned. First was about constructable terms, or also known as compound words or spanned words in NLP. In the corpuses above, especially with the wikipedia derived corpus, there may be compound words that are treated as a single word. Again, because we’re doing this in a word-embedding context, technically all the variants of the same words would have been learned via some form of contrastive divergence.

Secondly, he mentioned that I had neglected to address stemming. Again, there is no real need for it in the context of creating an embedding-like corpus.

Lastly, and this is quite important - there was no definition of which “English” was used. Are the hundreds creoles included? Therefore it’s not really fair to say 21 bits ought to be enough. I contend that it should be enough. the GloWbe corpus for example contains variants of English, grouped typically by countries. The common crawl corpus on the other hand is heavily biased towards US English. I’ve done some preliminary (and totally not rigourous) deduplication work, and the number comes to about 2.1 million words and some change.

I personally think the last question is perhaps the most important question to make in any assertion. I’ll have more on that in a few blog posts.