I was working on some additional work on lingo, and preparing it to be moved to go-nlp. One of the things I was working on improving is the corpus package. The goal of package corpus is to provide a map of word to ID and vice versa. Along the way package lingo also exposes a Corpus interface, as there may be other data structures which showcases corpus-like behaviour (things like word embeddings come to mind).
When optimizing and possibly refactoring the package(s), I like to take stock of all the things the corpus package and the Corpus interface is useful for, as this would clearly affect some of the type signatures. This practice usually involves me reaching back to the annals of history and looking at the programs and utilities I had written, and consolidate them.
One of the things that I would eventually have a use for again is n-grams. The current n-gram data structure that I have in my projects is not very nice and I wish to change that.
[Read More]