Quickly Annotate Your Machine Learning Dataset with One Weird Trick (It's Lisp)

TL;DR - I wrote annotation-mode, which is a emacs minor mode for annotating text documents for machine learning purposes.

Recently at work I had to annotate some text documents for a piece of NLP work. The annotation involves marking regions of the text with a category, as well as a rectangle which represents the region.

At first I thought to build a webapp to do the annotation. The webapp would load a text file would be drawn onto a canvas object. The user would then use the mouse to draw a box around specific regions. Then a popup would ask the user for the class of the region.

As I was typing the code for that I pressed C-M-c C-M-c (my keybinding for multi-cursor),. I realized emacs could do a better job than any webapp I could write. And it would be far quicker too. So I wrote a minor mode - annoation-mode. Feel free to check it out.

Here’s how it looks like in action:

Some Details

I am familiar with elisp. I write multiple small, usually throwaway functions every few days to aid in doing my work. However I had never written a minor mode before. This was my first.

Writing it was fairly trivial. Get the rectangle for a selected region, use an (interactive) function to query the user for the classification and you have a datum with region and classification information.

This function is then wrapped in another function which determines whether to write it to a JSON file or just to display the classificaion message.

The last thing to do is to hook it up so that when the mouse is used to draw a region it will automatically prompt the user for a classification.

My first attempt added a hook to active-mark-hook. But that led to a funny issue where if you use the mouse to select a region, the hook will be called before the region highlight gets rendered. After fiddling for a bit, adding advices here and there, I deciced that the active-mark-hook isn’t the best place to do things.

So instead I wrote a function called register-mouse-select and use that as an advice to the mouse-set-region function.

Some Thoughts on Writing a Minor Mode

JSON handling in emacs (and in most unityped languages) is particularly tricky. I had originally encoded my annotation as a plist instead of a alist. However this proved to be difficult.

Here’s why. When writing to the JSON file there are three cases:

  1. The file is empty.
  2. The file has one object.
  3. The file has an array of objects.

Adding an annotation to the JSON file would require us to check for each case. The handling for each case is as follows (numbers correspond to the case numbers above):

  1. Write obj to file.
  2. Write (list existing obj) to file.
  3. Write `(cons existing obj) to file.

A problem happens when you read a JSON file containing an array of objects (case 3). I had originally configured it to read it as a list.

Thus encoding the annotation as a plist means that we are actually unable to differentiate case 2 and 3 without further introspection. A snippet can be seen here:

ELISP> (setq x '((k1 v1) (k2 v2)))
((k1 v1)
 (k2 v2))

ELISP> (type-of x)
cons

ELISP> (listp x)
t

The trick of course, is to use a vector for arrays. Then we can use (vconcat existing (list obj)) to add to an existing array. No additional introspections required.

It was after all that before I discovered that the default config for json.el already sets as that. Thus these lines were unnecessary. I left them in there because I can see what gets deserialized into values of what types.

Some Thoughts

I think this is clearly a superior solution over the webapp idea I originally had. I could have spent hours writting a webapp that does the same thing and I think it wouldn’t be as smooth as what I have here. And all this was written in about 100 lines of code, building on the things other people have written.

There is a power in actually being able to modify the environment that you do your work in. More software should do that.

That said, the webapp can be scaled to people who are not me though (and non-emacs users). But for a small task I think this is quite OK.

Also, last night I showed my partner this, and I made an offhand remark saying “I don’t know why so many companies are buying dataset annotation apps/services when they could have an entire interface written in less than 2 hours”. The partner then said “is this why you don’t see these startup ideas as worthy to pursue?”. Take that what you will.

Pull requests are welcome.

comments powered by Disqus