Small Languages

Share on RedditTweet about this on TwitterShare on FacebookShare on Google+Share on StumbleUpon

“I like small languages,” said a friend of mine.

“Yeah, me too. Wait. What do you mean by small languages?” I replied

“You know, small. JavaScript. Lisp. Small, stuff… Not big,” he faltered as he struggled with the rest of his sentence.

That led to a series of discussions about what a small language is. We eventually enumerated a list of languages which we knew and could classify. Languages which we mutually agree are small are listed in small fonts; languages which we mutually agree are large are listed in large fonts:

  • C
  • Scheme
  • Lua
  • Python
  • Go
  • JavaScript
  • Haskell
  • Java
  • C#

Comparing Keywords

So what exactly is a “small language”, and why do people clamour for it? One metric discussed is the number of reserved keywords a language has. We quickly (and I must add, roughly) looked up and tabulated the results:

Language Reserved Words Count Sources
C 32(C99); 37(C11) Wikipedia
Scheme 20(R4RS); 0(R5RS) R4RS (PDF); R5RS
Lua 21 Lua Reference Manual(§2.1)
Python 31 (2.7.8); 33 (3.4) Python 2.7.8 reference; Python 3.4 reference (§2.3.1)
Go 25 Go Specification
JavaScript 41 (5.1); 43 (6) ECMAScript 5.1 Specification; ECMAScript 6 in MDN
Haskell 55 Haskell wiki
Java 50 Wikipedia
C# 79 MSDN

Okay, so C doesn’t fit into the (limited) picture of what a small language is. It’s also obvious that there are non-functional reserved words in JavaScript which I have included (i.e. the stupid weird shit known as future reserved words). The Haskell keywords wiki itself counts certain operators and lexemes as keywords (like comments).

A big problem with using keyword count is clearly that it’s not really representative of the language, though it is indicative of what we’re looking for. C# and Java have a large number of keywords (the word “enterprise” and “bureaucracy” also comes to mind when these two languages are mentioned), and are often thought of as large languages.

The use of keywords count also shed some light on the question of “what is a small language”. For such a short phrase, it’s surprisingly ambiguous. What could a small language mean? Is it small in terms of its built in functions? Is it small in terms of the standard library that comes with the language? These two do not seem to be likely what people are most often talking about when it comes to small languages. A small grammar could conceivably be what people mean when they talk about “small languages”.

So I thought to myself: why not compare grammars?

Comparing Grammars

Fortunately most programming languages have a well-defined grammar and they can be usually be expressed in (Extended) Backus-Naur Form. And fortunately too, most language specifications actually do specify the grammars of the language in some vague EBNF-like form.

So the solution is to count the number of production rules in the official language specifications. I’m going to ignore C# (because if there are 79 keywords, imagine how long the EBNF is going to be – Here’s where to find the C# 4.0 EBNF), and Java (the Java 8 specification has 17 pages worth of EBNFs.)

Some grammar specifications were embedded in long web pages, which I have extracted. Most of these grammars are not proper EBNF, but they resemble closely enough in the sense they follow this pattern Production Rule [separator] Nonterminals (some language spec has terminals in them). I also transformed all the grammars so that 1 line contained 1 production rule. Counting the number of production rules is simply counting the lines in a file.

Here’s what the languages I considered looked like:

Language Production Rules Source
C 68 Source
Scheme 122 R5RS Grammar
Lisp [1] < 10
Lua 22 Lua 5.1 Reference Manual
Python 82 Python 3.4.1 Complete Grammar
Go 153 Go specification; Extraction Script
JavaScript[2] 157 ECMAScript 5.1 Specification; Extraction script
Haskell 78 Haskell 2010 report

So there you have it. A comparison of grammars based on how many rules there are. These are the “lightweight” languages. Needless to say C# and Java will have a lot more rules.


As always (in most of my blog posts I have one of these), there is a caveat. For each language there can be multiple EBNFs – it depends on how you want to specify your grammar. The EBNFs I used are mostly official grammars from the language specifications themselves (some like C are printed on dead tree paper, so I’m not going to type that out, so I shortcutted and used someone else’s).

Perhaps another way to think of “small languages” is what kind of parser could be built for those languages. If a LL(1) parser could be built for the grammar, that implies it’s a simple grammar, and that means it’s a small language (yes, I am aware that I am conflating the concepts of simplicity with size). That pretty much also means that in the list above, only Python could be parsed (and even so it does with a lot of crutches).

Small Languages, A Conclusion

I can definitely see the appeal of small languages. A small grammar implies a smaller cognitive load. This leaves more space for the structure of the program that you’re writing itself. Of course small languages can be complex too – I often had to look up where to put the * during the course of whatever little C programming I have done (can’t remember the spiral rule for the life of me).

I think this also highlights how important the stdlib of a language is. Out of these languages, I love and use Go and Python the most simply because they have fantastic stdlibs. Small core, but great stdlibs, that’s what I look for in a language.

But surely this is not the only appeal of small languages? Tell me why you like small languages below.

  1. [1]that you built in a day
  2. [2]I only extracted a portion, since ECMA also specifies the JSON grammar, which is what I consider to be outside the programming language, as well as the regex grammar
  1. “Okay, so C doesn’t fit into the (limited) picture of what a small language is.”

    I assume you meant to write …C#…?

      • Interesting. I’ve never seen the number of AST nodes counted for a language. I think this might be a better metric than counting grammar rules. The grammar rules are parsed and then thrown away, either by the parser or your brain. It’s the AST that you have to keep in your head in order to reason about a program. Would this be a way to “measure” semantics? A language with more kinds of AST nodes would have more complex semantics, eh?

  2. (yes, I am aware that I am conflating the concepts of simplicity with size)

    That’s quite all right. Just say you’re measuring the Kolmogorov complexity and you’re in the clear. In fact, that might even work as a measure of small languages.

    • Hey WIM,

      IINM, Kolmogorov complexity can only be used to measure the complexity of a program, not the language specification.

      Unless of course, you’re saying – dump the entire grammar into a kolmogorov calculator or something