Small Languages

“I like small languages,” said a friend of mine.

“Yeah, me too. Wait. What do you mean by small languages?” I replied

“You know, small. JavaScript. Lisp. Small, stuff… Not big,” he faltered as he struggled with the rest of his sentence.

That led to a series of discussions about what a small language is. We eventually enumerated a list of languages which we knew and could classify. Languages which we mutually agree are small are listed in small fonts; languages which we mutually agree are large are listed in large fonts:

  • C
  • Scheme
  • Lua
  • Python
  • Go
  • JavaScript
  • Haskell
  • Java
  • C#

Comparing Keywords

So what exactly is a “small language”, and why do people clamour for it? One metric discussed is the number of reserved keywords a language has. We quickly (and I must add, roughly) looked up and tabulated the results:

Language Reserved Words Count Sources
C32(C99); 37(C11)Wikipedia
Scheme20(R4RS); 0(R5RS)R4RS (PDF); R5RS
Lua21Lua Reference Manual(§2.1)
Python31 (2.7.8); 33 (3.4)Python 2.7.8 reference; Python 3.4 reference (§2.3.1)
Go25Go Specification
JavaScript41 (5.1); 43 (6)ECMAScript 5.1 Specification; ECMAScript 6 in MDN
Haskell55Haskell wiki
Java50Wikipedia
C#79MSDN

Okay, so C doesn’t fit into the (limited) picture of what a small language is. It’s also obvious that there are non-functional reserved words in JavaScript which I have included (i.e. the stupid weird shit known as future reserved words). The Haskell keywords wiki itself counts certain operators and lexemes as keywords (like comments).

A big problem with using keyword count is clearly that it’s not really representative of the language, though it is indicative of what we’re looking for. C# and Java have a large number of keywords (the word “enterprise” and “bureaucracy” also comes to mind when these two languages are mentioned), and are often thought of as large languages.

The use of keywords count also shed some light on the question of “what is a small language”. For such a short phrase, it’s surprisingly ambiguous. What could a small language mean? Is it small in terms of its built in functions? Is it small in terms of the standard library that comes with the language? These two do not seem to be likely what people are most often talking about when it comes to small languages. A small grammar could conceivably be what people mean when they talk about “small languages”.

So I thought to myself: why not compare grammars?

Comparing Grammars

Fortunately most programming languages have a well-defined grammar and they can be usually be expressed in (Extended) Backus-Naur Form. And fortunately too, most language specifications actually do specify the grammars of the language in some vague EBNF-like form.

So the solution is to count the number of production rules in the official language specifications. I’m going to ignore C# (because if there are 79 keywords, imagine how long the EBNF is going to be - Here’s where to find the C# 4.0 EBNF), and Java (the Java 8 specification has 17 pages worth of EBNFs.)

Some grammar specifications were embedded in long web pages, which I have extracted. Most of these grammars are not proper EBNF, but they resemble closely enough in the sense they follow this pattern Production Rule [separator] Nonterminals (some language spec has terminals in them). I also transformed all the grammars so that 1 line contained 1 production rule. Counting the number of production rules is simply counting the lines in a file.

Here’s what the languages I considered looked like:

Language Production Rules Source
C68Source
Scheme122R5RS Grammar
Lisp * that you built in a day < 10
Lua22Lua 5.1 Reference Manual
Python82Python 3.4.1 Complete Grammar
Go153Go specification; Extraction Script
JavaScript* I only extracted a portion, since ECMA also specifies the JSON grammar, which is what I consider to be outside the programming language, as well as the regex grammar 157ECMAScript 5.1 Specification; Extraction script
Haskell78Haskell 2010 report

So there you have it. A comparison of grammars based on how many rules there are. These are the “lightweight” languages. Needless to say C# and Java will have a lot more rules.

Caveat

As always (in most of my blog posts I have one of these), there is a caveat. For each language there can be multiple EBNFs - it depends on how you want to specify your grammar. The EBNFs I used are mostly official grammars from the language specifications themselves (some like C are printed on dead tree paper, so I’m not going to type that out, so I shortcutted and used someone else’s).

Perhaps another way to think of “small languages” is what kind of parser could be built for those languages. If a LL(1) parser could be built for the grammar, that implies it’s a simple grammar, and that means it’s a small language (yes, I am aware that I am conflating the concepts of simplicity with size). That pretty much also means that in the list above, only Python could be parsed (and even so it does with a lot of crutches).

Small Languages, A Conclusion

I can definitely see the appeal of small languages. A small grammar implies a smaller cognitive load. This leaves more space for the structure of the program that you’re writing itself. Of course small languages can be complex too - I often had to look up where to put the * during the course of whatever little C programming I have done (can’t remember the spiral rule for the life of me).

I think this also highlights how important the stdlib of a language is. Out of these languages, I love and use Go and Python the most simply because they have fantastic stdlibs. Small core, but great stdlibs, that’s what I look for in a language.

But surely this is not the only appeal of small languages? Tell me why you like small languages below.

comments powered by Disqus