What Every JavaScript Developer Should Know About Floating Point Numbers

After I gave my talk on JavaScript (really, I was there trying to shamelessly plug my book - Underhanded JavaScript and its alternate title: JavasScript Technical Interview Questions), there was a Q&A session. I could answer most questions, but Khalid Hilaby asked me a very interesting and quite general question on JavaScript number types. He had simply wanted to know more about floats in JavaScript and why they act so strangely. While I could answer the question, I felt I didn’t answer it well enough. I loaded my article on Pointer Tagging in Go to explain the structure of a floating point number, explained a bit on floating point arithmetic, and how in the past they had to have special CPUs for floating points (FPUs)* Nowadays they're all integrated , and then sort of meandered from there.

Now that I am back in Sydney and well rested, I thought I’d give the question a second try. The result is the article - What Every JavaScript Developer Should Know About Floating Points on Flippin’ Awesome. This is the full unedited version before I edited down for length and appropriateness for Flippin’ Awesome.

This articles assume the reader is familiar with base-2 representations of base-10 numbers (i.e. 1 is 1b, 2 is 10b, 3 is 11b, 4 is 100b… etc). In this article, the word “decimal” mostly refers to the decimal representation of numbers (for example: 2.718). The word “binary” refers to a machine representation. Written representations will be referred to as “base-10” and “base-2”.

Floating Points

To figure out what a floating point is, we first start with the idea that there are many kinds of numbers, which we will go through. We call 1 is an integer - it is a whole number with no fractional values in it.

½ is what’s called a fraction. It implies that the whole number 1 is being divided into 2. The concept of fractions is a very important one in deriving floating points.

0.5 is commonly known as a decimal number. However, a very important distinction needs to be made - 0.5 is actually the decimal(base-10) representation of the fraction ¹⁄₂. This is how ¹⁄₂ is represented when written as a base-10 number - call it the positional notation. We call 0.5 a finite representation because the numbers in the representation for the fraction is finite - there are no more numbers after 5 in 0.5. An infinite representation would for example be 0.3333… when representing ⅓. Again, this idea is an important idea later on.

There exists too another way of representing numbers other than as whole numbers, fractions or decimal notations. You might have actually seen it before. It looks something like this: 6.022 x 1023* That's Avogadro's number, which is the number of molecules in a mole of chemical solution . It’s commonly known as the standard form, or the scientific notation. That form can be generalized to something that looks like this

D1.D2D3D4...Dp x BE

The general form is called a floating point.

The sequence of p digits of D, D1.D2D3D4…Dp are called Significands or Mantissa. p is the number of significant digits, commonly called the Precision. In the case of the simple Avogadro’s number above, let p be 4. x follows the mantissa (and is part of the notation. The multiplication symbol that will be used throughout this article will be *). The Base digit comes after, followed by the Exponent. The exponent can be a positive or negative number.

The beauty of the floating point is that it can be used to represent ANY number at all. For example, the integer 1 can be represented as 1.0 x 100. The speed of light can be represented as 2.99792458 x 106 metres per second. ¹⁄₂ can be represented in base-2 as 0.1 x 20.

The Radix Point

If the last example above seemed a little strange, it’s because we don’t normally see a representation of fractions in base-2. In case you were wondering how to represent fractions in binary with a radix point, I’m going to show you how.

But first, let’s have a look at the decimal representation. Why is ¹⁄₂ 0.5? If you’re like me, you learned in school on how to do long division. It was also the way explained why ¹⁄₂ is 0.5 - you simply divided 1 into 2:

   0.5
2
  1   
   0
   1 0
   1 0  

There is another way to look at fractions - look at them in terms of the number base and exponent. ¹⁄₂ can be expressed as a fraction with 101 as the denominator: 5⁄10. In fact, that is the rule when it comes to determining if a fraction can be finitely represented with a radix point - if it can be expressed as a fraction with the base and exponent as a denominator, it can be finitely expressed with the radix point notation.

The idea behind the positional notation is a simple one. Let’s look at an example. Consider the number 19.95 (the price I’m considering for my books - Underhanded JavaScript and JavasScript Technical Interview Questions). It can be broken down into the positions as follows:

1
9
.
9
5
101
100
.
10-1
10-2

This says that there is 1 unit in the 10 position, 9 units in the 1 position, 9 units in the 0.1 position and 5 units in the 0.01 position. This concept can likewise be extended to base-2 numbers. Instead of powers of 10, the positional notation for base-2 numbers have powers of 2 as the positions. It is for this reason why 10 in base-2 is 2, and why 100 in base-2 is 4.

To detemine if a number can be finitely expressed in base-2, the same method as above applies - check to see if the fraction can be expressed with a denominator that is a power of 2. Let’s take a simple example: 0.75. 0.75 can be expressed as 3⁄4, of which 4 is 100 in base-2. So it can be written as: 11⁄100. We know then that this can be finitely expressed as 0.11. Doing long division with base-2 numbers too yield the same result.

There is also a short-cut method to convert from decimal to base-2 radix point representation, which I for quick mental estimation:

  1. Take non-integral part of the the decimal and multiply it by 2: 0.75 * 2 = 1.50.
  2. Reserve the integral part of the result - 1. The base-2 radix point representation now reads 0.1
  3. The the non-integral part of the result and multiply it by 2: 0.5 * 2 = 1.00.
  4. Repeat 2 and 3 until finished* Or in the case of infinitely representable fractions, do until your heart is content : The radix point now reads 0.11
  5. Replace any integral part of the original decimal with the base-2 equivalent.

Now, try it for yourself with either methods, the fraction 1⁄10. Interesting results isn’t it? This will be important later on.

Removing the Radix Point

In the above examples, we’re still quite tied to having a radix point (the dot in the number). This presents some problems when it comes to representing something in binary. Given an arbitrary floating point, say π, we can represent it as a floating point as such: 3.14159 x 100. In a base-2 representation, it would look something like this: 11.00100100 001111…. Assuming that the number is represented in a 16 bit manner, this means the digits would be laid out in the machine like this: 11001001000011111. The question now is this: where is the radix point supposed to be? This doesn’t even yet involve the exponent (we implicitly assume the base is base-2).

What about if the number was 5.14159? The integral part would be 101 instead of 11, requiring one more bit field. Of course, we could specify that the first n bits of the field belong to the integer part (i.e. the left of the radix point), and the rest belongs to the fractional parts, but that’s the topic for another article about fixed point numbers.

Once we remove the radix point, then we only have two things to keep track of: the exponent and the mantissa. We can remove the radix point by applying a transformation formula, making the generalized floating point look like this:

D1D2D3D4...Dp ⁄ Bp-1 x BE

This is where we derive most of our binary floating points from. Note that the significand is now an integer. This makes it far simpler to store a floating point number in a machine. In fact, the most widely used method of representing floating points in binary is with the IEEE 754 format.

IEEE 754

The representation of floating points in JavaScript follows the format as specified in IEEE-754. Specifically it is a double-precision format, meaning that 64 bits are allocated for each floating point. Although it is not the only way to represent floating points in binary, it is by far the most widely used format* Thank goodness for that. I'd loathe to be working on one format then switch to IBM's BCP for another machine. Imagine the hell! . The format is represented in 64-bits of binary like so:

s
eeeeeee eeee
ffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
1
11
52

Of the 64 bits available, 1 bit is used for the sign - whether a number is positive or not. 11 bits are used for the exponent - this allows for up to 1023 as an exponent. The reason for this is because the exponent actually uses something called offset binary encoding to encode negative numbers. What this basically means is that if all 11 bit fields are set to 0 (the decimal equivalent is 0), the exponent is actually -1023 in decimal. When all 11 bit fields are set to 1 (the decimal equivalent is 2047), the exponent is actually 1024 in decimal. The exponent of 2047 is actually reserved for special numbers, as described below.

The remaining 52 bits are allocated for the mantissa. Even that is interesting. Look through a list of scientific constants - they’re all written in scientific notation. Notice that to the left of the radix point, there is usually only one non-zero digit. This is called the nomalized form. Likewise with floating points, there is a concept of having a normalized form - in fact, floating points are stored in the normalized form in binary according to the IEEE-754 standard. However, there is an interesting feature when storing the normalized form.

Let us consider the fraction 34. In base-2, it is written 0.11. This is not the normalized form. The normalized form is written 1.1 x 2-1 - recall that the integral part of the positional notation cannot be 0 in the normalized form. It is the normalized form that is stored according to the specification.

Because in base-2, digits can only either be 0 or 1, the normalized form of the floating point always have the form of 1.xxxx x 2E. This is a convenient feature - you wouldn’t need to store the first digit - it’s implied to be always 1. This gives one whole extra bit of precision. So the mantissa always stores the bit beginning after the radix point. In the case of 34, the mantissa is 1000000000000000000000000000000000000000000000000000. Laid out in memory, this is what 34 looks like:

0
0111111 1111
1000 00000000 00000000 00000000 00000000 00000000 00000000


The specification also allows for special numbers. Both infinity and NaN for example, is encoded as 2047 in the exponent, with the mantissa ranging from 1 (the last mantissa field is 1) to 4503599627370495 (all the mantissa fields are 1) for NaNs and 0 in the mantissa field for infinity. Any number in the mantissa field is ignored when the exponent is 2047. Since all pointers are only 48-bits in size, this allows for some really cool hacking - such as storing pointers inside NaNs.

This floating point format also explains why in JavaScript, there exists +0 and -0 as well as +Infinity and -Infinity - the sign bit in the front denotes that. The IEEE-754 specification also specifies that NaN will always compare unordered to any operand, even with itself, which is why in JavaScript, NaN === NaN will yield false.

If ever you want to look at how numbers are encoded in JavaScript, the IEEE 754 Decimal Converter is actually a good site to check out.

Rounding Errors

With the introduction to floating points done, we now enter a more prickly topic - rounding errors. It is the bane of all developers who develop with floating point numbers, JavaScript developers doubly so, because the only number format available to JavaScript developers are floating point numbers.

It was mentioned earlier that fractions like ⅓ cannot be finitely represented in base-10. This is actually true for all numbers represented in any base. For example, in base-2 numbers, 1⁄10 cannot be finitely represented. It is represented as 0.000110011001100110011…. Note that 0011 is infinitely repeating. It is because of this particular quirk that causes rounding errors.

But first, a primer on rounding errors. Consider one of the most famous irrational numbers, Pi: 3.141592653589793…. Most people remember the first 5 mantissa (3.1415) really well* Unless you're a bible thumping Christian. Then you only probably remember 1 mantissa - 3. It's in the bible (Kings 7:23-26). That or you lived in Indiana circa 1850. - that’s an example of rounding down, which we will use for this example. The rounding error can hence calculated as such:

(R - A) ⁄ Bp-1

Where R stands for the rounded number, and A stands for the actual number. B is the base as previously seen, as was p, which is the precision. So the oft-remembered Pi has a rounding error of: 0.00009265…).

While this does not sound quite as severe, let’s try this idea with base-2 numbers. Consider the fraction 1⁄10. In base-10, it’s written as 0.1. In base-2, it is: 0.00011001100110011…. Assuming we round to just 5 mantissa, it’d be written as 0.0001. But 0.0001 in binary is actually 1⁄16 (or 0.0625)! This means there is a rounding error of 0.0375, which is rather large. Imagine doing basic mathematics like 0.1 + 0.2, and the answer returns 0.2625!

Fortunately, the floating point specification that ECMAScript uses specifies up to 52 mantissa (making it 53 bits of information with some clever hacking), so the rounding errors are quite small. In fact the specification actually goes into the details of the errors, and using a fascinating metric called the ulp (units in last place) to define the precision of the floating point. Because conducting arithmetic operations on floating points causes errors to build up over time, the IEEE 754 specification also comes specific algorithms for mathematical operations.

However, it should be noted that despite all that, the associative property of binary operations (like addition, subtraction, multiplication and subtraction) are not guaranteed when dealing with floating points, even at high precision ones. What I mean by that is ((x + y) + a + b) is not neccessarily equal to ((x + y) + (a + b)).

And that is the cause of the bane of JavaScript developers. For example, in JavaScript, 0.1 + 0.2 === 0.3 will yield false. Hopefully, by now you would know why. What is worse of course, is the fact that rounding errors add up with each successive mathematical operation performed on it.

Handling Floating Points in JavaScript

I have one suggestion as tho how to handle floating points in JavaScript: don’t. But of course, given that JavaScript is such a shit language and only has one numerical type, it is unavoidable. There have been plenty of suggestions, both good and bad, when it comes to dealing with JavaScript numbers. Most of these suggestions have to do with rounding numbers in JavaScript before or after binary operations.

The worst advice I’ve actually heard so far is to “expect floating point rounding errors, and duct tape around it”. The advice then follows on to say - if you expect 0.1 to be 0.10000000000000001 then work as if you’re working with 0.10000000000000001 all the time. I mean, wtf is with that kind of ridiculous advice??! Sorry, but that’s plain dumb.

Another suggestion - one that isn’t actually too bad on the surface but shows all sorts of problems once you’ve given it some thought - is storing everything as an integer number (not the type) for operations, and then formatting it for display. An example can be seen as used by Stripe - the amounts are stored in cents. This has a notable problem - not all currencies in the world are actually decimal (Mauritiana). There too exists currencies in the world where there are no subunits (Japanese Yen) or non-100 subunits (Jordanian Dinars), or more than one subunits (Chinese Renminbi). Eventually, you’d just recreate the floating point. Probably poorly too.

The best suggestions I’ve seen to handle floating points is to use properly tested libraries like sinfuljs or mathjs for handling them. I personally prefer mathjs (but really, for anything mathematics related I wouldn’t even go near JavaScript). BigDecimal is also extremely useful when arbitrary precision math needs to be done.

Another oft-repeated advice is to use the built-in toPrecision() and toFixed() methods on numbers. A big warning to anyone thinking of using them - those methods return strings. So if you have something like:


function foo(x, y) {
    return x.toPrecision() + y.toPrecision()
}

>foo(0.1, 0.2)
"0.10.2"

The built in methods toPrecision() and toFixed() are really only for display purposes. Use with caution! Now go forth and multiply (safely)!

Conclusion

JavaScript numbers are really just floating points as specified by IEEE-754. Due to inadequecies when representing numbers in base-2, as well as a finite machine, we are left with a format that is filled with rounding errors. This article explains those rounding errors and why errors occur. Always use a good library for numbers instead of building your own. If you are interested to read more about floating points, I highly recommend the fairly awesome Handbook of Floating Point Arithmetic. It’s a bit pricey and a bit difficult to read, but if you take your time with it, it will come through.

Another book that is actually a good and very difficult read is Modern Computer Arithmetic. I haven’t finished it - I mainly skipped the proofs and used it for reference.

If you’re interested comparing floats, the seminal paper is one written by Bruce Dawson. It’s a must-read if you serious about comparing floating points (but really, don’t do it. I’ve done it and it was terrible). In fact, Bruce Dawson’s body of work is quite good a read - they are scattered everywhere on the Internet though, so you will have to go find it yourself.

Incidentally, this topic on floating points also covered in my books (Underhanded JavaScript and its alternate title: JavasScript Technical Interview Questions) - with a little bit more detail on representation of numbers and the like. I hadn’t originally wanted to address floating points, but on the flight back I wrote a long chapter on it while figuring out how to write this article. So if you liked this article, do buy the damn book :P

comments powered by Disqus