One Million


Doesn’t seem like a lot does it? Some cultures – like the Pirahã in Central America, or the Walpiri in Australia – use a one-two-many system * This should be noted that one-two-many is a generalization. There are subtleties in the counting methods of both cultures, that upon deeper inspection, may not necessarily conform to the normal understanding of the one-two-many theory . I would contend that all humans use a superset form of one-two-many.

I  have been pondering this issue since two weeks ago. At work, I had to write a program to analyze and score 1 million URLs. This involved lots of natural language parsing and spidering of sites. My initial attempts were too slow, and I had an immediate newfound respect for Google and their new Caffeine index.

My initial attempts had an expected completion time of 3 months. This won’t work, I thought to myself, so I stopped the process and gave it some thought. After some more thought, I ended up with an architecture of four bots running with their individual processes, with a RAM-based cache and a pseudo message-queuing style  write to disk. It completed scoring yesterday, and I received email notification at about 4 a.m yesterday. It had taken 7 days and 6 hours to spider and score 1 million URLs (parsing and scoring took the longest time)

Why had I run into this problem? Because I had misunderestimated * I love my Bushisms the number 1,000,000.

I am not a programmer. That is to say, I don’t write programs for a living. I write programs to extend my toolset to help me do my job.  I’m a lazy person too. If there is something that I can do but is long and tedious, I would rather spend a bit more time to write a small script that can replace me * The Big Bang Theory fans, I am both Sheldon and Koothrapali in this sense – for some things I can be replaced by a small script .

So when I had the task of scoring 1 million URLs, I thought it could be solved with a quick and easy script. And I was right. Within 1 hour (and most of that hour was me making the mistake of dicking around with regular expressions, instead of installing BeautifulSoup and using it), I had a basic URL scraper and parser. The problem was the 1 million URLs.

1 million URLs in a csv file is about 25MB in size. 1 million is a huge number that I thought I had grasped – afterall, I see numbers on the scale of 15 billion (that’s the amount of impressions of advertisements served through the company’s ad servers) on a daily basis.

1,000,000 was just a number. But when it comes to cardinality – that is to say, counting, 1 million is a HUGE number. I had written a loop-based script that simply looped through from 0 to 999,999. Alas, that was not the way to go.

It is true, that in modern society, we have labels to each cardinal number. We know 60,000 is larger than 20,000. But it takes a bit of time to let the magnitude of these numbers sink in. And when they sink in, it’s mind boggling. Even when it’s just 106. Which is why I contend that every human uses the one-two-many counting system. It’s just how far before we give up labeling our numbers and give up and say, “that’s many”.

comments powered by Disqus