Random Documents from CouchDB

Share on RedditTweet about this on TwitterShare on FacebookShare on Google+Share on StumbleUpon

A few days ago I tweeted this:

I was finding ways to return a random document from CouchDB. At a former project at Pressyo we had used emit(Math.random(), doc), but I wasn’t quite happy with it — mainly because I had convinced myself through a small number of experiments that I could actually predict the random numbers that were being emitted (Spidermonkey I am giving you so many ಠ_ಠ now). Anyway, the conclusion was I wanted to find a better way of returning a random document (or more) from CouchDB.

And so to the Googles machine I went and I ran into a few solutions. The most prominent was this StackOverflow answer by Christian Berg. His Approach 2 intrigued me quite a bit as it involved probabilities. I implemented all his ideas with a live CouchDB database, with both 2 documents and many-documents in them. But first, let’s look at his ideas, then the results of my implementations (skip to the end result):

So, let’s assume a mostly uniform (and hence mostly fair) pseudo random number generator is used, like /dev/urandom.

  1. Upon document creation, a random number between (0, 1) is assigned to it.
  2. We generate a random number, r between [0, 1) and use it as the startkey for the view. This basically enables us to look for any number of documents that is larger than or equal to r
  3. We get the documents from CouchDB. If no documents were returned, that means r is too big, so we wrap around (presumably by using 1 - r as the startkey

For the purposes of this blog (and in fact for the purposes of my project), I want to minimize database calls[1]. So I will skip the wrapping around part. Since Christian didn’t mention what is meant by wrapping around, I will assume that he meant using 1 - r to form a new request. As such, I’ll ignore that part.

According to Christian, not all the documents will have an equal chance of being picked. In the example he gave, where two documents, A and B are assigned 0.2 and 0.9 respectively. On a non-wrap-around solution, the probability of document A being picked is 0.2 while the probability of document B being picked is 0.7 Here’s the working (skip the working):

\text{Let } r \in [0,1) \text{ step 0.1 }\\ Pr(r) = \frac{1}{10} \\ \text{Given } R_A = 0.2 \text{, and } R_B = 0.8 \\ Pr(R_A) = 2 \times 0.1 \text{; 2 ways A can be selected: } r \in \{0.1, 0.2 \} \\ Pr(R_B) = 7 \times 0.1 \text{; 7 ways B can be selected: } r \in \{0.3, 0.4...0.8, 0.9 \}

Here's how the probability distribution looks like:
Probability that a document is selected given r and A

The X-axis is r, a random number, and the Y-axis is the probability of a document being selected. Due to the rules we've created, a random number can never hit 1, and the documents will never hit 0 or 1. The colours of the lines represent document A's assigned values, while the facets represents document B's assigned value

Introduce a little anarchy

As you will note, this is the curious case with using startkey, which tends to favour documents with higher assigned numbers. And so I thought: What if, by act of randomness, the angle brackets were flipped? That is to say, at any given time, a view could be using descending=true with 50% chance?

Pr(R_A) = 0.5 \times 0.2 + 0.5 \times 0.8 \\ Pr(R_B) = 0.5 \times 0.7 + 0.5 \times 0.1

This is how the distribution, given a 50% chance of the angle brackets swithching looks like (Click here for the original, more accurate, and non-xkcdish chart):

The first thing you'll note is that the maximum probability of a document being selected has dropped to about 0.5. Which is great, since the variance of probability that a document will be selected is now effectively halved. Also, do note that now the probability of a document being selected isn't any more based on the assigned number. In fact, the probability of a document being selected is based on how close it is to r. Kinda. If you note, if the assigned value is in the middle of the number line (0,1), the overall probability is lower than compared to extremal values in the number line (i.e. closer to 0 or closer to 1)

Centroid-biased

What we have identified so far is that distance of the assigned numbers to the randomly selected numbers matter, albeit somewhat center biased - in that it favours the middle of the number line. In fact, let's visualize the entire methods summarized so far with a number line:

Number line visualization

The arrowed lines at the bottom represents r. The letters to the right and left of the arrowed lines indicate which document will be selected. You will note that B wins more often than A.

But this is the case this method being center biased. Imagine, if you will, a new document, C that is assigned 0.9. If you were to draw the above diagram, you'll find that B still wins more often. And this is a good idea. Why? As the number of documents increase, it will become less and less center-biased and become more and more dependent upon the distance to r. Here's the same distribution from the same data as above if you group it as a function of the distance to r. In fact I went ahead and plotted it xkcd style

Ignoring the strange dips[2], you will note that as the distance from r, the probability of being selected drops. You will note that it still misbehaves around r = 0.5. As the number of documents increase though, it will behave better.

Not Using This

As it turned out however, I didn't have a need to use this. I needed a get-random-documents view, but I didn't need to have a fair selection. In fact, what I needed was a view that retrieved X amounts of random documents, so I simply opted to select a random block instead of a random document, which used a variant of my above described methods. It also turned out that more often than not, a second database call was needed, for when the first call returns an empty set.

So I turned this into a blog post for anyone interested in retrieving random documents in CouchDB. I played with Christian Berg's third alternative, but I wasn't too happy with the results — too many lines of code needed to be written, and too many database calls for my liking. But it is a fair selection process. Well, as fair as can be. I also played with another method where I used a random endkey instead of startkey with descending=true as I thought endkey did the same bisecting function that I wanted. Turns out, one should always RTFM, as endkey didn't work that way, and oh boy the distributions were so messed up.

To Select A Random Document from CouchDB


For those of you who couldn't wait, or have read so far, here's how to do it:

  1. Assign every document with a random number on creation. Call the field random or something
  2. Write a view that searches that field
  3. Generate a random number, r.
  4. Flip a coin (or generate a binary random number)
  5. Put r into the view as startkey
  6. If heads, use descending=true. Store the fact whether or not you're using descending=true
  7. If no results we returned, use the flipside of it (i.e if you used descending=true the first time, don't use it the second time
  8. ???
  9. PROFIT!

Errata

This is why I shouldn't be blogging in the wee hours of the morning. I had uploaded ALL THE WRONG PICTURES! To which I have fixed, and descriptions fixed too, and as a reward to you, dear readers, I have re-rendered the distance chart in xkcd style, given I woke up this morning and thought to myself, "purisa looks like a great typeface in the number line diagram". If you find any more errors, please feel free to comment.

Update 2: I decided to also re-render the second probability distribution chart xkcd-style. Just because I like it.

So, did you try this? Tell me how it went, or suggest improvements!

  1. [1] There are good reasons for these: performance is affected by latency to the database server for example
  2. [2] Which has more to do with my selection code than anything else

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>