Wednesday, June 1, 2011

Visualization of randomness


As I was writing a function to generate random values, I realized it is hard to tell if the implemented function truly generates random values. A unit test can validate the length and the format of the generated values, but not its randomness.

To identify potential implementation flaws in my random generator, I developed the following tool.


Keep Last 100 Random Values Keep All Random Values
The chart above compares 2 random generators. It shows the frequency for each value generated. By default, the chart keeps the last 100 generated values. You can click on the radio button above the chart to keep track of all generated values.

What I expect to see from a bad random generator is an uneven distribution of the random values.

In the example above, I am comparing the 2 following JavaScript random generators. They both generate a random value from 0 to 9.
function randomGeneratorUsingFloor() { 
 return Math.floor( Math.random() * 10 ); 
}

function randomGeneratorUsingRound() { 
 return Math.round( Math.random() * 9 ); 
}
Can you spot what is wrong with the second implementation? The numbers 0 and 9 have a smaller chance to be generated compared to the other numbers. If this is not obvious on the chart above, select "Keep All Random Values", it should reveal the issue as shown in the picture below.

Math.random() generates a number between 0 and 1. The value generated is then multiplied. In the implementation using Math.floor(), all the values between 0 and 0.99.. will result as a 0. In the implementation using Math.round(), only the values between 0 and 0.49.. will result as a 0.

It appears that this is a common mistake as you can see it on Google Code Hosting and GitHub.

The test I put in place here is very basic. It exists some advanced work such as the Diehard tests. I also found a very creative randomness visualization on random-walk.com.

3 comments:

  1. Your displays would be even more useful if you overplotted the expected number of counts per bin, and +/- 2 or 3 sigma error bars above and below this count.

    If the chance of landing in a bin is p (= 1/10) then over N draws the expected number of counts per bin is Np and the standard deviation is sqrt(Np(1-p)). (Look up "variance of bernoulli random variable" in wikipedia.)

    After Np > 20 or so, the Normal approximation should be quite good, so you can approximate the chance of any bin's count landing outside the (say) 2-sigma error bars using erfc. If this probability is, say, 5%, then you would expect only about a 50/50 chance of any count among the 10 bins landing outside the 2-sigma error bars.

    These probabilities should be obeyed by your RNG. This would provide a much more quantitative test.

    ReplyDelete