27 August 2009

Playing Fair with the Chi-Square Test of Homogeneity

Over at The Scrapyard Armory Ellis/Saxywolf has written a great post about testing to see if dice are fair. This is similar my Water Test I wrote back in September (my second post!), but quite frankly, Ellis has taken it step further, done a lot of hard work, and done a far better job with it than I did. Well done Ellis!

Go read Ellis' Fair Dice post and the comments, then stop back, because I've got a bit more to add.

In response to that post I created an Excel spreadsheet to do a Chi-Squared test of Homogeneity. That's statisticalese for a test of equal proportions. In this case, to test the null hypothesis that the probability of rolling each number/side/pips on all sides of a die is equal to 1/6, versus the alternative hypothesis that those probabilities are not equal 1/6. Alternately, or more formally:

H0: pi = 1/6, for all i = 1,2,3,4,5,6
HA: At least one pi ~= 1/6, (read "~=" as "not equal")

Here is a screen-cap of the spreadsheet:

To use the spreadsheet, roll your die a bunch of times and tally up the number of times each side is rolled. You will need to do a minimum of 30 rolls for the result to valid, and unless the die you are rolling is obviously unbalanced, several hundred rolls before you can reliably detect a small imbalance. Enter your counts in the appropriate cells in the spreadsheet, and then look up the p-value.

A p-value is a standard way of interpreting the results of a statistical test (computers are good at calculating them, but in the old days we had to use published tables of numbers to interpret results). The p-value is the probability of the counts you entered in the spreadsheet or any more extreme results that might have occurred IF the assumption of a die is fair (probability of each side is 1/6) really is true.

If the p-value is small, generally less than 0.05 (called the type I error rate), this indicates a result that is unlikely to occur in a fair die. This error rate is a choice about how the test will be interpreted:

1) If the die is fair the p-value is random and will be less than 0.05 about 5% of the time simply by random chance (error rate again).

2) If the die is unbalanced, and the assumption of a fair die is false, the p-value will be less than 0.05 MORE than 5% of the time. The more unfair the die, the more likely the p-value will be less than 0.05. Exactly how likely might be is a complex calculation, but the more time the die is rolled, the more likely you will correctly detect an unfair die (called statistical "power"), and you will be able to detect smaller degrees of "unfairness".

3) A word of caution: a smaller p-value does not necessarily indicate greater "unfairness", and you should not compare p-values between dice to determine which is more fair. This is because the p-values are partly random, so it is not meaningful to compare them that way. Instead, try looking at the ratio of proportions for the same side on two dice (This forms a type of statistic called an odds-ratio).

Back to Ellis' experiment for a moment; Ellis rolled each die 1000 times each (a lot of work!), and should be able to detect fairly small imbalance in the die. An educated guess: His test can detect imbalance as small as 0.02 to 0.03 between any two sides of a die rolled on a table-top (Update: this should be accurate to +/- 0.023 with 95% confidence). Further, we think rolling the die in water amplifies any imbalance in a die, and so gives greater power to detect unfairness. It's possible that Ellis' water test is detecting imbalance that are so small (0.001-0.005?) that we might not care (ie: very close to perfect is good enough).

It think there might be another post on this topic, it's now on my list of things to do. If you are curious now, try a search on "physics dice imbalance".

[UPDATE 1/5/2010]
As requested in the comments, I have updated the spreadsheet to handle results for other dice up to d20.