27 August 2009

Playing Fair with the Chi-Square Test of Homogeneity

Over at The Scrapyard Armory Ellis/Saxywolf has written a great post about testing to see if dice are fair. This is similar my Water Test I wrote back in September (my second post!), but quite frankly, Ellis has taken it step further, done a lot of hard work, and done a far better job with it than I did. Well done Ellis!

Go read Ellis' Fair Dice post and the comments, then stop back, because I've got a bit more to add.

In response to that post I created an Excel spreadsheet to do a Chi-Squared test of Homogeneity. That's statisticalese for a test of equal proportions. In this case, to test the null hypothesis that the probability of rolling each number/side/pips on all sides of a die is equal to 1/6, versus the alternative hypothesis that those probabilities are not equal 1/6. Alternately, or more formally:

H0: pi = 1/6, for all i = 1,2,3,4,5,6
HA: At least one pi ~= 1/6, (read "~=" as "not equal")

Here is a screen-cap of the spreadsheet:

fair dice spreadsheet Chi-square test homogeneity

To use the spreadsheet, roll your die a bunch of times and tally up the number of times each side is rolled. You will need to do a minimum of 30 rolls for the result to valid, and unless the die you are rolling is obviously unbalanced, several hundred rolls before you can reliably detect a small imbalance. Enter your counts in the appropriate cells in the spreadsheet, and then look up the p-value.

A p-value is a standard way of interpreting the results of a statistical test (computers are good at calculating them, but in the old days we had to use published tables of numbers to interpret results). The p-value is the probability of the counts you entered in the spreadsheet or any more extreme results that might have occurred IF the assumption of a die is fair (probability of each side is 1/6) really is true.

If the p-value is small, generally less than 0.05 (called the type I error rate), this indicates a result that is unlikely to occur in a fair die. This error rate is a choice about how the test will be interpreted:

1) If the die is fair the p-value is random and will be less than 0.05 about 5% of the time simply by random chance (error rate again).

2) If the die is unbalanced, and the assumption of a fair die is false, the p-value will be less than 0.05 MORE than 5% of the time. The more unfair the die, the more likely the p-value will be less than 0.05. Exactly how likely might be is a complex calculation, but the more time the die is rolled, the more likely you will correctly detect an unfair die (called statistical "power"), and you will be able to detect smaller degrees of "unfairness".

3) A word of caution: a smaller p-value does not necessarily indicate greater "unfairness", and you should not compare p-values between dice to determine which is more fair. This is because the p-values are partly random, so it is not meaningful to compare them that way. Instead, try looking at the ratio of proportions for the same side on two dice (This forms a type of statistic called an odds-ratio).

Back to Ellis' experiment for a moment; Ellis rolled each die 1000 times each (a lot of work!), and should be able to detect fairly small imbalance in the die. An educated guess: His test can detect imbalance as small as 0.02 to 0.03 between any two sides of a die rolled on a table-top (Update: this should be accurate to +/- 0.023 with 95% confidence). Further, we think rolling the die in water amplifies any imbalance in a die, and so gives greater power to detect unfairness. It's possible that Ellis' water test is detecting imbalance that are so small (0.001-0.005?) that we might not care (ie: very close to perfect is good enough).

It think there might be another post on this topic, it's now on my list of things to do. If you are curious now, try a search on "physics dice imbalance".

[UPDATE 1/5/2010]
As requested in the comments, I have updated the spreadsheet to handle results for other dice up to d20.

GBR Giant Battling Robots Favicon


Dan Eastwood said...

Ellis (aka Saxywolf) from Scrapyard Armory sent me some questions directly because of difficulty with the comment system. Therefore I'm posting his questions in his place. --- Dan

Ellis Writes:

I may be a dice rollin' maniac, but I'm also a statisticalese newb. Well put.

re: "3)"
There was quite a bit of variation of the p-value for each individual die though I guess each die would vary mostly within a certain range. I imagine if I were to graph a whole bunch of p-values for a die, they would make a bell-curve(ish) graph centered on it's "average". One might consider a die to be Unfair if that bell curve was centered too low (and how low would still be objective). I believe the greater the number of times a die is rolled to calculate an individual p-value, the skinnier the bell curve around it's average p-value would be and I imagine a fair die would still have a pretty wide bell curve, but center around a higher p-value.

Why compare the same side on two dice? Or could one of them be an 'ideal' die? Wouldn't you try looking at the ratio of proportions for the same die?

While I understand the idea of how accurate my results were because of the high number of rolls, and that the water test would amplify an imbalance and thus better detect an imbalance, I don't see where the numbers come from to describe how accurate the results are. I look forward to your post on how one comes up with imbalance detection accuracy.

Anonymous said...

How difficult would it be to modify the spreadsheet so there are tabs for different sided die (d3, d10, d20 etc)?

Dan Eastwood said...

@Anon: Not too hard. I have updated the old spreadsheet so that it recognizes up to 20 "sides" for the test. LINKY Just leave the green shaded cells blank if you don;t need them.

I'll put up a post about it too.

Anonymous said...

You have done a very good job with the spreadsheet.
There is a small improvement that is recommended.
That is applying Yates' continuity correction.
For how to do that, and why see:-

Applying it changes the outcome of your example.

Dan Eastwood said...

Hello Anon,
I disagree. The link you gave says this form of Yates correction is for 2-by-2 tables. Williams correction does not change the result, nor does a simple continuity correction (add 0.5 to each cell). Given the sample size of N=1000 and only small deviations from the expected value of 1/6th, the asymptotic distribution of the statistic certainly applies, and the Yates correction doesn't really seem necessary. I ran the exact test and get p=0.0483, only slightly larger than what the spreadsheet gives.

However, given that people may be using this without knowledge of when such corrections might be needed, it might be a good idea add some form of correction anyway. I'll put this on my list of things to update.