February 13, 2014

# How stats fool juries

Prof Peter Donnelly’s TED talk. You might want to skip over the first few minutes of vaguely joke-like objects

Consider the two (coin-tossing) patterns HTH and HTT. Which of the following is true:

1. The average number of tosses until HTH is larger than the average number of tosses until HTT
2. The average number of tosses until HTH is the same as  the average number of tosses until HTT
3. The average number of tosses until HTH is smaller than the average number of tosses until HTT?

Before you answer, you should know that most people, even mathematicians, get this wrong.

Also, as Prof Donnelly doesn’t point out, if you have a programming language handy, you can find out the answer very easily.

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

• As phrased, the question seems open to two interpretations, which give different answers. If the question is which sequence is likely to happen first, the answer is they are equally likely. If the question is, in a given number of tosses >= 5, which sequence should have more occurrences, the answer is HTH, for the reason given in the video. One way to intuit the latter result is that the smallest number of tosses in which it is possible to get two HTH sequences is 5 (HTHTH). The smallest number of tosses in which it is possible to get two HTT sequences is 6 (HTTHTT).

• Thomas Lumley

I think the slide is unambiguous, but there are many related questions about these subsequences.

Now, if Peter Donnelly says something is hard and everyone gets it wrong, I’m not going to assume I can work it out in my head, so I wrote code in R.

```> results = replicate(100000,{
+ 	lots.of.coins< -paste(sample(c("H","T"),100,replace=TRUE),collapse="")
+ 	c(end.first.htt=regexpr("HTT",lots.of.coins)+3-1,
+       end.first.hth=regexpr("HTH",lots.of.coins)+3-1)
+     })
> ## average time until first appearance
> mean(results["end.first.htt",])
[1] 8.00915
> mean(results["end.first.hth",])
[1] 9.9893
>
> ## probability HTT appears first
> mean(results["end.first.htt",] ⟨ results["end.first.hth",])
[1] 0.4976
>
> ## which appears more often in a sequence of eight million tosses
> coins=sample(c("H","T"),8e6,replace=TRUE)
> is.hth = function(i){(coins[i-2]=="H") & (coins[i-1]=="T") & (coins[i]=="H")}
> is.htt = function(i){(coins[i-2]=="H") & (coins[i-1]=="T") & (coins[i]=="T")}
>
> sum(is.hth(3:8e6))
[1] 1000038
> sum(is.htt(3:8e6))
[1] 1000081
```

So, on the slide, A is correct: the mean time until the first HTT is shorter than the mean time until the first HTH, and the values are 8 and 10.

You are correct that the probability of HTT happening before HTH is one half.

And you’re wrong about HTH happening more often in a long string. They both occur in 1/8th of the (overlapping) triplets, as the video says.