On Sunday I broke down some of the common misconceptions surrounding the Wells Report, including the social science involved, the statistical misinterpretations and the lack of coherence in the NFL’s story based on its own evidence. Then, on Wednesday provided a time-based visualization of the all the measurements presented in the Report based on where we’d expect them to be at a given time as the balls warmed up in the locker room. Visually, it’s fairly clear that the Colt balls and Patriot balls have similar issues, as many are “under-inflated” by similar degrees. But what does this mean in terms of probability if we actually run some statistical tests on the data?

To reiterate, time is a major variable in this case because the PSI of the balls was increasing with every minute that they were in the locker room at halftime. Thus, the time that each ball was measured becomes critical in trying to analyze the discrepancy between where a ball was measured and where a ball “should” be using Ideal Gas Law parameters. Below is one such scenario presented in the previous post. The blue line is where we’d expect a Colt ball to measure given the time indoors and the gold line where we’d expect a Patriot ball to be (based on Fig. 22 of Wells Report):

What we want to calculate is the difference between each ball at a given point in time (a circle) and where we’d expect the ball to be based on how long it’s been in the locker room (the solid line). For instance, in the above graph, Patriot ball #1 is about 0.25 PSI *above* where we’d expect. Ball #2 is 0.4 *below* where we’d expect. These values will be different depending on when the balls where measured, so our parameters for simulating the actual measuring circumstances are (assuming the report is correct in that the balls were correctly recorded in order, and were set to 12.5 and 13.0 PSI respectively in the pre-game):

- Set Up time (2-4 minutes according to accounts)
- Measurement time (21.8 – 27.3 seconds per ball)
- Inflation time (2-5 minutes)
- Packing time (unstated, but assumed to require some small degree of time between last measurement and re-emergence from locker room)

If we use Exponent’s Ideal Gas Law calculations that assumes 71 degrees pre-game — which may be slightly low, as noted in the last post — and add a small “wetness” factor per their report, we can then simulate a bunch of these scenarios to see what was likely and unlikely. The scenario above attempts to average all the accounts; set up time is “medium” (3 minutes, halftime between 2 and 4), measurement time is “medium” (~25 seconds) and inflation time is medium (3.5 minutes). But we can also examine other scenarios — instances where the Patriot balls were tested after 2 minutes or 4 minutes, quickly or slowly re-inflated, etc. If we do that, we’re left with a number of basic permutations we can study:

So what do these numbers mean? The “Patriot-Colt mean difference from expected” column calculates where each ball should be based on the time it’s measured, takes the average of all such Patriot balls and subtracts it from the average of all such Colt balls. If we take the mean of all six hypothetical scenarios, **the average Patriot ball is about 0.003 PSI below where it should be at the time of measurement relative to the Colt **balls. (i.e. using the Colt balls as a control group.) The p-value is the statistical likelihood that the balls come from different populations, i.e. that one set of balls had something done to them that the other set didn’t.

- The best-case statistical scenario for the Patriots is that Walt Anderson used the Logo Gauge pre-game, that the balls were measured at 2 minutes, each took about 22 seconds to measure and that the officials took 5 minutes to re-inflate the balls (labeled “Early Start, Fast Measure, Long Inflate” above). That produces a mean where the Patriots balls are
*higher*than the Colts, meaning it’s impossible for the Patriots balls to come from a population that is inherently lower than the Colts balls. - Three of the six scenarios in which the Logo Gauge was used pre-game completely exonerate the Patriots
- The worst-case scenario for the Patriots is that Walt Anderson used the Non-Logo Gauge in the pre-game, and that the balls were measured at 4 minutes, each took 27 seconds to measure and that the officials took 2 minutes to re-inflate the balls (labeled “Late Start, Slow Measure, Quick Inflate” above). That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

Although it’s far below “statistical significance,” 75.3% might sound like a lot. But what is that number actually saying? For that, we have to look at the observed difference in the averages to put this into perspective: there’s a 75% chance that the **0.3 PSI difference** is not simply from variance and is part of a different population (i.e. tampered balls).

Depending on the distribution, 0.3 PSI could easily be 99.99% likely to come from a different sample…which would suggest, what? There’s a 99.99% chance that the Patriots systematically released an *average* of 0.3 PSI per football? And that’s the worst-case scenario? That strains common sense.

In total, the (independent t-tests) results show that it was incredibly unlikely that the Patriot balls behaved any differently from the Colt balls *using the assumptions presented in the Wells Report*. Additionally, in the small unlikelihood that they are different — roughly a 15% chance if Anderson used the Logo Gauge in the pre-game and 57% if he used the Non-Logo gauge — **the degree to which the balls are different is nonsensically small**. We would expect a “small” degree of deflation to be something like 1.0-2.0 PSI; the initial reports were “11 more than 2 PSI below regulation,” with another ball falsely labeled by the NFL itself at 10.1. But the data presents a completely different story — the Patriot balls are sometimes higher than the Colt balls relative to what we’d expect, and the worst-case scenario for New England suggests a “non-significant” likelihood of tampering to a degree that is so small it’s equivalent to the variance seen between the two gauges used to measure the balls.

*The next post details exactly how Exponent used faulty methods to reach faulty conclusions in the Wells Report. *

*PS What happens if the Colts balls were measured immediately after the Patriots balls and then the Patriot balls were re-inflated? This still produces results that strongly suggest non-tampering, as shown below. If the Colt 4-ball group is indeed treated as a control group, we would expect to see the Patriot measurements 13.6% of the time in their “worst-case scenario” — not an occurrence considered “significant” in the scientific community. If Anderson used the Logo Gauge in the pre-game, Patriot measurements are roughly 0.1 to 0.2 PSI below the Colt measurements and far from statistically significant. *

*June 19, 2015 Update: Joe Arthur has asked me to run the calculation using a flatter transient curve — in other words, what happens if we expect the footballs to heat up more slowly than projected in Fig. 22 of the Wells Report? For a “slower” expected re-calibration, we can use Fig. 24 of the Wells Report. If any physics experts out there can tell us at what exact rate the balls are expected to recalibrate, it would be much appreciated. In the meantime, the aforementioned numbers are derived with a “fast” recalibration curve, and the below with a “slow” recalibration curve. Based on other amateur experiments I’ve found, it seems likely recalibration takes place somewhere between these two extremes. Below are the results of a “slower” expected recalibration. Worth noting is that on the Logo Gauge, the Colts calls are almost exactly where we’d expect. *

## 19 thoughts on “The Statistical Improbability of Deflate Gate”