On Sunday I broke down some of the common misconceptions surrounding the Wells Report, including the social science involved, the statistical misinterpretations and the lack of coherence in the NFL’s story based on its own evidence. Then, on Wednesday provided a time-based visualization of the all the measurements presented in the Report based on where we’d expect them to be at a given time as the balls warmed up in the locker room. Visually, it’s fairly clear that the Colts balls and Patriot balls have similar issues, as many are “under-inflated” by similar degrees. But what does this mean in terms of probability if we actually run some statistical tests on the data?

To reiterate, time is a major variable in this case because the PSI of the balls was increasing with every minute that they were in the locker room at halftime. Thus, the time that each ball was measured becomes critical in trying to analyze the discrepancy between where a ball was measured and where a ball “should” be using Ideal Gas Law parameters. Below is one such scenario presented in the previous post, in which Patriot balls were measured after 3 minutes in the locker room, measuring a ball took 25 seconds, and it took 3.5 minutes to re-inflate the balls. The blue line is where we’d expect a Colt ball to measure given the time indoors and the gold line where we’d expect a Patriot ball to be:

So our parameters for simulating the actual measuring circumstances (assuming the balls were indeed correctly recorded in order, and as Exponent believes they were set to 12.5 and 13.0 PSI respectively in the pre-game) are:

- Set Up time (2-4 minutes according to accounts)
- Measurement time (21.8 – 27.3 seconds per ball)
- Inflation time (2-5 minutes)
- Packing time (unstated, but assumed to require some small degree of time between last measurement and re-emergence from locker room)

If we use Exponent’s Ideal Gas Law calculations that assumes 71 degrees pre-game — which may be slightly low, as noted in the last post — and add a small “wetness” factor per their report, we can then simulate a bunch of these scenarios to see what was likely and unlikely. The scenario above attempts to average all the accounts. But we can also examine other scenarios — instances where the Patriots balls were tested after 2 minutes or 4 minutes, quickly or slowly re-inflated, etc. If we do that, we’re left with a number of basic permutations we can study:

So what do these numbers mean? The “Patriot-Colt mean difference from expected” column calculates where each ball should be based on the time it’s measured, takes the average of all such Patriot balls and subtracts it from the average of all such Colt balls. If we take the mean of all six hypothetical scenarios, **the average Patriot ball is about 0.02 PSI below where it should be at the time of measurement relative to the Colts **balls. (i.e. using the Colts balls as a control group.) The p-value is the statistical likelihood that the balls come from different populations, i.e. that one set of balls had something done to them that the other set didn’t.

- The best-case statistical scenario for the Patriots is that Walt Anderson used the Logo Gauge pre-game, that the balls were measured at 2 minutes, each took about 22 seconds to measure and that the officials took 5 minutes to re-inflate the balls (labeled “Early Start, Fast Measure, Long Inflate” above). That produces a mean where the Patriots balls are
*higher*than the Colts, meaning it’s impossible for the Patriots balls to come from a population that is inherently lower than the Colts balls. - Three of the six scenarios in which the Logo Gauge was used pre-game completely exonerate the Patriots
- The worst-case scenario for the Patriots is that Walt Anderson used the Non-Logo Gauge in the pre-game, and that the balls were measured at 4 minutes, each took 27 seconds to measure and that the officials took 2 minutes to re-inflate the balls (labeled “Late Start, Slow Measure, Quick Inflate” above). That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

Although it’s far below “statistical significance,” 75.3% might sound like a lot. But what is that number actually saying? For that, we have to look at the observed difference in the averages to put this into perspective: there’s a 75% chance that the **0.3 PSI difference** is not simply from variance and is part of a different population (i.e. tampered balls).

Depending on the distribution, 0.3 PSI could easily be 99.99% likely to come from a different sample…which would suggest, what? There’s a 99.99% chance that the Patriots systematically released an *average* of 0.3 PSI per football? And that’s the worst-case scenario? That strains common sense.

In total, the (independent t-tests) results show that it was incredibly unlikely that the Patriots balls behaved any differently from the Colts balls *using the assumptions presented in the Wells Report*. Additionally, in the small unlikelihood that they are different — roughly a 15% chance if Anderson used the Logo Gauge in the pre-game and 57% if he used the Non-Logo gauge — **the degree to which the balls are different is nonsensically small**. We would expect a “small” degree of deflation to be something like 1.0-2.0 PSI; the initial reports were “11 more than 2 PSI below regulation,” with another ball falsely labeled by the NFL itself at 10.1. But the data presents a completely different story — the Patriots balls are sometimes higher than the Colts balls relative to what we’d expect, and the worst-case scenario for New England suggests a “non-significant” likelihood of tampering to a degree that is so small it’s equivalent to the variance seen between the two gauges used to measure the balls.

*PS If anyone in the statistics community would like the data used here to perform further modeling, please comment below and I’ll provide it. *