The Statistical Improbability of Deflate Gate

On Sunday I broke down some of the common misconceptions surrounding the Wells Report, including the social science involved, the statistical misinterpretations and the lack of coherence in the NFL’s story based on its own evidence. Then, on Wednesday provided a time-based visualization of the all the measurements presented in the Report based on where we’d expect them to be at a given time as the balls warmed up in the locker room. Visually, it’s fairly clear that the Colt balls and Patriot balls have similar issues, as many are “under-inflated” by similar degrees. But what does this mean in terms of probability if we actually run some statistical tests on the data?

To reiterate, time is a major variable in this case because the PSI of the balls was increasing with every minute that they were in the locker room at halftime. Thus, the time that each ball was measured becomes critical in trying to analyze the discrepancy between where a ball was measured and where a ball “should” be using Ideal Gas Law parameters. Below is one such scenario presented in the previous post. The blue line is where we’d expect a Colt ball to measure given the time indoors and the gold line where we’d expect a Patriot ball to be (based on Fig. 22 of Wells Report):

Deflate Gate Logo Scenario

What we want to calculate is the difference between each ball at a given point in time (a circle) and where we’d expect the ball to be based on how long it’s been in the locker room (the solid line). For instance, in the above graph, Patriot ball #1 is about 0.25 PSI above where we’d expect. Ball #2 is 0.4 below where we’d expect. These values will be different depending on when the balls where measured, so our parameters for simulating the actual measuring circumstances are (assuming the report is correct in that the balls were correctly recorded in order, and were set to 12.5 and 13.0 PSI respectively in the pre-game):

  • Set Up time (2-4 minutes according to accounts)
  • Measurement time (21.8 – 27.3 seconds per ball)
  • Inflation time (2-5 minutes)
  • Packing time (unstated, but assumed to require some small degree of time between last measurement and re-emergence from locker room)

If we use Exponent’s Ideal Gas Law calculations that assumes 71 degrees pre-game — which may be slightly low, as noted in the last post — and add a small “wetness” factor per their report, we can then simulate a bunch of these scenarios to see what was likely and unlikely. The scenario above attempts to average all the accounts; set up time is “medium” (3 minutes, halftime between 2 and 4), measurement time is “medium” (~25 seconds) and inflation time is medium (3.5 minutes). But we can also examine other scenarios — instances where the Patriot balls were tested after 2 minutes or 4 minutes, quickly or slowly re-inflated, etc. If we do that, we’re left with a number of basic permutations we can study:

Deflate Gate p-value

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

So what do these numbers mean? The “Patriot-Colt mean difference from expected” column calculates where each ball should be based on the time it’s measured, takes the average of all such Patriot balls and subtracts it from the average of all such Colt balls. If we take the mean of all six hypothetical scenarios, the average Patriot ball is about 0.003 PSI below where it should be at the time of measurement relative to the Colt balls. (i.e. using the Colt balls as a control group.) The p-value is the statistical likelihood that the balls come from different populations, i.e. that one set of balls had something done to them that the other set didn’t.

  • The best-case statistical scenario for the Patriots is that Walt Anderson used the Logo Gauge pre-game, that the balls were measured at 2 minutes, each took about 22 seconds to measure and that the officials took 5 minutes to re-inflate the balls (labeled “Early Start, Fast Measure, Long Inflate” above). That produces a mean where the Patriots balls are higher than the Colts, meaning it’s impossible for the Patriots balls to come from a population that is inherently lower than the Colts balls.
  • Three of the six scenarios in which the Logo Gauge was used pre-game completely exonerate the Patriots
  • The worst-case scenario for the Patriots is that Walt Anderson used the Non-Logo Gauge in the pre-game, and that the balls were measured at 4 minutes, each took 27 seconds to measure and that the officials took 2 minutes to re-inflate the balls (labeled “Late Start, Slow Measure, Quick Inflate” above). That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

Although it’s far below “statistical significance,” 75.3% might sound like a lot. But what is that number actually saying? For that, we have to look at the observed difference in the averages to put this into perspective: there’s a 75% chance that the 0.3 PSI difference is not simply from variance and is part of a different population (i.e. tampered balls).

Depending on the distribution, 0.3 PSI could easily be 99.99% likely to come from a different sample…which would suggest, what? There’s a 99.99% chance that the Patriots systematically released an average of 0.3 PSI per football? And that’s the worst-case scenario? That strains common sense.

In total, the (independent t-tests) results show that it was incredibly unlikely that the Patriot balls behaved any differently from the Colt balls using the assumptions presented in the Wells Report. Additionally, in the small unlikelihood that they are different  — roughly a 15% chance if Anderson used the Logo Gauge in the pre-game and 57% if he used the Non-Logo gauge — the degree to which the balls are different is nonsensically small. We would expect a “small” degree of deflation to be something like 1.0-2.0 PSI; the initial reports were “11 more than 2 PSI below regulation,” with another ball falsely labeled by the NFL itself at 10.1. But the data presents a completely different story — the Patriot balls are sometimes higher than the Colt balls relative to what we’d expect, and the worst-case scenario for New England suggests a “non-significant” likelihood of tampering to a degree that is so small it’s equivalent to the variance seen between the two gauges used to measure the balls.

The next post details exactly how Exponent used faulty methods to reach faulty conclusions in the Wells Report. 

PS What happens if the Colts balls were measured immediately after the Patriots balls and then the Patriot balls were re-inflated? This still produces results that strongly suggest non-tampering, as shown below. If the Colt 4-ball group is indeed treated as a control group, we would expect to see the Patriot measurements 13.6% of the time in their “worst-case scenario” — not an occurrence considered “significant” in the scientific community. If Anderson used the Logo Gauge in the pre-game, Patriot measurements are roughly 0.1 to 0.2 PSI below the Colt measurements and far from statistically significant. 

Deflate Gate Fast Recal Patriots Inflate Last

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

June 19, 2015 Update: Joe Arthur has asked me to run the calculation using a flatter transient curve — in other words, what happens if we expect the footballs to heat up more slowly than projected in Fig. 22 of the Wells Report? For a “slower” expected re-calibration, we can use Fig. 24 of the Wells Report. If any physics experts out there can tell us at what exact rate the balls are expected to recalibrate, it would be much appreciated. In the meantime, the aforementioned numbers are derived with a “fast” recalibration curve, and the below with a “slow” recalibration curve. Based on other amateur experiments I’ve found, it seems likely recalibration takes place somewhere between these two extremes. Below are the results of a “slower” expected recalibration. Worth noting is that on the Logo Gauge, the Colts calls are almost exactly where we’d expect. 

Slow Recalibration Model Deflate Gate

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 24 of the Wells Report.

19 thoughts on “The Statistical Improbability of Deflate Gate

  1. Would be great if you could also do analysis involving temperature as each 1 degree of temperature change results in a 0.05 psi change.

    Major flaws in the Exponent Report include:

    (1) Patriots said they set the footballs at 12.6 psi and then gave to refs. Walt Anderson said that all but 4 of the Patriots balls measured at 12.5 psi. Two measured at 12.6 psi and 2 measured less than 12.5 psi which were inflated to over 12.5 psi by another ref and Walt Anderson then deflated to be 12.5 psi.

    It is fair to assume that there was a 2 degree temperature difference between the Patriots room and refs room where footballs measured as this would be a 0.1 psi difference which is what was observed.

    Exponent did it game simulation tests at 69 for the refs room. That would men that the Patriots room had to be 71 or the low end of the HVAC range. That would be extremely unlikely. For one, all the balls were first gloved before setting the psi. That would have been at least 24 footballs. Exponent’s own tests indicated that gloving raised the pressure by 0.7 psi which means that the footballs had their temperature raised by 14 degrees. The increase was quick to drop. Thus, this extra heat plus the body heat expended during gloving plus the shear number of people in the Patriots locker room area had to increase the ambient room temperature above 71 and most likely above the HVAC target of 72.5. Most likely the room were the Patriots footballs were originally set was 73 or 74. If HVAC records exist, this can easily be determined. If not, one can easily estimate the amount of heat put into the room and probable temperature rise if one knows the size of the room.

    Ergo, at a minimum, the refs room for measuring the footballs had to be 71.

    (2) Given the above, Exponent’s game day simulation should have started at 71 at a minimum (and not the 69 they used), and easily could have been at 72 given the number of people in the overall room and that a room’s temperature would quickly tend to equate especially given the number of people moving around the room. This causes a 0.1 psi to 0.15 psi error in Exponent’s game day simulation numbers due to using the wrong starting temperature.

    (3) The wet football values may also be wrong because of the wrong starting temperature. This would have to be tested to see if this was true.

    (4) The reasoning used by Exponent on page 44 of its report to say that Walt Anderson used the Non-Logo gauge is flawed for the following reason. As noted on pages 49-50 of the Wells Report, the Patriots footballs were gloved and then the pressure was set at 12.6 psi. By Exponent’s own test data, the psi could increase as much as 0.7 psi due to gloving. Since the Patriots set the football psi to 12.6 psi at sometime shortly after gloving the football, the psi would have to have dropped below 12.6 psi when the footballs were given to the refs. Based upon Exponent’s own gloving data, it is highly likely that most of the Patriots footballs had a pressure of 12.1 psi to 12.3 psi when given to the refs.

    Thus, the gloving data done by Exponent plus the Patriots stated gloving-pressure set procedure plus the Logo gauge reading 0.3 psi to 0.4 psi higher plus the remembered psi values for the Patriots virtually guarantees that Walt Anderson had to have used the Logo gauge. It would have been virtually impossible to get mostly readings of 12.5 psi using the Non-Logo gauge due to the gloving effect on the psi of the footballs.

    (5) Exponent ignored what the Patriots stated was their procedure for first gloving the footballs and then setting 12.6 psi when discounting the use of gloving causing a pressure increase having an affect with what the refs measured.

    They were WRONG in their conclusion. Their own data and the Patriots stated pressure setting procedure indicates that the Patriots footballs had to have their pressure affected by the gloving.

  2. I ‘ve read all your posts relating to the wells report. Great work and easy to understand for the a non-expert like me. Have you analyzed the post game measurements, and do they shed further light on the subject? My understanding is all post game readings were done later, when all 8 balls had normalized to inside temps, and colts balls were all under 13 psi with both gauges.

    • Hi Dave — the difficulty analyzing the post-game measurements is that we can’t interpret the data without knowing what time they were measured.

      It is disturbing as a scientist to see that the Wells Report did not contain this information, as the cameras or witness reports could shed some light on whether the balls were measured in a transitional period while re-acclimating to the indoor environment (i.e. right after they were taken off the field), or some time later where we would know the ball’s pressure would be based on an indoor temperature of, say, 70 degrees. I can’t emphasize enough how shoddy their analytical methods are.

      Without knowing the time the balls were measured, the only analysis one could start with is to assume that they were measured much later after the game after acclimating to the indoor temperature, and make some calculations based on the Colts balls only (sample size of 4). Why only the Colts balls? Because the Patriot balls were tampered with by the refs at halftime, and we have no idea to what degree they were tampered with when they re-inflated the balls.

      If that were the case, and we added the Colts balls to halftime measurement sample, it would it be EVEN LESS likely that there was tampering. In the worst-case scenario with the non-logo gauge, the p-value would decrease to 0.218 if we treat them as 4 separate balls (i.e. Colts have a sample of 8 balls measured). If we treat them as just another 4 random balls, the p-value drops to something like 0.7 because they are again almost identical the Patriots drop in PSI.

      I’m not going to make a fresh post just based on speculation, but if the Colts balls were indeed measured later, such measurements would actually increase our (already high) confidence in the findings that there was no tampering.

  3. Actually, the biggest issue that I have is that Exponent is being allowed to lie, and apparently getting away with it, while it results in the defaming of 3 people.

    A keystone question is why did Exponent not follow the Patriots gloving procedure but use a completely opposite procedure to do its testing?

    (1) Per the Wells Report the Patriots Gloving Procedure (pages 49-50) was:

    – Glove footballs
    – Set pressure

    (2) Per the Exponent Report in Appendix 1 of the Wells Report (pages 33-35), the Exponent Gloving Procedure was:

    – Set pressure
    – Glove footballs

    (3) Exponent’s football gloving test was the opposite of what the Patriots had done.

    (4) A simple, inexpensive test by an independent testing organization can prove that Exponent was wrong in its conclusion that gloving did not have long term pressure effects on the Patriots footballs when using the actual Patriots gloving procedure.

    (5) The gloving pressure effect is a keystone conclusion. Without Exponent switching the Patriot’s gloving procedure, their data analysis, their game simulation, and their conclusions would fall apart. And the Wells Report would similarly fall apart.

    (6) Since it would be proven that the Patriots did nothing wrong, would the NFL punishments for the Patriots, Tom Brady, Jim McNally, and John Jastremski have to be rescinded?

    Did Paul Weiss, et al (i.e., “Paul Weiss”) purposely withhold the Patriots gloving procedure from Exponent?

    If yes, why?

    If yes, no matter what the reason, then Paul Weiss is guilty of aiding the creation of a lie.

    • I’ve seen some of your other comments – excellent insight and analysis. I agree with your other post that the pats ball prep process named in the report proves that the logo gauge was used pregame to measure pats balls.

      Here’s the info from the Wells Report that (you pointed to in an earlier post) that I agree shows the logo gauge was most likely used for pats pre-game:

      To summarize page 49 and 50 of the report – For that day, Brady had Jastremski prep balls from scratch because it was raining and so he wanted them different than the ones already prepped (so they’d be less slippery). So Jastremski and his staff got brand new balls and prepped them all that day – 1st gloving a ball for 7-15 minutes, and then setting it to 12.6 psi. According to the exponent chart on page 34, a ball gloved for 7 minutes increases in psi by 0.5. 15 minutes increases psi by 0.6. So pats would glove a ball (raising its psi by .5 to .6) and then set it at 12.6. Also according the the same exponent chart, the psi will drop by the same amount within 30 minutes after gloving stops. So by the time the balls were delivered to Anderson (1+ hours later), each ball’s psi would be closer to 12 or 12.1.

      Anderson says all the pats balls read 12.5, when they should have read 12.0 or 12.1. We know the logo gauge read .4 higher than other gauges, so if the pats balls read 12.5, he was using the logo gauge.

      (here’s the process quoted from page 49-50 of the report “He (Jastremski) and other members of the equipment staff then “gloved”the footballs, spending between 7 and 15 minutes vigorously rubbing each ball. According to Brady, this created a set of game balls “where most of the tack on the ball ended up coming from the leather receiver gloves.” Jastremski told us that he set the pressure level to 12.6 psi after each ball was gloved and then placed the ball on a trunk in the equipment room for Brady to review.”

      You combine that with the correct analysis of the half-time data (either with colts balls being gauged before pats are re-inflated or after), and there’s no sign of tampering.

      • Thanks for the contribution. I would argue there is no sign of tampering even in the “worst-case” scenario due to “practical significance.” It strains belief to think that any individual would take 0.4 PSI out of a football only at home, or, alternatively (and even stranger), that the idea was to take like 0.6-0.7 PSI out of only a few footballs and leave the rest untampered. I still have a hard time believing that anyone can even tell the difference between such balls, let alone concocting a scheme to tamper with them in such a way. ESPN Sports Science showed an astronomically small difference in grip and weight with 2.0 PSI difference. 0.4 PSI starts to become physically microscopic and would obviously just be a placebo effect for Tom Brady. Furthermore, there’s a large-sample body of evidence we can look at to see how much air pressure impacts fumbling:

        That’s Brian Burke’s analysis of fumble rates based on temperature. Notice that the fumble rate at 40-50 degrees is slightly HIGHER than the 72-81 group. Obviously, at some point cold hands impacts grip, but I have a hard time accepting that between 50-72 degrees it’s really a problem, especially with gloves. So if there is clearly no drop in fumbling between 52-80 degrees, then fumbling is not affected by differing PSI’s as predicted by Sports Science. Since PSI changes with temperature, a 12.5 ball would range from 11.5-13.0 on the field in that temperature range, and a 13.5 PSI ball would range from 12.0-13.5. Per Sports Science analysis, we’d expect no difference in grip or throwing performance based on such small differences…differences so small, that thousands of players oblivious to physics never knew they were experiencing balls of differing PSI’s. Jerome Bettis and Mark Brunell included.

      • I realize the futility of continuing to post about this stuff, but it still interests me.

        Why did two balls have to be reinflated before the game? The Exponent report assumes that the Patriots’ gauge is equivalent to the non-logo gauge, and that Walt Anderson used the non-logo gauge. This is pretty critical to their analysis. And yet two balls had to have extra air pumped into them by Anderson, meaning they were below 12.5 psi when Anderson saw them, despite Jastremski pumping them to 12.6 psi. If their gauges were equivalent, AND Jastremski pumped them up in a colder room as they assumed, why would any of them need reinflation? This suggests to me that it is very possible that a few balls Jastremski pumped up were affected by the ‘gloving’ football preparation, and subsequently dropped in pressure. So right there you have a reason for varying pressure for Patriots’ balls with almost exactly the amount of potential deflation that Exponent couldn’t explain. Once again, only Wells’ assumption that Walt Anderson was infallible (except regarding which gauge he used) lies between the Patriots and complete exoneration.

        It is amazing that the data here is literally a moving target with no actual reference points (only assumed ones) and a control group that is 25% unreliable, and yet Exponent makes their case with absolute certainty.

  4. I’ve just read the AEI report on the matter, and they add yet another piece of evidence that should have people in the “99.9999%” confidence range of how improbable this actually is: How is it possible that the referees switched gauges while measuring balls and that absolutely no one noticed…unless they re-inflated the Patriot balls FIRST before measuring the Colts balls. (This is corroborated by the “ran out of time” statement as a reason for measuring only 4 Colts balls, since how did they know how long it would take to re-inflate the Patriot footballs?)

    Thus, while I did not weigh the likelihood of the scenarios in this post, from a probability standpoint, it becomes unlikely that the Colt balls were measured earlier in the locker room period, further reducing the likelihood of tampering as that possibility no longer carries equal weight against the other scenarios.

    In conjunction, as George has pointed out, the gloving of the Patriot balls makes it more likely that the Logo Gauge was used. This could basically be “proven” if the Wells Report (or someone) would investigate whether the Colts also glove right before inflating.

    Scientifically speaking, I would say the evidence is so strongly for the Patriots (and thus why they were expecting an apology as they sat on the measurement information) that the report goes as far to “prove” (beyond a reasonable doubt) that the Patriots don’t deflate footballs…since the man labeling himself as the “deflator” went into the bathroom with the footballs….and they showed absolutely no signs of deflation. The only remaining conclusions for tampering are so unlikely they are “beyond reasonable:”

    -McNally is bad at deflating and failed to take air out of most of the balls without realizing it
    -The variability of measurements is so large that the Patriots “lucked” into readings that were nearly identical to the Colts (and quite close to the ideal gas law predictions). Of course, this explanation renders the whole concept of precisely manipulating air pressure moot…which makes sense given that, again, No NFL player or referee has EVER noticed playing with balls that regularly change PSI by 2.0 PSI and sometimes up to 5.0 PSI. (Save for the Ravens on Jan. 10, 2015.)

  5. From the Exponent Report, Fig 21, by fitting the curve, the rate constant for cooling the balls is about 15 minutes (i.e. exp(-t/15)). It should be the same for warming, except that wet balls will warm more slowly — because of the effects of evaporative cooling (which is what causes the wet balls to be lower in pressure in the first place). We see that in Fig 21 as well, however the curves for warming definitely do not fit the exponential as well, indicating some systematic problems with their experiment.

    Note that this means the balls only warm 63% of the temp difference in 15 minutes, 87% in 30 minutes, …

    There still is the effect of the cold ball field which the investigators completely missed. In Boston that January, the average temp was 20 degrees, and warmed up to 50 degrees suddenly before the game. The ground under the artificial turf would have been very cold. Artificial turf is designed to bring up coolness from below, so we can expect the wet turf field to be significantly colder than the air temperature. Just before halftime, the Patriots had a long drive during which many of their footballs were involved in, and those wet balls would have been cold from the turf.

    In essence, he Wells Report “Game Day Simulation” failed to include the Game. As the balls sat at the line of scrimmage on the cold turf in between each play, they cooled well below the air temperature. With a 15 minute rate constant, they still would have been feeling that cooling when measured at halftime. This explains the wide dispersion in Patriots pressures, as well as the lower average temperature.

    • Thanks for the insight Bill. From a scientific methodological standpoint, I was floored when I read the Wells Report. Their proxies for re-creating game-day events were very poorly designed in my opinion, and this is yet another example.

  6. Tom Brady should sue eXponent for contributing flawed data that directly resulted in defamation of his character and legacy. At the least, it will bring out the technical facts in this situation, and educate the public. Sadly, people just believe what they hear from the media with zero critical thinking.

  7. At our Patriot forum, I wanted to bring attention your articles and posted a couple excerpts from them (along with their links).

    —>The Statistical Improbability of Deflate Gate…
    —>That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

    ONE POSTERS RESPONSE: That is not what a p-value means. Trust this data analyst at your own risk.

    As I do not have the knowledge to refute this dismissal, I was hoping you might have a response. I can tell that you clearly know what p-value means. I just cant speak on your behalf about how you might be altering the context of it in this circumstance (if you are). Thank you for the effort that you’ve put into these!

    • Yes, the poster is correct — technically my statement is not complete. It was designed for lay people who have no clue what a p-value is and would have a hard time interpreting something like 0.247 as a p-value. Saying there is a 75% chance the balls come from a different population based on that is indeed a bit of a bastardization. Getting into null hypotheses and normal distributions was overkill to help simplify a concept for people, IMO.

      Hope that helps.

Leave a Reply

Your email address will not be published. Required fields are marked *