Goodell’s Illogical and False Deflategate Statements

It turns out that Roger Goodell, Exponent and Ted Wells just aren’t very good at logic. Whether that’s due to severe defensiveness and a major confirmation bias or something else is irrelevant. I’m not going to go into legal details or CBA issues, but I will discuss the scientific and logical errors and inconsistencies from Goodell’s appeal ruling and the hearing itself in deflategate.

Falsehood No. 1 — Timing was accounted for the in the statistical test

On pg 6 of his ruling, Goodell writes:

“In reaching this conclusion, I took into account Dean Snyder’s opinion that the Exponent analysis had ignored timing…however, both [Dr. Caligiuri and Dr. Steffey] explained how timing was, in fact, taken into account in both their experimental and statistical analysis.”

This is patently false. It is not an opinion of Dr. Dnyder’s. It is a fact. And it is a fact agreed upon by Dr. Caligiuri and Dr. Steffey after much run around and refusal to answer this question. In Dr. Caligiuri’s testimony on pg 361 of the hearing:

“So the reason you don’t see a timing effect that we concluded in the statistical analysis is because it’s being masked out by the variability in the data due to these other effects.”

And then later on pg 380:

Kessler: So the initial test you did to determine whether there was anything to study did not have a timing variable?

Caligiuri: Not specifically, no.

Steffey echoes this fact on page 429 and 430:

Kessler: This one-structured model that you chose to present as your only structured model in this appendix and in the entire report, okay, has no timing variable in it, correct?”

Steffey: There’s no term in there that says time effect.

Goodell is either misrepresenting the truth or he is very, very confused and was not able to understand this issue at the hearing. Either way, once and for all, timing is not accounted for in Exponent’s statistical analysis. It is a major confound, and it does change the results when timing is indeed accounted for.

(By the way, the Exponent scientists were attempting to claim that an ordering effect is the same thing as accounting for timing, but that is also wrong. First, an ordering effect can have different increments of time (as the Patriot and Colt balls do) and second, an ordering effect is independent of time, which is relevant in an instance where another variable, like wetness, would completely mitigate the presence of an ordering effect but not undo the effect of time.)

Falsehood No. 2 — Brady’s “extraordinary volume” of communication for ball prep

On pg 8 of his ruling, as part of discrediting Brady’s testimony, Goodell reasons:

“After having virtually no communications by cellphone for the entire regular season, on January 19, the day following the AFC Championship Game, Mr. Brady and Mr. Jastremski had four cellphone conversations, totaling more than 25 minutes, exchanged 12 text messages, and, at Mr. Brady’s direction, met in the ‘QB room,’ which Mr. Jastremski had never visited before…the need for such frequent communication beginning on January 19 is difficult to square with the fact that there apparently was no need to communicate by cellphone with Mr. Jastremski or to meet personally with him in the ‘QB room’ during the preceding twenty weeks.”

This is a serious mischaracterization of facts. Let’s ignore the basic fact that there wasn’t a media frenzy surrounding Jastremski’s domain in any of the previous 20 weeks. During the hearing, Brady explained that, for the Super Bowl, Jastremski needs to prepare approximately one hundred footballs, at least eight times his normal volume.

Furthermore, Brady testified that deflategate allegations surfaced on days when he was not at the stadium because of the Super Bowl break. Frankly, it would have been stranger if he didn’t call Jastremski. The hoopla over the visit to the QB room is also bizarre, since Brady said he simply didn’t want to look for him in the stadium. There is no justification for how Goodell ignores this evidence, even taking it further and writing on pg 9:

“The sharp contrast between the almost complete absence of communication through the AFC Championship Game and the extraordinary volume of communication during the three days following the AFC Championship Game undermines any suggestion that the communication addressed only preparation of footballs for the the Super Bowl..”

Yet Brady testified, in front of Goodell, that they were discussing Super Bowl preparation (of 100 balls, not 12) and the issue of alleged tampering.

Logic Error No. 1 — It has never happened…but it has happened…but that doesn’t matter

On page 3 of his ruling, Goodell writes that:

“Mr. McNally’s unannounced removal of the footballs from the locker room was a substantial breach of protocol, one that Mr. Anderson had never before experienced. Other referees interviewed said…that [McNally] had not engaged in similar conduct in the games that they had worked at Gillette Stadium.”

So Goodell is saying that McNally grabbing the balls is a huge deal and in fact, it has never even happened before! Which would then make it impossible for this to have been a regular practice.

Thus, when analyzing text messages, Goodell ignores this information and believes that McNally’s references to “Deflator” (in May) and “needles” in October of 2014 are signs of a tampering scheme, but when trying to establish the severity of the situation he believes nothing like this has ever happened before.

Similarly, during the hearing (pg 307) Ted Wells admitted that he ignored the testimony of Rita Calendar and Paul Galanis — game day employees — who claimed that McNally took the balls to the field about half of the time without the officials. Wells doesn’t even think this issue is relevant, explaining that:

“I didn’t need to drill down and decide when he walked down the hall 50 percent of the time by himself or was this person right or that person right.”

Got all that? This has been happening since at least 2014, but this is the first time something like this has ever happened. And Wells thinks it doesn’t matter whether this ever happened before or not.

Logic Error No. 2 — Jastremski expects a 13 PSI ball despite a tampering scheme

On pg 278 of the hearing, Wells acknowledges that John Jastremski texted his significant other about the Jets game and said that he expected the footballs to be 13 PSI. Amazingly, Wells believes he is telling the truth. Which creates yet another, Wellsian logical impossibility.

How can Wells believe Jastremski expected the balls to be at 13 PSI for the Jets game and believe that there was a scheme to deflate the balls below 12.5? It is a completely contradictory thought. (This is similar to Jastremski’s text message that he sent to McNally about the ref causing the balls to be 16 PSI in that game, and not a message to McNally about why the balls weren’t properly deflated.)

This makes it logically impossible for there to have been a tampering scheme for that home game against the Jets. This either means that:

  1. There was no tampering scheme ever
  2. There was a tampering scheme, but only after October, 2014
  3. The tampering is carried out inconsistently at home

The third explanation borders on preposterous, namely because the text still would have said something like “we should deflate every week from now on to avoid this!” The other two explanations make it impossible for the comments from May, 2014 to be about deflating footballs. Yet Goodell follows suit and cites such messages as evidence of a tampering scheme (pg 10 of his ruling):

“Equally, if not more telling, is a text message earlier in 2014, in which Mr. McNally referred to himself as “the deflator.”

Goodell, like Wells before him, omits that McNally claimed the reference was about weight loss, which may sound crazy until you consider that other people use the term for weight loss, including the NFL’s own network in 2009, and that McNally himself appears to make a reference to weight loss using the term “deflate” during the Patriot-Packers in 2014 in Green Bay. (McNally was watching the game on TV from his living room, and after seeing a picture of Jastremski on the suddenly in a large, puffy jacket texted him a message to “deflate and give someone that jacket.”)

Logic Error No. 3 — For the Colts, the Logo gauge matters. For the Patriots, it is impossible.

On pg 3 of Goodell’s ruling, he writes:

“Eleven of New England’s footballs were tested at halftime; all were below the prescribed air pressure range as measured on each of two gauges. Four of Indianapolis’s footballs were tested at halftime; all were within the prescribed air pressure range on at least one of the two gauges

First, this is bizarre, because it’s clear both sets of footballs lost pressure due to environmental factors. The Colts being “within the prescribed air pressure range” is simply due to their balls starting higher — Goodell knows it, you know it, every c-minus physics student in America knows it.

But what’s more problematic, and yet another assault on common sense, is that Goodell later rules that Anderson had to have used the non-logo gauge at halftime due to unassailable logic, but here he references the Colts being “within the prescribed air pressure range” on a gauge he considers to have been impossible to have been used.

Logic Error No. 4 — The balls were the same wetness

Wetness or moisture is a huge issue in the science. Yet here’s what Exponent scientist Dr. Caligiuri had to say about it as an alternative explanatory factor to tampering on pg 385 of the hearing:

“It is a possibility [that the Patriots’ balls could have been much wetter than the Colts’ balls because of the fact that the Patriots were on offense all the time with the balls], but there is no evidence that that occurred. The ball boys themselves said they tried to keep them as dry as possible. “

Brady’s attorney Jeffrey Kessler then asks him to confirm:

Kessler: Well, if you are on offense and you playing with the ball, can you keep it dry when it’s out there on the field?

Caligiuri: No

Kessler: Okay. So if the Patriots have those balls out there on the field, it’s plausible those balls were wetter, sir, right? You are under oath.

Caligiuri: Sure.

Kessler: Okay. And you didn’t test of that plausible assumption, right? Did you test for it?

Caligiuri: No…

Later Caligiuri states:

“We did not test for that because there was no basis to test for that.”

Yet, there is indisputable evidence that the Patriot balls were wetter. Namely, it was raining during the game and the Patriot possessed the ball for essentially 17 consecutive minutes in real-time, during the rain, to end the first half. Saying that there is no basis to test for that is a direct contradiction of the publicly available and undisputed information. Yet, on page 383-384 of the hearing, Caligiuri says:

“Did we look at wetness as a variability…in the beginning, no we didn’t.

Instead, he says they looked at “extremes.” This makes plenty of sense, except there are two giant problems. First, misting a football every 15 minutes with a hand spray and then immediately toweling it off is a nonsensical proxy for constant exposure to rain. Second, it does no good to create a range of possibilities and then not test the most likely possibility, namely that one set of footballs is on the wetter end of that range and the other is on the dryer end.

Logic Error No. 5 — Evidence that inflation mattering = deflation mattering/preference for deflation

Goodell has another breakdown in logic on pg 11, footnote 9:

“Even accepting Mr. Brady’s testimony that his focus with respect to game balls is on a ball’s ‘feel'” rather than its inflation level, there is ample evidence that the inflation level of the ball does matter to him.”

Yes, there is evidence that it matters if the ball is grossly overinflated. There’s no evidence that he wants it underinflated, or that reasonable inflation levels actually matter to him. None. It is a logical fallacy to think otherwise. It’s like saying “Mr. Brady complained about his food being too salty last night, therefore there is evidence that Mr. Brady really cares about having under-salted food.”

Logic Error No. 6 — Practical Significance

Finally, lost in all the discussion of the statistical significance is the issue of practical significance. This is the area that I really wish the NFLPA would have attacked at the hearing, but they did not broach it at all. Ironically, It’s probably the easiest part of the science for the lay person to understand.

Let’s assume that we were 99.9999% certain that the Patriot balls were all 0.3 PSI below where they should have been at halftime based on temperature alone — right around the actual number we think they are based on projections. That certainly does not mean that “tampering” is the only alternative explanation, and more importantly, it’s not very likely if the real-world explanation is not practical.

What benefit would someone actually gain from a completely undetectable change in PSI? Remember, players have never even known there were PSI changes from temperature in the past.

In other words, even if there is statistical significance on data that incorporates measurement time (which there isn’t), what would that data be suggesting? That Brady can magically detect differences in footballs that others can’t (and yet despite this, does not care if balls on the road are not a few tenths below 12.5), or that some other factor, like wetness, wind, temperature difference, gauge variability, inaccurate memory, etc., is a more practical explanation?

For those who missed it, Exponent themselves discovered on order of a few tenths of a PSI difference between the Patriot actual halftime measurements and where they projected their measurements to be.

Bonus Logic Error — It had to be the Non-Logo gauge

I’m hesitant to discuss this Red Herring, because the difference is negligible between the Logo and Non-Logo gauge when comparing the Colt and Patriot measurements. And this makes total sense — shifting the Patriot balls down a few tenths should (and does) also shift the Colt balls down a few tenths. But let’s pause to appreciate the absurdity of this logic, and then doubling-down to call it “unassailable.”

On pg 7, footnote 1, Goodell writes:

“I find unassailable the logic of the Wells Report and Mr. Wells’s testimony that the non-logo gauge was used because otherwise neither the Colt’s balls nor Patriots’ balls, when tested by Mr. Anderson prior to the game, would have measured consistently with the pressure at which each team had set their footballs prior to delivery to the game officials.”

Here’s what he’s referring to, echoed by Dr. Caligiuri on pg 364 of the hearing:

“Yes, he calculated, I rounded it up. 12.17, correct, okay. And then if you look at the Colts’ balls, if the same logo gauge was used, it’s reading 12.6, 12.7. We were told that the Patriots and the Colts were insistent that they delivered balls at 12 and a half and 13, which means, geez, looks like the logo gauge wasn’t used pre-game.”

OK, now let me assail it quickly — something that was already done at the hearing which Goodell provided over. The Logo gauge is inaccurate (reads too high) and the Non-Logo gauge is much closer to the “true” reading. Exponent tested a bunch of new gauges. Based on these two facts alone, Wells and Exponent have concluded that it’s improbable the Logo gauge was used because then then Colt and Patriot gauge would also have to be off by a similar amount, and that’s just, I mean, geez, that’s just insane.

Right?

Except for the pesky little problem that according to the Wells Report, Exponent tested one model only, Wilson model CJ-01. A model they describe as being “similar” to the Non-Logo gauge! So their sample size to make these “unassailable” conclusions is really one.

But there’s more: Exponent discovered gauges can “drift,” or grow more inaccurate with use. It’s quite possible that the Patriot and Colt ballboys both have older gauges that have “drifted” to a similar degree. At the hearing, this was scoffed at because it would be coincidental that they were off by the same amount. Again, this doesn’t actually matter — it’s a Red Herring — but it demonstrates how poor these people are at basic logic. On pg 295, Wells said:

“Maybe lightning could strike and both the Colts and Patriots also had a gauge that just happened to be out of whack like the logo gauge. I rejected that.”

The Patriots claim to set balls at 12.6 PSI, but Anderson did not remembering gauging them all at 12.6 in the pre-game. (He remembered 12.5, and had to re-inflate two balls that were under 12.5.) There are two likely explanations for this:

  1. The gloving procedure created some variability in the Patriot balls. This would make it more likely the Logo gauge were used base on Exponent’s logic.
  2. The Patriot gauge and Anderson’s pre-game gauge are off by about 0.15 PSI.

Either way, it’s impossible for the “lightning striking” concept to even apply (that the gauges were off by an identical amount). Using Wellsian logic — which means we ignore things like gloving or temperature changes from ball to ball — the very fact that the balls weren’t 12.6 as the Patriot say but some were under 12.5 for Anderson tells you that the two gauges are not identical. So there’s no need for “lightning to strike.”

Bonus Question: How closely did Roger Goodell read the Wells Report?

In his ruling, Goodell states that he relied on the factual and evidentiary findings (pg 1) of the Wells Report — but during the appeal, there are times during the appeal hearing when Goodell does not seem to know the basic case facts:

  • pg 49, he asks “John who?” when Brady is talking about John breaking the balls in. It’s possible this is Goodell’s way of confirming he is talking about John Jastremski, however it’s bizarre given the context of Brady’s explanation and Jastremski being one of a handful central figures in the case that he has to ask who John is. Does he know about Jim and Dave too?
  • pg 61, in reference to the October Jets game, he says, “Just so I’m clear, the Jets game is in New York.” This is a huge detail to not understand as it relates to the 13 PSI text mentioned above.
  • pg 177, while Edward Snyder is discussing the halftime period, he interjects “Just so I’m clear, you are saying it would take 4 minutes for 11 balls to be properly inflated? That’s your analysis or what analysis is that?” Here, Goodell is saying that he is completely unaware that the witnesses in the room at halftime provided those estimates to Wells, who relayed them to Exponent, and that those estimates are central to the scientific and statistical analysis in the case.
  • pg 180, in the discussion about “dry time” (vs moisture), Goodell asks in regards to moisture, “that’s a what-if, right?” How can the person ruling on the case, after reading a report that was designed to determine if environmental factors could explain the halftime measurements, ask if “rain” is a “what if” when it rained during the game?
  • pg 396, perhaps the clearest indication that Goodell either did not read or did not properly retain the information in the report is that he has no idea what the “gloving” issue is. This is the gloving referenced by Bill Belichick in his press conference and given an entire section by Exponent in their report.

Deflategate: Exponent’s Bias and the Master Error

With all of the publicized corrections to the science section of the Wells Report, I’ve been asked by more than one person whether Exponent, the author of said section, was simply incompetent, or whether they were biased. It’s a question that might have legal ramifications in the near future for Tom Brady.

As I’ll detail below,  there is a body of evidence suggesting that Exponent’s report was not merely the result of bad science, but conducted with a clear anti-Patriot bias. They repeatedly made errors or only looked at possibilities that weakened the Patriot’s position without ever making errors in their favor. The nature and frequency of these errors makes it unlikely to be a coincidence. Furthermore, Exponent committed a major error in one of their key figures, an error that allowed them to report, incorrectly, an anti-Patriot conclusion back to Ted Wells. What exactly am I referring to?

Not accounting for time of halftime measurements

At a high level, the biggest methodological error Exponent commits is not properly accounting for the time differences of when the balls were measured at halftime. This leads to a nonsensical statistical test that they publish to establish “statistical significance.” The problem is, they knew about this factor. They too considered it a salient factor. They made multiple transient curves mapping how things change depending on when they were measured at halftime.

They didn’t stop there.

They dedicated an entire section (Table 13, page 58) to perform a mini-version of the analysis I present here, using periods of “average measurement time” to compare the difference between expected PSI and observed PSI at a given time.

Wells writes, on page 122:

“According to Exponent, the environmental conditions with the most significant impact on the halftime measurements were the temperature in the Officials Locker Room when the game balls were tested prior to the game and at halftime, the temperature on the field during the first half of the game, the amount of time elapsed between when the game balls were brought back to the Officials Locker Room at halftime and when they were tested, and whether the game balls were wet or dry when they were tested. “

So they thought a lot about the impact of the timing of halftime measurements. On page 57, in one of many mentions of this:

“A similar effect is seen in the game day simulation data; the average pressure rises as the average measurement time is increased.”

Again on page 62:

“Based on the transient curves explained above, one would expect that if the Patriots footballs were set to a consistent or relatively consistent starting pressure, the pressure would rise relatively consistently as they were tested later in the Locker Room Period.”

Yet they still published their p-values on page 11 and conducted analyses in the opening pages without considering time! This cannot be due to incompetence since they are keenly aware of and explicitly call out the importance of time on multiple occasions. On page 64, in their concluding statement, their second point cites these statistical tests as critical pieces of evidence supporting their conclusion. Unless different people prepared different parts of the report, this is evidence of a clear bias against the Patriots. But it’s also just the beginning.

Switching Fig. 26 to the extreme low temperature of 67 degrees

The transient curve used in Figure 24 to project Non-Logo gauge results uses a pre-game room temperature of 71 degrees. The HVAC on the day of the game was set between 71 and 74 degrees. But Exponent measured the temperature in the room where the balls were gauged by officials in the pre-game to range from 67-71 degrees. It was a good 30 degrees colder outside on the day Exponent measured, and there wasn’t the same game day activity where numerous people give off extra heat in the room.

When they project the Logo gauge results on the transient curve used in Fig. 26, they switch the pre-game temperature to 67 degrees, the extreme end of the plausible spectrum that produces the lowest Patriot reading. Their explanation for using 67 degrees is so the Colt measurements align with the projections. This is a reasonable approach, given that the Colt balls “should” obey the laws of physics, but (a) it should not be the only scenario examined and (b) they did not need to drop the pre-game temp all the way to 67 degrees to achieve this! Doing so only increased the appearance of guilt for the Patriots. The Colt readings are still viable and withinin Exponent’s “range’ of what is predicted by physics even with a 69 degree pre-game temperature.

Misting the footballs to simulate rain

When accounting for water, as described on page 42 (footnote 36), footballs were sprayed every 15 minutes with a hand held spray bottle and then toweled off immediately. As has been demonstrated, this is a minimal attempt at simulating rain. This is critical to interpreting the results (that will be discussed below and that reflect those presented here); Exponent’s wet curves between Figure 24 and Figure 26 show an additional effect of about 0.25 PSI due to wetness simply from running the simulation again. Yet, as we’ll see in a second, they cannot imagine how the Patriot footballs would be a few tenths below where they were expected based on temperature-only projections.

Not calculating the actual PSI differences from expected

The mini experiment Exponent runs in Table 13 produces the following results: at the earliest plausible time (let’s use the 4:17 reading), Patriot averages should have been 11.54 PSI on the Non-Logo gauge. The actual Master-adjusted halftime average on the Non-Logo gauge was 11.09 PSI. So the Patriots are -0.45 PSI from expected. The Colts Non-Logo average was 12.29 according to Table 11 on pg 45. (This is because Exponent uses the “switch” option to correct for the anomalous 3rd Colt ball.) Therefore, the Patriot balls are about 0.4 PSI below the Colt balls relative to expected. Is that clear from Table 13?

Exponent Table 13

Not only is it unclear, Exponent never even publishes the differences. They fail to calculate or discuss perhaps the most specific and important detail of all of their experimentation, instead simply noting that the Colt readings are in-line with these simulations and the Patriot readings are not. This is not incompetence, it is a bias of omission. More importantly, are the Colt measurement times in Table 13 even plausible?

Assuming the Colt balls are measured before the Patriot balls

Exponent assumes, contrary to the evidence, that the Colt balls were gauged before the 11 Patriot balls were reinflated. This is yet another anti-Patriot “error” or instance where they refuse to examine other plausible scenarios. The repeated and consistent manner in which this happens is hard to chalk up to coincidental incompetence.

Wells does not explicitly state that the Colt balls were gauged before the Patriot balls were re-inflated. Exponent should have asked about this and should have clearly stated it if it were provided such information. If not, they should have, “to be fair,” at least considered the possibility that the Colt balls were gauged later in the locker room period as an explanation for the differences of a few tenths of air pressure.

Burying the Logo and Non-Logo average PSI results

So, what happens if they were to explicitly note the PSI differences in their table as well as including Colt measurements at 11 or 12 minutes, the times that they were likely to be gauged?

Table 13 Updated

An updated version of Exponent’s table 13, showing Non-Logo Gauge Master-Adjusted results with a 71-degree pre-game temperature. This table includes a later measurement time for the Colts as well as explicitly calling out the differences between the expected and observed halftime values.

Now, for example, it’s crystal clear that an approximate 4-and-a-half minute measure time for New England and 11-minute measure time for Indianapolis result in a difference of 0.3 PSI on the Non-Logo gauge between the Patriot and Colt balls. This is similar to what has been observed in more detailed analyses.

Forget the inclusion of a later Colt measurement though. Why doesn’t Exponent call out that differential since it’s perhaps the single most salient data point in their entire report? Without any corrections, it would reveal differences of a few tenths of PSI between the control (Colts) and Patriot Non-Logo readings. Would publishing that number have impacted people’s reactions to their conclusions?

What about the Logo gauge experiment in Table 14? The Patriot Master-adjusted Logo halftime average value was 11.21 PSI, hidden in the paragraph on the following page, meaning that their experiment again found Patriot balls 0.3-0.4 PSI below expected on the Logo gauge, with the pre-game temperature at 67 degrees.

Table 14 Updated

An updated version of Exponent’s table 14, showing Logo Gauge Master-Adjusted results with a 67-degree pre-game temperature. This table includes a later measurement time for the Colts as well as explicitly calling out the differences between the expected and observed halftime values.

Could water account for that small difference? Or a different temperature? Placing the pre-game temperature at something like 69 degrees will bring the Patriot balls about 0.1 PSI closer to expected. Again, this is something Exponent conveniently does not even consider, despite providing a plausible temperature range of 67-74 degrees and running misting tests that demonstrate an effect of wetness.

The Master Error — failing to use master projections for master results

And then there’s this enormous error.

In Figure 26 (a figure recycled again in Figure 30), Exponent used a Master-adjusted transient curve to demonstrate where the footballs are projected to be as they heat up at halftime. Only they fail to present an adjusted curve! Figure 26 is simply wrong.

The curve shows a dry starting halftime value of over 11.5 PSI for the expected Patriot values. But a Master-adjusted Patriot ball would actually be 12.17 PSI in the pre-game according to Exponent. A dry football is expected to be 11.20 PSI at 48 degrees if it were set at 12.17 PSI in a 67 degree environment in the pre-game, as Exponent is attempting to model. The graph is not master-adjusted, even though Exponent claims it is. It is a clear error and needs to be corrected.

What happens when it is corrected?
Screen Shot 2015-07-24 at 10.14.59 PM

The Logo scenario that Exponent presents to support its case suddenly contradicts it. It makes their primary conclusion on page 55 simply wrong:

“Based on the above conclusions, although the relative ‘explainability’ of the results from Game Day are dependent on which gauge was used by Walt Anderson prior to the game, given the most likely timing of events during halftime, the Patriots halftime measurements do not appear to be explained by the environmental factors tested, regardless of the gauge used.

Correcting this huge error would fundamentally alter this conclusion.

Incorrectly claiming that the pre-game temperature is set to help the Patriots

They continue to write, on page 54, that

“it is important again to note that values for the pre-game and halftime locker room temperatures shown in Figure 27 put the Patriots transient curves at their lowest possible positions.”

But this is completely backwards — yet another anti-Patriot error. In order to generate the lowest starting transient curve within the HVAC parameters, the pre-game temperature would be 74 degrees, producing a starting halftime value of 10.86 PSI. 67 degrees is actually the worst starting value for the Patriot differentials.

Inability to conceive of wetness as the explainable natural factor

The icing on the cake is that the differences in the Colt and Patriot measurements are in all likelihood the difference in their exposure to rain. For the uninitiated, this can be clearly seen in the gradient of differences among the Patriot balls that suggests some Patriot balls were exposed to more rain, and in particular those balls on the final drive of the half.

Yet on page 55, when discussing wetness as a factor, they write:

“According to Paul, Weiss, [a majority of wet balls] were most likely not present on Game Day.”

How can they say that, given the factors around wetness? They mention nothing of the Patriot balls being used more, and being in play at the end of the half. This is yet another ant-Patriot oversight. Remember, they presented back-to-back graphics in which water made on order of 0.2 PSI-0.4 PSI differences from the “dry” condition, based on their own misting procedure. Despite the game being played in rain, Exponent concludes that results of the exact same magnitude cannot be explained by rain.

Conclusion

All told, the only time they seem to do something that isn’t anti-Patriot is when they create a row in Tables 13 and 14 for average measurement times that are improbably early in the locker room period. Otherwise, every misstep, omission and blatant error is decidedly anti-Patriot, and often committed in inexplicable fashion. In summary, Exponent demonstrates the following biases by:

  • Failing to account for halftime measurements in publishing p-values, despite knowing time of measurement is critical
  • Switching to an (unnecessarily) extremely low temperature projection for the Patriot Logo gauge
  • Misting footballs to simulate rain (and immediately toweling them off)
  • Not publishing the actual PSI differences between halftime measurements and expected measurements
  • Assuming the Colt balls are measured improbably early in the locker room period, and not considering later measurement times
  • Presenting Figure 26 and 30 with completely false transient curves, thereby altering their conclusions vis-a-vis the Logo gauge
  • Incorrectly claiming the pre-game indoor temperature of 67 degrees is a best-case for the Patriots
  • Not considering wetness as an explanation for the few tenths difference despite finding a few tenths difference from wetness

Do the Patriots Fumble Less at Home?

Despite the mountain of evidence that there was no tampering in Deflate Gate, there’s a very basic and very obvious question to be asked: Do the Patriots fumble less at home? After all, that’s where the Wells Report alleges they deflate footballs for a competitive advantage — during home games where New England personnel can take possession of the balls outside of the referee’s watch. Is there a smoking gun in the form of their home game fumbling rates?

Spoiler: Nope. Not even close.

Ironically, out of 32 NFL teams, the Patriots come away looking like the least likely team to have a home-away discrepancy in football quality. New England has actually fumbled more frequently at home than on the road since 2007!

Home Fumble Improvement 07-14

2014 was actually the Patriot’s best home year of the period, fumbling at a 0.45% lower rate at home than on the road…good for 4th in the league this year in home fumbling improvement. (data from PFR across all offensive plays, save for kneels) Tampa Bay and Oakland led the league, with a fumble rate that was 1.5% lower at home than on the road. If you aren’t familiar with how big of a difference that is, here’s a refresher on fumbling data. Also worth noting is that the three dome teams thrown out in the infamously bad “Sharp” fumble analysis — Atlanta, Indianapolis and New Orleans — are either similar on the road (New Orleans) or fumble less away from the dome; the top road fumbling teams over the period are (1) New England, 1.08% (2) Atlanta, 1.21% (3) Indianapolis, 1.35%.

Additionally, since 2007, the Patriots have the most consistent performance in the NFL between home and away fumbling rates. Year by year, they fumble at similar rates in home and away games. Below are the year-by-year fumbling rate improvements for each team from 2007-2014 — notice how the Patriots straddle the league average like a metronome, exhibiting the least amount of variation among any team:

Home Fumble Variance 07-14Curiously, Tampa Bay also holds the best single-season improvement for home fumbling rates over the last eight seasons, fumbling at an amazing 2.38% lower rate at home in 2011.

The Patriots are alleged to have deflated footballs at home, yet the Patriots fumble slightly more at home when the average team fumbles less at home, Tom Brady is statistically better on the road and the measurements at halftime of the AFC Championship are incredibly similar to the Colts non-tampered footballs. I think it’s pretty clear what’s going on here…

The best team of the last decade can’t figure out how to properly deflate footballs.

 

Debunking Exponent’s Methodology in the Wells Report

Let’s ignore smaller little hand-waving techniques like deciding to ignore one of the measurements for the Colt balls because it “looked strange.” Or that there might be compelling evidence in the report that Walt Anderson did indeed use the Logo gauge in his pre-game measurements. Instead, let’s just focus on the two major methods Exponent used to reach its faulty conclusion in Deflate Gate’s Wells Report.

  1. The “p-value” without controlling for time
  2. The use of a “visual proof” that is biased toward dry balls measured later in the locker-room period

To show exactly what’s so silly (and biased) about this methodology, imagine the following thought experiment:

  • All footballs, for whatever reason, are slightly below where we’d expect them at halftime, about 0.4 PSI below expected
  • The Patriots balls are dry instead of wet
  • The Patriots balls are measured second instead of first
  • The Colts balls are wet instead of dry
  • Finally, imagine that the Patriots deliberately released 0.15-0.3 PSI from some of their footballs after referee Walt Anderson approved them

Doing that produces a table with the following hypothetical halftime measurements:

Fake Exponent Table

As you can see, the Patriot balls have a much higher halftime average measurement because they were dry, inflated to 13.0 PSI in the pre-game and were measured after the Colt balls, allowing them more time to recalibrate to pre-game PSI levels. Now let’s use Exponent’s two main methodologies to see if we can detect the tampering that we’ve built in to the hypothetical.

Exponent Method #1: p-value independent of time

First, let’s run a statistical (independent t) test to see the likelihood that these two groups of footballs are from the same (untampered) groups. If we do what exponent did, which is ignore time of measurement and compare the pre-game level of the Patriot balls (13.0 PSI in our hypothetical) and Colt balls (12.5 PSI) to where they measured at halftime, we get a p-value of 0.0097. That’s almost exactly what Exponent came up with in the real-life scenario for the Patriots.

Only in our thought-expirement it’s the Colts who are the ones found likely of tampering. Exponent’s method literally picks out the wrong team because it ignores major, confounding variables like time. Pause for a second and appreciate how bad this is: I’ve created a scenario where the Patriots cheat, and Exponent’s methodology points to the Colts…because Exponent’s methodology is biased toward any team that was measured second in the locker room period (and also had wet footballs).

In our hypothetical, The Colt balls were measured first, and thus had less time to recalibrate. Furthermore, the Colt balls were treated as the wet group in this scenario, so they drop further in PSI compared to the dry Patriot balls (the opposite of the real-life AFC title game). So, using Exponent’s nonsensical test, it’s fairly easy to demonstrate “statistical significance” that one team tampered with the balls…even though it was not the team that actual tampered. It was just the team who had their wet footballs measured first.

Exponent Method #2: a picture that lies

Exponent also attempts to use a “visual proof” of sorts to demonstrate that something is wrong with the Patriot footballs and not the Colt footballs. This approach is Exponent’s acknowledgement that time (“transient curves”) is a relevant variable in the measurement process, but their demonstration is simply incorrect.

Without getting into the nitty gritty, we can simply draw the exact same picture that Exponent draws (starting with Figure 24, pg. 210) to completely disprove their “proof.” (Note: they draw a picture because if they ran an actual statistic test on the data using a transient curve, they would reach the opposite conclusion.) I’ve taken the same data and drawn an Exponent-like picture below. According to Exponent’s “logic,” if the “window” between a dry and wet football doesn’t intersect with the band of what was actual measured (within +/- 2 standard errors from the mean) then it demonstrates something outside the physical boundaries of what is possible.

Here’s our hypothetical data presented Exponent style:

Fake Exponent Graph

Low and behold…the Colt band does not intersect with a 12.5 PSI projection curve and the Patriot band does intersect with its 13.0 PSI projection curve. According to Exponent, this shows that Colt balls must have had something additional done to them. Except that in this scenario, it was the Patriot balls that were tampered with!

This is possible because this “visual proof” approach is biased toward a dry ball that was measured later in the locker room. Just like their p-value is biased toward a dry ball measured later in the locker room period. In the case of the visual, if normal variance (from any non-tampering factor) moves dry-ball measurements slightly below what is expected — as we’ve done in this hypothetical for the Patriots — it simply shifts the team’s band below the dry-ball upper boundary but still above the wet-ball lower boundary. If a team actually had a wet ball, they are instead shifted below the lower band, outside of what Exponent considers physically acceptable.

With regards to time, the Patriot balls in our hypothetical comfortably intersect with the acceptable region. This is because they were measured later in the locker-room period; despite dropping in PSI from the fake deflation of seven balls, there is still a band with which to intersect earlier in the locker room period. Conceptually, this is simply taking the point where the hypothetical region and bounded region (the two red regions) intersect and “shifting it left.” The team measured later in the locker-room period has room to “shift left.”

The Colts, however, by virtue of being measured first in this hypothetical, can only “shift left” for about 2 minutes, because that’s when their balls were measured. The Colts can’t “shift left” 4 minutes, because they would no longer be in the locker-room period. The team measured later can “shift left” 4 minutes, because it just means their “possible scenario” occurred earlier in the locker-room period.

I’ve tampered with the Patriot balls only, but both of Exponent’s major methods strongly suggest it’s the Colt balls who have been tampered with! For the record, if we use a proper methodology as shown in the last post — one that accounts for time — a t-test will produce a statistically significant result (p-value of 0.046) that correctly identifies the Patriot balls as being tampered with.

Conclusion

There are other peripheral weirdnesses in Exponent’s methods, but we don’t need to move beyond the two major core issues here that lead them to their conclusions. In our fake scenario, in which the Patriots deflated seven balls, both of Exponent’s methods would find the wrong team guilty of tampering with the balls! The method used in the last post that controls for time — specifically, taking the difference of each ball at the time it is measured and seeing how far it is from the projected PSI — instead correctly identified the tampered balls with statistical significance.

Yes, a proper method can demonstrate this despite a sample size of just four footballs from the control group. This is possible because of the consistency of the measurements in our hypothetical. You know what set of data did not show the same consistency? The real Colt balls, measured at halftime of the AFC Championship game. That Exponent jumps through hoops to try and demonstrate a lack of variability in measurements is fine and dandy…but as Rasheed Wallace once said, “ball don’t lie.” And the four Colt balls don’t lie — there is enough variability in the data set that, unsurprisingly, a 0.2 PSI difference in expected measurements at halftime is not statistically significant, and in many cases, not even close.

A proper analysis from Exponent, given the real halftime data presented in the Wells Report, would have found this.

 

 

The Statistical Improbability of Deflate Gate

On Sunday I broke down some of the common misconceptions surrounding the Wells Report, including the social science involved, the statistical misinterpretations and the lack of coherence in the NFL’s story based on its own evidence. Then, on Wednesday provided a time-based visualization of the all the measurements presented in the Report based on where we’d expect them to be at a given time as the balls warmed up in the locker room. Visually, it’s fairly clear that the Colt balls and Patriot balls have similar issues, as many are “under-inflated” by similar degrees. But what does this mean in terms of probability if we actually run some statistical tests on the data?

To reiterate, time is a major variable in this case because the PSI of the balls was increasing with every minute that they were in the locker room at halftime. Thus, the time that each ball was measured becomes critical in trying to analyze the discrepancy between where a ball was measured and where a ball “should” be using Ideal Gas Law parameters. Below is one such scenario presented in the previous post. The blue line is where we’d expect a Colt ball to measure given the time indoors and the gold line where we’d expect a Patriot ball to be (based on Fig. 22 of Wells Report):

Deflate Gate Logo Scenario

What we want to calculate is the difference between each ball at a given point in time (a circle) and where we’d expect the ball to be based on how long it’s been in the locker room (the solid line). For instance, in the above graph, Patriot ball #1 is about 0.25 PSI above where we’d expect. Ball #2 is 0.4 below where we’d expect. These values will be different depending on when the balls where measured, so our parameters for simulating the actual measuring circumstances are (assuming the report is correct in that the balls were correctly recorded in order, and were set to 12.5 and 13.0 PSI respectively in the pre-game):

  • Set Up time (2-4 minutes according to accounts)
  • Measurement time (21.8 – 27.3 seconds per ball)
  • Inflation time (2-5 minutes)
  • Packing time (unstated, but assumed to require some small degree of time between last measurement and re-emergence from locker room)

If we use Exponent’s Ideal Gas Law calculations that assumes 71 degrees pre-game — which may be slightly low, as noted in the last post — and add a small “wetness” factor per their report, we can then simulate a bunch of these scenarios to see what was likely and unlikely. The scenario above attempts to average all the accounts; set up time is “medium” (3 minutes, halftime between 2 and 4), measurement time is “medium” (~25 seconds) and inflation time is medium (3.5 minutes). But we can also examine other scenarios — instances where the Patriot balls were tested after 2 minutes or 4 minutes, quickly or slowly re-inflated, etc. If we do that, we’re left with a number of basic permutations we can study:

Deflate Gate p-value

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

So what do these numbers mean? The “Patriot-Colt mean difference from expected” column calculates where each ball should be based on the time it’s measured, takes the average of all such Patriot balls and subtracts it from the average of all such Colt balls. If we take the mean of all six hypothetical scenarios, the average Patriot ball is about 0.003 PSI below where it should be at the time of measurement relative to the Colt balls. (i.e. using the Colt balls as a control group.) The p-value is the statistical likelihood that the balls come from different populations, i.e. that one set of balls had something done to them that the other set didn’t.

  • The best-case statistical scenario for the Patriots is that Walt Anderson used the Logo Gauge pre-game, that the balls were measured at 2 minutes, each took about 22 seconds to measure and that the officials took 5 minutes to re-inflate the balls (labeled “Early Start, Fast Measure, Long Inflate” above). That produces a mean where the Patriots balls are higher than the Colts, meaning it’s impossible for the Patriots balls to come from a population that is inherently lower than the Colts balls.
  • Three of the six scenarios in which the Logo Gauge was used pre-game completely exonerate the Patriots
  • The worst-case scenario for the Patriots is that Walt Anderson used the Non-Logo Gauge in the pre-game, and that the balls were measured at 4 minutes, each took 27 seconds to measure and that the officials took 2 minutes to re-inflate the balls (labeled “Late Start, Slow Measure, Quick Inflate” above). That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

Although it’s far below “statistical significance,” 75.3% might sound like a lot. But what is that number actually saying? For that, we have to look at the observed difference in the averages to put this into perspective: there’s a 75% chance that the 0.3 PSI difference is not simply from variance and is part of a different population (i.e. tampered balls).

Depending on the distribution, 0.3 PSI could easily be 99.99% likely to come from a different sample…which would suggest, what? There’s a 99.99% chance that the Patriots systematically released an average of 0.3 PSI per football? And that’s the worst-case scenario? That strains common sense.

In total, the (independent t-tests) results show that it was incredibly unlikely that the Patriot balls behaved any differently from the Colt balls using the assumptions presented in the Wells Report. Additionally, in the small unlikelihood that they are different  — roughly a 15% chance if Anderson used the Logo Gauge in the pre-game and 57% if he used the Non-Logo gauge — the degree to which the balls are different is nonsensically small. We would expect a “small” degree of deflation to be something like 1.0-2.0 PSI; the initial reports were “11 more than 2 PSI below regulation,” with another ball falsely labeled by the NFL itself at 10.1. But the data presents a completely different story — the Patriot balls are sometimes higher than the Colt balls relative to what we’d expect, and the worst-case scenario for New England suggests a “non-significant” likelihood of tampering to a degree that is so small it’s equivalent to the variance seen between the two gauges used to measure the balls.

The next post details exactly how Exponent used faulty methods to reach faulty conclusions in the Wells Report. 

PS What happens if the Colts balls were measured immediately after the Patriots balls and then the Patriot balls were re-inflated? This still produces results that strongly suggest non-tampering, as shown below. If the Colt 4-ball group is indeed treated as a control group, we would expect to see the Patriot measurements 13.6% of the time in their “worst-case scenario” — not an occurrence considered “significant” in the scientific community. If Anderson used the Logo Gauge in the pre-game, Patriot measurements are roughly 0.1 to 0.2 PSI below the Colt measurements and far from statistically significant. 

Deflate Gate Fast Recal Patriots Inflate Last

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

June 19, 2015 Update: Joe Arthur has asked me to run the calculation using a flatter transient curve — in other words, what happens if we expect the footballs to heat up more slowly than projected in Fig. 22 of the Wells Report? For a “slower” expected re-calibration, we can use Fig. 24 of the Wells Report. If any physics experts out there can tell us at what exact rate the balls are expected to recalibrate, it would be much appreciated. In the meantime, the aforementioned numbers are derived with a “fast” recalibration curve, and the below with a “slow” recalibration curve. Based on other amateur experiments I’ve found, it seems likely recalibration takes place somewhere between these two extremes. Below are the results of a “slower” expected recalibration. Worth noting is that on the Logo Gauge, the Colts calls are almost exactly where we’d expect. 

Slow Recalibration Model Deflate Gate

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 24 of the Wells Report.

Follow-Up: The Evidence for Non-Tampering in 2 Pictures

The last post on the cognitive and statistical biases in Deflate Gate included a visual of the “Logo Gauge” scenario that was added to the post based on the timeline provided on page 70 of the Wells Report. In that post I discussed the problems with the Colt balls, but failed to include a visual for the scenarios. Such a visual, for both the Logo and Non-Logo Gauge readings, clearly demonstrate how similar the Colts balls are to the Patriots balls relative to what a given ball’s PSI should be at a given time.

Below are the measurements taken at halftime, both from the “Logo” gauge and the “Non’Logo” gauge. Each line is the expected PSI of the balls (blue for Indianapolis, gold for new England) as they heat up during the locker room period. The dots are the actual measurements of the balls. Note the differences between the actual balls and where we expect them to be based on temperature and the Ideal Gas Law:

Deflate Gate Logo Scenario

Deflate Gate Non Logo 12.95

The projections use the following assumptions: The temperature indoors was 71 pre-game, 48 degrees outdoors with an atmospheric pressure of 14.636; The balls were indeed measured in order (this is alluded to but not explicitly stated); It took 3 minutes in the locker room before testing began (an average of the 2-4 min guess made by Exponent); It took 25 seconds to test a ball (Exponent’s 4-5 min estimation for measuring Patriots balls produces a range of 22-27s); It took 3.5 minutes to re-inflate the Patriots balls (again, an average of Exponent’s 2-5 min estimation); It took just over 90 seconds to pack up and leave, using the assumption that packing would take less time than set up;The Patriots balls were all of the same wetness, following a “wet” curve estimated by Exponent’s wet test; The Colts balls were all of the same dryness; Most importantly, that every Patriot ball was exactly 12.5 PSI pre-game and every Colt ball 13.0 PSI pre-game, as claimed by referee Walt Anderson.

 

Notice how the Colts balls are “shifted down” below where they should be in a similar manner to the Patriots balls. It’s likely the measurements reflect natural variance we see in measuring actual game-play footballs, as both Patriot and Colts balls aren’t where we’d “expect” based on the assumed parameters Exponent uses and the Ideal Gas Law. (This variance can come from the operator or from other subtle environment factors not captured by temperature and atmospheric pressure. It can also come from the balls not all being perfectly 12.5 pre-game, as well or the temp not being exactly 71 degrees F.) We can say this for two major reasons:

  1. Some Patriots balls are above where we’d expect them based on a 12.5 PSI pre-game measurement and the Ideal Gas Law
  2. 7/8 Colts balls are also below where we’d expect them based on a 13.0 PSI pre-game measurement and the Ideal Gas Law

The Colts balls are actually the best evidence for the Patriots, as they are the only four other footballs ever measured at halftime of a game and they show a departure from what’s expected despite not being tampered with. (Note: Here I’m not arbitrarily treating Colts ball #3 as a transcription error as Wells does.)

Screen Shot 2015-05-20 at 3.38.52 PM

This essentially exonerates the Patriots in the Non-Logo scenario, which is what Exponent used to reach its conclusion. Because in that scenario, three of four Colts balls are more than 0.5 PSI under the expected range, with one of four about 0.75 PSI below expected. 7/11 Patriot balls were more than 0.5 PSI below expected with 5/11 more than 0.75 PSI under expected. As stated in the last post, Exponent overlooked this because they ignored the variable of time (the balls heating up) and presented all balls as being measured at the same time.

The final post in this series examines the statistical improbability that there was any tampering during Deflate Gate.

EDIT: Reader “George” astutely noted that simply increasing the indoor temperature pre-game by a degree or three, due to a slight temperature discrepancy between the HVAC and the actual room temperature (from bodies giving off heat in the room) helps explains much of the slightly below-expected values from both teams. In that case, each team’s expected PSI line would be shifted down slightly, helping to explain the results of both sets of balls, as shown below: 

Deflate Gate Logo 74 Degrees

 

The Cognitive and Statistical Biases of Deflate Gate

I’ve been biting my tongue on Deflate Gate, but the scientist in me reached a tipping point today after reading this Boston Globe article analyzing the Patriots rebuttal to the Wells Report. Simply put: Many of the salient parts of this story aren’t being told properly.

This post will analyze key areas of the Wells Report based on the foundations of this blog (cognition in the context of sports statistics):

  1. The interpretation of context-free communication snippets
  2. The interpretation of memory-based claims
  3. The statistical analysis of the AFC Championship game measurements
  4. The lack of coherence in any proposed tampering scheme

Conclusions and a summary are presented at the bottom for those who want to skip over 4,000 words of details.

1. Communication and Context

a. Ambiguity

Essentially, human beings cannot communicate without context. Think about words like transactions. Deposits. Tellers. Now read the following snippet from a conversation:

Person A: “Pick me up by the bank of the river.”

In all likelihood, you thought of a financial institution. Without any context, that snippet is ambiguous. But the mind will rarely interpret it as ambiguous; talking about things related to a financial institution primes the brain to think of a financial institution and thus interpret “bank” as a building with windows and tellers. Most importantly, one’s instinct is to think it’s something, and not think it’s ambiguous. But what happens when we get to see the full conversation?

Person A: “Are you we going rowing tomorrow?”

Person B: “Yes, I’m setting up the boat right now.”

Person A: “Great, so when should we meet?”

Person B: “Let’s say 9. I’ll start at the boathouse.”

Person A: “Pick me up by the bank of the river.”

Person B: “Great — you can just hop on there.”

The conversion is no longer really ambiguous — these two people are talking about an embankment next to the river, not a financial institution.

The majority of the non-statistical case in the Wells Report is based on highly ambiguous texts that were taken out of context. Surely, most people who have been following the story from the beginning don’t think they are ambiguous because most people were thinking about the context of deflating footballs when they read the text messages. They were primed to think this way and as such can no longer see another explanation as even reasonable without oodles of context.

This is a typical case of anchoring and confirmation bias, two of the most powerful mechanisms that govern our decision making. I can shape your opinions by putting information in your head (like the financial institution example above), whether accurate or wrong, and that initial information holds extra weight in your mind (anchoring). Then, once you start to believe something, you start to only look for evidence that supports your story (confirmation bias).

b. “Help the Deflator”

The Globe article takes exception to the Patriots explanation that Jim McNally referenced himself as the ‘deflator’ because he is a big fellow and wanted to lose weight. The author, Ben Volin, responds “It’s hard to find a rational-thinking person in the country who buys this answer.”

That’s just ignorant. And understandably — the science of thinking isn’t exactly taught in high schools.

Volin is under the impression that his mind isn’t heavily anchored to the context of deflating footballs, when for months, he’s only associated the term deflator with this issue. He, like most of us now, probably can’t even think of the word “deflate” without thinking of PSI and footballs. From a cognitive standpoint, that’s predictable.

But not necessarily accurate.

Prima facie the texts reflect incriminating language to those who have been loaded up with the idea that there was a tampering ploy in place. Once the mind has decided what the “deflator” refers to, it has a hard time accepting a counter explanation without a larger sum of evidence. But again, that’s a recipe for false conclusions and simply a predictable function of the brain’s desire to create certainty instead of ambiguity.

Conversely, if I told you that two jocular workmates came up with strange terms to needle each other with, and those terms were related to the actual work they did every day, would you think that’s strange? It’s possible, without any additional evidence (see section 3 and 4 below), that he calls himself the “deflator” because he regularly tampers with footballs on Sunday. There are also a myriad of other possibilities for that one text message, especially given McNally’s texting habits and propensity for wild language and nonsensical statements (e.g. “what’s up dorito dink?”)

People use jargon specific to their vocation all the time, and do so in extending humor or personalizing phrases. On the outside (with no context) these references seem meaningless or are misinterpreted. That’s the definition of an “inside joke.” It takes one instant of connecting a football losing weight to a person losing weight and voila, an inside joke. (Or, the only recorded instance of McNally referring to his role in a tampering scheme, a scheme that was otherwise never apparently discussed over text.)

Just imagine what kind of story can be painted when snippets are taken out of context. The possibilities are endless:

Honestly, I just wanted to slip in Brian Williams rapping. Everyone’s so serious about these footballs that they could use a little Gin and Juice.

c. What happens when you add context?

There are a number of instances of people misinterpreting something without context. I was going to cue up a bunch of examples and research experiments, but we need to look no further than the Patriots inclusion of (alleged) other testimony that was omitted from the Wells Report.

Consider that the report tries to make it look like Brady is bribing McNally with gifts. What the report does not include, according to the Patriots, is that Brady regularly gives comparable gifts to “15 non-player personnel.” This, and many other omissions like this in the report, need little analysis to illustrate the problems with context-free conclusions; obviously it looks quite different if Brady only gives alleged co-conspirators gifts, versus a standard distribution of comparable gifts to a number of people regularly.

Similarly, there was a bit in the Wells Report about “getting a needle.” Without context, many interpreted this as fitting a narrative, as the brain is designed to do. But, when you add the context that McNally was the individual who literally provided needles to the officials for their pre game process, it completely changes the meaning of the text. Again, I’ll leave it up to the reader to guess why Wells left out the context, that according to the Patriots rebuttal, was confirmed by all witnesses involved. (Specifically, that McNally had to return to ask Jastremski for another needle at the request of the officials, and that this practice became, like seemingly everything else between them, a running joke.)

The only difference between these examples and the “deflator” joke is that the full context about the weight-loss joke hasn’t been substantiated by others. However, this is the same cognitive mechanism seen in God of the Gaps thinking. It is a fallacy, and the very problem with context-free thinking, to think that just because we don’t currently have that context that an alternative explanation is improbable or, as the Globe believes, impossible. Especially in the face of all of the other insinuations in the Wells Report being corrected.

2. Memory

There are then a number of other major claims in the Wells Report regarding memory.

a. “I don’t know Jim McNally”

Let’s say my doorman is named Bill. I know his face. I know his voice. I know some of his habits. If I were asked about “William McCluskey” I would have to honestly say “I have no idea who that is.” Both of these things are true — I know a doorman named Bill, and I also don’t know someone named “William McCluskey.”

Only it turns out that these are the same people. I would not be lying to say “I don’t know William McCluskey.” Just like Brady wasn’t lying when he said he didn’t know some lowly locker room attendant’s full name. He did think his name was “Burt,” phonetically similar to his nickname, “Bird,” so Brady was indeed aware of the existence of McNally and even had a name to pair to his face. (I’ll leave it up to the reader to decide if the Wells investigators lacked the knowledge to realize this or intentionally framed it to look like a lie.)

b. “I didn’t do anything abnormal”

Next, there is the trip to the bathroom that McNally took. He was originally asked if he did anything out of the ordinary when transporting the balls from the locker room to the field, to which he answered, “no.” He then later said he went to the bathroom, which the investigators presumed to be out of the ordinary, and thus interpreted McNally as being untruthful. This is a terribly false conclusion.

Assuming he does occasionally go to the bathroom, as he claimed, his answer of “nothing was abnormal” and then later “I went to the bathroom”  is indeed consistent — as opposed to a contradiction — based on his own recollections. Because, for McNally, going to the bathroom wasn’t out of the ordinary. Simply because the investigators find it out of the ordinary doesn’t mean it was for McNally.

c. “I’ve never seen THAT before”

The final issue I’ll discuss regarding memory is referee Walt Anderson’s recollection of the ball location. I once worked across the street from a pink house for a month straight. One day the topic of houses came up. I looked out the window and said “look, they just painted that house pink!” I was informed by many others that the house had always been pink, I’d simply never cared or devoted any attentional resources to it.

It is possible that in his entire career, the AFC Championship game was indeed the only time Walt Anderson remembers the balls “going missing.” (Although it does beg the question of why no one else found it strange that a giant man carried a giant bag of footballs out of the room in plain sight.) But, given that he had been primed before the game to pay attention to the balls for the first time in his career, it is expected that he would suddenly notice things he’s never noticed before. It’s possible that every time Anderson was in New England, or anywhere, that the procedure was equally as lackadaisical as it was during that game…he just never gave any attentional resources to it. Kind of like how that house was always pink.

3. Statistical Analysis and Physics

It is quite clear that no one at the NFL knew anything about the Ideal Gas Law when this story broke. Heck, it’s quite clear most of the public didn’t know about it either. This creates another major bias that is very hard to undo.

Due to the Ideal Gas Law, ever since footballs have been inspected pre game they’ve also been in play at different PSIs. All the time. If we could go back in time and measure balls at half-time of every game, some would be in the 10s, some in the 11s, some in the 12s, etc. People only found the Patriots report abnormal because they had never been introduced to it before. It was a physics problem, and most people, without knowing physics, declared it was abnormal. When explanations of the Ideal Gas Law sprang up on the Internet, people had the same disbelieving reaction that they do now to “the deflator” explanation.

Again, the mind is ripe to do this. We’d like to be measured and cautious and say “I’m not an expert at this so I’ll find out more,” but that’s not how the brain is hardwired to work. By the time the Ideal Gas Law was popularized, people had already made up their mind there was tampering. And it’s very, very hard to undo that. This leads to the aforementioned confirmation bias.

Here’s the thing though — the Wells Report did not present strong evidence that concluded the balls were tampered with, or even likely tampered with. Amazingly, the report tries to bury this finding by making a bunch of assumptions in an attempt to say that it was possible there was tampering, when the “more probable than not” interpretation from any scientist would have to be the opposite conclusion: that it was more probable than not there was no tampering, and that the ambiguities around game day measurements leave open some possibility of foul play, although even that is stretching it.

Analyzing the Wells Report Data

This issue has been discussed in great detail, but I want to translate some numbers to demonstrate why the Wells language is the opposite of what it should be based on the data. The Exponent team commits all sorts of scientific faux pas, such as presenting p-values based on a nonsensical dataset that is literally the best looking data they have to support tampering. (Amazingly, despite all their assumptions, this is the only area of the report they perform such statistical tests.)

So why is it so disingenuous to compare the Colts averages to the Patriots? Primarily because the Colts balls were measured after the Patriots, so they had ample time to recalibrate to the new indoor temperature, raising the air pressure with every minute that passed. Exponents own graphs (fig 22, page 203) show a ~1.0 PSI increase in pressure expected in the Indianapolis balls after 10 minutes indoors…but they make no attempt to adjust the data and retest. From a methodological standpoint, this is astounding. This wouldn’t pass an undergraduate peer review.

A 12.5 PSI football, with all other factors being equal (which they weren’t), is expected to be at 11.32 PSI given the game-day conditions in Foxboro (including an atmospheric pressure of 14.636.). Similarly, we’d expect a 13.0 PSI football to be 11.8 PSI when it entered the locker room at halftime. However, as the balls heat up in the locker room after coming off the field, they will rapidly increase in PSI as shown below. All of the Patriots balls were apparently measured first — with perfect instruments, and excluding the effect of water, we’d expect those balls to be about 11.5 PSI after 2 minutes indoors, when measuring could have started, and [edit] based on time estimations, the highest ball would be about 12.2 PSI.

Meanwhile, starting the Colts measurements at the 10 to 11 minute mark, we’d expect their balls (13 PSI pre game) to measure in the 12.8 to 12.9 range. In other words, in 10 minutes indoors, the Colts balls would have almost completely returned to pre game levels, while a large chunk of the Patriots balls would be significantly closer to outside-condition measurements.

Now, that 11.32 number does not include water, which changes the volume of the ball and has an additional effect on the pressure. Exponent seems to contradict themselves here, stating first that they couldn’t observe any volume change in a wet ball (thus making water moot). However, in their “spraying” test, where they simulated outdoor conditions with wet balls, they clearly find a difference between wet and dry balls. Again, significant because the Patriots balls were wet and the Colts balls were protected in a bag and unused at the end of the first half. Below are the expected readings based on time indoors and dry/wet conditions:

Figure 22 of the Wells Report

Figure 22 of the Wells Report

Now here’s the actual data:

Screen Shot 2015-05-16 at 4.38.08 PM

Note that we wouldn’t really ever expect to see a reading below 11.2, according to Exponent, even with water involved. So, unless something else needs to be incorporated, something beyond the factors we’ve examined would have additionally deflated the footballs per Clete Blakeman’s readings. (Assuming that it’s just simply not transcription error or gauge inaccuracy.)

Amazingly, it turns out something else does need to be incorporated: Walt Anderson possessed a gauge that read roughly 0.3-0.45 PSI below the other gauge. Exponent believes his gauges were consistent — something in science we call “reliability” — despite not being accurate. The Patriots expressed concerns about reliability when they pointed out that the intercepted football measured 11.45, 11.35 and then 11.75 using the same gauge on the sideline, although Exponent did indeed test the gauges in question and found them to be fairly reliable. We’ll assume they are reliable (consistent) for the rest of the post, although this is clearly an area that could create additional variance in the readings.

The Logo Gauge (Higher readings) Scenario

Anderson claims to have used the higher gauge to take the pre game measurements. Why does this matter?

  • If we examine just the presumed logo gauge measurements between pre game and halftime, only the 4th and 10th Patriot ball fall just below our expected floor (by 0.2 and 0.3 PSI, respectively).
  • If we examine just the non-logo gauge measurements and assume Anderson used the logo gauge pre game, then we’d expect something like a floor of 10.9 PSI without water involved, and thus probably nothing below 10.7 PSI. There, the 4th ball is right on the cutoff and the 10th ball 0.2 PSI below.

In other words, if you believe Walt Anderson, then almost all of the Patriots balls were found to be in a range that demonstrates non-tampering. Not the opposite.

From what I’ve seen about variability in measurements, I’m not comfortable chalking up 0.2-0.3 PSI on two balls to “tampering” factors outside of gauge reliability, transcription error, or some other subtle natural effect (i,e. additional water) that we aren’t accounting for. [edit] This hunch is confirmed when analyzing the data in greater detail. Heck, the sample size of this experiment is really 1, because we’d need to test balls at halftime for a handful of games to see if there are readings that also fall just outside the range predicted by Ideal Gas Law or if that is indeed abnormal, even by a small amount. (That’s where you’d publish a p-value, FYI.)

For instance, the Colts 3rd ball measures 12.95. Exponent believes this is a transcription error because it would be the only instance of the Non Logo Gauge measuring higher than the Logo Gauge. Simply introducing this kind of measurement variability essentially puts every Patriot football measured within the expected norm.

In other words, if the Logo Gauge was used pre game, it’s most likely the Patriots balls were not adjusted or tampered with.

The Non-Logo Gauge Scenario

Now, there’s another major issue that Exponent also skirts over. If the Non-Logo Gauge were used in the pre game, then how does one explain the Colt’s readings on that gauge? Indy’s four balls assumed to be measured at halftime by the Non-Logo Gauge exhibited PSIs at 12.7, 12.75, 12.5 and 12.55. However, look at the dry temperature curve presented above. Those balls should all clearly be above 12.8 PSI.

If the Non-Logo Gauge were used pre-game, then something doesn’t add up with the Colt’s balls.

In other words, using Exponent’s (arbitrary) decision to discount Anderson’s claim that he used the Logo pre-game and assume he actually used the Non-Logo gauge, then the Colt’s balls have the exact same problems (to almost the exact same degree) that the Patriots balls exhibited. [edit] Indeed, the Colts balls are either approximately 0.2 or 0.35 PSI higher than the Patriot’s balls on this gauge, depending on when the Colts balls were measured.

In conclusion, assuming we believe the (unrecorded) pre game measurements of 12.5 for Patriot balls and 13.0 for Indianapolis balls, the data shows that

  • if Anderson used the Non-Logo Gauge in the pre game, the Colts balls were also slightly below where they should be based on the physics
  • if Anderson used the Logo Gauge, the Patriots are essentially exonerated

Contrary to Exponent’s conclusions, their procedure is not reasonably scientific because it’s predicated on Anderson’s recollection that the Patriots balls were 12.5 and Colts balls were 13.0. (Memory is unreliable.) There were also no consistent (or repeated) measurements standards and no controls, such as measuring the balls at the same time. With that said, since we can safely assume the Colts didn’t tamper with the balls, that the Colts balls exhibit the same minor failure to align with the Ideal Gas Law and other physics factors as the Patriots balls do.[edit] It’s actual highly improbably that there was tampering whether Anderson used the Logo or Non-Logo gauge, and the claim from Exponent that the Non-Logo gauge implies tampering is based on the statistically invalid practice of ignoring the time and variance in the data.

4. Is there a plausible explanation for Wells’ claim?

Despite reading a good deal of reaction to this story, I have yet to encounter a coherent explanation for what is being alleged. The Wells Report is quite careful not to author such a story, using vague language instead. But let’s actually spell out what they are alleging:

  • Walt Anderson was wrong about what gauge he used in the pre-game, thus the Patriots have some under inflated footballs because of tampering
  • According to Wells, eight of their footballs seem under-inflated by (approximately) 0.5-0.6 PSI on average, with the lowest football being about 1.0 PSI below expected
  • Those footballs were deflated by Jim McNally in the 100 seconds he was in the bathroom with the balls
  • McNally has done this regularly since at least 2013, because he calls himself the “deflator”

In order to believe the above story, one also must believe the following:

  1. Brady would have to figure out at some point in time that he preferred balls just under 12.5 PSI, in the 11.5-12.0 PSI range.
  2. Brady determined that the difference between 11.5-12.0 PSI and 12.5 PSI was so great that he felt he needed to ask an employee to tamper with the footballs, and not even risk under inflating them and hoping they passed inspection.
  3. However, Brady could only tamper with the footballs at home. He’d be using footballs that were so different in his mind that they were worth tampering with at home…but on the road, he would be out of luck. Despite this, he’s better on the road than most NFL QBs.
    • Note that this makes the Indianapolis Colts claim that the Patriots used deflated footballs in Indianapolis during the regular season essentially impossible.
  4. During the October, 2014 game against the Jets, “the deflator” Jim McNally failed to deflate at least some footballs (that were 16 PSI)
  5. Tom Brady, after blowing up on the sideline about the quality of the footballs during that game, performed the following as a charade to protect the cover-up:
    • Brady (allegedly) in front of others, declared he wanted balls at the low permissible range (~12.5 PSI) before giving them to the referee. (Even though McNally was already deflating balls…so why would they not already be at the low range to save McNally time in his deflation process?)
    • Brady brought a rule book to the officials to show them 12.5 PSI balls should not be touched…even though he knew McNally was going to alter them.
  6. Furthermore, Brady would have to go through the charade of inspecting the balls pre-game in front of other people, knowing that these would not be the balls he would playing with. His true pre-game ritual was one of the following:
    • He secretly inspected footballs at some 11.5-12.0 range and then told the staff to inflate them to 12.5 so he could stage a second, phony inspection in front of others every game (right before the balls are delivered to the officials), while no one noticed him missing or sneaking away during this period, OR
    • He simply inspected them at 12.5 for tack and feel, knowing that once he let out a little air, the PSI would be where he wanted it. This explanation assumes that he is both so meticulous about PSI that we wanted less than a pound of PSI (which no human can seemingly detect) out of the ball and simultaneously does not think there would be a tactile difference between a 12.5 and 11.5 PSI ball that he needs to actually inspect the real ball-condition he will play with.
  7. Jastremski and Brady are either horrible at tampering — setting balls to 12.75-12.85 instead of the lowest permissible 12.5 before the Jets game (which would make McNally’s work harder), or they did indeed start at 12.5 pre-October, 2014 and decided to make up a story to tell the investigators that they used to inflate to 12.75-12.85 so Brady could plausibly deny ever knowing about PSI before October, 2014.

I have yet to see Wells, or anyone, make sense of this convoluted, contradictory set of events that must have had to happen according to their meaning of the “deflator” text and allegations about regularly deflating footballs. Which of the following conclusions seems more likely to you?

Conclusion A: Tom Brady figured out that he really liked footballs just under the legal limit, decided not to have his equipment team slightly under inflate balls to hope they would pass a lackadaisical NFL inspection process (the technique Aaron Rodgers told Phil Simms about), but instead set up an elaborate process to take just a little air out of the balls. But only at home. And Walt Anderson forgot what gauged he used…so he took another tenth or two of PSI out of the balls.

Conclusion B: Walt Anderson correctly remembered what gauge he used, footballs can have incredibly small perturbations outside what we’d expect based on just temperature and pressure and Jim McNally was indeed referring to “deflating” his waist.

Can anyone reconcile these issues? Because Occam’s Razor says Conclusion B is a significantly more — excuse me — Conclusion B is “more probable than not.” Yet despite this, recent polls suggest that a majority of the country believe the Patriots cheated…without actually being able to offer up a coherent story for how the events in the Wells Report make sense. There is evidence in the report that needs to be reconciled with the alleged claim because it seemingly contradicts the claim or relies on less-than-likely dependencies.

Conclusion

  1. A Lack of context and predictable cognitive biases can make text messages appear as they aren’t, and make alternative explanations less believable than they really are
  2. A lack of understanding around memory likely led the Wells investigators to a number of false conclusions
  3. The AFC Championship game data shows the following [edit]:
  4. People are comfortable claiming tampering despite the story from the Wells Report lacking coherence and requiring the following to be true:
    • Brady would have discovered he feels a large enough advantage in slightly deflating, when no one else seems to be able to tell the difference
    • He would have been OK with using different (non-deflated) footballs, on the road, despite leading the 2006 rule change to create uniform preparation for QBs at home and on the road
    • Despite tampering only being carried out at home, Brady’s performance has been better on the road.
    • Despite his better road performance, Brady still went through with tampering by carrying out a phony inspection in the locker room before every game (because those would not be the conditions of the balls he would use post-McNally deflation.)
    • After the October, 2014 game against the Jets, Brady extended the charade by providing a copy of the rulebook to the officials before games, knowing full well that McNally would deflate below 12.5 anyway.

PS Please don’t use this post to disparage others. It’s designed to educate, regardless of your opinion or rooting allegiances.

Is Tom Brady Better at Home?

All of this talk about deflating footballs begs a natural question: What are the differences in performance in Tom Brady’s performance at home (where the Patriots are alleged to have tampered with the footballs) and on the road (where, based on the existing accusations, they would not have tampered with the balls)?

The following table is a list of all the quarterbacks in the last two seasons who have attempted at least 200 passes at home and on the road. The numbers show are the difference between home and road performance. In other words, a positive number means a higher number at home. Below are the results:

Screen Shot 2015-05-17 at 2.58.10 PM

In the last 2 years, Aaron Rodgers has shown the greatest improvement at home relative to away games. Rodgers ranks first in interception percentage drop, first in increase in touchdown percentage, second in increase in yards per attempt (Brandon Weeden is first) and first by a landslide in QB Rating. Tom Brady is 10th in improvement at home in QB rating, leagues behind Rodgers amazing 37.9 QB Rating jump in home games.

What if we expand the sample to go back to the 2006 rule change where quarterbacks from each team could control the ball? How does Brady look then? (Min 500 attempts home and away to qualify for this sample.)

Screen Shot 2015-05-17 at 3.36.47 PM

Brady has actually been quite poor over the long haul at home, at least from a basic statistical perspective. Brady is well below average in these areas, showing a pretty significant road bias (not home) in performance. The average for these 45 qualifying quarterbacks was an improvement of 3.9 QB Rating points at home, 0.2 more yards per attempt at home, a 0.4% bump in TD% and 0.2% drop in INT%. Brady does throw fewer interceptions at home per pass, but his other numbers actually trend in the opposite direction and are better on the road. His QB Rating is a shade better at home, but again, that indicates abnormally strong play on the road relative to the rest of the quarterbacks in the league.