In 2014 we first heard of a branch of scholarly research purporting to support the accuracy of mainstream fact checking.
The research was mentioned in a paper by political scientists Brendan Nyhan and Jason Reifler, The Effect of Fact-checking on Elites: A field experiment on U.S. state legislators:
While individual fact-checks sometimes veer into punditry or semantic disputes (Marx 2012; Nyhan 2012, 2013b), an academic analysis of the ratings by elite fact-checking organizations ﬁnds a very high level of agreement when they evaluate identical or similar claims (Amazeen 2012, 66–68).
We reviewed an earlier version of Nyhan and Reifler’s paper in October 2013. It’s worth noting that Nyhan and Reifler cited fact checker agreement to support the idea that fact checkers are consistent and reliably accurate.
Glenn Kessler’s June 2014 report from the first-ever Global Fact Checking Summit in London also touched on the research:
Some skeptics maintain that fact checking is merely another form of opinion journalism, disguised behind a veneer of objectivity. But Michelle Amazeen, assistant professor at Rider University, said that preliminary research indicates that during the 2008 and 2012 elections, The Fact Checker, PolitiFact and FactCheck.org reached the same conclusion on similar statements at least 95 percent of the time.
We considered the prospect of using fact checker agreement to support fact checker accuracy a difficult problem, so we looked forward to reviewing the methods the researchers used. In late 2014 we found Checking the Fact-Checkers in 2008: Predicting Political Ad Scrutiny and Assessing Consistency, posted online on Oct. 14, 2014, pending publication in the Journal of Political Marketing.
Author Michelle A. Amazeen graciously shared a draft copy with us for purposes of review.
Thanks to Amazeen’s willingness to answer some questions about her research, we divide our review into two sections. The first section will consist of our review in the standard format. After that, we’ll present a narrative telling the story of our work double-checking our conclusions and trying to communicate our criticisms to Amazeen. A third section will supplement the review by digging more deeply into a key technical detail.
Checking the Fact-Checkers in 2008: Predicting Political Ad Scrutiny and Assessing Consistency
As the title suggests, Amazeen’s paper (hereafter “Checking”) tries to predict which ads fact checkers will scrutinize and aims to assess fact checker consistency. Our criticism will focus first on problems with Checking’s claims of fact checker consistency, and after that look at Checking’s misguided attempt to use fact checker consistency to support fact checker reliability.
Political ad scrutiny
Checking identifies attack ads as the most likely political ads to draw the attention of fact checkers, largely owing the common use of verifiable figures used in attack ads.
We find that part of the paper reasonable and uncontroversial.
We anticipated problems in supporting the accuracy of fact checkers based on consistency. We hypothesize that the tendency of journalists to lean left would lead to similar ratings for similar claims. Fact checker consistency, while a reasonable prediction if two different accurate fact checkers look at the same or similar claims, also occurs if two different fact checkers carry similar bias into a fact check.
Checking does not address the bias problem at all, but the neglect counts in the end as a minor problem. We find Checking uses dubious means to support its claims of fact checker consistency.
Checking places all fact checker ratings into one of two groups, true or untrue. Checking places claims for which no fault was found in the first group. All other claims fall into the second group. So all claims PolitiFact rated “Mostly True” or worse receive the same designation, untrue. Likewise, all claims receiving one or more “Pinocchios” from Glenn Kessler end up in the group of untrue statements.
What’s the problem?
First, lumping mildly flawed statements with deeply flawed statements hides many of the differences in the way the three “elite” fact checkers rate statements. It’s like measuring inches with all the markings erased from a ruler, or refereeing a basketball game while wearing frosted lenses. Checking only looks at whether the “elite” fact checkers found statements true or untrue.
Second, though Checking notes the tendency of fact checkers to look for flawed statements to check, the research makes no attempt to account for confirmation bias. Worse, the resulting population of fact checker ratings contains only five true ratings out of the total of 110 ratings. The fact checkers’ selection bias leaves Checking with a population of ratings guaranteeing a high rate of agreement. When the high number of false ratings guarantees the fact checkers will agree over 90 percent of the time, shouldn’t that fact accompany the claims of high consistency?
The research method in Checking resembles a study of expert shoe examiners. The examiners specialize in evaluating adults’ shoes. The experts pick out shoes to examine, mostly the shoes of adults since that’s their focus. Then the experts evaluate the shoes. Sure enough, the great majority of the shoes are adults’ shoes. Two experts check the same 50 shoes, finding that 48 of them are the shoes of adults. The experts disagree on the other two shoes, with each expert saying one of the shoes is a child’s shoe, yet differing on which of the two is the child’s shoe. That’s 96 percent agreement, but obviously the tendency to focus from the outset on adults’ shoes gives us a biased shoe population incapable of adequately measuring disagreements. The experts could disagree at most 4 percent of the time for this population, and the research fails to make clear whether it measures the experts’ initial assessment or their later careful examination. We don’t find out whether the experts evaluate adults’ shoes well.
Checking suffers from a comparable statistical problem, and as a result fails to support its main point.
Checking clearly identifies fact checker agreement as the key finding (bold emphasis added):
The primary take-away from this research is that political marketers, reporters and voters can be comfortable that evaluations of the leading fact-checkers are consistent. This study demonstrates that the elite fact-checkers were overwhelmingly in agreement about the presence of political ad claim inaccuracies in 2008. Because of the differing methods used by the fact-checking organizations to assess the claims, however, the analysis in this study was limited to binary agreement. Nonetheless, despite the differing philosophies about whether the degree of accuracy can be measured, the elite fact-checkers do exhibit consistency in determining whether a fundamental inaccuracy is present. Convergence around evidence using different approaches lends credibility to this analysis (Jackson and Jamieson 2007).
Checking’s citation of the book “Un-Spun” by Brooks Jackson and Kathleen Hall Jamieson helps cement the fact that the paper claims the consistency of fact checkers supports fact checker accuracy. Jackson and Jamieson give simple expression to their concept:
We can be more confident about a conclusion when different sources using different methods end up agreeing on it.
Despite Checking’s emphasis of this point, the paper provides no convincing evidence that the so-called elite three fact checkers show a high level of agreement in their finished assessments of political claims. Checking makes no attempt to separate fact checkers’ selection bias toward flawed statements from the research phase of fact checking. That failure undermines the research even if the study had addressed other neglected potential problems such as similarly biased fact checkers.
We charge Checking with misappropriating Jackson and Jamieson’s maxim. Jackson and Jamieson do not arbitrarily exclude all sources save for three respected sources. The proper application of their guideline includes testing the so-called “elite” fact checkers against other observations. Checking twists Jackson and Jamieson’s test and manipulates the scientific method (see Page 2) to fabricate an endorsement of mainstream fact checking.
James Cochran, professor of statistics at the University of Alabama, summed up the problem neatly:
(A)ny attempt to interpret a high level of agreement as an indication (o)f accuracy is improper and misleading. Two very poor fact-checkers may agree on every assessment they both make, but the agreement of poor fact-checkers certainly could not be considered evidence of their accuracy.
Page 2: The Trail of Inquiry
Page 3: Additional Notes on Krippendorff’s Alpha