All That Glitters is Not Gold
Polling results in the 2020 presidential election shows why live phones should no longer be considered the gold standard for survey research; instead, survey researchers should prioritize innovation and evaluation instead of privileging “legacy” methodologies.
By Kevin Collins Survey 160
The American survey research industry is in a period of transition, with deepening challenges to phone poll response rates but also emerging technologies like SMS surveys. However, despite this changing landscape, there remains a common perception that live caller polls are still the best interview mode to produce accurate results. For example, according to FiveThirtyEight, only firms that use “exclusively live caller with cellphones” can be considered “gold-standard” pollsters.
But was it still true in 2020 that live caller only polls were more accurate than polls using other modes of research? In short, no. As the statistical analysis described below shows, mixed mode surveys with an SMS component performed best in state-level presidential polling in 2020, and live phone only polls were, in fact, among the worst. Now is the moment for the polling community to re-think the choice to privilege “legacy” methodological choices that worked well in the past, and to instead prioritize novel methodological research and innovation backed up by empirical validation.
Methodology
To test the relationship between interview mode and accuracy in 2020, we start with a sample of all statewide likely-voter polls in the FiveThirtyEight database for which the field period started on or after 10/1/2020, though most of the polls in this set finished on 10/20 or later [1]. We then calculate the Biden margin in the poll, by subtracting Trump’s vote share percent from Biden’s. We do the same with the election results for that state; election result data comes from MIT’s Election Lab. Then we calculate the absolute value of the difference between Biden’s margin over Trump in the poll and Biden’s margin in the election result. This is our primary measure of accuracy. For example, if a poll in North Carolina had Biden leading Trump 50-48, and Biden ended up losing to Trump 48.6 to 49.9, this poll would be considered to have an absolute error of 3.3 percentage points.
We like absolute error instead of bias (i.e. the difference between Biden’s margin in the poll and in the election result, not the absolute difference) because it minimizes a known problem where some modes favor a demographic group and creates a reliable skew, and then performs better when the party favored by that group does well [2].
The methodology as coded by FiveThirtyEight includes a number of categories, and leaves some uncoded [3]. We collapse the multi-mode methods to two categories: those that include text (or SMS) and those that do not, to better understand how use of SMS relates to poll accuracy. Almost 60% of these polls were conducted solely online, but about 15% were conducted with live phones only, and another 15% were conducted using mixed modes including SMS (or were SMS only).
Importantly, while we divide these polls into these coarse categories, there is significant methodological variation within them as well. Live phones include both registration-based samples and random digit dialing. Online surveys include both high quality panels and samples sourced from online marketplaces. SMS surveys include both live interviewer SMS surveys and SMS push-to-web surveys. And each mode includes pollsters with a variety of weighting strategies. So we should expect (and we do find) variation in the accuracy within the survey type. But this at least provides a starting place to consider the relationship between interview modality and survey accuracy.
Analysis by Mode
First, let’s just look at the raw distributions of error in this sample by interview mode. It’s relatively clear from this simple look that live phone polling is not obviously superior to other modes. There is substantial variation of accuracy within modes, far more than there is across modes.
However, there could be confounds here that should be taken into account. All else being equal, we might expect larger samples to be better or polls closer to Election Day to be more accurate, or that some states are easier to poll than others. To take those into account, we turn to regression analysis. Specifically, using OLS, we regress absolute error on indicators for each mode, logged days between the final day the survey is in the field and the election, logged sample size, and indicators for state, clustering errors by pollster. The graph below shows the predicted absolute error for each mode derived from that regression analysis, with corresponding confidence intervals.
The regression analysis here shows that on average, mixed mode with SMS has the lowest absolute error, and that live phone polls have the highest error of any polling method recorded in this sample. While there is a considerable amount of uncertainty in these estimates, mixed mode surveys with SMS have a lower average absolute error than do live interviewer phone surveys (p<0.05) [4].
To be clear, this regression analysis shows that interview mode is (somewhat) correlated with accuracy, but it cannot show that the choice to use phones *causes* those polls to be less accurate than mixed-mode surveys with SMS. Other methodological choices may be correlated with the decision to use SMS and act as confounding variables. But this finding does point to the importance of active experimentation in survey methodology, in which randomized controlled trials by mode can identify what method of interviewing yields the most accurate and lowest cost results, both overall and within defined segments of the electorate.
Key Takeaways
Looking at 2020 presidential polling, it is clearly a mistake to treat live only phone polls as a gold standard that reliably produces more accurate results than do other modes. Other researchers are coming to similar conclusions. For example, in a new journal article that looks at a more restricted set of polling using different inclusion criteria, Costas Panagopolous finds that “Polls conducted entirely by phone were the least accurate in 2020.”
Instead of relying on the methodological choices that have worked in the past, we should place greater value on efforts to innovate. For some that means SMS or online, but increasingly it will mean mixed mode research, where survey modes other than live interviewer phones are selected to optimize the accuracy and cost of the survey, rather than just used as a last resort when standard practices have failed. To be sure, phones will continue to be essential tools of survey research for the foreseeable future. But often that will mean respondents will type answers in an SMS app or web browser, rather than speak them to a live operator.
In light of the vanishing advantages of phones in terms of survey accuracy, the polling community needs to embrace innovation over tradition, while continuing to hold ourselves to high standards of accuracy. Ensuring this accuracy in survey research requires active, ongoing experimentation in between elections, instead of simply waiting to see how the next round of election polling turns out. Some industry leaders, like the Pew Research Center, are already doing this kind of essential research. But all media and politically-aligned pollsters should be doing the same; the continued credibility of survey research demands it.
[1] Polls rather than pollsters are the unit of analysis here, so some polling firms have multiple polls for the same state in this period. There are multiple observations for 32 of these polls, including duplicates for leaned or un-leaned vote choice, for low or high turnout situations, and questions that include all candidates on ballot or only Biden and Trump. For those, we choose the un-leaned vote choice, the high turnout scenario, and those with the fuller set of candidates listed, yielding 799 polls.
[2] For example, IVR-only surveys skew old and Republican, so they have some of the lowest bias metrics on average. However, because this causes them to be too Republican in cases where the Democrat won, their average accuracy is worse, relative to other modes, than their average bias.
[3] We were also able to code some of the uncoded on the basis of their releases; SurveyMonkey polls were all coded as online, and the Public Policy Polling poll releases in this set of 799 all indicated they were phone and text.
[4] This finding is largely robust to the case selection decisions described above. However, when the “low turnout” scenarios in the Monmouth polls are used in place of the high turnout scenarios, the difference between phone polls on average is slightly smaller, and the significance level drops to (p<0.1).
Kevin Collins (@kwcollins) is Co-founder and Chief Research Officer at Survey 160, a firm that offers software and services for live interviewer SMS surveys.