Whiteness as the Defining Predictor in American Politics

By Shom Mazumder

Introduction

What sorts of features were useful for predicting the 2016 Presidential election and what do they tell us about American politics today? To answer these questions, we at Data for Progress built a forecasting model based on 2012 Presidential election data and machine learning methods. While past Democratic Party performance is still a strong predictor of future performance, our analysis demonstrates that the racial composition of the American electorate has become a powerful predictor of Presidential elections. In essence, our predictive models demonstrate that American elections have become racialized and are likely to remain that way in the future.

How did we build our model?

Data

Throughout the analysis, we used data on the percentage of the two-party votes cast for the Democratic Party’s Presidential candidate in 2012 and 2016 (Barack Obama and Hillary Clinton respectively) at the county-level as the main outcome variable. The most important ingredient for any model is the choice of predictor variables. Our goal was to build a model that could forecast from readily available socio-economic, demographic, and electoral data as well as public opinion data. Using data from the Census as well as the Cooperative Congressional Election Study, we measured a wide range of features such as racial composition, economic structure, turnout, campaign contributions, educational attainment, racial attitudes, and gender attitudes for instance. 

Analysis

Most prediction models attempt to generate the best predictions in the most parsimonious way possible. We consider two such machine learning algorithms (as well as their mix): regularized regression and random forests. The regularized regression builds on basic regression models by removing predictors that aren’t very good at predicting election outcomes. Random forests take a different approach by building a large amount of weaker regression trees to learn different areas of the data. 

In our analysis, we use 2012 as our training data and then test predictions on 2016 data to attempt to optimize forecasting ability. By doing this iteratively across a number of different tuning parameters, we can optimize how much regularization and how many trees to generate to generate the best possible forecasts. 

Results

Overall, our models performed fairly well. The regularized logistic model had an r-squared (the proportion of variance explained) of 0.85 and the random forest algorithm had an r-squared of 0.72. Using an ensemble algorithm that combines both sets of predictions, we get an overall r-squared of 0.85.

 
image1.png
image3.png
image2.png
 

What sorts of features were important for predicting elections? Two things stand out. First, both models unsurprisingly agree that the past election’s vote share is the strongest predictor of future elections. Second, both models also agree that the percentage of a county that is white is the second most important feature that predicts election vote shares at the county-level. In fact, the regularized regression (elastic net) suggests that past vote share and percent white are the only important predictors. The random forest model suggests that a few more variables such as campaign contribution (the percentage going to Democrats and the total number of Democratic donors) also matter. In short, our models suggest that race is a powerful predictor of elections even above and beyond past election performance.

Conclusion

Using data on elections as well as socio-economic and demographic information, we assessed which types of predictors seem to be the most important for forecasting presidential elections. Our analysis demonstrates that race is a powerful factor in predicting elections today. 


Shom Mazumder is a Ph.D. Candidate in Government at Harvard University

Guest User