Read the technical writeup | See the code and a sample data

Since November 2019, The Washington Post has constructed several different models for the purpose of estimating the number of outstanding votes in an election.

Our first model was built to estimate turnout in the Virginia state house and state senate elections in 2019. This model allowed us to stop using the old “precincts reporting” measurement and instead show a “percentage of final vote recorded.”

Our second model was built to estimate turnout and vote share in the Democratic primaries in 2020. This model was built to track the flow in vote share for Democratic candidates between 2016 and 2020. Helpfully, there were only two Democratic candidates in 2016, so this made the process simpler. Additionally, this approach yielded a wonderful descriptive tool which showed the estimated vote flows between candidates, so we could help answer questions like, “Which 2020 candidates are 2016 Clinton voters supporting?

Today, we’re introducing our third model. In collaboration with our friends at Decision Desk HQ/0ptimus Analytics, we’ve built a model to estimate turnout and vote share for the 2020 Presidential and Senate elections. And while we’ll summarize the high points here, you should read our technical writeup and dive into the code yourself.

Our new method no longer estimates the covariance between geographic units like our Virginia model did. Instead, we model the quantities we are interested in directly. We do that by estimating the normalized “residual” or error between our pre-night prediction — spoiler: the 2016 turnout — and what we think turnout will be this year. We use demographic features such as age, gender, ethnicity, income and education as covariates, which means we implicitly leverage the night-of, observed covariances for estimation.

Estimating the normalized residual instead of turnout directly makes our model less sensitive to population differences in counties. It also makes our model less reactive to the fact that the order of reported counties is not random. In normal elections, rural counties tend to report earlier than urban counties. And because of time zones, states that are further east report results sooner than states that are further west. This year, the order of reported results will be especially bizarre because many states might not report votes totals on election night because of increased vote-by-mail. All those things together mean that we know that the order of counties reporting isn’t a random ordering. But using the normalized residuals as our response variable makes them seem closer to random.

The predictive core of our model is a quantile regression. The difference between quantile regression and the regression you’re used to seeing, ordinary least squares, is that quantile regression predicts the conditional median instead of the conditional mean. This is yet another technique which makes our prediction less sensitive to outliers. There will invariably be some counties where our covariates aren’t satisfactory for predicting turnout — some unmodeled event like a scandal, or that one of the candidates runs a particularly good campaign. Using the median over the mean makes sure that we don’t adjust our parameters too much because of those few outliers, especially early on.

Another benefit of using quantile regression is that we can estimate other quantiles which are not the median. We can estimate the 0.05th and 0.95th — for example — quantiles, which gives us a first estimate for our prediction intervals. Critically, this means that we did not make any distributional assumptions to get these prediction intervals. This is a big change from our Virginia model where we assumed turnout to be jointly normal!

Unfortunately, these prediction intervals can be inaccurate because our model may not be specified properly or because the counties that have delivered early results are particularly unrepresentative. To solve this problem, we use a method called conformal prediction, which adjusts the prediction intervals generated by quantile regression to make them correct under certain minimal assumptions.

Intrigued? Dive into our model internals in this technical writeup or get started with a sample data set by cloning our model code repository.

As always, reach out to either Jeremy Bowers, Lenny Bronner or John Cherian with questions.

Acknowledgements

This model was created in cooperation with Decision Desk HQ/0ptimus Analytics. Special thanks to Alexander Podkul, Alex Alduncin, Matt Shor, Kiel Williams, Scott Tranter and Drew McCoy.

Furthermore, Jessica Eng was integral to this work. We would not be where we are today without her.

We want to thank everyone at The Washington Post that made this work possible. That includes, but is not limited to: Peter Andringa, Jason Bernert, Jeremy Bowers, David Byler, Reuben Fischer-Baum, Simon Glenn-Gregg, Shana Hadi, Jason Holt, Aditya Jain, Teddy Kentor, Emily Liu, Anthony Pesce, Erik Reyna, Terri Rupar, Ashlyn Still and Susan Tyler.

Finally, we want to thank Laura Bronner, Akosua Busia, Drew Dimmery, Jessica Hullman, Idrees Kahloon, Kabir Khanna and G. Elliott Morris for their help, advice and support along the way.