On most election nights, the best way for us to communicate how many votes are outstanding is the fraction of precincts that have reported their results. Unfortunately, this metric can mislead our readers. For example, if many small precincts had already reported but a few large precincts hadn’t, the percentage of precincts reporting would give the idea that most of the election was already over and was possibly already decided. Conversely, if only a few large precincts had finished counting, it could appear that there were many votes outstanding even if election night was actually nearly over. And this doesn’t take into account other problems like early, absentee or provisional voting, which can make up much more than half of the total vote in some areas.

To avoid this issue, we began working on a model that estimates turnout before the election begins and then updates these predictions throughout the night as precincts report their results.

This Is The Very Model of a Modern Election General

Our model is founded on the assumption that past turnout predicts future turnout. While it is true that there are exceptions, this maxim often gets us very close to where we want to be. And because our model was built to adapt to live data rather than act as a static pre-election prediction, these estimates are just our starting point.

Our focus on the live experience is also the reason we chose to model this election on the precinct level. By returning predictions for each precinct, we would not have to wait for entire districts to finish counting and report their results.

To learn the relationship between past turnout and future turnout, we looked at historical elections that most closely resembled this election. For the House of Delegates, we used the 2017 election, which included a Governor’s race but no state Senate elections, and 2015, which on the surface is the most similar to 2019. For the state Senate, we chose the 2015 and 2011 elections.

It’s possible that choosing the 2017 election was a mistake since, in a lot of important ways, it was quite different than the election this year. But we made a conscious choice to include it because we think that an election in which Democrats nearly doubled their Gubernatorial win margin compared to 2013 may have marked a fundamental shift in Virginia politics.

To predict the turnout for any of those elections, we looked at the turnout in prior elections. So to predict the 2017 turnout, we looked at turnout in 2016, 2015 and 2013. We also included turnout from the year’s primary and the turnout in the primary of the equivalent election two years previously. Since a lot can change between elections, primary turnout gave us a better indicator of enthusiasm for an election in a specific year. Other features we added included an indicator of whether or not there was a Governor’s race — which helps us account for increased turnout in 2017 — and a number of demographic features. After some testing, we settled on 10-year age buckets and ethnicity.

We then added those features to a model, which were used to predict turnout precinct-by-precinct in that year. While we tested more sophisticated models, we settled on Ordinary Least Squares — or OLS — because it gave us an easy way of computing confidence intervals. Also, because of the assumption of normality of errors, OLS gave us a straightforward way of updating our predictions for the live version of the model. Furthermore, we didn’t see a drop in quality of fit between OLS and more complex models that we tried.

The setup is common to OLS:

\(Y = X \beta + \epsilon\)

If \(n\) is the number of precincts and \(p\) is the number of covariates, then \(Y \in \mathbb{R}^{n}\) is the vector of turnouts. \(X \in \mathbb{R}^{n \times p}\) is the matrix of covariates where each row of the matrix consists of the data we use to predict turnout for a precinct. \(\beta \in \mathbb{R}^{p}\) is the vector of parameters we are trying to estimate and \(\epsilon \in \mathbb{R}^n\) is the error such that \(\epsilon \sim N(0, \Sigma)\) so that \(Y \sim N(X \beta, \Sigma)\).

This equation is then solved using the Moore-Penrose pseudo-inverse:

\(\beta = (X^T X)^{-1} X^T Y\)

The fitted model gave us parameter estimates for the covariates specified above, which we could then apply to the corresponding data in 2019. Which were now the 2018, 2017 and 2015 general election turnouts, 2019 and 2017 primary turnout, an indicator of a Governor’s election (set to 0) and the demographic features — 10-year age buckets and ethnicity.

Interval Training

Because we wanted a sense for how confident we could be about our predictions, and also to give the predictions an upper and lower bound, we computed prediction intervals. Prediction intervals are similar to confidence intervals, which contain the true population of a sample statistic with some predefined probability. The difference is that prediction intervals take into account the additional uncertainty of predicting future observations.

To understand the difference between prediction and confidence intervals let us consider estimating the mean of a population. Assume that \(x_i \sim N(\mu, \sigma^2), iid, i = 1, ..., n\) where both \(\mu\) and \(\sigma^2\) are unknown. We use the sample mean, \(\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\) as the estimate for the population mean.

A 95% confidence interval would contain the true population mean with 95% probability. The variance we use to compute this confidence interval is the variance of \(\bar{x}\), which, by our assumption of is independence, is \(\frac{\sigma^2}{n}\).

Now, assume we want to predict a new point \(x_{n + 1}\), we also want to estimate an interval that quantifies our uncertainty in this prediction. These prediction intervals ought to contain the true value of \(x_{n + 1}\) with 95% probability. Given that \(x_{n + 1}\) is also drawn from a distribution with mean \(\mu\), a reasonable estimator \(\hat{x}_{n + 1}\) is the sample mean \(\bar{x}\).

Counterintuitively, the variance of this estimator is not equal to the variance of the sample mean: \(\frac{\sigma^2}{n}\). \(\frac{\sigma^2}{n}\) only quantifies how far away \(\bar{x}\) is from \(\mu\) (and not \(x_{n + 1}\)) in expectation. The process of drawing \(x_{n + 1}\) introduces an additional source of uncertainty: the sampling process itself. Since these two sources of error are independent, the variance we use for the prediction interval can be computed by simply adding \(\frac{\sigma^2}{n}\) and \(\sigma\) together. After factoring, this yields a final variance estimate of \(\sigma^2(1 + \frac{1}{n})\).

To compute prediction intervals for our turnout model, we generalize this derivation for sample means to the case of OLS predictions.

Since our predictions are the output of a linear regression it would seem reasonable to use a t-distribution for our prediction intervals. However, using this distribution would build on the assumption that the precincts are independent of each other. This is likely incorrect. The factors that drive high turnout in one precinct would, at a minimum, likely also drive high turnout in other precincts within the same district. And these same factors might even suggest high turnout generally across the entire election.

To solve this issue we used the multidimensional analog of the t statistic known as Hotelling’s T-squared statistic. It gave us a multidimensional prediction region for all precincts simultaneously. Notwithstanding its name, this statistic uses an F-distribution instead of a t-distribution. It has thicker tails which take into account more uncertainty, giving us robust prediction intervals for each precinct.

Again, let \(x_i \sim N(\mu, \sigma^2),iid, i = 1, ..., n\) with unknown mean and variance. Let \(\bar{x}\) be the sample mean and \(s\) be the sample standard deviation. The test statistic for a hypothesis test on the mean is:

\(t = \frac{\sqrt{n} (\bar{x} - \mu)}{s}\)

By assumption \(x_i\) is normal, so as the sum of normal random variables \(\bar{x}\) is too. Since \(s\) is the sample standard deviation, it is the square root of a sum of squared normal random variables, making it a scaled \(\chi\) random variable. We further know that the quotient of an independent normal random variable and the square root of a chi-square random variable, divided by its degrees of freedom, has a t-distribution with \(n - 1\) degrees of freedom.

By Basu’s theorem we know that the sample mean and sample standard deviation of a normal random variable are in fact independent of each other, which gives us:

\(t \sim t_{n - 1}\)

Unfortunately, it is not obvious how to generalize the test statistic \(t\) to the multivariate setting.

Let us square both sides, resulting in an equivalent test statistic:

\(t^2 = \frac{n(\bar{x} - \mu)^2}{s^2}\)

This is now the quotient of two \(\chi^2\) random variables, which we know follows an \(F\)-distribution with \(1\) and \(n-1\) degrees of freedom.

Unlike \(t\) we can now intuitively generalize this:

\(T^2 = n(\bar{\mathbf{X}} - \mathbf{\mu})^T \mathbf{S^{-1}} (\bar{\mathbf{X}} - \mathbf{\mu})\)

Where \(\mathbf{\mu}\) is the multivariate population mean, \(\bar{\mathbf{X}}\) is the multivariate sample mean and \(S\) is the sample covariance matrix.

According to Harold Hotelling – a journalist turned statistician – \(T^2\) also follows a \(F\)-distribution and once we scale for the corresponding degrees of freedom, we get:

\(\frac{n - p}{(n - 1)p} T^2 \sim F_{p, n - p}\)

We can further show that the \(1 - \alpha\) confidence ellipsis look like:

\(\begin{pmatrix} \bar{\mathbf{X}} - \sqrt{ \frac{p (n - 1)}{n - p} F_{p, n - p} (\alpha) \frac{\mathbf{S}}{n}}, \bar{\mathbf{X}} + \sqrt{ \frac{p (n - 1)}{n - p} F_{p, n - p} (\alpha) \frac{\mathbf{S}}{n}} \end{pmatrix}\)

Unfortunately, this was not quite what we wanted. While it was helpful to have prediction intervals for each precinct, we were more interested in prediction intervals on a district level. Worse, we couldn’t simply sum variances of the precincts that we used to compute the precinct-level prediction intervals. That would rely on an assumption that turnouts in precincts are independent of each other. Instead, we needed to add in the covariances between the precincts, accounting for the fact that if one precinct had high turnout, other precincts in the same district would likely have high turnout too.

Fortunately, Hotelling’s T-squared prediction intervals were easily generalized to a sum of random variables by summing the covariances of the precincts within the same district. This gave us simultaneous prediction intervals for district-level predictions.

We’ll Do It Live

The second — and more important — component of our turnout model was a live model that could use actual turnout results from precincts that had already reported to update our predictions for those that hadn’t.

Through the use of Ordinary Least Squares regression and the Hotelling prediction intervals we had already baked in the assumption that the vector of precinct-level predictions were jointly normal. Throughout the night, we were interested in the conditional distribution of the unreported subset of those predictions given the observed predictions in the reported precincts. Thankfully, the conditional distribution of the unreported precincts is also normal and with well-known closed form solutions for the conditional mean and covariance matrix.

Let \(\mathbf{x} = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{bmatrix}\) such that \(\mathbf{x} \sim N(\mathbf{\mu}, \mathbf{\Sigma})\) where \(\mathbf{\mu} = \begin{bmatrix} \mathbf{\mu}_1 \\ \mathbf{\mu}_2 \end{bmatrix}\) and \(\mathbf{\Sigma} = \begin{bmatrix} \mathbf{\Sigma}_{11} & \mathbf{\Sigma}_{12} \\ \mathbf{\Sigma}_{21} & \mathbf{\Sigma}_{22} \end{bmatrix}\).

Then \(\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} \sim N(\bar{\mathbf{\mu}}, \bar{\mathbf{\Sigma}})\) where \(\bar{\mathbf{\mu}} = \mathbf{\mu}_1 + \mathbf{\Sigma}_{12} \mathbf{\Sigma}_{22}^{-1} (\mathbf{a} - \mathbf{\mu}_2)\) and \(\bar{\mathbf{\Sigma}} = \mathbf{\Sigma}_{11} - \mathbf{\Sigma}_{12} \mathbf{\Sigma}_{22}^{-1} \mathbf{\Sigma}_{21}\).

Covariances and Surfing Waves of Change

Because our model relies heavily on covariances — both for prediction intervals and for the live updates — generating the covariance matrix became an area of intense scrutiny for us. The usual approach would have been to take the sample covariance as the covariance estimate. Unfortunately, we couldn’t just do that because of the intricacies of our sparse data set of historical elections.

There are approximately 2,000 precincts in Virginia. Our covariance matrix — a 2,000 by 2,000 matrix — would have two million unique entries. Unfortunately, there have been very few — at best, four — comparable elections that we can use as samples to compute the covariances. The under-specification of this problem means that the output of the covariance estimate, if done naively, was numerically unstable. The result? The updated covariances in the live model jumped to zero after very few precincts were observed. This would suggest that after only a handful of precincts had reported there was no extra information to be learned by seeing more precincts. This is almost certainly incorrect. Also, because the resulting matrices were no longer positive semidefinite and close to zero, there would be cases where we could have observed negative district level variances, something which would have broken our prediction intervals.

We tried a number of methods to mitigate this issue, including applying an inverse-Wishart prior distribution to the covariance matrix and scaling the matrix by a multiple of the identity matrix. Unfortunately none of these methods had enough of an effect on the estimand.

After some trial-and-error, we settled on two different approaches — one for the House of Delegates and one for the state Senate. We are using a different approach for each chamber because there have been fewer state Senate elections using the current districts — and many of those races were previously uncontested — so our sample covariance estimates were even more unstable.

The covariance matrix in the State Senate consisted of three unique numbers: The precinct variance, the covariance between two precincts in the same district and the covariance between two precincts in different districts. These are each calculated as the average over the original sample covariances.

For the House of Delegates, we decided to model the covariance as a linear function of the distance and demographic similarity. In other words, we assumed that the covariance between two districts was function of the physical distance between those two districts and the Mahalanobis distance between demographic vectors. We solved this linear equation using OLS with the sample covariance as the dependent variable. Though we did use the average sample variance as our variance estimate as in the Senate.

We estimate the covariance matrix as:

\(\hat{\sigma}_{i, j} = \alpha + \beta \cdot d(i, j) + \gamma \cdot sim(i, j)\)

Where \(d(i, j)\) is the distance between the centroids of precinct \(i\) and precinct \(j\), and \(sim(i, j)\) is the Mahalanobis distance between the demographic features of the precincts.

Currently we are using OLS to solve the above equation for all districts. Unfortunately this gives us no guarantees on the semi-definiteness of the resulting matrix. For the next iteration of this model, we are hoping to solve the equation as a PSD programming problem instead.

Using these methods to compute the covariance matrix loses a lot of information — especially in the case of the state Senate — but we decided that the gain in numerical stability was worth the loss.

Election Night Jitters

Our model starts updating its predictions as soon as the first precincts start reporting. Realistically, we expect to start seeing stable numbers by the time about 10 percent of precincts have reported. We’ll likely be able to begin publishing our model results when we get to about 25 percent of precincts reporting. Why do we wait? Early numbers can have wild effects on the predictions if some of the precincts are outliers. Precincts do not report in random order. In many states, rural precincts tend to report earlier, and this makes our model unreliable for urban precincts early in the night.

There are a number of metrics we will be watching in real-time to see how our model is performing. These include the average residual on the precincts that have reported so far and the average residual for the precincts that are just reporting in that moment. We will also be looking at the average prediction interval coverage of the precincts that have reported. We expect it to be relatively low initially as the model adjusts to the real data, but we expect it to stabilize to around 95 percent by the time about 10 percent of precincts have reported. Additionally, we will be looking at the average size of prediction intervals, which we expect to decrease over time. Finally, we will be looking at the raw precinct and district predictions and make sure that the numbers look reasonable with the help of The Washington Post’s local team, who I’ll be sitting with on election night.

What’s Next

Even before our model goes live on Tuesday, we are already thinking about what’s coming next. The first, and most obvious thing is that there is much we don’t know about how to build election night models. We came up with and built the above model from scratch. This means we will be continuing to run models in future elections, making improvements to the current one or testing out other ideas. We hope to run a very similar model for the Louisiana Governor election on Saturday, November 16th.

Additionally, we will probably move from a precinct-level approach to a county-level one. While a precinct-level model gives us finer granularity and allows live updates to start earlier, it also comes with costs. Gathering data is painful, especially because precincts often change between elections. This makes comparing results over time very tricky. Also, as explained above, estimating such a large covariance matrix is difficult and moving to counties will hopefully make the problem a lot more tractable.

We are also some smaller structural improvements to the estimation process, such as trying to use positive semidefinite programming to solve for our covariance estimation. And we’re planning on investing more time in feature engineering, especially for demographic features.

Watch this space for more explanation behind our models — both for general elections and primaries!

Special thanks to Drew Dimmery, John Cherian and Idrees Kahloon for their advice and insight.