Some of you may have seen our recent piece about how The Washington Post estimates turnout on election nights using live partial results. Although we did not publicize these results for the 2019 elections in Virginia and Louisiana, our previous model also allowed us to estimate the two-party vote counts.

We are now interested in estimating turnout and vote count for primary elections. Unfortunately, the model we used for the 2019 elections does not lend itself to estimating the vote count for party primaries. The prediction updates relied heavily on a covariance matrix which itself was estimated using previous elections. This was possible because most recent previous elections in the US have seen a Democrat and a Republican run against each other so we can generally compute a covariance matrix for each party across different precincts.

In a party primary, however, this isn’t possible because in each cycle, the makeup of the candidates running is almost completely unique. Elizabeth Warren has never run in a presidential primary before, so we don’t know what her performance in one precinct tells us about her performance in another precinct. And even for candidates that have run before — such as Bernie Sanders — our ability is rather limited. Sanders has only run in one previous presidential primary, so we don’t have enough data to estimate a covariance matrix like the one we used in our general election model.

So, we found ourselves in the position of having to build a new model for estimating vote counts in a party primary.

America’s Next Top Model

We need to solve this problem without relying heavily on past elections. Instead, we considered how the previous election informs the current one. What if we could predict what Hillary Clinton and Bernie Sanders supporters were going to do this time?

The mental model for this new approach is to think about the primary as having two states — and when we say “states” here, we mean conditions, not geography. The first state is the result of the previous election and the second state is the result of the current one. This frames the problem as computing the transition matrix between the two elections. Obviously, we want to do this before the election is over, so it’s about computing this transition matrix from partial results.

Specifically, we are interested in the 2020 primary, which means this is the second state. The last relevant election was in 2016, so we define that as the first state. Therefore we are trying to estimate a transition matrix between the results of the 2016 primary and the 2020 primary. In other words, we are trying to predict where Hillary Clinton’s and Bernie Sanders’ votes from 2016 were going in the 2020 primary.

For the sake of simplicity, let us assume there are four major candidates in the 2020 democratic primary. This means we will define six variables to represent the 2020 democratic primary outcomes: One for each of the four major candidates, one for “other” and write-in candidates, and one for all of the voters who decided not to vote in 2020 for any reason, including voters who have left the state or passed away.

We define a similar set up for 2016, though with fewer variables — one for Sanders, one for Clinton, one for “other” and write-in candidates and one for the 2016 voters who decided not to vote or who weren’t yet eligible to vote in 2016 but are in 2020.

We’re generally interested in estimating the proportion of people that transferred from one of the 2016 variables to each of the 2020 variables. We will do this by looking at precincts that have reported at every stage of the night and comparing their 2016 and 2020 results to estimate how voters preferences have shifted.

Since we are have six variables for 2020 and four for 2016, we are trying to estimate 6*4 = 24 variables. This means that we have to have at least 24 precincts in before the model estimate is mathematically stable.

Once we have reached this minimum number of precincts, we can set up the problem. Our model tells us that the precinct results from 2016 (including 2016 non-voters) times our variables should be equal to our precinct results from 2020 (also including 2020 non-voters). This is because our variables represent where the 2016 voters shift to in the 2020 election.

Formally, let \(n_{2016}\) be the number of variables in 2016 and \(n_{2020}\) be the number of variables in 2020 (number of major candidates + 2). Let \(p\) be the number of precincts that have reported (\(p \geq n_{2016} \times n_{2020}\))

Then the problem we are trying to solve can be set up as:

\(\min ||AX - B||_F\)

\(st. \sum_{j = 1}^n x_{i, j} = 1, \forall i\)

\(x_{i, j} \geq 0, \forall i, j\)

where \(A \in \mathcal{R}^{p, n_{2016}}\), \(B \in \mathcal{R}^{p, n_{2020}}\) and \(X \in \mathcal{R}^{n_{2016}, n_{2020}}\)

That is, \( A \) and \( B \) are the 2016 and 2020 results, respectively. Each row in \(A\) corresponds to one precinct and each column corresponds to one 2016 candidate, while each column in \(B\) represents a 2020 candidate.

The function we are trying to minimize is the difference between the actual 2020 results (\( B \)) and the 2020 results we are estimating using our parameters (\( AX \)). Since \( A \) is fixed and \( X \) is the matrix of our variables, this minimization problem fits the variables to minimize the difference between the true and estimated 2020 results.

Our first \(n_{2016}\) constraints are telling us that the rows of the transition matrix need to sum to 1. In other words, every voter that voted for Hillary Clinton in 2016 has to do something in 2020 (vote for one of the major candidates, one of the other candidates or not vote at all). The second set of \(n_{2016} \times n_{2020}\) constraints are making sure that the transition matrix is positive. Since the numbers represented in both \( A \) and \( B \) are people that voted, both need to be positive, which is what this constraint enforces.

As with every optimization problem the questions that needs to be answered are (1) whether this problem has a solution and (2) if so, is it unique

To see that (1) is true, let us assume that all Hillary Clinton voters go to Joe Biden, all Bernie Sanders voters remain with Sanders and all 2016 minor candidates and all-non voters stay at home in 2020. It is clear that this is one feasible solution to our optimization problem as it doesn't violate any constraints. Therefore we know that this problem has a solution.

To answer (2), let us note that we are minimizing the Frobenius norm of the matrix difference, which is defined as:

\( ||A||_F = \sqrt{\sum_{i = 1}^m \sum_{j = 1}^n |a_{i, j}|^2} \)

Therefore we are minimizing the square root (a positive function) of the sum of squares, in other words, a convex function. In fact, the function is strictly convex, so we know that it has a unique minimum. Since from (1) we know that a feasible solution exists, and we know that a global minimum exists exists, we know that it must have a minimum in the set of feasible solutions. In other words, we can solve this optimization problem.

We are also running a second model without constraints. The estimates are generally slightly better (especially early in the night) but the parameters are less interpretable, which is why we prefer the constrained model

After we have estimated the 2016 to 2020 voter flow, we can apply these estimates to the outstanding precincts in 2020 to get a sense for where the rest of the night might be heading. We can add in the vote totals of the precincts that have already reported to get predictions for vote totals by candidate.

As I mentioned above, the interpretation of the parameters we are estimating is the proportion of supporters of the 2016 candidates that moved to support the 2020 candidates. To see why this is useful, let us apply the same model to results of the 2008 and 2016 primaries.

Note: 2012 did not feature a competitive primary among Democrats.

We can now visualize the estimated voter-flow using a sankey diagram. Our model believes that Hillary Clinton retained most of her voters between 2008 and 2016 while some of her support went to Bernie Sanders or became non voters. Unsurprisingly, the model thinks that most John Edwards supporters decided to vote for Bernie Sanders in 2016, while a large chunk became non-voters. Since both candidates attracted younger and more progressive voters, this split makes sense. Finally, the model estimates that a little less than half of Obama voters in 2008 decided to vote for Hillary Clinton, while a slight majority voted for Bernie Sanders. This split in Obama voters is interesting, and is probably due to age differences. Areas that had many young Obama voters (such as college towns) were more likely to vote for Bernie Sanders in 2016, while areas with older Obama voters probably saw most of their support go to Hillary Clinton.

Of course some of the results we see above are caused by error in the model — and there are multiple potential sources of error.

One major problem with our model is that it is subject to a type of ecological inference fallacy. We are trying to estimate behaviour of a subgroup of individuals — Clinton supporters — from aggregate data — Clinton, Sanders, O’Malley supporters and non-voters. This can lead to estimates that are quite wrong. The thing to remember is that the model is not a perfect estimate for how Clinton and Sanders voters have changed their preferences. It is a model of voter shift that is consistent with the election results observed.

An issue that comes with this ecological inference fallacy is the order in which precincts report and what that might do to the estimate. To illustrate this, let us assume that Bernie Sanders retains 80% of his 2016 support in precincts with large numbers of college students but only retains 40% of his support in other precincts. Now assume that the college precincts report early. In this case, our model would overestimate the fraction of supporters that Bernie Sanders would keep in all other precincts, thus overestimating his support in general. Thus, it’s important for us to only run our model when we are confident that precincts from many different regions in the state have reported. Typically, we’ve been waiting until 15-20% of precincts are reporting.

But even once enough precincts from different areas have reported, the ecological inference problem continues to crop up and is definitely something we would be looking out for by comparing the model parameters to exit polls. One potential solution would be to run multiple versions of our model, one for urban precincts, one for suburban and one for rural precincts. The problem with that is that some states have few reporting units — there are only 260 townships in New Hampshire, for example — and we would then possibly have too little data for reasonable estimates of the transition matrices.

A Vary Model Sequel

As with every model we produce, we are interested in estimating the uncertainty that comes with our predictions. And you’ll remember from the introduction that primaries leave us wanting a new way of estimating the covariance of our predictions because we do not have relevant historical elections to use.

We estimate our covariance using the bootstrap. The bootstrap allows us to estimate the sampling distribution of our prediction, which gives us an estimate of its covariance. The benefit of using this method is that it makes no distributional assumption over the data.

We run the bootstrap \( B \) times, which means \( \forall b \in B \) uniformly sample \( p \) elements with replacement from the training set (ie. the precincts that have reported so far). Using that sample construct a new estimator \( \hat{X}_b \).

Use the set of bootstrapped estimators \( \{\hat{X}_1, ..., \hat{X}_B \} \) to estimate the covariance of \( X \).

In order to generate the uncertainty estimates for the second (unconstrained) model, we want a covariance estimate for our predictions. This means we multiply each \( \hat{X}_b \) by the outstanding precincts to generate \( \hat{Y}_b \). We then compute the covariance for \( Y \) using \( \{\hat{Y}_1, ..., \hat{Y}_B \} \).

A fun note on the implementation. To speed up the computation of the bootstrap we multi-thread the process. One thing one has to be careful about is that multi-threaded processes need not be independent of each other - which is a core assumption of the bootstrap. Remember to set a seed!

One of the main assumptions of the bootstrap is that the observations are independent and identically distributed. It’s unlikely that this is the case for election results. Generally, smaller precincts are more likely to report earlier than larger precincts and the results in any one precinct are not going to be independent from the results in another. This means that our procedure is likely underestimating the uncertainty somewhat, and in a way that we’re unable to quantify.

Note: We love to talk to people about math. If you have a solution to this problem — perhaps some form of the wild or pairwise bootstrap? — please feel free to reach out.

In order to mitigate the underestimation of uncertainty in the second model we add in an estimate for the variance (we compute the mean squared error using the jackknife). For each precinct, we compute the parameters with all other precincts (leaving that one out) and compute the residual for the outstanding precinct -- we then take the average of the squared residuals as an estimate for the variance. This gives us intervals that are akin to prediction intervals instead of confidence intervals. You can read more about the difference in our previous blogpost.

Uncertain Terms

We now want to use our estimate of the covariance to approximate the uncertainty in our predictions. In our primary model, we used an F-distribution and estimates for the covariance and mean squared error to produce prediction intervals. Instead of generating a prediction interval, for the primaries we are more interested in samples from the distribution created by our predictions.

In our previous model, using a F-distribution was justified by the normality assumption of the errors in Ordinary Least Squares (OLS) which follows the application of the central limit theorem. In this case, we are not using OLS to estimate our parameters and our parameters definitely aren’t normal this time. We know this because they are bounded by zero and one and sum to one.

This means that the parameters can be interpreted as random variables that follow a Dirichlet distribution. Dirichlet random variables are bounded between zero and one and sum to one, just like our parameters. So we can sample from that distribution to create variations of our results, and we use that sample as our confidence intervals.

The issue with normal Dirichlet random variables is that they are defined by one vector of parameters. This means that we can only define their mean and variance simultaneously and they have to be related to each other. Also, dirichlet random variables are only loosely dependent — since they sum to one, the last variable has to be 1 minus the sum of the rest. But we cannot actually define how they are related to each.

So, instead of using a Dirichlet distribution, we decided to use a generalized Dirichlet distribution. This distribution is defined by two vectors, which means we have the extra degrees of freedom necessary for setting the mean and the variances independently of each other. Furthermore, this distribution allows for covariances between the different elements of the distribution.

We have parameter estimates from our model, which we define as the mean of our generalized Dirichlet random variables and we have a covariance matrix, which we computed using the bootstrap. The generalized Dirichlet distribution is defined by two vectors of parameters \( \alpha \) and \( \beta \) , we now need to match these parameters using our empirical estimates of the moments -- a process called method of moments.

From the Wikipedia page, we can see that the general moment function is:

\( E[X_1^{r_1} X_2^{r_2} ... X_k^{r_k}] = \prod_{j = 1}^k \frac{\Gamma(\alpha_j + \beta_j) \Gamma(\alpha_j + r_j) \Gamma(\beta_j + \delta_j)}{\Gamma(\alpha_j) \Gamma(\beta_j) \Gamma(\alpha_j + \beta_j + r_j + \delta_j)} \)

Where \( \delta_j = r_{j+1} + r_{j + 2} + ... + r_{k} \) for \( j = 1, 2, ..., k-1 \) and \( \delta_k = 0 \).

Helpfully, the Wikipedia page also gives us the first moment:

\( E[X_j] = \frac{\alpha_j}{\alpha_j + \beta_j} \prod_{m = 1}^{j - 1} \frac{\beta_m}{\alpha_m + \beta_m} \)

Plugging in the right value for \( r_j \) and using the fact that \( \Gamma(x) = (x-1) \Gamma(x-1) \) we get the following two equations:

\( E[X_j^2] = \frac{(\alpha_j + 1) \alpha_j}{(\alpha_j + \beta_j + 1) (\alpha_j + \beta_j)} \prod_{j = 1}^{j -1} \frac{(\beta_j + 1) \beta_j}{(\alpha_j + \beta_j + 1)(\alpha_j + \beta_j)} \)

\( E[X_i X_j] = \frac{\alpha_i \beta_j}{(\alpha_i + \beta_i) (\alpha_i + \beta_i + 1)} \frac{\alpha_j}{\alpha_j + \beta_j} \prod_{s = 1}^{i - 1} \frac{\beta_s (\beta_s + 1)}{(\alpha_s + \beta_s + 1)(\alpha_s + \beta_s)} \prod_{t = i + 1}^{j - 1} \frac{\beta_t}{\alpha_t + \beta_t} \)

Note: If you happen to find an error in the solution, please do let me know

This means we can set up the following system of equations for all \( i \) and \( j \):

\( \mu_j = E[X_j] \)

\( \sigma_j^2 = E[X_j^2] - E[X_j]^2 \)

\( \sigma_{ij} = E[X_i X_j] - E[X_i] E[X_j] \)

Finally, we solve this system of equations to get estimates for \( \alpha \) and \( \beta \) and using those estimates we can sample from the generalized Dirichlet distribution.

For our unconstrained model we do something slightly less complicated. We simply sample from a t-distribution with n-2 degrees of freedom, using our prediction as means and our computed bootstrapped covariances (with the additional variance estimate added in) as the covariance. The justification for this model is similar to the justification for the last model. The central limit theorem lets us assume that our errors are normally distributed. But since we estimated the mean and covariance we choose not to use a normal distribution, but a t-distribution since that has thicker tails.

We can now visualize these samples to show the uncertainty in our estimates.

Conclusion

With every model, there is much we could have tried and much we would like to try in the future. For future primaries, we’re thinking about building a model that incorporates demographic information of the precincts or running multiple models that are specific to geographic areas — e.g., the urban/rural split we discussed above. We’d also like to find a better way of estimating the covariance matrix, ideally one for which we are less certain about assumption violations.

Generally, in posts like these, we’re interested in explaining the choices we made and why we made them. But we’re also interested in illustrating places where we know our model is imperfect and some scenarios where things might go wrong. We do this because we think it’s important to explain what our models can and cannot do.

In some cases, we chose methods that are easier to understand or have better justification over methods that were possibly better but also more complex. This is because we think there is value in having our models be easier to explain and understand. We believe our mission is not to have the best possible estimate on primary night but instead to have one we can quickly and accurately explain to readers while also giving them insight into what is happening.

We ran this model internally for New Hampshire and we’ll be running it for South Carolina and many of Super Tuesday contests. If you’d like to see it in action, watch the Washington Post’s live show on Tuesday, March 3rd.

Special thanks to Walter Schachermayer, Drew Dimmery, John Cherian, Idrees Kahloon, Akosua Busia, G. Elliott Morris, Kabir Khanna, David Byler, Al Johri and Evan Rosenman for their advice and insight.