Several times during this election cycle, you may have noted that we posted these lovely estimated vote-share change diagrams on Twitter:

The flows in these diagrams are the parameters of a model that The Washington Post uses to predict expected votes remaining for various candidates during single-party primaries. You can read more about that model in a previous post.

Ecological Inference: A Niche Problem

To save you a click, we described a problem with our previous model that was in need of a solution. That problem is one of ecological inference. The word “ecological” doesn’t refer to anything green when we’re talking about statistics. We’re using it to describe the fallacy of using aggregated data to infer individual behavior. In particular, we hoped to infer the voting patterns of subgroups of people who voted for Hillary Clinton or Bernie Sanders in 2016 — or people who didn’t vote in 2016 at all. Here’s a snippet that explains why this is so challenging:

One major problem with our model is that it is subject to a type of ecological inference fallacy. We are trying to estimate behaviour of a subgroup of individuals — Clinton supporters — from aggregate data — Clinton, Sanders, O’Malley supporters and non-voters. This can lead to estimates that are quite wrong. The thing to remember is that the model is not a perfect estimate for how Clinton and Sanders voters have changed their preferences. It is a model of voter shift that is consistent with the election results observed.

An issue that comes with this ecological inference fallacy is the order in which precincts report and what that might do to the estimate. To illustrate this, let us assume that Bernie Sanders retains 80% of his 2016 support in precincts with large numbers of college students but only retains 40% of his support in other precincts. Now assume that the college precincts report early. In this case, our model would overestimate the fraction of supporters that Bernie Sanders would keep in all other precincts, thus overestimating his support in general. Thus, it’s important for us to only run our model when we are confident that precincts from many different regions in the state have reported. Typically, we’ve been waiting until 15-20% of precincts are reporting.

Trying to make subgroup inferences from population data is a common problem. For example, ecological inference comes up occasionally in voting rights cases. After the Thornburg v. Gingles Supreme Court decision, one way to win cases under the 1965 Voting Rights Act was to demonstrate racial voting patterns. To illustrate how that would work, you might use data to show precincts with more higher black population tended to have more votes for candidate A in order to infer that black people tend to vote for candidate A.

Like any inference, this is based on certain prior assumptions about which kinds of voting patterns are more or less plausible. Unfortunately, it’s very possible for this inference to be wrong. For instance, it’s possible that mostly white people voted for candidate A, but only in places where they had more black neighbors.

To see how an inference might go wrong in practice, let us imagine there are counties. In one county, Clinton won 60% of the 2016 vote and Sanders won 40%. In the other county, the vote was split 45% for Clinton and 50% for Sanders. Assume that the county with a majority of Clinton voters voted for Biden and the split county voted for Sanders in 2020. At first glance, this result is consistent with the hypothesis that Clinton voters voted for Biden and Sanders voters remained with him. However, this instinct is not supported by the evidence, and is an example of the ecological inference fallacy. Consider the possibility that Sanders voters in the majority Clinton county were driven towards Biden because they are more likely to know and be friends with Clinton voters — just a few Clinton voters had to vote for Biden to give him a majority in that case. To avoid this fallacy, we must be more careful when drawing conclusions about voter subgroups from aggregated data.

A Brief Regression

The first statistical technique to approach ecological inference was called “ecological regression”. Though it’s been superseded, it’s worthwhile to take a brief look at how it worked and, more importantly, how it could fail to work, as it will be instructive for other methods.

Ecological regression works when there are two pre-existing groups that we’d like to map to two behaviors. The idea is to simply plot each geographic area as a point, with the percentage of the first group on the x axis and the percentage of the first behavior on the y axis, then run a linear regression. You extrapolate the line out to see what would happen for an area which had 0% or 100% of the first group — and thus, 100% or 0% of the second group.

For example, here’s a plot of how the 650 constituencies in the UK voted on the 2016 Brexit referendum, with the percentage of voters who have a post-secondary degree on the x axis:

There is a clear correlation that appears roughly linear. Constituencies where more voters have a degree tended to vote to remain in the EU in higher percentages.

We might then ask ourselves: What is the probability of a voter with a post-secondary degree voting for Brexit? By itself, the data cannot help us answer this question, as it only gives us information about populations with a certain % of voters that possess those qualifications. An individual, however, either possesses the degree or doesn’t. A simple way to estimate this individual’s voting probability, then, is to imagine a constituency where 100% or 0% of the voters have a degree. But if you extend this line out to a hypothetical constituency where 100% of the voters had a degree, in order to model this behavior, there is a problem:

This simple model is inferring that Brexit was supported by 79% of people without a degree... and negative 16% of those with one? Of course, it’s impossible for a group to have voted 116% to remain.

What actually happened is that in constituencies with a high proportion of degree-havers, even the non-degree-havers are less likely to support Brexit; but the model did not allow for this possibility. This is another example of the ecological fallacy.

This is why it was so important to note that the diagrams we produced for the 2020 primaries were vote-share change diagrams and, critically, not actual voter-flow diagrams. These diagrams cannot represent actual voters because we cannot infer the decisions that Hillary Clinton voters from 2016 made in 2020. Instead, what they represent is the geographic correlation between two candidates and are an estimate for the aggregate movement from one group of voters to another.

Model - We may have run out of model related puns

Thankfully, scholars have developed novel statistical techniques to help us mitigate this problem. Let’s reconsider our Brexit example. While we may not be able to pinpoint the precise individual behavior of Brexit voters, we can estimate the possible individual-level behaviors that are consistent with the aggregate data we observe by quantifying the uncertainty in our ecological inference. In order to quantify that uncertainty, we rely on the “RJKT model”, so-called because it was originally proposed by Rosen, Jiang, King and Tanner in 2001 (you can read their paper here). In many ways, this model is similar to our original turnout model — for instance, neither of these models can ever give impossible estimates like the -16% vote share we got from the simple ecological regression algorithm above. However, the RJKT model, unlike our original turnout model, uses Baysian inference to make sure that it isn’t over-fitting or under-fitting to the data we give it. This can’t guarantee it will avoid the ecological fallacy, but it certainly helps make that problem rarer in practice.

Our updated model assumes that the number of people that voted for each candidate in 2020 follows a multinomial distribution. The parameter for each category is the sum product of the fraction of people that moved from the 2016 candidates to the 2020 candidate times the number of people that voted for the 2016 candidates — so far the same as our model, except for the distributional assumptions.

\(Y_{\vec{rc},p}\sim Multinomial(\sum_r x_{rp},[\beta_{rcp}x_{rp}])\)

where \(r\) is an index over the 2016 voting categories (Clinton/Sanders/nonvoter), \(c\) is an index over the 2020 voting categories (Biden/Sanders/etc.) and \(p\) is an index over the precincts.

\(Y_{\vec{rc},p}\) is the numbers of voters in precinct \(p\) that voted for candidate \(r\) in 2016 and candidate \(c\) in 2020.

\(x_{rp}\) is the number of voters in precinct \(p\) that voted for candidate \(r\) in 2016.

\([\beta_{rcp}x_{rp}]\) is the probability vector whose elements are given by the formula \(\beta_{rcp}x_{rp}\). That is, the precinct-specific probability of a voter in who voted for candidate \(r\) in 2016 voting for candidate \(c\) in 2020, multiplied by \(x_{rp}\).

We further assume that the vector of fractions of people that went from a 2016 candidate to a 2020 candidate follows a Dirichlet distribution, with parameters that follow an exponential distribution. Finally, we use Markov Chain Monte Carlo to solve this Bayesian inference problem.

Put another way, voter behavior in 2020 is assumed to be made up of independent random choices, with probabilities of choosing each 2020 candidate that depend only on their 2016 behavior and their precinct. The variation of these probabilities across precincts is assumed to be completely random, though with Dirichlet parameters controlling whether a given 2016 group’s behavior is relatively uniform or whether it varies widely.

One technical flaw in this model is that it allows for a small amount of random variation in how many of each group are in each precinct, even if those numbers are actually known exactly. This variation makes fitting the model easier, but tends to “soak up” some of the variability that should be attributed to behavior, leading to confidence intervals in the output that can be too narrow in practice.

This model doesn’t completely solve the ecological fallacy we saw above with ecological regression because no model can do so perfectly. However, as a true Bayesian model, it can give us a principled logical inference based on what one could conclude if its basic assumptions were true. Bayesian models work well if their assumptions are realistic at the lowest level while still allowing for meaningful variation at the highest level. This model clears both of those bars. It models individual voter behavior in at least a semi-realistic manner — a multinomial model which assumes each voter deciding independently using group-specific probabilities — while still allowing for systematic variation between precincts.

As we can see, this model isn’t perfect and doesn’t solve the ecological inference issue in its entirety either. But in order to get as close as possible and to make the problem tractable this model makes one key assumption, as well as one minor simplification.

Key assumption(s)

This model assumes that voting behavior — that is, each voter’s chance of voting for each candidate — depends only on who they voted for in 2016 and their location. Furthermore, we assume that the behavior of each 2016 candidate’s supporters varies randomly around some central average.

In the simplest version, this random variation is assumed not to correlate with anything else in the model. For example, assume that 2016 Hillary Clinton voters had a 30% chance of voting for Elizabeth Warren in 2020 with a standard deviation of 5%. Then in one precinct, they might have a 35% of doing so, while in the precinct next door the probability might be 25%.

There is an extended version of the model which allows this cross-precinct variation in group behavior to correlate with a single measurable fact about each precinct. These might include covariates such as latitude or median income. As explained below, we checked to see if using covariates would improve our estimates, and it seemed likely that doing so would change things only insubstantially, so we decided to use only the simpler model.

Minor simplification(s)

This model assumes that each voter randomly sampled the support for their 2016 candidate when they entered the voting booth with probability equal to the fraction of people that had supported each 2016 candidate. For example, if 55% of voters had voted for Hillary Clinton and 45% of voters had voted for Bernie Sanders, then, when entering the voting booth, each person would have a 55% chance of having been a Clinton supporter and a 45% chance of having been a Sanders supporter. They then would choose their 2020 candidate based on the assumption above. This mechanism is not how voters actually choose who to vote for, but the impact on the conclusions of the inference should be relatively small; with hundreds to thousands of voters in each precinct, the law of averages (or as mathematicians call it, the law of large numbers) begins to kick in and the difference between “20% of voters are in group X” and “each voter has a 20% probability of being in group X” becomes minor.

Additionally, recall that the possible groups of 2016 voters are: Clinton voters; Sanders voters; other candidate voters; didn’t vote in 2016; weren’t eligible in 2016. What does it mean that the model thinks a given 2020 voter “wasn’t eligible” in 2016? It could be voters who were underage in 2016 or voters who were unregistered or voters who moved into the precinct between 2016 and 2020. But in college towns, the total number of student voters was probably approximately flat between 2016 and 2020 (though some evidence suggests a bit higher in New Hampshire and a bit lower in Virginia). Even though such precincts would have relatively many first-time voters in 2020 compared to other precincts, the model will tend to misclassify those voters as being the “same people” as the ones who occupied their dorm rooms back in 2016. This may or may not be correct — a four-year degree often takes longer than four years to achieve — but critically, it should not affect the results.

Results

We ran this model using data from the New Hampshire and Virginia 2020 primary in order to compare the new voter flow estimates to those that came out of the original model.

Generally, the estimated vote share diagrams look quite similar to the original ones for New Hampshire and Virginia, which is good! But there are some important differences, which are generally the same across both states — indicating that this may have been due to a flaw in the original model.

We can see that Elizabeth Warren appeared to have gotten fewer 2016 Sanders voters than our model had initially estimated. New Hampshire exit polls had suggested that Warren had received 12% of Clinton’s vote and 5% of Sanders’ vote. While there may have been some confusion surrounding that question in the exit polls, the new model estimates are more in line with those numbers compared to what the original model had outputted.

Another aspect that is now in line with the exit poll is that the new model thinks that Joe Biden got fewer of the 2016 non-voters than the original model had estimated in New Hampshire. Exit polls had said that Biden had gotten around 2% of those voters, while the original model had put that number closer to 20%. Interestingly, the original model seems to have been correct about Biden galvanizing 2016 non-voters in Virginia. Correspondingly, the fraction of 2016 Hillary Clinton voters to 2020 non-voters was larger than the original estimate.

Most importantly and unsurprisingly our initial model underestimated the number of 2016 non voters that voted for Bernie Sanders in 2020. We’ve previously discussed why this might have happened:

As we mentioned above, while the issue is mitigated slightly, it’s still present. So this estimate is very likely still an underestimate of the true non-voter to Sanders flow.

Another glaring difference is that our new vote share change diagram has a lore more flows than the original one. Viewers of the original diagrams may have been surprised to see that there was no flow from Sanders to Biden in either New Hampshire nor Virginia, even though they might know voters that chose those two candidates in respective elections. The reason for this was that our original model, similarly to regularization in lasso regression models, pushes estimates that are close to zero towards zero. The technical reason for this is that we are looking for a constrained optimization solution, which has to lie on the edge of the simplex of feasible solutions. Now, however, we have no such constraint and so our new model is allowed to estimate the small flows too.

One thing does remain the same throughout the models and approaches is the difference between the elections in New Hampshire and Virginia. The diagrams show to what extent Joe Biden was able to unify voters that had not voted for Bernie Sanders behind him. This is largely the reason why he is the presumptive Democratic nominee.

Checks And Balances

The ecological inference model we used to derive the above numbers is simple. It assumes the voter behavior in any two precincts is drawn independently from a random distribution, without any correlation with other observable characteristics — covariates — of the precincts except for their 2016 voter behavior. In fact, we know that it’s possible for the between-precinct variation to correlate with other things, such as average income, population density, proportion of White voters, etc..

To check whether we needed to add any of these real-world features to our model, we measured the correlation between the parameters from the simple model — which correspond to the probability a voter from a given 2016 group in a given precinct would vote for a given candidate in 2020 — and various possible covariates, including the three mentioned above. This small test would indicate whether adding covariates would improve the model fit, though it will tend to underestimate the potential advantage somewhat. In no cases was the average r² we estimated from this check greater than 0.05. Overall, this suggests that while it may be possible to improve the model somewhat by adding covariates, it’s unlikely that it would lead to substantially different overall conclusions that would be visible in the flow diagrams. In other words, knowing the 2016 vote in a precinct (as well as the overall patterns of behavior from other precincts) can get you most of the way towards predicting the 2020 behavior; adding additional covariate information is likely to improve that prediction only slightly.

Conclusion

For future primaries, The Washington Post will use this new approach to estimate the voter flows between candidates.

Many of you reached out with questions related to the original numbers we posted. We’re very grateful you did so because take your questions seriously. We hope this blog post sheds some light onto the issues that come with overinterpreting the parameters from our old model and how our new model mitigates that issue.

As always, please reach out with questions — we’d love to hear from you!