Forecast: The PA House

The midterm elections are only two weeks away. Finally. Thankfully.

All eyes have been focused on the national elections. But in this space I’ve been looking at races for the PA General Assembly’s lower house. At first glance the house could appear out of reach for Democrats: the districts are famously gerrymandered, and the State Supreme Court’s February decision didn’t affect the state legislature boundaries. Democrats are down by 37 seats heading in to the race.

On the other hand, 19 districts that voted for Republican representatives in 2016 also voted for Clinton. And it looks like the state could have higher turnout than any midterm since at least 2002. Could the Blue Wave possibly sweep down to the state races, and change our state houses? Today I present my forecast.

Modeling the PA House
Because it’s nerve-wracking to publish predictions into the world, and because I don’t have nearly the extent of data to do this responsibly (9 elections since 2002 doesn’t give me much to work with), I’m going to walk through some preamble before we get to the predictions. What information does the model use? When it inevitably is wrong, why will that be?

The challenge in forecasting these state races is that we don’t have polling. This is a huge problem, because public sentiments in districts are all correlated with each other, and if you don’t have some indication of what “type” of election it’s going to be, your prediction will have to cover the options from Blue Wave to Red Wave, giving gigantic error bars. Making it useless.

To get around this, I borrow information from polls for the Congressional races, and measure how historically congressional races have correlated with their local state races. It turns out that correlation is strong. So I pull in the current predictions from FiveThirtyEight’s House Model as a noisy estimate of each congressional outcome, which is highly suggestive of a strong Democratic year.

There are obviously a ton of other factors, including each district’s own history, incumbent candidates, and ways that certain districts tend to move together. I ended up choosing a fairly simple model, to account for the few years I have to work with. This means that it will capture the strongest trends, and leave the rest as uncertainty. The model relies mostly on (a) a district’s last election results for state house, congress, and the president/governor, and (b) the FiveThirtyEight predictions for the current governor and congressional races. It also has election and candidate random effects, to capture correlated outcomes in an election and certain historically over-performing candidates (e.g. 177’s John Taylor).

As such, I expect the model to do a good job of capturing the overall mood of the state, as well as the historic stickiness of incumbents in specific races. Also, when we have a new, unknown candidate, it greatly increases the uncertainty in the race, since it knows nothing about them. If you have a favorite candidate that has an awesome twitter account, the model doesn’t know that. Instead, it shows the full range of outcomes for the historic variance in quality of candidates.

How to read the model
It’s scary to release a forecast into the world, especially one that’s so bullish (see below…). Here’s how to read the results: It’s a simple accounting of a few features that clearly affect elections, weighted for how those features have mattered since 2004. How do you account for gerrymandered districts? For incumbency? For the fact that 2018 is a midterm, the increase in districts that Democrats are contesting, and the overall tenor of US Congress races? This model looks at the historical size of the effects, and weights them accordingly. What it doesn’t do is account for the fact that maybe this election is Different™, and won’t look like any we’ve seen in the last 12 years.

How my model could be wrong
Given the lack of historic data, it’s possible that I haven’t captured the full range of potential types of elections there are. Given the need for lagging variables, I’ve further limited the elections I use to model to only elections from 2006-2016. If this election looks wildly different from any of those, then my forecasts could be wrong.

Particularly, if the large attention to the midterm or the nationalization of state politics has changed the correlation between competitive races and turnout, the power of incumbency, or the quality of new candidates, the model could be very wrong, largely by cases where all races swing together. Plausibly, if Democrats are running fundamentally different types of candidates (perhaps more skilled, perhaps more leftist) than in the past, the model won’t capture whether the candidates fit the district better or worse than before.

Finally, one of the largest sources of uncertainty is new candidates. I give the model no information about that, so every time it sees a new candidate, it has to cover the full range of quality, from terrible to great. This increases the uncertainty in each race, and uncertainty will shrink the prediction towards 50-50 (if the Republicans are favored to win most seats by a narrow margin, for example, added uncertainty will more often switch Republican seats to Democratic than vice versa.)

The Prediction
Enough hand wringing.

I predict that in two weeks the Democrats will win between 92 and 110 seats. On average in my simulations, they win 101.5 seats, which is annoyingly, bizarrely, exactly half of 203. They win the majority 53% of the time.

This is surprisingly bullish for Democrats. The average represents a 18.5 seat pickup for them, and even the low end of the confidence interval–92 seats–would be a 9 seat pickup.

My predictions are particularly optimistic on the Philadelphia area’s Democratic challengers, largely because of the sweeping victories expected in the region’s Congressional races. Remember: the model doesn’t know anything about the state candidates themselves, and just uses broad indications of the district and the environment, so your knowledge of a given candidate could mean very different predictions. It gives Democrat Kristin Seale a 32% chance of unseating Quinn in Delco’s 168. It gives Hohenstein a 34% chance of winning the River Wards’ 177, now that John Taylor isn’t in the picture. And it gives Michael Doyle a 36% chance of beating Martina White in 170 in the Northeast.

Below is a widget where you can see the predictions for every single race:
[EDIT: The widget didn’t embed correctly. CLICK HERE for it!]

One thing you may notice above is that even in the close races, Republicans are favored to win. That’s a fascinating fact of the model, and the race in general. Republicans are actually favored to win in in 115 seats.

How is that possible? How can Republicans be favored in more seats than is my upper bound on how many they’ll win? The answer is gerrymandering. Because of “cracking and packing”, there are a ton of districts that are safely Republican in any typical year, but not so safe as to waste too many votes. A Blue Wave would push them just to the boundary. Any additional randomness–a good Democratic candidate, a local story–pushes them over to Democrat seats. This is also helped by the fact that Democrats are contesting more districts than ever before. And there just aren’t any similarly teetering seats on the Democratic side. So while the model isn’t sure which of the close seats will swing over the line, it’s sure that some will. And maybe enough to win Democrats the house.

See you in two weeks!
I was stunned to predict such a close race. Democrats will pick up at least 8 seats, and are an even bet to win the house. With that aggressive prediction on the internet forever, I’m turning my attention to the Turnout Tracker. Stay tuned!

Sources
Data comes from the amazing Open Elections Project.
I also leaned heavily on Ballotpedia to complement and extend the data.
GIS data is from the US Census.

Predicting November Turnout

Predicting November
I’ve spent the last few months trying to see if there’s a way I could plausibly predict November. Sites like FiveThirtyEight do a plenty good job of national races, but what can we say about state races? Could Democrats win the Pennsylvania House? The PA Senate?

Well, I finally think I’ve got a model that does a plausible job.

​Soon, I’ll publish some predictions for the winners. But first, let’s look at turnout.

Trends in PA Turnout

Today, I’m focusing on the turnout in even-year general elections (so all Presidential or Gubernatorial races) since 2002. I’m going to only use the two-party vote and use the total votes for President or Governor as turnout, rather than the actual turnout. This ignores third party voters and people who skipped the topline election altogether. The difference between this and actual turnout won’t be large, and this makes the predictions later easier

Between 3.5M and 4.1M Pennsylvanians voted in the midterms since 2002.

What’s a good guess for turnout this year? 2006 seems like an obvious benchmark. In that year, an incumbent Democratic Governor and Senate candidate Bob Casey, Jr capitalized on a national Democratic surge against an unpopular president. Sounds familiar. In that election, 4.09 million Pennsylvanians voted for governor. The other high was from the other wave election in the period: Republicans’ sweep of 2010.

The Model
I’ve built a model that predicts turnout of every precinct using data from even years from 2002 – 2016. The model uses information on the election (if it’s a midterm, the party in the presidency, whether local races are contested, the incumbency of local races, the presence of female candidates, and district population), and allows for different precinct-level responses to midterm elections, presidential party, and turnout growth or shrinkage over times.

The thing that makes predicting state races so hard is that there aren’t surveys. Without them, it’s really hard to find good proxies for voter excitement and disproportionate interest. Instead, I’ve built the model to simulate the full distribution of types of elections, from very Democratic to very Republican, and then give the entire range of possible results. We can then use that to either (a) examine the full range of possible outcomes, or (b) plug in specific values and see the results, for example “what if the election looked like 2006?”

To achieve this, I’ve modeled the correlations in turnout among precincts, to identify groups of precincts that all turn out together. Some precincts all come out disproportionately in midterms, others come out only in Democratic wave years. It’s this factor that is the biggest unknown moving into November: what type of election will it be. These correlations create a lot of uncertainty: you can’t rely on the Law of Large Numbers to cancel out all of the districts’ indiosyncracies.

So, does the model work?

Testing the Model: 2016

To test the model, let’s pretend its September 2016. Using only data from 2002-2014, and I fit it, and then generate predictions for 2016 turnout.

In 2016, I would have estimated 5.68 million votes cast statewide, with a 95% credible interval of 5.16M – 6.29M (the uncertainty is huge, but listen, science is hard, and I’m a serious person). In reality, 6.01 million votes were cast for President. I undershot it by a little bit, but the result is well within the interval.

Capturing relative turnout is arguably more important for final results than overall. Which places voted more than usual, and which less? Let’s compare the model’s predictions for Vote in 2016 / Vote in 2012, compared to the actual values.
I did less well on that. Above is a plot of the observed turnout growth in each geography (measured as turnout in 2016 divided by turnout in 2014) versus what the model would have predicted. A perfect prediction would have all of the points on the 45-degree.

There maybe exists correlation between my predicted growths and the observed results, but it’s weak. It turns out that the growth depends heavily on the partisanship of the election; the correlation factors that I discussed above. Since I don’t know what that is ahead of time, I have to simulate them from all of the possibilities, resulting in the elliptical blob above. The model easily identifies these factors retrospectively–I can say for example that 2006 was a very strong Democratic year–but I don’t in general have a way to predict that for an upcoming election.

The Predictions
Enough delay. What do I predict for turnout in 2018?

There will be 4,295,981 votes for Governor.

This strikes me as high. It’s higher turnout than any midterm in my dataset. But the model did relatively well in the holdout test of 2016, and I don’t want to commit the sin of post-hoc adjusting. So this is my prediction, and I’m sticking to it.

What are the arguments for this astronomical number? You, a person who somehow reads this blog and thus are well down the elections analysis rabbit-hole, might have noticed unprecedented excitement for a midterm, and be unsurprised by a high prediction. But the model doesn’t have that info. Instead, it does see that (a) a Republican is president, which increases midterm turnout more in Democratic precincts than a Democratic president increases in Republican ones, (b) many more races are contested, including in the newly-redrawn congressional districts and a ton of contested state house seats, and (c) after all of the adjustments, turnout has been steadily increasing since 2002. All of these combine to create a prediction for midterm turnout that is unprecedented in the dataset. And some of those features, particularly the contested races, are probably serving as proxies for voter enthusiasm.

There’s a lot of uncertainty in the prediction because, again, science is hard. The 95% credible interval is 3.85M to 4.72M. That interval would include the turnout of the last two wave midterm elections—4.09M in 2006 and 4.00M in 2010—and exclude the lower-turnout years of 2002 and 2014.

Within Philadelphia, I project 460,000 voters, with a 95% CI of (410,000, 517,000). Even at the lower end, that would beat out the 2006 and 2010 turnout highs.

What does the model have to say about precinct-specific changes? Below is a plot of its predictions in Philadelphia, relative to turnout in 2014. Keep in mind that these predictions are equivalent to the blob plot above: there’s a loose predictive power, but a ton of noise based on what type of election this ends up being.

I predict particularly high turnout in Center City East and the River Wards, upwards of 60% growth over 2014. That one bright yellow precinct in the River Wards is because of population changes that have seen increasing midterm turnout, and a competitive State House election in a neighborhood that hasn’t seen one for years. West Philly, North Philly, up to West Oak Lane, will likely turn out similarly to 2014, given their largely uncontested races.

Coming Soon
So I tentatively expect record turnout, at least among election since 2002. Will it happen? I’ve over-predicted turnout before. We’ll see if I learned my lesson.

Until that test comes, let’s brazenly barrel forward and predict the actual results. Coming soon.

Sources
Data comes, as always, from the amazing Open Elections Project.
I also leaned heavily on Ballotpedia to complement and extend the data.
GIS data is from the US Census.