Predicting Elections Tutorial!

I’m teaching a workshop at Penn’s Master’s of Urban Spatial Analytics on April 22nd.

I’ve posted all of the materials on github, including RMarkdown walkthroughs. Want to learn how I predicted the 2018 election and only made one bad mistake? Check it out!

Note: As an early tester, by reading the posts you commit to sending me feedback. Preferably before April 22nd. thx.

Post 1: The Relational Database. How I’ve organized the election data.

Post 2: Geographies. How I crosswalked geographies across moving boundaries.

Post 3: Creating the rectangular data.frame. Final steps to get ready to model.

Post 4: Predicting the election. The good stuff! (You can skip the others. This is what I’ll be teaching.)

The Turnout Tracker is open sourced. And going to Chicago!

The Turnout Tracker: An Introduction

The turnout tracker is a citizen science tool to track election turnout in real time.

In May 2017, I noticed that Philadelphians had organically started sharing on social media where and when they voted, and what number voter they were at their precinct.

I thought “Wow, all that needs is a statistical model to know what turnout is across the city.” So I built it.

We’re going to Chicago

Chicagoans, I need your help! The municipal runoff elections are April 2nd. Let’s track turnout together.

Before Election Day

  • Tell your friends! Share this post!

On Election Day

Open Sourcing the Turnout Tracker

I’m also sharing the code behind the Turnout Tracker with the world. I’ve cleaned it up some, but better engineering and documentation is a work in progress.

Check out the repository at https://github.com/jtannen/turnout_tracker

Programmer/Data Scientist? The codebase may not be fully self-serve yet. So if you want to bring the Turnout Tracker to your city, get in touch. jonathan (dot) tannen (at) gmail (dot) com

_sunglassesemoji_

The neighborhoods that decide Council District 2

[Note 2019-03-09: this post has been heavily updated thanks to an insightful suggestion from @DanthePHLman]

Could Kenyatta Lose?

Kenyatta Johnson, the two term councilmember from Southwest and South Philly’s District 2, is being challenged by Lauren Vidas, the former assistant finance director under Mayor Nutter. Johnson dominated a challenge from developer Ori Feibush four years ago, but has since been mired in land deal scandals. In Wednesday’s post, I claimed District 3’s challenger faced a plausible but steep path. How about for District 2? What would it take for Vidas to win?

Johnson’s District 2 is quite different from West Philly’s District 3. The gentrification has covered less ground, and Graduate Hospital didn’t take to Bernie Sanders and Larry Krasner in the same way that University City did. On the other hand, Johnson’s recent scandals will likely kneecap his 2015 popularity, and Vidas occupies a quite different lane than developer Feibush.

What are the neighborhood cohorts that will decide District 2? If Johnson holds, what neighborhoods will he have done well in? If Vidas’s challenge is successful, which neighborhoods’ vote will she have monopolized?

District 2’s voting blocks

The voting blocks for District 2 are less distinct than for District 3: there’s the pro-Kenyatta base of Point Breeze, the challenger base of Grad Hospital and a nub of East Passyunk, and then there’s Southwest Philly, which is somewhere in between.

View code
library(tidyverse)
library(rgdal)
library(rgeos)
library(sp)
library(ggmap)

sp_council <- readOGR("../../../data/gis/city_council/Council_Districts_2016.shp", verbose = FALSE)
sp_council <- spChFIDs(sp_council, as.character(sp_council$DISTRICT))

sp_divs <- readOGR("../../../data/gis/2016/2016_Ward_Divisions.shp", verbose = FALSE)
sp_divs <- spChFIDs(sp_divs, as.character(sp_divs$WARD_DIVSN))
sp_divs <- spTransform(sp_divs, CRS(proj4string(sp_council)))

load("../../../data/processed_data/df_major_2017_12_01.Rda")

ggcouncil <- fortify(sp_council) %>% mutate(council_district = id)
ggdivs <- fortify(sp_divs) %>% mutate(WARD_DIVSN = id)
View code
## Need to add District 2 election from 2015
raw_d2 <-  read.csv("../../../data/raw_election_data/2015_primary.csv") 
raw_d2 <- raw_d2 %>% 
  filter(OFFICE == "DISTRICT COUNCIL-2ND DISTRICT-DEM") %>%
  mutate(
    WARD = sprintf("%02d", asnum(WARD)),
    DIV = sprintf("%02d", asnum(DIVISION))
  )

load('../../../data/gis_crosswalks/div_crosswalk_2013_to_2016.Rda')
crosswalk_to_16 <- crosswalk_to_16 %>% group_by() %>%
  mutate(
    WARD = sprintf("%02s", as.character(WARD)),
    DIV = sprintf("%02s", as.character(DIV))
  )

d2 <- raw_d2 %>% 
  left_join(crosswalk_to_16) %>%
  group_by(WARD16, DIV16, OFFICE, CANDIDATE) %>%
  summarise(VOTES = sum(VOTES * weight_to_16)) %>%
  mutate(PARTY="DEMOCRATIC", year="2015", election="primary")
df_major <- bind_rows(df_major, d2)
View code
races <- tribble(
  ~year, ~OFFICE, ~office_name,
  "2015", "MAYOR", "Mayor",
  "2015", "DISTRICT COUNCIL-2ND DISTRICT-DEM", "Council 2nd District",
  "2015", "COUNCIL AT LARGE", "City Council At Large",
  "2016", "PRESIDENT OF THE UNITED STATES", "President",
  "2017", "DISTRICT ATTORNEY", "District Attorney"
) %>% mutate(election_name = paste(year, office_name))

candidate_votes <- df_major %>% 
  filter(election == "primary" & PARTY == "DEMOCRATIC") %>%
  inner_join(races %>% select(year, OFFICE)) %>%
  mutate(WARD_DIVSN = paste0(WARD16, DIV16)) %>%
  group_by(WARD_DIVSN, OFFICE, year, election) %>%
  mutate(
    total_votes = sum(VOTES),
    pvote = VOTES / sum(VOTES)
  ) %>% 
  group_by()

turnout_df <- candidate_votes %>%
  filter(!grepl("COUNCIL", OFFICE)) %>% 
  group_by(WARD_DIVSN, OFFICE, year, election) %>%
  summarise(total_votes = sum(VOTES)) %>%
  left_join(
    sp_divs@data %>% select(WARD_DIVSN, AREA_SFT)
  )

turnout_df$AREA_SFT <- asnum(turnout_df$AREA_SFT)

The second council district covers Southwest Philly, and parts of South Philly including Point Breeze and Graduate Hospital.

View code
get_labpt_df <- function(sp){
  mat <- sapply(sp@polygons, slot, "labpt")
  df <- data.frame(x = mat[1,], y=mat[2,])
  return(
    cbind(sp@data, df)
  )
}

ggplot(ggcouncil, aes(x=long, y=lat)) +
  geom_polygon(
    aes(group=group),
    fill = strong_green, color = "white", size = 1
  ) +
  geom_text(
    data = get_labpt_df(sp_council),
    aes(x=x,y=y,label=DISTRICT)
  ) +
  theme_map_sixtysix() +
  coord_map() +
  ggtitle("Council Districts")

plot of chunk council_map

View code
DISTRICT <- "2"
sp_district <- sp_council[row.names(sp_council) == DISTRICT,]

bbox <- sp_district@bbox
## expand the bbox 20%for mapping
bbox <- rowMeans(bbox) + 1.2 * sweep(bbox, 1, rowMeans(bbox))

basemap <- get_map(bbox, maptype="toner-lite")

district_map <- ggmap(
  basemap, 
  extent="normal", 
  base_layer=ggplot(ggcouncil, aes(x=long, y=lat, group=group)),
  maprange = FALSE
) 
## without basemap:
# district_map <- ggplot(ggcouncil, aes(x=long, y=lat, group=group))

district_map <- district_map +
  theme_map_sixtysix() +
  coord_map(xlim=bbox[1,], ylim=bbox[2,])


sp_divs$council_district <- over(
  gCentroid(sp_divs, byid = TRUE), 
  sp_council
)$DISTRICT

sp_divs$in_bbox <- sapply(
  sp_divs@polygons,
  function(p) {
    coords <- p@Polygons[[1]]@coords
    any(
      coords[,1] > bbox[1,1] &
      coords[,1] < bbox[1,2] &
      coords[,2] > bbox[2,1] &
      coords[,2] < bbox[2,2] 
    )
  }
)

ggdivs <- ggdivs %>% 
  left_join(
    sp_divs@data %>% select(WARD_DIVSN, in_bbox)
  )

district_map +
  geom_polygon(
    aes(alpha = (id == DISTRICT)),
    fill="black",
    color = "grey50",
    size=2
  ) +
  scale_alpha_manual(values = c(`TRUE` = 0.2, `FALSE` = 0), guide = FALSE) +
  ggtitle(sprintf("Council District %s", DISTRICT))

plot of chunk district_map
Despite the large expanse of land, the vast majority of the district’s votes come from Center City and northern South Philly.

View code
# hist(turnout_df$total_votes / turnout_df$AREA_SFT, breaks = 1000)

turnout_df <- turnout_df %>%
  left_join(races)

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(turnout_df, by =c("id" = "WARD_DIVSN")),
    aes(fill = pmin(total_votes / AREA_SFT, 0.0005))
  ) +
  scale_fill_viridis_c(guide = FALSE) +
  geom_polygon(
    fill=NA,
    color = "white",
    size=1
  ) +
  facet_wrap(~ election_name) +
  ggtitle(
    "Votes per mile in the Democratic Primary", 
    sprintf("Council District %s", DISTRICT)
  )

plot of chunk turnout_map
In fact, so few votes come from the industrial Southernmost tip of the city that let’s drop it from the maps. Sorry Navy Yard, but you’re ruining my scale.

View code
d2_subset <- sp_divs[sp_divs$council_district == DISTRICT,]
d2_subset <- d2_subset[
  d2_subset$WARD_DIVSN %in% 
    turnout_df$WARD_DIVSN[turnout_df$total_votes / turnout_df$AREA_SFT > 0.0001],
]

bbox <- gUnionCascaded(d2_subset)@bbox
## expand the bbox 20%for mapping
bbox <- rowMeans(bbox) + 1.2 * sweep(bbox, 1, rowMeans(bbox))

basemap <- get_map(bbox, maptype="toner-lite")

district_map <- ggmap(
  basemap, 
  extent="normal", 
  base_layer=ggplot(ggcouncil, aes(x=long, y=lat, group=group)),
  maprange = FALSE
) 
## without basemap:
# district_map <- ggplot(ggcouncil, aes(x=long, y=lat, group=group))

district_map <- district_map +
  theme_map_sixtysix() +
  coord_map(xlim=bbox[1,], ylim=bbox[2,])


sp_divs$council_district <- over(
  gCentroid(sp_divs, byid = TRUE), 
  sp_council
)$DISTRICT

First, let’s look at the results from five recent, compelling Democratic Primary races: 2015 City Council At Large, City Council District 2, and Mayor; 2016 President; and 2017 District Attorney. The maps below show the vote for the top two candidates in District 2 (except for City Council in 2015, where I use Helen Gym and Isaiah Thomas, who were 4th and 5th in the district, and 5th and 6th citywide.)

View code
candidate_votes <- candidate_votes %>%
  left_join(sp_divs@data %>% select(WARD_DIVSN, council_district))

## Choose the top two candidates in district 3
## Except for city council, where we choose Gym and Thomas
# candidate_votes %>%
#   group_by(OFFICE, year, CANDIDATE) %>%
#   summarise(
#     city_votes = sum(VOTES),
#     district_votes = sum(VOTES * (council_district == DISTRICT))
#   ) %>%
#   arrange(desc(city_votes)) %>%
#   filter(OFFICE == "DISTRICT ATTORNEY")

candidates_to_compare <- tribble(
  ~year, ~OFFICE, ~CANDIDATE, ~candidate_name, ~row,
  "2015", "COUNCIL AT LARGE", "HELEN GYM", "Helen Gym", 2,
  "2015", "COUNCIL AT LARGE", "ISAIAH THOMAS", "Isaiah Thomas", 1,
  "2015", "DISTRICT COUNCIL-2ND DISTRICT-DEM", "KENYATTA JOHNSON", "Kenyatta Johnson", 1,
  "2015", "DISTRICT COUNCIL-2ND DISTRICT-DEM", "ORI C FEIBUSH", "Ori Feibush", 2,
  "2015", "MAYOR", "JIM KENNEY", "Jim Kenney",  2,
  "2015", "MAYOR", "ANTHONY HARDY WILLIAMS", "Anthony Hardy Williams", 1,
  "2016", "PRESIDENT OF THE UNITED STATES", "BERNIE SANDERS", "Bernie Sanders", 2,
  "2016", "PRESIDENT OF THE UNITED STATES", "HILLARY CLINTON", "Hillary Clinton", 1,
  "2017", "DISTRICT ATTORNEY", "LAWRENCE S KRASNER", "Larry Krasner", 2,
  "2017", "DISTRICT ATTORNEY", "JOE KHAN","Joe Khan", 1
)

candidate_votes <- candidate_votes %>%
  left_join(races) %>%
  left_join(candidates_to_compare)

vote_adjustment <- function(pct_vote, office){
  ifelse(office == "COUNCIL AT LARGE", pct_vote * 4, pct_vote)
}

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(
        candidate_votes %>% filter(!is.na(row))
      ),
    aes(fill = 100 * vote_adjustment(pvote, OFFICE))
  ) +
  scale_fill_viridis_c("Percent of Vote") +
  theme(
    legend.position =  "bottom",
    legend.direction = "horizontal",
    legend.justification = "center"
  ) +
  geom_polygon(
    fill=NA,
    color = "white",
    size=1
  ) +
  geom_label(
    data=candidates_to_compare %>% left_join(races),
    aes(label = candidate_name),
    group=NA,
    hjust=0, vjust=1,
    x=bbox[1,1],
    y=bbox[2,2]
  ) +
  facet_grid(row ~ election_name) +
  theme(strip.text.y = element_blank()) +
  ggtitle(
    sprintf("Candidate performance in District %s", DISTRICT), 
    "Percent of vote (times 4 for Council, times 1 for other offices)"
  )

plot of chunk proportion
Notice two things. First, the section of Point Breeze that dominated for Kenyatta Johnson in 2015, but also voted disproportionately for Isaiah Thomas, Anthony Hardy Williams, and Hillary Clinton. These are predominantly Black neighborhoods that didn’t bite on Helen Gym, Jim Kenney, or Bernie Sanders. Unlike in West Philly, Krasner did even better in Black Point Breeze than he did in the White, gentrified Graduate Hospital, where Joe Khan did unusually well. East Passyunk exhibited similar Krasner excitement to University City.

Second, note that Washington Avenue provides the stark boundary between pro-Kenyatta Point Breeze and pro-Gym, Feibush, and Kenney Graduate Hospital (Interested in this emergent boundary? Boy, have I got a dissertation for you!) Above Washington (along with the nub of East Passyunk that extends into the East of the district) both support the farther left challengers and turn out in force, although they didn’t support Sanders and Krasner as sharply as other gentrified parts of the city.

The district had one more coalition, hidden by these maps: Trump supporters.

View code
usp_2016 <- df_major %>%
  filter(
    election=="general"&
      year == 2016 &
      OFFICE == "PRESIDENT OF THE UNITED STATES" &
      CANDIDATE %in% c("DONALD J TRUMP", "HILLARY CLINTON")
    ) %>%
  mutate(WARD_DIVSN = paste0(WARD16, DIV16)) %>%
  group_by(WARD_DIVSN, CANDIDATE) %>%
  summarise(VOTES = sum(VOTES)) %>%
  group_by(WARD_DIVSN) %>%
  summarise(
    turnout = sum(VOTES),
    pdem = sum(VOTES * (CANDIDATE == "HILLARY CLINTON")) / sum(VOTES)
  )

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(
        usp_2016
      ),
    aes(fill = 100 * (1-pdem))
  ) +
  scale_fill_gradient2(
    "Percent for Donald Trump",
    low = strong_blue, mid = "white", high = strong_red, midpoint = 50
  )+
  theme(
    legend.position =  "bottom",
    legend.direction = "horizontal",
    legend.justification = "center"
  ) +
  geom_polygon(
    fill=NA,
    color = "white",
    size=1
  ) +
  expand_limits(fill = 80) +
  ggtitle("South of Passyunk went for Trump", "Percent of two-party vote in the 2016 Presidential election")

plot of trump support

South of Passyunk voted Trump, with up to 60% of the vote! Coupled with parts of the Northeast, this represents Philadelphia’s Trump Democrats. We’ll treat them separately.

To simplify the analysis, let’s divide the District into coalitions. We’ll use four: “Johnson’s Base” of Point Breeze, “Gentrified Challengers” of Graduate Hospital and East Passyunk, “Southwest Philly”, which supported Johnson but not homogenously, and “Trumpist South Philly”, below Passyunk.

View code

xcand <- "Kenyatta Johnson"
ycand <- "Larry Krasner"

## Everything west of the Schuylkill call Southwest.
div_centroids <- gCentroid(sp_divs[sp_divs$council_district == DISTRICT,], byid=TRUE)
sw_divs <- attr(div_centroids@coords, "dimnames")[[1]][div_centroids@coords[,1] < -75.20486]

## Pull out the places trump won
trump_winners <- usp_2016 %>%
  inner_join(sp_divs@data %>% filter(council_district == DISTRICT)) %>%
  filter(pdem < 0.5)

district_categories <- candidate_votes %>%
  filter(!is.na(candidate_name)) %>%
  group_by(WARD_DIVSN) %>%
  mutate(votes_2016 = total_votes[candidate_name == 'Bernie Sanders']) %>%
  group_by() %>%
    filter(
      council_district == DISTRICT & 
        candidate_name %in% c(xcand, ycand)
    ) %>%
    group_by(WARD_DIVSN, votes_2016) %>%
    summarise(
      x_pvote = pvote[candidate_name == xcand],
      y_pvote = pvote[candidate_name == ycand]
    ) %>%
  mutate(
    is_sw = WARD_DIVSN %in% sw_divs,
    trump_winner = WARD_DIVSN %in% trump_winners$WARD_DIVSN,
    cat = ifelse(is_sw, "Southwest", ifelse(trump_winner, "Trumpists", "East"))
  ) 

# district_categories <- district_categories %>% left_join(turnout_wide, by = "WARD_DIVSN")

ggplot(
  district_categories,
  aes(x = 100 * x_pvote, y = 100 * y_pvote)
  # aes(x = 100 * x_pvote, y = (votes_2017 - votes_2015) * 5280^2)
) +
  geom_point(aes(size = votes_2016), alpha = 0.3) +
  scale_size_area("Total Votes in 2016")+
  theme_sixtysix() +
  xlab(sprintf("Percent of Vote for %s", xcand)) +
  ylab("Change in Votes/Mile, 2015 - 2017") + 
  ylab(sprintf("Percent of Vote for %s", ycand)) +
  coord_fixed() +
  geom_abline(slope = 3, intercept =  -130) +
  # geom_hline(yintercept=0) +
  # geom_vline(xintercept=60, linetype="dashed") +
  # geom_abline(slope = 100, intercept =  -7000) +
  geom_text(
    data = data.frame(cat = rep("East", 2)),
    x = c(28, 80),
    y = c(70, 10),
    hjust = 0.5,
    label = c("Challenger\nBase", "Johnson\nBase"),
    color = c(strong_green, strong_purple),
    fontface="bold"
  ) +
  facet_wrap(~cat)+
  ggtitle("Divisions' vote", sprintf("District %s Democratic Primary", DISTRICT))

plot of chunk scatter_bernie_gym
The Vote for Krasner is only weakly negatively correlated with the vote for Johnson, surprisingly. I’ve drawn an arbitrary line that appears to divide the clusters. We’ll call the divisions in the East above the line the Gentrified Challengers, and those below the line Johnson’s Base.

Here’s the map of the cohorts that this categorization gives us.

View code
district_categories$category <- with(
  district_categories,
  ifelse(
    cat != "East", cat,
    ifelse(y_pvote > 3 * x_pvote - 1.30, "Gentrified Challengers", "Johnson Base")
  )
)

cohort_colors <- c(
      "Johnson Base" = strong_purple,
      "Gentrified Challengers" = strong_green,
      "Southwest" = strong_orange,
      "Trumpists" = strong_red
    )

district_map + 
  geom_polygon(
    data = ggdivs %>% 
      left_join(district_categories) %>% 
      filter(!is.na(category)),
    aes(fill = category)
  ) +
  scale_fill_manual(
    "Cohort",
    values=cohort_colors
  ) +
  ggtitle(sprintf("District %s neighborhood divisions", DISTRICT))+
  theme(legend.position = c(0,1), legend.justification = c(0,1))

plot of chunk category_map
Looks reasonable.

How did the candidates do in each of the sections? The boundaries separate drastic performance splits.

View code
neighborhood_summary <- candidate_votes %>% 
  inner_join(candidates_to_compare) %>%
  group_by(candidate_name, election_name) %>%
  mutate(
    citywide_votes = sum(VOTES),
    citywide_pvote = 100 * sum(VOTES) / sum(total_votes)
  ) %>%
  filter(council_district == DISTRICT) %>%
  left_join(district_categories) %>%
  group_by(candidate_name, citywide_votes, citywide_pvote, election_name, category) %>%
  summarise(
    votes = sum(VOTES),
    pvote = 100 * sum(VOTES) / sum(total_votes),
    total_votes = sum(total_votes)
  ) %>%
  group_by(candidate_name, election_name) %>%
  mutate(
    district_votes = sum(votes),
    district_pvote = 100 * sum(votes) / sum(total_votes)
  ) %>% select(
    election_name, candidate_name, citywide_pvote, district_pvote, category, pvote, total_votes
  ) %>%
  gather(key="key", value="value", pvote, total_votes) %>%
  unite("key", category, key) %>%
  spread(key, value)
  

neighborhood_summary %>%
  knitr::kable(
    digits=0, 
    format.args=list(big.mark=','),
    col.names=c("Election", "Candidate", "Citywide %", sprintf("District %s %%", DISTRICT), "Gentrified Challengers %", "Gentrified Challengers Turnout", "Johnson Base %", "Johnson Base Turnout", "Southwest %", "Southwest Turnout", "Trumpist %", "Trumpist Turnout")
  )
Election Candidate Citywide % District 2 % Gentrified Challengers % Gentrified Challengers Turnout Johnson Base % Johnson Base Turnout Southwest % Southwest Turnout Trumpist % Trumpist Turnout
2015 City Council At Large Helen Gym 8 9 14 24,471 7 25,139 4 15,569 5 5,338
2015 City Council At Large Isaiah Thomas 7 7 5 24,471 10 25,139 8 15,569 2 5,338
2015 Council 2nd District Kenyatta Johnson 62 62 44 6,669 79 9,508 64 5,894 36 2,057
2015 Council 2nd District Ori Feibush 38 38 55 6,669 21 9,508 36 5,894 64 2,057
2015 Mayor Anthony Hardy Williams 26 30 11 7,606 41 10,127 45 6,740 2 2,580
2015 Mayor Jim Kenney 56 56 70 7,606 46 10,127 44 6,740 89 2,580
2016 President Bernie Sanders 37 37 39 11,702 38 14,293 29 9,224 47 2,051
2016 President Hillary Clinton 63 63 60 11,702 62 14,293 71 9,224 52 2,051
2017 District Attorney Joe Khan 20 23 33 6,985 18 6,569 13 3,354 14 982
2017 District Attorney Larry Krasner 38 39 43 6,985 41 6,569 32 3,354 18 982

The turnout splits are fascinating. The Johnson Base represented a consistent 37ish percent of the votes, dominating the election in 2015 and 2016, but surpassed by the Gentrified Challengers’ 39% in 2017. Still, the Southwest typically represents 25% of the votes (this fell to 19% in 2017), so Johnson’s Base combined with the Southwest made up a strong 63% of the 2016 vote, and 58% of the 2017 vote.

View code
cohort_turnout <- neighborhood_summary %>%
  group_by() %>%
  filter(election_name %in% c("2015 Mayor", "2016 President", "2017 District Attorney")) %>%
  select(election_name, ends_with("_total_votes")) %>%
  gather("cohort", "turnout", -election_name) %>%
  unique() %>%
  mutate(
    year = substr(election_name, 1, 4),
    cohort = gsub("^(.*)_total_votes", "\\1", cohort)
  ) %>%
  group_by(year) %>%
  mutate(pct_turnout = turnout / sum(turnout))
  
ggplot(cohort_turnout, aes(x=year, y=100*pct_turnout)) +
  geom_line(aes(group=cohort, color=cohort), size=2) +
  geom_point(aes(color=cohort), size=4) +
  scale_color_manual(values=cohort_colors, guide=FALSE) +
  theme_sixtysix() +
  expand_limits(y=0) +
  expand_limits(x=4)+
  geom_text(
    data = cohort_turnout %>% filter(year == 2017),
    aes(label = cohort, color = cohort),
    x = 3.05,
    fontface="bold",
    hjust = 0
  ) +
  ylab("Percent of District 2's votes") +
  xlab("") +
  ggtitle(
    "Cohorts' electoral strength", "Percent of District 2's votes in the Democratic Primary"
  )

plot of chunk turnout by cohort

How does the combination of (a) Johnson’s Base sheer size but (b) the Gentrifiers’ surge in voting impact the election? It comes down to percent of the vote in each region. In 2015, Kenyatta won 44% in the Challenger Base even as he dominated his own Base and the Southwest, 79 and 64%. Feibush won 64% from the South Philly Trumpists. Vidas, who is a very different candidate from Feibush (to put it mildly), would have to do much, much better in the Gentrified regions, and hope Johnson’s dominance of Point Breeze has fallen.

The relative power of West and Southwest and University City

How much does the power shift between the two cohorts? Let’s do some math.

How much does a candidate need from each of the sections to win? Let t_i be the relative turnout in section i, defined as the proportion of total votes. So in the 2017 District Attorney Race, t_i was 0.39 for the Gentrified Challengers, and 0.37 for the Johnson Base. Let p_ic be the proportion of the vote received by candidate c in section i, so in 2017, p is 0.41 for Krasner in the Johnson Base.

Then a candidate wins a two-way race whenever the turnout-weighted proportion of their vote is greater than 0.5: sum_over_i(t_i p_ic) > 0.5.

Since we’ve divided District 2 into four sections, it’s hard to plot on a two-way axis. For simplicity, I’ll combine the Johnson Base with Southwest Philly, and the Gentrified Challengers with the Trumpists (these are in my opinion the likely race-correlated dynamics that will play out). On the x-axis, let’s map a candidate’s percent of the vote in the Gentrifiers + Trumpists, and on the y, a candidate’s percent of the vote in Southwest + the Johnson Base (assuming a two-person race). The candidate wins whenever the average of their proportions, weighted by t, is greater than 50%. The dashed lines show the win boundaries; candidates to the top-right of the lines win. Turnout matters less than in District 2 than in District 3 because it swings less; they didn’t experience the Krasner bump in 2017.

I’ll plot only the two-candidate vote for the top two candidates in the district for each race, to emulate a two-person race. (For City Council in 2015, I use Helen Gym and Isaiah Thomas, who were 4th and 5th in the district, and 5th and 6th citywide.)

View code
get_line <- function(x_total_votes, y_total_votes){
  ## solve p_x t_x+ p_y t_y > 50
  tot <- x_total_votes + y_total_votes
  tx <- x_total_votes / tot
  ty <- y_total_votes / tot

  slope <- -tx / ty
  intercept <- 50 / ty  # use 50 since proportions are x100
  c(intercept, slope)
}

line_2017 <- with(
  neighborhood_summary,
  get_line(
    (`Gentrified Challengers_total_votes` + Trumpists_total_votes)[candidate_name == "Larry Krasner"],
    (`Johnson Base_total_votes` + `Southwest_total_votes`)[candidate_name == "Larry Krasner"]
  )
)

line_2015 <- with(
  neighborhood_summary,
  get_line(
    (`Gentrified Challengers_total_votes` + Trumpists_total_votes)[candidate_name == "Jim Kenney"],
    (`Johnson Base_total_votes` + `Southwest_total_votes`)[candidate_name == "Jim Kenney"]
  )
)

## get the two-candidate vote
neighborhood_summary <- neighborhood_summary %>%
  group_by(election_name)  %>% 
  mutate(
    challenger_pvote_2cand = (
      `Gentrified Challengers_pvote` + Trumpists_pvote
      ) / sum(`Gentrified Challengers_pvote` + Trumpists_pvote),
    kenyatta_pvote_2cand = (`Southwest_pvote` + `Johnson Base_pvote`)/sum(`Southwest_pvote` + `Johnson Base_pvote`)
  )


library(ggrepel)

ggplot(
  neighborhood_summary,
  aes(
    x=100*challenger_pvote_2cand,
    y=100*kenyatta_pvote_2cand
  )
) +
  geom_point() +
  geom_text_repel(aes(label=candidate_name)) +
  geom_abline(
    intercept = c(line_2015[1], line_2017[1]),
    slope = c(line_2015[2], line_2017[2]),
    linetype="dashed"
  ) +
  coord_fixed() + 
  scale_x_continuous(
    "Gentrified Challenger + Trumpist percent of vote",
    breaks = seq(0,100,10)
  ) +
  scale_y_continuous(
    "Johnson Base + Southwest percent of vote",
    breaks = seq(0, 100, 10)
  ) +
  annotate(
    geom="text",
    label=paste(c(2015, 2017), "turnout"),
    x=c(10, 8),
    y=c(
      line_2015[1] + 10 * line_2015[2],
      line_2017[1] + 8 * line_2017[2]
    ),
    hjust=0,
    vjust=-0.2,
    angle = atan(c(line_2015[2], line_2017[2])) / pi * 180,
    color="grey40"
  )+
  annotate(
    geom="text",
    x = 80,
    y=75,
    label="Candidate wins",
    fontface="bold",
    color = strong_green
  ) +
  geom_hline(yintercept = 50, color="grey50") +
  geom_vline(xintercept = 50, color="grey50")+
  expand_limits(x=100, y=80)+
  theme_sixtysix() +
  ggtitle(
    "The relative strengths of District 2 neighborhoods",
    "Candidates to the top-right of the lines win. Points are two-candidate vote."
  )

plot of chunk win_scatter

Hillary Clinton, Larry Krasner, and Jim Kenney won the two-way votes in all sections. Kenyatta lost the Gentrified + Trumpist vote 59-41, but dominated Point Breeze and Southwest Philly. (Notice the points don’t match the table above because these are two-candidate votes.)

What would be Vidas’s path to victory? Helen Gym looks like a prototype (remember that there were actually 16 candidates for five spots, so this head-to-head analysis is hypothetical). Developer Ori Feibush didn’t do nearly well enough in Grad Hospital and East Passyunk to win. If Vidas burnishes more progressive credentials, and pushes that percentage up to 80%, then she could win even if Johnson doesn’t lose any support in his base.

Looking to May

We’re left in a grey area. There are reasons to believe that the recent scandals could drastically change Johnson’s support from 2015, but without polling, we have no way to tell exactly how much. It would take a huge change from 2015 for him to lose, but the combination of scandal and not running against Feibush could be that change.

Up next, I’ll stick with scandal-plagued incumbents and look at Henon’s District 6. Stay tuned!

The neighborhoods that decide Council District 3

Could Jannie Lose?

Jannie Blackwell, the six term councilmember from West Philly’s District 3, is being challenged by Jamie Gauthier. The race appears to be shaping up as a reform-minded challenger against a powerful longtime incumbent, and it’s generated some serious buzz due to recent protests and homophobic remarks. Could it really be close?

More generally, I’m curious about the way Philadelphia’s gentrification and the 2016 election have changed electoral power structures. Even in 2015, Helen Gym won largely on the votes of Center City and the ring around it. But then 2016 happened, and turnout in those neighborhoods reached unprecedented heights. Exactly how powerful is that cohort? And while they’re strong citywide, have they taken over specific districts, to be able to dictate outcomes there?

Blackwell hasn’t faced a primary challenger since 1999, so we don’t have any evidence on her individual strength. Let’s instead look at recent competitive elections that could illustrate the neighborhood’s relative views.

What are the neighborhood cohorts that will decide District 3? Is the Krasner/Gym base strong enough on its own to dictate the election, or is the traditionally decisive West and Southwest Philly base still decisive?

District 3’s voting blocks

In the last three Democratic primaries, District 3 has displayed two clear voting blocks: University City and farther West/Southwest Philly.

View code
library(tidyverse)
library(rgdal)
library(rgeos)
library(sp)
library(ggmap)

sp_council <- readOGR("../../../data/gis/city_council/Council_Districts_2016.shp", verbose = FALSE)
sp_council <- spChFIDs(sp_council, as.character(sp_council$DISTRICT))

sp_divs <- readOGR("../../../data/gis/2016/2016_Ward_Divisions.shp", verbose = FALSE)
sp_divs <- spChFIDs(sp_divs, as.character(sp_divs$WARD_DIVSN))
sp_divs <- spTransform(sp_divs, CRS(proj4string(sp_council)))

load("../../../data/processed_data/df_major_2017_12_01.Rda")

ggcouncil <- fortify(sp_council) %>% mutate(council_district = id)
ggdivs <- fortify(sp_divs) %>% mutate(WARD_DIVSN = id)
View code
races <- tribble(
  ~year, ~OFFICE, ~office_name,
  "2015", "MAYOR", "Mayor",
  "2015", "COUNCIL AT LARGE", "City Council",
  "2016", "PRESIDENT OF THE UNITED STATES", "President",
  "2017", "DISTRICT ATTORNEY", "District Attorney"
) %>% mutate(election_name = paste(year, office_name))

candidate_votes <- df_major %>% 
  filter(election == "primary" & PARTY == "DEMOCRATIC") %>%
  inner_join(races %>% select(year, OFFICE)) %>%
  mutate(WARD_DIVSN = paste0(WARD16, DIV16)) %>%
  group_by(WARD_DIVSN, OFFICE, year, election) %>%
  mutate(
    total_votes = sum(VOTES),
    pvote = VOTES / sum(VOTES)
  ) %>% 
  group_by()

turnout_df <- candidate_votes %>%
  filter(OFFICE != "COUNCIL AT LARGE") %>% 
  group_by(WARD_DIVSN, OFFICE, year, election) %>%
  summarise(total_votes = sum(VOTES)) %>%
  left_join(
    sp_divs@data %>% select(WARD_DIVSN, AREA_SFT)
  )

turnout_df$AREA_SFT <- asnum(turnout_df$AREA_SFT)

The third council district covers West Philly, from the Schuylkill River to the city line.

View code
get_labpt_df <- function(sp){
  mat <- sapply(sp@polygons, slot, "labpt")
  df <- data.frame(x = mat[1,], y=mat[2,])
  return(
    cbind(sp@data, df)
  )
}

ggplot(ggcouncil, aes(x=long, y=lat)) +
  geom_polygon(
    aes(group=group),
    fill = strong_green, color = "white", size = 1
  ) +
  geom_text(
    data = get_labpt_df(sp_council),
    aes(x=x,y=y,label=DISTRICT)
  ) +
  theme_map_sixtysix() +
  coord_map() +
  ggtitle("Council Districts")

plot of chunk council_map

View code
DISTRICT <- "3"
sp_district <- sp_council[row.names(sp_council) == DISTRICT,]

bbox <- sp_district@bbox
## expand the bbox 20%for mapping
bbox <- rowMeans(bbox) + 1.2 * sweep(bbox, 1, rowMeans(bbox))

basemap <- get_map(bbox, maptype="toner-lite")

district_map <- ggmap(
  basemap, 
  extent="normal", 
  base_layer=ggplot(ggcouncil, aes(x=long, y=lat, group=group)),
  maprange = FALSE
) +
  theme_map_sixtysix() +
  coord_map(xlim=bbox[1,], ylim=bbox[2,])

sp_divs$council_district <- over(
  gCentroid(sp_divs, byid = TRUE), 
  sp_council
)$DISTRICT

sp_divs$in_bbox <- sapply(
  sp_divs@polygons,
  function(p) {
    coords <- p@Polygons[[1]]@coords
    any(
      coords[,1] > bbox[1,1] &
      coords[,1] < bbox[1,2] &
      coords[,2] > bbox[2,1] &
      coords[,2] < bbox[2,2] 
    )
  }
)

ggdivs <- ggdivs %>% 
  left_join(
    sp_divs@data %>% select(WARD_DIVSN, in_bbox)
  )

district_map +
  geom_polygon(
    aes(alpha = (id == DISTRICT)),
    fill="black",
    color = "grey50",
    size=2
  ) +
  scale_alpha_manual(values = c(`TRUE` = 0.2, `FALSE` = 0), guide = FALSE) +
  ggtitle(sprintf("Council District %s", DISTRICT))

plot of chunk district_map

First, let’s look at the results from four recent, compelling Democratic Primary races: 2015 City Council At Large and Mayor, 2016 President, and 2017 District Attorney. The maps below show the vote for the top two candidates in District 3 (except for City Council in 2015, where I use Helen Gym and Isaiah Thomas, who were 4th and 5th in the district, and 5th and 6th citywide.)

View code
candidate_votes <- candidate_votes %>%
  left_join(sp_divs@data %>% select(WARD_DIVSN, council_district))

## Choose the top two candidates in district 3
# Except for city council, where we choose Gym and Thomas
# candidate_votes %>% 
#   group_by(OFFICE, year, CANDIDATE) %>% 
#   summarise(
#     city_votes = sum(VOTES), 
#     district_votes = sum(VOTES * (council_district == DISTRICT))
#   ) %>% 
#   arrange(desc(district_votes)) %>%
#   filter(OFFICE == "COUNCIL AT LARGE")

candidates_to_compare <- tribble(
  ~year, ~OFFICE, ~CANDIDATE, ~candidate_name, ~row,
  "2015", "COUNCIL AT LARGE", "HELEN GYM", "Helen Gym", 1,
  "2015", "COUNCIL AT LARGE", "ISAIAH THOMAS", "Isaiah Thomas", 2,
  "2015", "MAYOR", "JIM KENNEY", "Jim Kenney",  1,
  "2015", "MAYOR", "ANTHONY HARDY WILLIAMS", "Anthony Hardy Williams", 2,
  "2016", "PRESIDENT OF THE UNITED STATES", "BERNIE SANDERS", "Bernie Sanders", 1,
  "2016", "PRESIDENT OF THE UNITED STATES", "HILLARY CLINTON", "Hillary Clinton", 2,
  "2017", "DISTRICT ATTORNEY", "LAWRENCE S KRASNER", "Larry Krasner", 1,
  "2017", "DISTRICT ATTORNEY", "TARIQ KARIM EL SHABAZZ","Tariq Karim El Shabazz", 2
)

candidate_votes <- candidate_votes %>%
  left_join(races) %>%
  left_join(candidates_to_compare)

vote_adjustment <- function(pct_vote, office){
  ifelse(office == "COUNCIL AT LARGE", pct_vote * 4, pct_vote)
}

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(
        candidate_votes %>% filter(!is.na(row))
      ),
    aes(fill = 100 * vote_adjustment(pvote, OFFICE))
  ) +
  scale_fill_viridis_c("Percent of Vote") +
  theme(
    legend.position =  "bottom",
    legend.direction = "horizontal",
    legend.justification = "center"
  ) +
  geom_polygon(
    fill=NA,
    color = "white",
    size=1
  ) +
  geom_label(
    data=candidates_to_compare %>% left_join(races),
    aes(label = candidate_name),
    group=NA,
    hjust=0, vjust=1,
    x=-75.258,
    y=39.985
  ) +
  facet_grid(row ~ election_name) +
  theme(strip.text.y = element_blank()) +
  ggtitle(
    sprintf("Candidate performance in District %s", DISTRICT), 
    "Percent of vote (times 4 for Council, times 1 for other offices)"
  )

plot of chunk proportion
Notice two things. First, these competitive elections all split along the same boundaries: University City versus farther West and Southwest Philly. The candidates’ overall results were different (Sanders lost the district, Krasner won), but their relative strengths were exactly the same place. Demographically, the split is obvious: University City is predominantly White and wealthier, farther West is predominantly Black and has lower incomes. Even though Krasner did well across the city, and Shabazz poorly, Krasner did disproportionately well in University City, and Shabazz dispropotionately well farther West and Southwest.

Turnout is a more complicated story.

View code
# hist(turnout_df$total_votes / turnout_df$AREA_SFT)

turnout_df <- turnout_df %>%
  left_join(races)

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(turnout_df, by =c("id" = "WARD_DIVSN")),
    aes(fill = pmin(total_votes / AREA_SFT, 0.0005))
  ) +
  scale_fill_viridis_c(guide = FALSE) +
  geom_polygon(
    fill=NA,
    color = "white",
    size=1
  ) +
  facet_wrap(~ election_name) +
  ggtitle(
    "Votes per mile in the Democratic Primary", 
    sprintf("Council District %s", DISTRICT)
  )

plot of chunk turnout_map
The 2017 election was completely different from 2015. In 2015, we saw the West and Southwest Philly neighborhoods dominate the vote, and decide the election. In 2017, University City (really, Cedar Park and Spruce Hill) boomed for Krasner. While Gym, Kenney, and Sanders all monopolized the University City percent of the vote, only Krasner multiplied that effect by monopolizing the turnout.

The change in votes per mile from 2015 to 2017 illustrates that starkly.

View code
turnout_wide <- turnout_df %>%
  group_by() %>%
  mutate(
    votes_per_sf = total_votes / AREA_SFT,
    key = paste0("votes_", year)
  ) %>%
  select(WARD_DIVSN, key, votes_per_sf) %>%
  spread(key = key, value = votes_per_sf)

district_map +
  geom_polygon(
    data = ggdivs %>%
      filter(in_bbox) %>%
      left_join(turnout_wide),
    aes(
      fill = (votes_2017 - votes_2015)*5280^2
    )
  ) +
  scale_fill_gradient2(
    "Change in votes per mile\n  2015 - 2017",
    low=strong_orange,
    mid="white",
    high=strong_purple,
    midpoint=0
  ) +
  geom_polygon(
    fill=NA,
    color = "black",
    size=1
  )  +
  theme(legend.position = "bottom", legend.direction = "horizontal") +
  ggtitle(
    sprintf("Change in votes per mile, District %s", DISTRICT),
    "Orange: More votes in 2015, Purple: More in 2017"
  )

plot of chunk relative_turnout_15_17

To simplify the analysis, let’s divide the District into the two distinct coalitions: the Clinton/Hardy Williams “West & Southwest”, and the Krasner/Sanders “University City”. While they’re obvious on the map, we need a rule to split them up; ideally, there would be natural clusters to divide them. Using the simplistic division based on whether the average Krasner/Sanders vote was greater than 50% is surprisingly useful:

View code
district_categories <- candidate_votes %>% 
    filter(
      council_district == DISTRICT & 
        candidate_name %in% c("Larry Krasner", "Bernie Sanders")
    ) %>%
    group_by(WARD_DIVSN) %>%
    mutate(votes_2016 = total_votes[year == 2016]) %>%
    select(WARD_DIVSN, votes_2016, candidate_name, pvote) %>%
    spread(key=candidate_name, value=pvote)

ggplot(
  district_categories,
  aes(x = 100 * `Bernie Sanders`, y = 100 * `Larry Krasner`)
) +
  geom_point(aes(size = votes_2016), alpha = 0.7) +
  scale_size_area("Total Votes in 2016")+
  theme_sixtysix() +
  xlab("Percent of Vote for Bernie Sanders") +
  ylab("Percent of Vote for Larry Krasner") +
  coord_fixed() + 
  geom_abline(slope = -1, intercept =  100) +
  annotate(
    geom = "text",
    x = c(35, 20),
    y = c(15, 87),
    hjust = 0,
    label = c("West & Southwest", "University City"),
    color = c(strong_green, strong_purple),
    fontface="bold"
  ) +
  ggtitle("Divisions' vote", sprintf("District %s Democratic Primary", DISTRICT))

plot of chunk scatter_bernie_gym
We’ll call the divisions above the line University City, and those below the line West & Southwest.

Here’s the map of the cohorts that this categorization gives us.

View code
district_categories$category <- with(
  district_categories,
  (`Bernie Sanders` + `Larry Krasner`) > 1.0
)
district_categories$cat_name <- ifelse(
  district_categories$category,
  "University City",
  "West & Southwest"
)

district_map + 
  geom_polygon(
    data = ggdivs %>% 
      left_join(district_categories) %>% 
      filter(!is.na(cat_name)),
    aes(fill = cat_name)
  ) +
  scale_fill_manual(
    "",
    values = c("University City" = strong_purple, "West & Southwest" = strong_green)
  ) +
  ggtitle(sprintf("District %s neighborhood divisions", DISTRICT))

plot of chunk category_map
Looks reasonable.

How did the candidates do in each of the two sections? The boundary separates drastic performance splits.

View code
neighborhood_summary <- candidate_votes %>% 
  inner_join(candidates_to_compare) %>%
  group_by(candidate_name, election_name) %>%
  mutate(
    citywide_votes = sum(VOTES),
    citywide_pvote = 100 * sum(VOTES) / sum(total_votes)
  ) %>%
  filter(council_district == DISTRICT) %>%
  left_join(district_categories) %>%
  group_by(candidate_name, citywide_votes, citywide_pvote, election_name, cat_name) %>%
  summarise(
    votes = sum(VOTES),
    pvote = 100 * sum(VOTES) / sum(total_votes),
    total_votes = sum(total_votes)
  ) %>%
  group_by(candidate_name, election_name) %>%
  mutate(
    district_votes = sum(votes),
    district_pvote = 100 * sum(votes) / sum(total_votes)
  ) %>% select(
    election_name, candidate_name, citywide_pvote, district_pvote, cat_name, pvote, total_votes
  ) %>%
  gather(key="key", value="value", pvote, total_votes) %>%
  unite("key", cat_name, key) %>%
  spread(key, value)


neighborhood_summary %>%
  knitr::kable(
    digits=0, 
    format.args=list(big.mark=','),
    col.names=c("Election", "Candidate", "Citywide %", sprintf("District %s %%", DISTRICT), "University City %", "University City Turnout", "West & Southwest %", "West & Southwest Turnout")
  )

 

Election Candidate Citywide % District 3 % University City % University City Turnout West & Southwest % West & Southwest Turnout
2015 City Council Helen Gym 8 8 16 18,521 5 47,400
2015 City Council Isaiah Thomas 7 8 7 18,521 8 47,400
2015 Mayor Anthony Hardy Williams 26 48 24 5,738 55 19,335
2015 Mayor Jim Kenney 56 39 62 5,738 33 19,335
2016 President Bernie Sanders 37 39 59 12,376 30 27,991
2016 President Hillary Clinton 63 61 41 12,376 70 27,991
2017 District Attorney Larry Krasner 38 51 73 7,125 36 11,113
2017 District Attorney Tariq Karim El Shabazz 12 15 5 7,125 22 11,113

Gym won 16% in University City, but only 5% in West & Southwest; Thomas ran an even 7 and 8%, respectively. Kenney won 62% in University City and only 33% in West and Southwest, Hardy Williams flipped that for 24 and 55%. Krasner won an astounding 73% of the vote in University City (in a crowded race!), and only 36% in West and Southwest, though that was still good enough to win the neighborhood. El Shabazz won 5 and 22%.

Also, notice the dramatic change in relative turnout. In the 2015 Mayoral race, West & Southwest had 3.4 times the vote of University City. The dramatic turnout swing of 2017 shrunk that to 1.6. West and Southwest still hold most of the voters (among substantially more households), but the relative proportions needed shift.

The relative power of West and Southwest and University City

How much does the power shift between the two cohorts? Let’s do some math.

How much does a candidate need from each of the sections to win? Let t_i be the relative turnout in section i, defined as the proportion of total votes. So in the 2017 District Attorney Race, t_i was 0.39 for University City, and 0.61 for West & Southwest. Let p_ic be the proportion of the vote received by candidate c in section i, so in 2017, p is 0.73 for Krasner in University City.

Then a candidate wins a two-way race whenever the turnout-weighted proportion of their vote is greater than 0.5: sum_over_i(t_i p_ic) > 0.5.

Since we’ve divided District 3 into only 2 sections, we can plot this on a two-way plot. On the x-axis, let’s map a candidate’s percent of the vote in University City, and on the y, a candidate’s percent of the vote in West & Southwest (assuming a two-person race). The candidate wins whenever the average of their proportions, weighted by \(\tilde{t}\) is greater than 50%. If the turnout looks like 2015, West & Southwest easily carry the District; if it looks like 2017, the sections carry nearly equal weight. The dashed lines show the win boundaries; candidates to the top-right of the lines win.

I’ll plot only the two-candidate vote for the top two candidates in the district for each race, to emulate a two-person race. (For City Council in 2015, I use Helen Gym and Isaiah Thomas, who were 4th and 5th in the district, and 5th and 6th citywide.)

View code
get_line <- function(x_total_votes, y_total_votes){
  ## solve p_x t_x+ p_y t_y > 50
  tot <- x_total_votes + y_total_votes
  tx <- x_total_votes / tot
  ty <- y_total_votes / tot

  slope <- -tx / ty
  intercept <- 50 / ty  # use 50 since proportions are x100
  c(intercept, slope)
}

line_2017 <- with(
  neighborhood_summary,
  get_line(
    `University City_total_votes`[candidate_name == "Larry Krasner"],
    `West & Southwest_total_votes`[candidate_name == "Larry Krasner"]
  )
)

## get the two-candidate vote
neighborhood_summary <- neighborhood_summary %>%
  group_by(election_name)  %>% 
  mutate(
    ucity_pvote_2cand = `University City_pvote` / sum(`University City_pvote`),
    wsw_pvote_2cand = `West & Southwest_pvote`/sum(`West & Southwest_pvote`)
  )

line_2015 <- with(
  neighborhood_summary,
  get_line(
    `University City_total_votes`[candidate_name == "Jim Kenney"],
    `West & Southwest_total_votes`[candidate_name == "Jim Kenney"]
  )
)

library(ggrepel)

ggplot(
  neighborhood_summary,
  aes(
    x=100*ucity_pvote_2cand,
    y=100*wsw_pvote_2cand
  )
) +
  geom_point() +
  geom_text_repel(aes(label=candidate_name)) +
  geom_abline(
    intercept = c(line_2015[1], line_2017[1]),
    slope = c(line_2015[2], line_2017[2]),
    linetype="dashed"
  ) +
  coord_fixed() + 
  scale_x_continuous(
    "University City percent of vote",
    breaks = seq(0,100,10)
  ) +
  scale_y_continuous(
    "West & Southwest percent of vote",
    breaks = seq(0, 100, 10)
  ) +
  annotate(
    geom="text",
    label=paste(c(2015, 2017), "turnout"),
    x=c(10, 8),
    y=c(
      line_2015[1] + 10 * line_2015[2],
      line_2017[1] + 8 * line_2017[2]
    ),
    hjust=0,
    vjust=-0.2,
    angle = atan(c(line_2015[2], line_2017[2])) / pi * 180,
    color="grey40"
  )+
  annotate(
    geom="text",
    x = 70,
    y=75,
    label="Candidate wins",
    fontface="bold",
    color = strong_green
  ) +
  geom_hline(yintercept = 50, color="grey50") +
  geom_vline(xintercept = 50, color="grey50")+
  expand_limits(x=100, y=80)+
  theme_sixtysix() +
  ggtitle(
    "The relative strength of W & SW Philly and U City",
    "Candidates to the top-right of the lines win."
  )

plot of chunk win_scatter

Hillary Clinton and Larry Krasner won the district in a landslide, with Clinton winning despite losing University City to Sanders. Helen Gym and Jim Kenney were in the turnout-dependent zone: they would win the district if turnout looked like 2017, and lose it if turnout looked like 2015 (and vice versa for Hardy Williams and Thomas).

So could a candidate who monopolized University City win? Maybe, but it’s hard. If turnout looks like 2017, then a candidate who wins 70% of the University City vote still needs to win 37% of the West and Southwest vote. If the turnout looks like 2015, the required W/SW vote jumps to 44. Clinton and Krasner pulled off dominant victories that would win in any turnout climate; Hardy Williams, Kenney, El Shabazz, and Gym saw the neighborhoods’ turnouts be decisive.

Looking to May

I don’t know how Jamie Gauthier will fare in University City or in West & Southwest Philly, but my hunch is that she’s seeking the reformist, University City lane. But that’s a hard lane to win in. Even if she achieves Gym and Kenney percentages, she would need to additionally inspire turnout the way that Krasner did. Alternatively, she needs to pull enough support from West and Southwest; significantly more than Gym and Kenney did. It’s possible, but a steep climb.

What At Large City Councilors most polarized the vote?

May’s primary will include elections for Philadelphia City Council. The council is constituted of 17 councilors, ten of whom are voted in by specific districts and seven of whom are At Large, voted in by the city as a whole. Of those seven at large, only five can come from the same party. In practice means that five Democrats will win this primary, and then win landslide elections in November.

In advance of May, I’m going to be looking at what it takes to win a Democratic City Council At Large seat. Today, let’s look at how polarizing candidates are.

[Note: I’m starting today making my blog posts in RMarkdown. Click the View Code to see the R code!]

View code
## You can access the data at: 
## https://github.com/jtannen/jtannen.github.io/tree/master/data
# load("df_major_2017_12_01.Rda")

df_major$CANDIDATE <- gsub("\\s+", " ", df_major$CANDIDATE)
df_major$PARTY[df_major$PARTY == "DEMOCRATIC"] <- 'DEMOCRAT'

df_major <- df_major %>% 
  filter(
    election == "primary" &
      OFFICE == "COUNCIL AT LARGE" &
      PARTY %in% c("DEMOCRAT")
  )

df_total <- df_major %>% 
  group_by(CANDIDATE, year, PARTY) %>%
  summarise(votes = sum(VOTES)) %>%
  group_by(year, PARTY) %>%
  arrange(desc(votes)) %>%
  mutate(rank = rank(desc(votes)))

div_votes <- df_major %>%
  group_by(WARD16, DIV16, OFFICE, year) %>%
  summarise(div_votes = sum(VOTES))

Measuring Vote Polarization

One way to measure polarization is using the Gini coefficient, common in studying inequality. Suppose for each candidate we line up the precincts in order of their percent of the vote. We then move down the precincts, adding up the total voters and the votes for that candidate. We plot the curve, with the cumulative voters along the x axis, and the cumulative votes for that candidate along the y.

The curvature of that line is a measure of the inequality of the distribution of votes. In this case, I call that polarization. Suppose a candidate got 50% of the vote in every single precinct. Then the curve would just be a straight line with a slope of 0.5; there would be no polarization. Alternatively, if a candidate got zero of the votes from 90% of the precincts, but all of the vote in the remaining 10%, then the curve would be flat at 0 for the first 90% of the x-axis, but then bend and shoot up; a sharp curve and a lot of polarization.

View code
vote_cdf <- df_major %>%
  left_join(div_votes) %>%
  group_by(CANDIDATE, year) %>%
  mutate(
    p_vote_div = VOTES / div_votes,
    cand_vote_total = sum(VOTES)
  ) %>%
  arrange(p_vote_div) %>%
  mutate(
    cum_votes = cumsum(VOTES),
    vote_cdf = cum_votes / cand_vote_total,
    cum_denom = cumsum(div_votes) / sum(div_votes)
  ) 

ggplot(
  vote_cdf %>% 
    left_join(df_total) %>%
    filter(year == 2015 & rank <= 7),
  aes(x=cum_denom, y=cum_votes)
) + geom_line(
    aes(group=CANDIDATE, color=CANDIDATE),
    size=1
) +
  geom_text(
    data = vote_cdf %>% 
    left_join(df_total) %>%
    filter(year == 2015 & rank <= 7) %>%
      group_by(CANDIDATE) %>%
      filter(cum_votes == max(cum_votes)),
    aes(label = tolower(CANDIDATE)),
    x = 1.01,
    hjust = 0
  ) +
  xlab("Cumulative voters") +
  scale_y_continuous(
    "Cumulative votes for candidate",
    labels=scales::comma
  ) +
  scale_color_discrete(guide=FALSE)+
  expand_limits(x=1.3)+
  theme_sixtysix() +
  ggtitle(
    "Vote distributions for 2015 Council At Large",
    "Top seven finishers"
  )

plot of chunk gini

Above is that plot for the top seven At Large finishers in 2015 (remember that five Democrats can win). Helen Gym was the fifth. Interestingly, she also was the most polarizing: 49.4% of her votes came from her best 25% of divisions. For comparison, 38.3% of Derek Green’s votes came from his best 25% of divisions.

If we scale each candidate’s y-axis by their final total votes, the difference in curvature is even more stark.

View code
ggplot(
  vote_cdf %>% 
    left_join(df_total) %>%
    filter(year == 2015 & rank <= 7),
  aes(x=cum_denom, y=vote_cdf)
) + geom_line(
  aes(group=CANDIDATE, color=CANDIDATE),
  size=1
) +
  coord_fixed() +
  geom_abline(slope = 1, yintercept=0) +
  xlab("Cumulative voters") +
  ylab("Cumulative proportion of candidate's votes") +
  scale_color_discrete(guide = FALSE) +
  annotate(
    geom="text",
    y = c(0.45, 0.3),
    x = c(0.52, 0.6),
    hjust = c(1, 0),
    label = c("william k greenlee", "helen gym")
  ) +
  theme_sixtysix() +
    ggtitle(
    "Vote distributions for 2015 Council At Large",
    "Top seven finishers, scaled for total votes"
  )

plot of chunk gini_scaled

So Helen Gym snuck in four years ago, with a highly polarized vote. Is that common for new challengers? Not really. Usually, it’s hard to win without more even support.

To summarise the curvature into a single number, the Gini coefficient is defined as the area above the curve but below the 45 degree line, divided by the total area of the triangle. Notice that the more curved the line, the more area between the 45-degree line and the curve, and the higher the coefficient. If there is no inequality, the Gini coefficient is 0, if there’s complete inequality, it’s 1. Helen Gym’s Gini coefficient is 0.35, Bill Greenlee’s is 0.19.

Below I plot each candidate’s proportion of the vote on the x-axis (blue names are winners), and their Gini coefficient on the y-axis (higher values are more polarized).

View code
gini <- vote_cdf %>% 
  arrange(CANDIDATE, year, cum_denom) %>%
  group_by(CANDIDATE, year) %>%
  mutate(
    is_first = cum_denom == min(cum_denom),
    bin_width = cum_denom - ifelse(is_first, 0, lag(cum_denom)),
    avg_height = (vote_cdf + ifelse(is_first, 0, lag(vote_cdf)))/2,
    area = bin_width * avg_height
  ) %>% 
  summarise(
    gini = 1 - 2 * sum(area),
    total_votes = weighted.mean(p_vote_div, div_votes)
  )

ggplot(
  gini %>% left_join(df_total) %>% filter(rank <= 10), 
  aes(x=total_votes, y=gini)
) + 
  geom_text(
    aes(label=tolower(CANDIDATE), color=(rank<=5)),
    size = 3
  ) +
  scale_color_manual(
    "winner", 
    values=c(`TRUE` = strong_blue, `FALSE` = strong_red),
    guide = FALSE
  )+
  scale_x_continuous(
    "proportion of vote",
    expand=expand_scale(mult=0.2)
  ) +
  ylab("gini coefficient (higher means more polarization)")+
  facet_wrap(~year) +
  theme_sixtysix() +
  ggtitle("Total votes versus vote polarization",
          "Top ten finishers for City Council At Large. Winners in blue.")

plot of chunk gini_scatter

Helen Gym had the highest Gini coefficient of any winner in the last four elections, and no one else was close.

There are a few things going on here. First, the winners are usually incumbents, and incumbents probably benefit from name recognition across the city. All of the winners in 2011 were incumbents, for example.

But even the non-incumbents who won had more even support. Allan Domb had the second lowest gini coefficient in 2015, and Derek Green the third. Greenlee and Bill Green had the lowest Gini coefficients when they won as challengers in 2007 (Greenlee was technically an incumbent from a 2006 Special Election).

There are a few ways to view Helen Gym’s polarization. Remember that this is unrelated to total proportion of the vote; she won the fifth most votes, more than candidates who had even and low support across the city. She did so by particularly consolidating her neighborhoods, mobilizing the wealthier, whiter progressive wards that formed her coalition (presumably with the incumbency, she will receive broader support this time around).

View code
# library(sf)
# divs <- st_read("2016_Ward_Divisions.shp", quiet = TRUE)

gym_vote <- divs %>% 
  left_join(
    df_major %>% 
      filter(year == 2015) %>% 
      mutate(WARD_DIVSN = paste0(WARD16, DIV16)) %>% 
      group_by(WARD_DIVSN) %>% 
      mutate(p_vote = VOTES / sum(VOTES)) %>% 
      filter(CANDIDATE == "HELEN GYM")
    )

ggplot(gym_vote)+ 
  geom_sf(
    aes(fill = p_vote * 100),
    color = NA
  ) +
  theme_map_sixtysix() +
  scale_fill_viridis_c("% of Vote") +
  ggtitle(
    "Helen Gym's percent of the vote, 2015",
    "Voters could vote for up to five At Large candidates"
  )

plot of chunk gym_vote

One perspective is that she won entirely on the support of whiter, wealthier liberals. Another is that she managed to squeeze the last drips of votes out of those neighborhoods, eking out her edge over candidates with similar city-wide votes. Notably, the common concern around a candidate with this base would be that she would ignore the lower income, Black and Hispanic neighborhoods that didn’t vote for her, but I don’t think that’s a common complaint lodged against the fierce public education advocate.

What coalitions win the City Council At Large seats?

One question I find fascinating is what coalitions candidates use to win. Gym clearly won with the wealthier white progressive wards, but candidates may also just as often win with support of the Black wards, or the more conservative Northeast and deep South Philly. In the upcoming months, I’m going to dig more into this question.

Philadelphia’s Court of Common Pleas elections are decided by a lottery.

New year, new election.

This time around, May’s election will have the Mayoral race at the top of the ballot, and City Council races. Odd-year primaries also bring Philadelphia’s lesser-followed judicial elections. These elections are problematic: we vote for multiple candidates each election, upwards of 7, and even the most educated voter really doesn’t know anything about any of them. In these low-information elections, the most important factor in getting elected is whether a candidate ends up in the first column of the ballot. Ballot position is decided by a lottery. Philadelphia elects its judges at random.

I’ve looked into this before, measuring the impact of ballot position on the Court of Common Pleas, and then that impact by neighborhood. We even ran an experiment to see if we could improve the impact of the Philadelphia Bar’s Recommendations. But first, let’s revisit the story, update it with 2017 results, and establish the gruesome baseline.

The Court of Common Pleas
The most egregious race is for the city’s Court of Common Pleas. That court is responsible for major civil and criminal trials, juvenile and domestic relations, and orphans, so the bulk of our judicial cases. Common Pleas judges are elected to 10-year terms.

The field for Common Pleas is huge; we don’t know how many openings there will be this year, but in the five elections since 2009, there was an average of 31 candidates competing for 9 openings. Even the most informed voter doesn’t know the name of most of these candidates when she enters the voting booth, and ends up either relying on others’ recommendations or voting at random. This is exactly the type of election where other factors–flyers handed out in front of a polling place, the order of names on the ballot–will dominate.

Because of this low attention, some of the judges that do end up winning come with serious allegations against them. In 2015, Scott Diclaudio won; months later he would be censured for misconduct and negligence, and then get caught having given illegal donations in the DA Seth Williams probe. The 2015 race also elected Judge Lyris Younge, who has since been removed from Family Court for violating family rights, and just in October made headlines by evicting a full building of residents with less than a week’s notice. In 2016, Mark Cohen was voted out as state rep after reports on his excessive use of per-diem expenses. In 2017, he was elected to the Court of Common Pleas.

Every one of those candidates—Diclaudio, Younge, and Cohen—was in the first column of the ballot.

Random luck is more important than Democratic Endorsement or Quality
The order of names on the ballot is entirely random, so this gives us a rare chance to causally estimate the importance of order, compared to other factors.

First, what do I mean by ballot order? Here’s a picture of the 2017 Democratic ballot for the Court of Common Pleas.

There are so many candidates that they need to be arranged in a rectangle. In 2017, candidates were organized into 11 rows and 3 columns. Six of the nine winners came from the first column. (The empty spaces are candidates who dropped out of the race after seeing their lottery pick.)

The shape of the ballot changes wildly over years. Sometimes it’s short and wide, in 2017 it was tall and thin. But in every year, candidates in the first column fare better than others. The results year over year show that the first column consistently receives more votes:

As the above makes clear, candidates do win from later columns every year. But first-column candidates win more than twice as often as if all columns won equally. Below, I collapse the columns for each year, and calculate the number of actual winners from each column, compared to the expected winners if winners were completely at random. The 2011 election marked the high-water mark, when more than three times as many winners came from the first column than should have.

This effect means that Philadelphia ends up with unqualified judges. Let’s use recommendations by the Philadelphia Bar Association as a measure of candidate quality. The Bar Association’s Judicial Commission does an in-depth review of candidates each cycle, interviewing candidates and people who have worked with them. They then rate candidates Recommended or Not Recommended. A Bar Association Recommendation should be considered a low bar; on average, they Recommend two thirds of candidates and more than twice as many candidates as are able to win in a given year. While the Bar Association maintains confidentiality about why they Recommend or don’t Recommend a given candidate (giving candidates wiggle room to complain about its decisions), my understanding is that there have to be serious concerns to be Not Recommended: a history of corruption, gross incompetence, etc.

In short, it’s worrisome when Not Recommended candidates win. But win they do. In the last two elections, a total of six Not Recommended candidates won. They were all from the first column.

Note that, from a causal point, this analysis could be done. Ballot position is randomized, so there is no need to control for other factors. The effect we see has the benefit of combining the various ways ballot position might matter: clearly because voters are more likely to push buttons in the first column, but also perhaps if endorsers take ballot position into account, or candidates campaign harder once they see their lottery result. This analysis is uncommonly clean, and we could be done.

But clearly other factors affect who wins, too. How does the strength of ballot position compare to those others?

Let’s compare the strength of ballot position to three other explanatory features: endorsement by the Democratic City Committee, recommendation by the Philadelphia Bar Association, and endorsement by the Philadelphia Inquirer. (There are two other obvious ones, Ward-level endorsements and money spent. I don’t currently have data on that, but maybe I’ll get around to it!)

​I regress the log of each candidate’s total votes on their ballot position (being in the first, second, or third column, versus later ones, and being in the first or second row), endorsements from the Democratic City Committee and the Inquirer, and Recommendation by the Philadelphia Bar, using year fixed effects.

Being in the first column nearly triples your votes versus being in the fourth or later. The second largest effect is the Democratic City Committee endorsement, which doubles your votes versus not having it. (Notice that the DCC correlation isn’t plausibly causal, the Committee certainly takes other factors into its consideration. Perhaps this correlation is absorbing candidate’s abilities to raise money or campaign. And it almost certainly is absorbing some of the first-column effect: the DCC is known to take ballot position into consideration!)

The Philadelphia Bar Association’s correlation is sadly indistinguishable from none at all, but with a *big* caveat: the Philadelphia Inquirer has begun simply adopting the Bar’s recommendations as its own, so that Inquirer correlation represents entirely Recommended candidates.

​While this analysis doesn’t include candidate fundraising or Ward endorsements, those certainly matter, and would probably rank high on this list if we had the data. Two years ago, I found tentative evidence that Ward endorsements may matter even more than the first column. And Ward endorsements, in turn, cost money.

What are the solutions?
There are obvious solutions to this. I personally think holding elections for positions that no-one pays attention to is an invitation for bad results. It’s better to have those positions be appointed by higher profile officials; it concentrates power in fewer hands, but at least the Mayor or Council know that voters are watching.

A more plausible solution than eliminating judicial elections altogether might be to just randomize ballot positions between divisions, rather than use the same ones across the city. That would mean that all candidates would get the first-column benefit in some divisions and not in others, and would wash away its effect, allowing other factors to determine who wins.

Unfortunately, while these are easy in theory, any changes would require a change to the Pennsylvania Election Code, and may be a hard haul. But one thing everyone seems to agree on is that our current system is bad.

The Turnout Funnel

The turnout in November was unprecedented. Nationally, we had the highest midterm turnout since the passage of the Voting Rights Act. In Philadelphia some 53% of registered voters voted a huge increase over 4 years ago, or really any midterm election in recent memory. Turnout can be hard to wrap your mind around, with a lot of factors that affect who votes and in what elections. What drives those numbers? Is it registration deviations? Or variations in the profile of the election? And how is that structured by neighborhood?

Decomposing the Funnel
I’m conceptualizing the entire process that filters down the whole population as a funnel, with four stages at which a fraction of the population drops out.

Let’s break down the funnel using the following accounting equation:

Voters per mile =
(Population over 18 per mile) x
(Proportion of over-18 population that are citizens) x
(Proportion of citizens that are registered to vote) x
(Proportion of registered voters that voted in 2016) x
(Proportion of those who voted in 2016 that voted in 2018).

The implications are very different if a neighborhood lags in a given one of these steps. If a ward has a low proportion of citizens that are registered, registration drives make sense. If a neighborhood has very low turnout in midterms vs presidential elections, then awareness and motivation is what matters.

We are also going to see that using metrics based on Registered Voters in Philadelphia is flawed. The lack of removal of voters from the rolls—which is actually a good practice, since the alternative is to aggressively remove potential voters—means that the proportion of registered voters that vote is not a useful metric.

Funnel Maps
Let’s walk through the equation above, map by map.

Overall, 44% of Philadelphia’s adults voted in 2018. That represented high turnout (measured as a proportion of adults) from the usual suspects: the ring around Center City, Chestnut Hill, Mount Airy, and the Oak Lanes.

How does the funnel decompose these difference?

First, consider something not in that map: population density. This is obviously, almost definitionally, important. Wards with more population will have more voters. ​

​The ring around Center City, North Philly, and the lower Northeast have much higher densities than the rest of the city. [Note: See the appendix for a discussion of how I mapped Census data to Wards]

​We don’t allow all of those adults to vote. I don’t have data on explicit voter eligibility, but we can at least look at citizenship.

The city as a whole is 92% citizens, and some 53 of the city’s 66 wards are more than 90%. Notable exceptions include University City, and immigrant destinations in South Philly and the lower Northeast.

The next step in the funnel: what fraction of those citizens are registered to vote? Here’s where things get hairy.

Overall in the city, there are 93% as many registered voters as there are adult citizens. That’s high, and a surprising 22 wards have *more registered voters than the census thinks there are adult citizens*. What’s up? Philadelphia is very slow to remove voters from the rolls, and this imbalance likely represents people who have moved, but haven’t been removed.

While people tend to use facts like this to suggest conspiracy theories, Philadelphia’s ratio is actually quite typical for the state, across wealthy and poor, Republican and Democratic counties. [See below!]

And it’s very (very) important to point out that this is good for democracy: being too aggressive in removing voters means disenfranchising people who haven’t moved. And we have no evidence that anybody who has moved is actually *voting*, just that their name remains in the database.

It does, however, mean that measuring turnout as a fraction of registered voters is misleading.

I explore the registered voter question below, but for the time being let’s remove the registration process from the equation above to sidestep the issue:

Voters per mile =
(Population over 18 per mile) x
(Proportion of Over 18 pop that are citizens) x
(Proportion of citizens that voted in 2016) x
(Proportion of those who voted in 2016 that voted in 2018).

Why break out 2016 voters first, then 2018? Because there are deep differences in the processes that lead people to voting in Presidential elections versus midterms, and low participation in 2016 and 2018 have different solutions. Presidential elections are obviously high-attention, high-energy affairs. If you didn’t vote in 2016, didn’t turn out in the highest-profile, biggest-budget election of the cycle, you either have steep barriers to voting (time constraints, bureaucratic blockers, awareness), or are seriously disengaged. If someone didn’t vote in 2016, it’s hard to imagine you’d get them to vote in 2018.

Compare that to people who voted in 2016 but not in 2018. These voters (a) are all registered, (b) are able to get to the polling place, and (c) know where their polling place is. This group is potentially easier to get to vote in a midterm: they’re either unaware of or uninterested in the lower-profile races of the midterm, or have made the calculated decision to skip it. Whichever the reason, it seems a lot easier to get presidential voters to vote in a midterm than to get the citizens who didn’t even vote in the Presidential race.

So which citizens voted in 2016?

Overall, 63% of Philadelphia’s adult citizens voted. This percentage has large variation across wards, ranging from the low 50s along the river, to 81% in Grad Hospital’s Ward 30 [Scroll to the bottom for a map with Ward numbers]. Wards 28 and 47 in Strawberry Mansion and North Philly have high percentages here, and also had high percentages in the prior map of registered voters, which to me suggests a combination of high registration rates in the neighborhood and a low population estimate from the ACS (see the discussion of registered voters below).

How many of these 2016 voters came out again two years later?

This map is telling. While there were fine-grained changes in the map versus 2014’s midterm, the overall pattern of who votes in midterms didn’t fundamentally change: it’s the predominantly White neighborhoods in Center City, its neighbors, and Chestnut Hill and Mount Airy, coupled with some high-turnout predominantly Black Wards in Overbrook, Wynnefield, and West Oak Lane.

The dark spot is the predominantly Hispanic section of North Philly. It’s not entirely surprising that this region would have low turnout in a midterm, but remember that this map has as its denominator *people who voted in 2016*! So there’s even disproportionate fall-off among people who voted just two years ago.

So that’s the funnel. We have some small variations in citizenship, confusing variations in proportions registered, steep differences in who voted in 2016, and then severely class- and race-based differences in who comes out in the midterm. Chestnut Hill and Center City have high scores on basically all of these dimensions, leading to their electoral dominance (except for Chestnut Hill’s low population density and Center City’s relatively high proportion of non-citizens). Upper North Philly and the Lower Northeast are handcuffed at each stage, with telling correlations between non-citizen residents and later voting patterns, even those which could have been unrelated, such as the turnout among citizens.

What’s going on with the Registered Voters?
It’s odd to see more registered voters than the Census claims there are adult citizens. I’ve claimed this is probably due to failing to remove from the the rolls people who move, so let’s look at some evidence.

First, let’s consider the problem of uncertainty in the population estimates. The American Community Survey is a random sample, meaning the population counts have uncertainty. I’ve used the 5-year estimates to minimize that uncertainty, but some still exists. The median margin of error in the population estimates among the wards is +/- 7%. This uncertainty is larger for less populous wards: the margin of error for Ward 28’s population is +/-12%. So the high registration-to-population ratios may be partially driven by an unlucky sample giving a small population estimate. But the size of the uncertainty isn’t enough to explain the values that are well above one, and even if it were large enough to bring the ratios just below one, the idea that nearly 100% of the population would be registered would still be implausibly high.

So instead, let’s look at evidence that these high ratios might be systematic, and due to lags in removal. ​First, consider renters.

Wards with more renters have higher registration rates in the population, including many rates over 100%. On the one hand, this is surprising because we actually expect renters to be *less* likely to vote; they are often newer to thei
r home and to the city, and less invested in local politics. On the other hand, they have higher turnover, so any lag in removing voters will mean more people are registered at a given address. The visible positive correlation suggests that the second is so strong as to even overcome the first.

​(For reference, here’s a map of Ward numbers)

There’s an even stronger signal than renters, though. Consider the Wards that lost population between 2010 and 2016, according to the Census.

The group of Wards that saw the greatest declines in population also have the highest proportions of registered voters (except for 65, which is a drastic outlier). This correlation suggests that the leaving residents may not be removed from the rolls, further pointing the finger at that as the culprit.

Finally, let’s look at Philadelphia versus *the rest of the state*. It turns out that Philadelphia’s registered voter to adult citizen ratio is typical of the state, especially when you consider its high turnover.

First, there isn’t a strong correlation between Registered Voters per adult citizen and Democratic counties.

Allegheny and Philadelphia, to two most Democratic counties are 1st and 6th, respectively, but predominantly-Republican Pike County, outside of Scranton, is second, and the rest of Philadelphia’s wealthier suburbs are 3, 4, 5, and 7. (I’m not totally sure what’s going on with Forest County, at the bottom of the plot, but the next scatterplot helps a little bit).

​More telling is a county’s fraction of the residents that have moved in since 2000: it looks like counties with higher turnover has higher ratios, which we would expect if the culprit were lagging removal.

Philadelphia here looks entirely normal: counties with more recent arrivals have higher ratios. This also sheds some light on Forest County. With a population of only 7,000, it’s an outlier in terms of long-term residents, and thus has a *very* low registration-to-citizen ratio.

Apologies for the long explanation. I didn’t want to just ignore this finding, but I’m terrified it’ll be discovered and used by some conspiracy theorist.

Exactly where the voters fall out of the funnel matters
Philadelphia’s diverse wards have a diverse array of turnout challenges. Unfortunately, the voter registration rolls are pretty unhelpful as a signal, at least in the simplistic aggregate way I considered them here. (Again: good for democracy, bad for blogging).

Which stage of the funnel matters depends on two things: their respective sizes, and how easy it is to move each. Is it more plausible to get out the citizens who didn’t vote in 2016? Or is it more plausible to get the 2016 voters out again in 2018? Where will registration drives help, and where is that not the issue?

​Next year’s mayoral primary, a local election with an incumbent mayor, will likely be an exaggerated version of the midterm, with even more significant fall-off than we saw in November. More on that later.

Appendix: Merging Census data with Wards
I use population data from the Census Bureau’s American Community Survey, an annual sample of the U.S. population with an extensive questionnaire on demographics and housing. Because it’s a sample, I use the five-year aggregate collected from 2012-2016. This will minimize the sampling uncertainty in the estimates, but mean there could be some bias in regions where the population has been changing since that period.

The Census collects data in its own regions: block groups and tracts. These don’t line up with Wards, so I’ve built a crosswalk from census geographies to political ones:
– I map the smallest geography–Census Blocks–to the political wards.
– Using that mapping, I calculate the fraction of each block group’s 2010 population in each Ward. I use 2010 because blocks are only released with the Decennial Census. Largely, any changes over time in block populations are swamped by across-block variations in populations that are consistent through time.
– I then apportion each block group’s ACS counts–of adults, of citizens–according to those proportions.
This method is imperfect; it will be off wherever block groups are split by a ward and the block group is internally segregated, but I think the best available approach.

The State House Model did really well. But it was broken.

Before November’s election, I published a prediction of the Pennsylvania General Assembly Lower House: Democrats would win on average 101.5 seats (half of the 203), and were slight favorites (53%) to win the majority. Then I found a bug, and published an updated prediction: Democrats would win 96 seats on average (95% confidence interval from 88 to 104), and had only a 13% chance of taking the house. This prediction, while still giving Republicans the majority, represented an average 14 seat pickup for Democrats from 2016.

That prediction ended up being correct: Democrats currently have the lead in 93 seats, right in the meat of the distribution.

But as I dug into the predictions, seat-by-seat, it looked like there were a number of seats that I got wildly wrong. And I’ve finally decided that the model had another bug; probably one that I introduced in the fix to the first one. In this post I’ll outline what it was, and what I think the high-level modelling lessons are for next time. 

Where the model did well, and where it didn’t

The mistake ended up being in the exact opposite of the fix I implemented: candidates in races with no incumbents.

The State House has 203 seats. This year, there were 23 uncontested Republicans and 55 uncontested Democrats. I got all of them right 😎. The party imbalance in uncontested seats was actually unprecedented in at least the last two decades, they’re usually about even. Among the 125 contested races, 21 had a Democratic incumbent, 76 a Republican incumbent, and 28 no incumbent. I was a little bit worried this new imbalance meant that the process of choosing which seats to contest had changed, and that past results in contested races would be different from this year. Perhaps Democrats were contesting harder seats, and would win a lower percentage of them than in the past. That didn’t prove true.

The aggregate relationship between my predictions and the results look good. I got the number of incumbents that would lose, both Democrat and Republican right. In races without incumbents, I predicted an even split of 14 Democrats and 14 Republicans, and the final result of 17 R – 11 D might seem… ok? Within the range of error, as we saw in the histogram above. But the scatterplot tells a different story.

Above is a scatterplot of my predicted outcome (the X axis) versus the actual outcome (the Y axis). Perfect predictions would be along the 45 degree line. Points are colored by the party of the incumbent (if there was one). The blue dots and the red dots look fine; my model actually expected any given point to be centered around this line, plus/minus 20 points, so that distribution was exactly expected.  But black dots look horribly wrong. Among those 28 races without incumbents,  I predicted all but three to be close contests, and missed many of them wildly. (That top black dot, for example, was Philadelphia’s 181, where I predicted Malcolm Kenyatta (D) slightly losing to Milton Street (R). That was wrong, to put it mildly. Kenyatta won 95% of the vote.)

What happened? My new model specification, done in haste to fix the first bug, imposed a faulty logic. It forced the past Presidential races to carry the same information about races without incumbents as races with, even though races with incumbents had other information. I should have allowed the model to fall back on district partisanship when it didn’t have incumbent results, but the equivalent of a modelling typo didn’t allow that. Instead, all of these predictions ended up at a bland, basically even race, because the model couldn’t use the right information to differentiate them. My overall house prediction ended up being good only because a few 28 of the total 203 districts were affected, and getting three too many wrong didn’t make the topline look too bad. But it was a bug.

Modelling Takeaways
I’m new to this world of publishing predictive models based on limited datasets and with severe time constraints (I can’t spend the months on a model that I would in grad school or at work). What are the lessons of how to build useful models under these constraints?

Lesson 1: Go through every single prediction. I never looked at District 181. If I had seen that prediction, I would have realized something was terribly wrong. Instead, I looked at the aggregate predictions (similar to the table, and things looked okay enough). Next time, I’ll force myself to go through every single prediction (or a large enough sample of predictions if there are too many). When I tried to hand-pick sanity checks based on my gut, I happened to not choose “a race with no incumbents, but which had an incumbent for decades, and which voted for Clinton at over 85%”.

Lesson 2: Prefer clarity of the model’s calculations over flexibility. I fell into the trap of trying to specify the full model in a single linear form. Through generous use of interactions, I thought I would allow the model flexibility for it to identify different relationships between historic presidential races and length of incumbency. This would have been correct, if I had implemented it bug-free. But I happened to leave out an important three-way interaction. If I had fit separate models for different classes of races–perhaps allowing the estimates to be correlated across models–I would have immediately noticed the differences.

Lesson 2b: I actually learned this extension to Rule 2 in the process of fitting, but the post-hoc assessment bangs it home: when you have good information, the model can be quite simple. In this case, the final predictions did well even with the bug because the aggregate result was really pretty easy to predict given three valuable pieces of information: (a) incumbency, (b) past state house results, and (c) FiveThirtyEight’s Congressional predictions. The last was vital: it’s a high-quality signal about the overall sentiment of this year’s election, which is the biggest factor after each district’s partisanship. A model that used only these three data points and then correctly estimated districts’ correlated errors around those trends would have gotten this election spot on.

Predictions will be back
For better or worse, I’ve convinced myself that this project is actually possible to do well, and I’m going to take another stab at it in upcoming elections. First up is May’s Court of Common Pleas election. These judges are easy to predict: nobody knows anything about the candidates, so you can nail it with just structural factors. More on that later!

When do Philadelphians Vote? Midterm General Election Edition

On November 6, 2018, we (i.e. all of you readers out there) collectively generated 1,341 turnout data points to the Sixty-Six Wards Live Turnout Tracker. That made my modelling effort pretty easy: we missed the true turnout of 548,000 by only 7,000 votes! (This is almost too good… More thoughts on that in a later post.)

One novel feature of this dataset is that we can look at when through the day Philadelphians voted. I did this back in the Primary, and my surprising finding was that turnout was relatively flat. While there were before- and after-work surges, they weren’t nearly as pronounced as I had expected.

That was a midterm primary. How about this time, in the record-setting general? Two weeks ago we saw an unprecedented, energized electorate, and it looks like their time-of-day voting pattern was different too.

Analysis
Estimating the overall pattern requires some statistics. Each data submission contains three pieces of information: precinct, time of day, and the cumulative votes at that precinct. Here, I fit something different than my live election-day model, since I have the true final results. I assume that between hours h-1 and h, a fraction f(h) of the city’s population votes, and each precinct as a fraction that randomly deviates from that distribution.

For division d, the fraction of the final votes that have been achieved by hour h is given by
v_d(0) = 0
v_d(h) = v_d(h-1) + f(h) + f_d(h)
where f(h) is the city-wide precinct average fraction (I use an improper, flat prior), and the deviations are drawn from a normal distribution with an unknown standard deviation sigma_dev.

Because observations don’t occur only on the hour, I use linear interpolation in between, so a data submission x_i, which occurs in division d at time h + t with t between 0 and 1, is modeled as
x_i ~ student_t(1, mean = (1-t) v_d(h) + t v_d(h+1), sd = sd_error),
with sd_error unknown.

The result is that I estimate the fraction of the votes cast in a given hour, f(h), around which divisions randomly deviate. Notice that this isn’t exactly the citywide fraction of votes if those deviations from the mean are correlated with overall turnout (if, for example, high-turnout wards also are more likely to vote before 9am), but the coverage of submissions is heavily skewed towards high-turnout wards, so you should read these results as mainly pertaining to high-turnout wards anyway (see below for more). I also don’t account for the fact that needing to add up to 100% of the vote induces correlations in the estimates but… um… this will be fine.

When Philadelphia Votes
The results: In this election, we saw strong before- and after-work volume, with 26.9% of votes coming before 9am [95% CI: (25.8, 28.1)] and 27.4% of votes coming between 4 and 7pm (24.7 , 30.1).

The overall excitement of the election appears to have largely manifested in the morning, when reports of lines (in a midterm!) came from across the city. There is another surge from 4-7, as people leave work, but once again voting from 7-8 was quiet (without the thunderstorm, this time).

Do neighborhoods vote differently?
It seems likely that residents of different neighborhoods would vote at different times of the day. Unfortunately, the (albeit humbling) participation in the Turnout Tracker came disproportionately from Whiter, wealthier wards, and I don’t have great data to identify differences between groups. But here’s a tentative stab at it.

I’ve manually divided wards into six groups based on their overall turnout rates, their correlated voting patterns across the last 16 elections, and racial demographics. (Any categorization is sure to upset people, so I’m preemptively sorry. But these wards appear to vote similarly, and I’ve chosen clear names over euphemistic ones.)

Below is a plot of the raw data submissions, divided by the eventual final votes, by ward group. The difference between groups is not especially pronounced, but it appears that Center City and Ring did vote disproportionately before 9am  (along with Manayunk, which behaves like them). [Notice that in the raw submissions, some of the fractions are over 1, meaning that a submitted voter number was higher than the final vote count.]
To estimate the fraction of the vote at exactly 9am, I filter all of the data to between 8 and 10am, drop the obvious outliers, and fit a simple linear model of fraction on  time for each group. The estimated fractions at 9am are below.
Ward Group
​Percent Voted Before 9am (95% MOE)
Center City and Ring
27.8 +/- 0.7
Far-Flung White Wards
24.0 +/- 1.3
High-Turnout Black Wards
23.7 +/- ​1.7
Low-Turnout Black Wards
23.7 +/- ​1.5
North Philly and Lower Northeast
16.6 +/- ​1.1
University Wards
28.2 +/- ​4.7
Nearly 28% of the Center City and Ring Wards’ votes came before 9am, which was statistically significantly higher than all other groups except for the Universities (which had large uncertainty). The other wards hovered around 24% by 9am, except for the very low-turnout North Philly and Lower Northeast wards which only had 17% of their final turnout by that hour.

Energy means early voting?
It certainly appears that in this highly energetic election, a disproportionate amount of votes came in the morning. This could mean that the population that gets energized is more likely to vote in the morning (perhaps workers, as opposed to retirees), or that in excited elections, voters want to ensure their vote in the morning.

In May 2019, the Live Election Tracker will be back. And at that point, we’ll have three whole datasets to be able to evaluate time of day data! Then we’ll really be able to say something. Before then, I’ve got a number of posts in the pipeline. Next up: evaluating my various models, some better than others. Stay tuned!

Wow, that turnout

The vote count on Tuesday morning had me worried. When the Turnout Tracker crawled above 198,000 voters before 10am–half of 2014 turnout through just three hours of voting–I scrambled to my computer to double, triple check the calculations. But my code wasn’t off, it was the voters that had completely changed. By 8pm that night, over 547,000 Philadelphians had voted, a 43% increase over four years ago.

I’ll dig into the tracker in another post, but let’s do a quick hit of the actual final turnout and some maps.

* Quick Note: In everything below, I use the votes cast for President/Governor, rather than actual turnout. This will be short by however many people leave that race blank, which appears to be < 1% of voters in past years. These results are also preliminary, representing only 99.5% of precincts.

Some 537,231 Philadelphians cast votes for Governor. This is a 157 thousand increase over the 379,046 of 2014. The absolute size of this increase is unprecedented since at least 2002, and would have been the largest proportional increase in that span if not for the *82%* increase to elect Krasner in 2017. As the plot makes jarringly clear, in the aftermath of 2016, something is different.

And Philadelphia somewhat outpaced the rest of the state, where votes grew by 41%. So not only did turnout surge, but Philadelphia eked out some more statewide clout, too.

Where the votes came from
The turnout boom was not evenly distributed. As we saw in 2017, it was driven by the wealthier, predominantly white wards, and especially those that have gentrified over the last twenty years.

Immediately in the morning, the Tracker made one thing clear: Ward 27 saw the biggest turnout growth. It ruined the color scale on the map.

The 27th, in University City, saw an increase in votes of 135%. That manages to dwarf even the 95% growth of Wards 31 and 18 (in Kensington/Fishtown), in second and third.

This is worth digging into more. How did a ward possibly see this much growth? It’s all Penn. The division right along the river, containing most undergrad dorms, went from 116 votes in 2014 to 585 in 2018. That manages to make the second and third place divisions look a weaker green, even though they quadrupled(!!) their votes, from 88 to 363 and from 82 to 322, respectively.

This is mostly a problem of denominators, though. Turnout at Penn was tiny in 2014. It’s not as if Ward 27 all of a sudden has the most votes in the city. Instead, the student-heavy ward just this year performed like the average center-city-ringing neighborhood.

As evidence, consider instead 2018 votes as a fraction of 2016. I like this comparison, because the Presidential election of 2016 probably represents the highest plausible attainable turnout for a midterm; you’re just never going to get someone who doesn’t vote for President to vote for Governor.


By this comparison, Wards 9 and 22 (Chestnut Hill and Mount Airy) look great, but typically so, with 88% and 87% of the 2016 voters coming out again in 2018. In third place was Ward 31 in Kensington, and Ward 18 below it was sixth, with 87% and 85% of 2016 turnout, respectively. Notice that these two also saw the second and third most *growth* since 2014, and have quickly ascended the ranks of top voting wards in the city.

Meanwhile, many of the Black wards that always come out in midterms continued to do so, including in Overbrook and Wynnfield in West Philly, and in Cedar Brook and West Oak Lane in the Northwest. These wards didn’t see their turnout grow a ton, mostly because they always, always vote (at least relative to the rest of the city).

Many of the divisions in Ward 31 to the North saw triple the 2014 turnout, whereas divisions in Ward 18 below it merely doubled.

[Edited: My map was incorrectly handling new division 18-18, so I’ve removed it until I have a fix].

The growth in these districts aren’t always what a candidate will care about. Growth in a division doesn’t matter all that much if it started at a very low point, or if nobody lives there. For candidates, what might be most useful is just density of votes, which will be a function of population density and turnout. In this metric, Fairmount and West Center City glow, along with Ward 46 in University City. These are all divisions with high turnout and a ton of people.
Finally, here’s turnout as a function of the population over 18. This has some benefits over the typical reported turnout of voters divided by registered voters because it (a) doesn’t rely on efficiently taking people off the books, which is notoriously slow in Philadelphia, and (b) bakes into the calculation any systematic differences in getting registered to vote, which I claim should be considered part of the voting process. It’s main (large) downside is that it includes in the denominator non-citizens or other residents not eligible to register, which will make immigrant communities look particularly bad.
Grad Hospital, Fairmount, and Chestnut Hill shine by this metric, while North Philly, the lower Northeast, and even Penn despite all its growth have low percentages.

Sadly, one demographic has been under-represented in every single map in today’s post: the Hispanic communities in North Philly. For example, Ward 7 is the darkest ward in the map of 2018 vs 2016, meaning that despite its 63% vote growth over 2014 (11th best growth in the city!), it still had the worst turnout relative to 2016: only 52% of the votes from 2016 came out for 2018. This community already has the lowest turnout in Presidential races, but it shrinks even farther in non-Presidential elections.

Two straight elections of something different
Philadelphia’s turnout since 2016 has been astounding. Across the city there were 43% more votes than four years ago, and every single Ward’s turnout grew.

While voters always come out in Center City, Chestnut Hill, Overbrook, and Cedar Brook, we saw unprecedented booms in Fishtown, University City, the rest of the neighborhoods ringing Center City.

These changes have now stuck around for two straight elections–2017 and 2018–and could presage a fundamental change in our city’s political calculus for years to come.