R Tutorial: How I crosswalk Philadelphia’s election data

Philadelphia’s past election results are all released based on past division boundaries. In order to make good historical comparisons, it’s necessary to apportion these old results into the modern boundaries. This post is a tutorial on how I use R’s GIS packages to crosswalk old data to present-day boundaries.

Note: All of the data and resources are available at https://github.com/jtannen/crosswalk. Just want the final crosswalked election data? You can find it in the outputs folder.

Note 2: I’ve turned this into an R package, available at https://github.com/jtannen/jaywalkr.

For GIS in R, I use the sf package, which is a tidyverse-friendly mapping library.

What is crosswalking?

What’s the challenge? Below is a map of the divisions in Philadelphia’s 5th Ward, using 2016 and 2019 boundaries.

library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)

sixtysix_colors <- list(
  red="#D0445E",
  blue="#0077E0",
  purple="#C92EC4",
  green="#009871",
  orange="#9C7200",
  grey="#009871",
  light_red="#EE6178"
)

theme_map_sixtysix <- function(){ 
  ggthemes::theme_map() %+replace%
    theme(
      text = ggthemes::theme_fivethirtyeight()$text,
      title = ggthemes::theme_fivethirtyeight()$plot.title,
      panel.grid.major = element_line(color="grey90")
    )
}

divs_16 <- st_read("data/division_shapefiles/2016/2016_Ward_Divisions.shp") %>%
  rename(warddiv16 = WARD_DIVSN) %>%
  mutate(
    DIVSN = sprintf("%02d", DIVSN),
    WARD = sprintf("%02d", WARD)
  ) %>%
  st_transform(4326)
  
divs_19 <- st_read("data/division_shapefiles/2019/Political_Divisions.shp") %>%
  rename(warddiv19 = DIVISION_N) %>%
  mutate(WARD = substr(warddiv19, 1, 2))%>%
  st_transform(4326)
  
ggplot(
  divs_16 %>% filter(WARD == "05")
) +
    geom_sf(
    color = sixtysix_colors$light_red,
    size=2, 
    fill=NA
  ) + 
  geom_text(
    aes(x=X, y=Y, label = DIVSN),
    data = with(
      divs_16 %>% filter(WARD == "05"),
      data.frame(DIVSN, st_centroid(geometry) %>% st_coordinates())
    ),
    color=sixtysix_colors$light_red,
    size=4,
    fontface="bold"
  ) +
  geom_sf(
    data=divs_19 %>% filter(WARD == "05"),
    color = "black",
    size=1,
    fill=NA
  ) +
  geom_text(
    aes(x=X, y=Y, label = SHORT_DIV_),
    data = with(
      divs_19 %>% filter(WARD == "05"),
      data.frame(SHORT_DIV_, st_centroid(geometry) %>% st_coordinates())
    ),
    color="black",
    size=4
  ) +
  geom_text(
    aes(label=year, x=x, y=y, fontface=fontface, color=color),
    data=tribble(
      ~year, ~x, ~y, ~fontface, ~color,
      2016, -75.12607, 39.941, "bold", sixtysix_colors$light_red,
      2019, -75.12607, 39.943, "plain", "black"
    ),
    hjust=1
  ) +
  scale_color_identity(guide=FALSE) +
  theme_map_sixtysix() +
  ggtitle("Division change in Ward 5")

Notice that many of the boundaries changed. Five divisions were added: 05-30 through 05-34. Division 05-03 expanded down into 05-19. Division 05-26 shrank. If we were to naively compare 2016 and 2019 results by division id, we see that 05-21’s votes nosedived, while 05-34 went from 0 votes (it was non-existent) to a huge number. What we need to do instead is apportion the votes from the old boundaries to the new ones.

The challenge is that we don’t have the right data to perfectly make that decision. Look at 2016’s 05-21. It gets divided into 05-21, 05-33, and 05-34. Should we divide its votes evenly among the three? Should we apportion it based on the area of each division? By population?

Areal Weighting

A simple, common technique is to apportion values by area. We can calculate the area of each intersection, and give 2016’s votes to 2019 boundaries proportional to how much of its area lies in each (this is called “areal weighting”). We’ll reproject them to EPSG:2272 (the PA South State Plane) to get the most accurate areas:

areal_weights <- st_intersection(
  st_transform(divs_16, 2272), 
  st_transform(divs_19, 2272)
) %>% 
  mutate(area = st_area(geometry)) %>% 
  as.data.frame() %>% 
  select(warddiv16, warddiv19, area) %>% 
  group_by(warddiv16) %>% 
  mutate(prop_of_16 = as.numeric(area / sum(area))) 

areal_weights %>%
  filter(warddiv16 == "0521")
## # A tibble: 5 x 4
## # Groups:   warddiv16 [1]
##   warddiv16 warddiv19             area   prop_of_16
##      <fctr>    <fctr>      <S3: units>        <dbl>
## 1      0521      1802     712.1734 US_survey_foot^2 6.830801e-05
## 2      0521      0530    9891.1043 US_survey_foot^2	9.487038e-04
## 3      0521      0534 7279215.1422 US_survey_foot^2	6.981848e-01
## 4      0521      0533 1929128.9079 US_survey_foot^2	1.850321e-01
## 5      0521      0521 1206966.6049 US_survey_foot^2	1.157660e-01

The above shows us that Division 05-21 in 2016 was split up into five divisions for 2019. Two of them–18-02 and 05-30–look like noise; the areas are tiny and probably represent a trivial move in the boundary.

To apportion 2016’s votes to 2019 boundaries, we would give new division 05-34 69.8% of the votes, 05-33 18.5%, and 05-21 11.6%. Done.

Exercise: What assumptions make this apportionment valid? (answer in appendix)

Population Weighting

I usually do one better than areal weighting: population weighting. Instead of calculating the area of the intersection, I calculate the population living in each intersection. We then apportion 2016 counts proportionally to the population in each boundary. For example, 2019’s 05-34 lies along the Delaware River, and includes a vast industrial region. It probably represents many fewer votes than its area would suggest. Instead of using the area of each population, we could use the population.

We need fine-scale data to make this calculation. Luckily, the Census provides us with block-level populations every ten years.

phila_blocks <- st_read("data/census_blocks/tl_2010_42101_tabblock10.shp") %>%
  st_transform(4326)
  
## We'll only deal with centroids, since blocks are contained within divisions
phila_blocks <- st_centroid(phila_blocks)

phila_block_pops <- read.csv(
  "data/census_blocks/phila_block_pops.csv", 
  colClasses=c("character","numeric")
)

phila_blocks <- left_join(phila_blocks, phila_block_pops)

Here are the 2010 populations of the Census Blocks of the area.


ward_05_blocks <- sapply(
  phila_blocks %>% st_within(divs_19 %>% filter(WARD == "05")),
  length
) == 1
ward_05_blocks <- phila_blocks[ward_05_blocks,] %>%
  mutate(x = st_coordinates(geometry)[,1],
         y = st_coordinates(geometry)[,2])

ggplot(
  divs_16 %>% filter(WARD == "05")
) +
  geom_point(
    data = ward_05_blocks, 
    aes(x=x,y=y,size = pop10),
    alpha = 0.3,
    pch = 16,
    color=sixtysix_colors$blue
  ) +
  scale_size_area(guide=FALSE) +
  expand_limits(size=0) +
  geom_sf(
    color = sixtysix_colors$light_red,
    size=2, 
    fill=NA
  ) + 
  geom_text(
    aes(x=X, y=Y, label = DIVSN),
    data = with(
      divs_16 %>% filter(WARD == "05"),
      data.frame(DIVSN, st_centroid(geometry) %>% st_coordinates())
    ),
    color=sixtysix_colors$light_red,
    size=4,
    fontface="bold"
  ) +
  geom_sf(
    data=divs_19 %>% filter(WARD == "05"),
    color = "black",
    size=1,
    fill=NA
  ) +
  geom_text(
    aes(x=X, y=Y, label = SHORT_DIV_),
    data = with(
      divs_19 %>% filter(WARD == "05"),
      data.frame(SHORT_DIV_, st_centroid(geometry) %>% st_coordinates())
    ),
    color="black",
    size=4
  ) +
  geom_text(
    aes(label=year, x=x, y=y, fontface=fontface, color=color),
    data=tribble(
      ~year, ~x, ~y, ~fontface, ~color,
      2016, -75.12607, 39.941, "bold", sixtysix_colors$light_red,
      2019, -75.12607, 39.943, "plain", "black"
    ),
    hjust=1
  ) +
  scale_color_identity(guide=FALSE) +
  theme_map_sixtysix() +
  ggtitle("Division change in Ward 5", "Dots are 2010 Census Block Populations")

The goal is simple: to add up the blue dots inside of each intersection, and apportion votes that way.
get_pop_of_geoms <- function(g, idcol, blocks=phila_blocks){
  if(st_crs(g) != st_crs(blocks)) stop("g and blocks must have same crs.")
  intersection <- st_intersects(blocks, g)
  nmatch <- sapply(intersection, length)
  
  if(sum(blocks$pop10[nmatch==0]) > 0) stop("blocks weren't matched; population was lost.")
  if(any(nmatch > 1)) stop("blocks matched to multiple geometries. This shouldn't happen.")
  
  matches <- data.frame(
    blocki = (1:length(nmatch))[nmatch == 1], 
    gi = unlist(intersection)
  ) %>% mutate(
    block_id = blocks$GEOID10[blocki],
    g_id = g[[idcol]][gi]
  ) %>%
    left_join(
      blocks %>% as.data.frame %>% select(pop10, GEOID10),
      by = c("block_id"="GEOID10")
    ) %>%
    group_by(g_id) %>%
    summarise(pop10 = sum(pop10))
  
  if(sum(matches$pop10) != sum(blocks$pop10)) stop("Population was lost/added. Something's wrong.")
  
  names(matches)[1] <- idcol
  
  return(matches)
}

get_crosswalk <- function(g1, g2, idcol1, idcol2, blocks=phila_blocks){
  if(st_crs(g1) != st_crs(g2)) stop("g1 and g2 must have same crs.")
  g1$id1 <- g1[[idcol1]]
  g2$id2 <- g2[[idcol2]]
  
  intersection <- st_intersection(g1, g2)
  intersection <- intersection %>%
    mutate(id1_id2 = paste(id1, id2, sep="_")) %>%
    select(id1_id2, geometry)
  pop_of_intersection <- get_pop_of_geoms(intersection, "id1_id2", blocks)
  crosswalk <- pop_of_intersection %>%
    separate(id1_id2, c("id1", "id2"), sep = "_") %>%
    group_by(id1, id2) %>%
    summarise(pop10 = sum(pop10)) %>%
    group_by(id1) %>%
    mutate(frac_of_g1 = {if(sum(pop10) > 0) pop10/sum(pop10) else 1/n()}) %>%
    group_by(id2) %>%
    mutate(frac_of_g2 = {if(sum(pop10) > 0) pop10/sum(pop10) else 1/n()}) %>%
    group_by()
  
  names(crosswalk)[1:2] <- c(idcol1, idcol2)
  return(crosswalk)
}

crosswalk_16_19 <- get_crosswalk(divs_16, divs_19, "warddiv16", "warddiv19")
head(crosswalk_16_19)
## # A tibble: 6 x 5
##   warddiv16 warddiv19 pop10 frac_of_g1 frac_of_g2
##       <chr>     <chr> <dbl>      <dbl>      <dbl>
## 1      0101      0101   878          1          1
## 2      0102      0102   834          1          1
## 3      0103      0103   759          1          1
## 4      0104      0104  1006          1          1
## 5      0105      0105   807          1          1
## 6      0106      0106   762          1          1

What happened to 05-10 and 05-21, above?

crosswalk_16_19 %>% 
  filter(warddiv16 %in% c("0521", "0510"))
## # A tibble: 5 x 5
##   warddiv16 warddiv19 pop10 frac_of_g1 frac_of_g2
##       <chr>     <chr> <dbl>      <dbl>      <dbl>
## 1      0510      0510   390  0.8405172  0.4016478
## 2      0510      0530    74  0.1594828  0.1223140
## 3      0521      0521   716  0.3308688  1.0000000
## 4      0521      0533   936  0.4325323  1.0000000
## 5      0521      0534   512  0.2365989  1.0000000

The crosswalk says that 05-21 was split into 05-21, 33, and 34, in proportions of 0.33, 0.43, and 0.24. Notice that crosswalking back, it says that all of 2019’s 05-33 is contained in 2016’s 05-21, which is correct (frac_of_g2 = 1.0).

It says that 2019’s 05-30 is only 0.12 within 2016’s 05-10. Most of its population actually comes from that region to the West, which is smaller in area.

Exercise 2: What is the assumption(s) in using population-based weighting?

Finally, let’s crosswalk the raw data from 2016 forward to 2019 divisions. The folder data/raw_votes contains csv files of Philadelphia elections as downloaded directly from www.philadelphiavotes.com.

general_16 <- read.csv("data/raw_votes/2016_general.csv") %>%
  mutate(warddiv16 = paste0(sprintf("%02d", WARD), sprintf("%02d", DIVISION)))
head(general_16)
##   WARD DIVISION TYPE           OFFICE           CANDIDATE      PARTY VOTES
## 1    1        1    A ATTORNEY GENERAL            Write In                0
## 2    1        1    A ATTORNEY GENERAL       JOHN RAFFERTY REPUBLICAN     1
## 3    1        1    A ATTORNEY GENERAL        JOSH SHAPIRO DEMOCRATIC     6
## 4    1        1    A  AUDITOR GENERAL            Write In                0
## 5    1        1    A  AUDITOR GENERAL EUGENE A DEPASQUALE DEMOCRATIC     5
## 6    1        1    A  AUDITOR GENERAL          JOHN BROWN REPUBLICAN     0
##   warddiv16
## 1      0101
## 2      0101
## 3      0101
## 4      0101
## 5      0101
## 6      0101

To crosswalk, we will apportion each 2016 division’s votes to 2019 boundaries proportional to the 2016 fraction of population.

general_16_crosswalked <- general_16 %>%
  left_join(crosswalk_16_19, by = "warddiv16") %>%
  group_by(TYPE, OFFICE, CANDIDATE, PARTY, warddiv19) %>%
  summarise(VOTES = sum(VOTES * frac_of_g1))

head(general_16_crosswalked)
## # A tibble: 6 x 6
## # Groups:   TYPE, OFFICE, CANDIDATE, PARTY [1]
##     TYPE           OFFICE     CANDIDATE      PARTY warddiv19 VOTES
##   <fctr>           <fctr>        <fctr>     <fctr>     <chr> <dbl>
## 1      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0101     1
## 2      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0102     1
## 3      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0103     3
## 4      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0104     5
## 5      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0105     2
## 6      A ATTORNEY GENERAL JOHN RAFFERTY REPUBLICAN      0106     0

We’re done! We have 2016’s results in 2019 boundaries.

Note: We could probably do even one better, and geocode the addresses of where people actually voted, using data from the voter file for each year. Maybe I’ll get around to that one day…

Crosswalking all the data

Just for finality, let’s loop through all years and store the outputs. Sounds easy, but… oof.

I’ve only been able to find shapefiles for divisions from a handful of years. The amazing folks at Azavea and the Commissioners’ Office have put together a batch of historical division boundaries, but even their herculean efforts don’t perfectly cover all recent elections. I’ve only got boundaries for 2005, 2008-2013, 2016, and 2019. For each election, I’ve chosen the boundaries that appear to fit best based on divisions being created or destroyed, but I can’t know how pesty boundary changes might affect these numbers.

Note: Have more complete division boundaries? Want to be a hero and create them from the text description of boundaries? Let me know!

crosswalk_years <- list.dirs("data/division_shapefiles", full.names = F)

for(crosswalk_year in crosswalk_years[!crosswalk_years %in% c("", "2019")]){
  crosswalk_year2 <- substr(crosswalk_year, 3, 4)
  div_dir <- sprintf("data/division_shapefiles/%s", crosswalk_year)
  div_path <- list.files(div_dir, pattern = ".*\\.shp$", full.names = TRUE)
  g1 <- st_read(div_path) %>% 
    mutate(warddiv = paste0(sprintf("%02d", WARD), sprintf("%02d", DIVSN))) %>%
    st_transform(4326)
  
  idcol <- paste0("warddiv", crosswalk_year2)
  names(g1)[names(g1) == "warddiv"] <- idcol
  
  crosswalk <- get_crosswalk(g1, divs_19, idcol, "warddiv19")
  write.csv(
    crosswalk, 
    file=paste0("outputs/", "crosswalk_", crosswalk_year2, "_to_19.csv"),
    row.names = FALSE
  )
}

use_crosswalk <- tribble(
  ~year, ~election, ~crosswalk_year,
    2002, "primary", 2005,
    2002, "general", 2005,
    2003, "primary", 2005,
    2003, "general", 2005,
    2004, "primary", 2005,
    2004, "general", 2005,
    2005, "primary", 2005,
    2005, "general", 2005,
    2006, "primary", 2005,
    2006, "general", 2005,
    2007, "primary", 2005,
    2007, "general", 2005,
    2008, "primary", 2008,
    2008, "general", 2008,
    2009, "primary", 2008, # broken
    2009, "general", 2008, # broken
    2010, "primary", 2009,
    2010, "general", 2011,
    2011, "primary", 2011,
    2011, "general", 2011,
    2012, "primary", 2011,
    2012, "general", 2012,
    2013, "primary", 2013,
    2013, "general", 2013,
    2014, "primary", 2016, 
    2014, "general", 2016,
    2015, "primary", 2016,
    2015, "general", 2016,
    2016, "primary", 2016,
    2016, "general", 2016,
    2017, "primary", 2016,
    2017, "general", 2016,
    2018, "primary", 2019,
    2018, "general", 2019
)

fix_colnames <- function(df){
  ## 2018 general results changed all the column names...
  replace_name <- function(df, oldname, newname){
    names(df)[names(df) == oldname] <- newname
    return(df)
  }
  
  df <- df %>% 
    replace_name("CATEGORY", "OFFICE") %>%
    replace_name("SELECTION", "CANDIDATE") %>%
    replace_name("VOTE.COUNT", "VOTES")
  return(df)
}


ALLOW_MISTAKES <- TRUE
VERBOSE <- FALSE
elections_with_mistakes <- data.frame()

for(i in 1:nrow(use_crosswalk)){
  if(VERBOSE) print(use_crosswalk[i,])
  
  year <- use_crosswalk$year[i]
  election <- use_crosswalk$election[i]
  crosswalk_year <- use_crosswalk$crosswalk_year[i]
  
  election_df <- read.csv(sprintf("data/raw_votes/%s_%s.csv", year, election))
  election_df <- fix_colnames(election_df)
  election_df <- election_df %>% 
    mutate(warddiv = paste0(sprintf("%02d", WARD), sprintf("%02d", DIVISION)))
  
  if(crosswalk_year == "2019"){
    election_df_19 <- election_df %>% rename(warddiv19 = warddiv)
    
    missing_g1 <- unique(election_df$warddiv)[!(unique(election_df$warddiv) %in% divs_19$warddiv19)]
    missing_g2 <- divs_19$warddiv19[!(divs_19$warddiv19 %in% unique(election_df$warddiv))]
  } else {
    idcol <- paste0("warddiv", substr(crosswalk_year, 3, 4))
    
    crosswalk <- read.csv(
      sprintf("outputs/crosswalk_%s_to_19.csv", substr(crosswalk_year, 3, 4)),
      colClasses = c(rep("character", 2), rep("numeric", 3))
    )  
    
    election_df_19 <- election_df %>%
      left_join(crosswalk, by=c("warddiv"=idcol)) %>%
      group_by(TYPE, OFFICE, CANDIDATE, PARTY, warddiv19) %>%
      summarise(VOTES = sum(VOTES * frac_of_g1))
    
    missing_g1 <- unique(election_df$warddiv)[!(unique(election_df$warddiv) %in% unique(crosswalk[,idcol]))]
    missing_g2 <- unique(crosswalk[,idcol])[!(unique(crosswalk[,idcol]) %in% unique(election_df$warddiv))]
  }
  

  if(length(missing_g1) > 0 || length(missing_g2) > 0){
    if(!ALLOW_MISTAKES) stop("Votes mismatch. Something went wrong.")
    elections_with_mistakes <- bind_rows(
      elections_with_mistakes, 
      data.frame(
        year=year, 
        election=election, 
        lost_votes=sum(election_df$VOTES) - sum(election_df_19$VOTES),
        total_votes = sum(election_df$VOTES),
        g1_missing_from_crosswalk=paste(missing_g1, collapse=","),
        crosswalk_missing_from_g1=paste(missing_g2, collapse=",")
      )
    )
  }
  
  write.csv(
    election_df_19, 
    file=sprintf("outputs/%s_%s_crosswalked_to_19.csv", year, election),
    row.names=FALSE
  )
}

print(elections_with_mistakes)
##   year election lost_votes total_votes g1_missing_from_crosswalk   crosswalk_missing_from_g1
## 1 2009  primary          0     1564872            4923,4924,4925                          
## 2 2009  general          0     1816570            4923,4924,4925                          
## 3 2018  primary          0     1305757       1818,0532,0534,0533 

Notice above the three elections with flagged mistakes. For the 2009 results, the 2008 crosswalk that I used was expecting three divisions that weren’t actually in the 2009 results. This is a problem; it means that I over-apportioned those votes somewhere else. The boundaries for 2009 aren’t any better, six divisions are missing. This isn’t the end of the world, three out of 1,681 divisions isn’t terrible. And the ward-level results will still be right. But don’t trust the division-level results for the 49th ward in 2009.

Similarly, the 2018 primary had two divisions that weren’t in the 2016 map, 05-30 and 05-31, but was missing four divisions that were used in the general: 18-18, 05-32, 05-33, and 05-34. I’ve used the 2019 boundaries, to avoid losing votes. Instead, 18-18 et al will be assumed to have 0 votes (which isn’t great either).

The fact that only these three elections are flagged isn’t dispositive proof that the other crosswalks were perfect: the boundaries could have moved without creating or removing division numbers, which is all that we’re checking.

I’ll keep looking for shapefiles, and will update the repo as I find them, but for the time being this is probably the best we can do.

That’s it!

We’ve created our crosswalks.

You can download the results from the outputs folder.

You may be uncomfortable with the simplicity of the assumptions above. This method usually does a darn good job of getting sensible data, but you might imagine some extensions:

– Using annual voter file weights instead of Decennial Census populations.

– A model-based Bayesian approach that uses similarities in voting patterns across years to guess which votes came from where. You could additionally use race and income data, which is released by the census at the Block level. These methods risk the Ecological Fallacy, but not in a worse way than what we’re already doing.

Answers to Exercises

What assumptions make the areal apportionment valid? It assumes that the value was spread uniformly in space in the original boundaries. For example, apportioning 2016 votes to 2019 boundaries using areal weighting assumes that the votes per mile in 2016 was evenly spread within each division (although it will obviously be different for different divisions).

What assumptions make the population apportionment valid? It assumes that the population is concentrated at the centroid of the block or that the blocks are entirely contained within one division (since we only use the centroids) and that the 2010 population is proportional to the target metric. The first is true in our case: divisions don’t split blocks, so blocks are entirely contained in divisions. The second is wrong, but much less wrong than the areal assumption: votes are more proportional to population than they are to area. We could make this better by using the correct year’s population if we had it, or using voting-age population or registered voters instead of census population, but this does a pretty darn good job.

The Needle knew who would win at 8:38 PM. Was it omniscient or lucky?

I think it was… omniscient?

On May 21st, I unveiled my “Needle”. Simlar to the New York Times, the idea was to process the live, incomplete election results and predict who would win. In the weeks before, I fine-tuned the model, tested it on historic data, made some pretty graphics.

Election night came. The polls closed at 8pm, and I started grabbing the live results from Commissioners’ website. The first batch finally came in at 8:38. The model cranked its numbers and produced its projections.

And I was stunned.

With 80 out of 1692 precincts reporting, the Needle was ready to call almost every race.

By the end of the night, every one of those predictions would end up being right. So what was going on? Was the needle psychic? Or lucky?

It goes against every fiber of my being to put this down on paper, but honestly, this wasn’t a mistake. The Needle was omniscient.

(Or rather, it really is possible to predict final outcomes with a high degree of certainty with just 5% of precincts. We’re just super easy to predict.)

((To be complete: there are definitely ways to improve the model. It got Justin DiBerardinis pretty wrong, for example. But it still will probably be able to call most races within about 30 minutes.))

The Needle’s Methodology

The Needle makes its predictions based on historic correlations in how precincts vote. Before election night, I calculate a huge covariance matrix among all of the divisions in the city based on the past 17 primaries. (The covariance matrix is actually not fully identified, so I use the Singular Value Decomposition dimensions from my last post.)

When those 80 precincts reported results, I then predicted the leftover 1612. The strategy is to calculate each unknown division as an average of the known ones, weighted by historic correlations, with appropriate uncertainty. (The math ends up being clean, just a conditional multivariate normal distribution.) I simulate from that distribution, and voila, have my range of predictions.

What are the ways this could go wrong? There are a few:

1. If This Time is Different(TM): if our politics have fundamentally changed in a way that breaks the historic correlations. I don’t give this worry much weight. I mean, even in an election in which we demolished my high turnout projection, the correlations between which divisions turned out strong and which didn’t stayed exactly the same.

2. If the model doesn’t capture historic correlations sufficiently well, then it will pile all of the leftover uncertainty into precincts’ independent noise. Then, when I simulate, all of those independent noises will get washed out thanks to the Central Limit Theorem, and my range of predictions will be way too small (correlated errors are much more dangerous than independent ones.) This was my largest concern on Election Night.

3. A bug. It’s been known to happen.

The only real way to rule out (2) is to test the needle on more elections. Since I don’t want to wait another five months, I’ll do the second best thing: look at why the needle was making the predictions it did at 8:38, and see if it was right for the wrong reasons or the right ones.

There’s one very important aspect of the methodology: the model doesn’t know anything about the candidates. At the beginning of the night, every candidate has the same chance of winning, and has the same chance of doing well in each neighborhood. The Needle doesn’t know that Isaiah Thomas was endorsed by the DCC, or that Justin DiBerardinis will probably do well in Center City. All it does is look at where each candidate has done well in the divisions that have reported so far, and how those divisions historically correlate with the divisions that haven’t yet reported.

What the Needle saw at 8:38

Flash back to May 21st, 8:38PM. We didn’t know Gym would romp to a dominant victory. Philadelphia was in its two weeks of Spring. Bendall was still a thing.

Here’s what the Needle saw.

View code
library(tidyverse)
library(sf)

source("../../admin_scripts/util.R")

should_download <- FALSE
use_maps <- FALSE
use_saved_covariance <- TRUE
use_log <- TRUE
use_latest_file <- FALSE

source("needle.R")

df_838 <- load_data("PRECINCT_2019521_H20_M38_S25.txt")
df_938 <- load_data("PRECINCT_2019521_H21_M38_S25.txt")
df_full <- load_data("PRECINCT_2019525_H08_M09_S54.TXT")

ggplot(
  phila_whole
) + 
  geom_sf(fill = "grey50", color=NA) +
  geom_sf(
    data=divs %>% inner_join(df_838 %>% select(warddiv) %>% unique),
    fill = "black", color = NA
  ) +
  theme_map_sixtysix() +
  ggtitle("Divisions Reporting as of 8:38 PM")

Here are the votes for the Council At Large candidates who eventually end up in the top six (though Rivera Reyes led DiBerardinis at the time).

View code
candidate_order <- df_838 %>% 
  # filter(OFFICE == "COUNCIL AT LARGE-DEM") %>%
  group_by(candidate, OFFICE) %>%
  summarise(votes = sum(Vote_Count)) %>%
  group_by(OFFICE) %>%
  arrange(desc(votes)) %>%
  mutate(pvote = votes/ sum(votes)) %>%
  group_by() 
use_cands <- c("Gym", "Domb", "Green", "Thomas", "Gilmore Richardson", "DiBerardinis")

ggplot(
  phila_whole
) + 
  geom_sf(fill = "grey50", color=NA) +
  geom_sf(
    data=divs %>% 
      inner_join(
        df_838 %>%
          mutate(
            candidate = factor(candidate, use_cands)
          ) %>%
          filter(candidate %in% use_cands) 
      ),
    aes(fill = pvote*100), color = NA
  ) +
  scale_fill_viridis_c("Percent of Votes") +
  geom_text(
    data=candidate_order %>% 
      filter(candidate %in% use_cands) %>% mutate(candidate=factor(candidate, use_cands)),
    aes(label = sprintf("%0.1f%%", 100 * pvote)),
    x = 2755e3, y=218e3, size = 6, fontface="bold",
    hjust=1, vjust=1,
    color = "grey30"
  )+
  theme_map_sixtysix() %+replace% theme(legend.position = "right") +
  facet_wrap(~candidate)+
  ggtitle("At Large results as of 8:38 PM", "With unadjusted percent of vote so far.")

And here’s what the Needle predicted:

View code
turnout_dem <- get_turnout(df_838, "MAYOR-DEM", turnout_cov_dem)
council_at_large_838 <- get_needle_for_office(
  df_838,
  use_office="COUNCIL AT LARGE-DEM",
  office_name="Council At Large (D)",
  turnout_sim = turnout_dem,
  n_winners=5
)

council_at_large_838$needle %+% 
  (council_at_large_838$needle$data %>% filter(candidate %in% use_cands)) +
  theme(strip.text = element_text(face="bold", size = 12)) +
  ggtitle("At Large probability of winning as of 8:38", "Rivera Reyes had the other 16% chance.")

Wait what? You may not have looked at the maps above and thought “It’s over for DiBerardinis.” But the Needle did.

So why did the Needle think it was over? Let’s look at the average results from the simulations, focusing on Katherine Gilmore Richardson and Justin DiBerardinis, who finished 5 and 6.

View code
map_pred <- function(
  df_obs, 
  df_sim, 
  candidates, 
  consider_divs=divs_to_council$warddiv,
  color_max = 0.2,
  color_min=0
){

  df_obs <- df_obs[df_obs$candidate %in% candidates,] 
  df_sim <- df_sim[df_sim$candidate %in% candidates,] %>%
    group_by(warddiv, candidate) %>%
    summarise(pvote = mean(pvote))
  
  df_obs$candidate <- factor(df_obs$candidate, levels = candidates)
  df_sim$candidate <- factor(df_sim$candidate, levels = candidates)
  
  df_sim$pvote <- pmin(df_sim$pvote, color_max)
  
  ggplot(
    divs %>% filter(warddiv %in% consider_divs)
  ) +
    geom_sf(
      data=divs %>% inner_join(df_sim),
      aes(fill=100*pvote), color = NA
    ) +
    geom_sf(
      data=divs %>% inner_join(df_obs),
      aes(fill=100*pvote), color="black"
    ) +
    scale_fill_viridis_c("Percent\n of votes") +
    expand_limits(fill = c(color_min, 100*color_max))+ 
    facet_wrap(~candidate) +
    theme_map_sixtysix() %+replace%
    theme(
      legend.position = "right",
      strip.text = element_text(face="bold", size = 15)
    )
}

map_pred(
  df_838, 
  council_at_large_838$office_sim, 
  c("Gilmore Richardson", "DiBerardinis"), 
  color_max=0.2
) + ggtitle("Average Simulated Results at 8:38", "Reporting divisions are outlined.")

This is exactly what we want to see. Basically, with just 80 precincts reporting, the model had identified that Katherine was a broadly popular candidate, with a high floor and slightly stronger results in Black neighborhoods. It also figured out that Justin was going to do significantly better in wealthier neighborhoods and the Northeast than elsewhere.

(The thing that we wouldn’t want to see is uniformity across the city, or random noise.)

Your eye may not have noticed the divisions in Kensington, South Philly, and Mantua that went strongly Justin, but the Needle did. And it knew exactly how he would eventually split the city.

That was the average simulated result, but just as important is the variance in the result. A candidate’s probability of winning depends on how many times their simulation is higher than the others’. If that candidate has highly variable swings, then they’ll win a larger percent of the time. Was it really true that among all the possible outcomes that could match the data we had so far observed, Justin won in zero of them?

Here is the range of outcomes the Needle gave to Justin at 8:38.

View code
histogram_simulation <- function(
  office_sim_total, 
  candidate, 
  df_final, 
  time, 
  win_bar,
  ndigits=0
){
  final_p <- 100 * df_final$prop[df_final$candidate==candidate]
  
  binwidth = 10^-ndigits
  ggplot(
    office_sim_total[office_sim_total$candidate == candidate,]
  ) + 
    geom_histogram(
      aes(x=pvote * 100, y=stat(density)),
      boundary=0,
      binwidth=binwidth,
      fill = strong_blue
    ) +
    geom_vline(xintercept = win_bar, linetype="dashed") +
    annotate(
      "text",
      label=sprintf("Win Percent = %s%%", win_bar),
      x=win_bar,
      y=0,
      hjust=0,
      vjust=1.2,
      angle=90
    ) +
    geom_vline(
      xintercept = final_p,
      color = "grey20",
      size=1
    ) +
    annotate(
      "text", 
      label=sprintf("Actual Final Result = %s%%", round(final_p, ndigits)),
      x=final_p,
      y=0,
      angle=90,
      vjust=1.2,
      hjust=0
    ) +
    xlab("Percent of Vote") +
    ylab("Density of Simulations") +
    theme_sixtysix() +
    ggtitle(sprintf("%s's Simulated Results at %s", candidate, time))
}

final_at_large <- df_full %>% 
  filter(OFFICE == "COUNCIL AT LARGE-DEM") %>%
  group_by(candidate) %>%
  summarise(votes = sum(Vote_Count)) %>%
  group_by() %>%
  mutate(prop = votes/sum(votes))

histogram_simulation(
  council_at_large_838$office_sim_total, 
  "DiBerardinis", 
  final_at_large, 
  "8:38", 
  6.6, 
  ndigits=1
)

This isn’t good. The Needle was confident that Justin would finish between 4-6%, and he ended up at 6.2%. That wasn’t close to the 6.6% needed to win, but it’s still an outcome the model says is basically impossible.

What happened? It must be that there’s an additional source of uncertainty, a way that the division-level errors are correlated that I haven’t accounted for. I have two guesses.

The first is turnout. Here’s what the Needle thought turnout would look like at 8:38.

View code
map_turnout <- function(
  df_obs, 
  turnout_sim, 
  consider_divs=divs_to_council$warddiv,
  use_outline=TRUE
){

  turnout_obs <- df_obs %>%
    filter(OFFICE == "MAYOR-DEM") %>%
    group_by(warddiv) %>%
    summarise(turnout = sum(Vote_Count))
  
  turnout_sim <- turnout_sim %>%
    group_by(warddiv) %>%
    summarise(turnout = mean(turnout))
  
  color_max <- 10e3

  ggplot(
    divs %>% filter(warddiv %in% consider_divs)
  ) +
    geom_sf(
      data=divs %>% inner_join(turnout_sim),
      aes(fill=pmin(turnout / Shape__Are * 5280^2, color_max)), 
      color = NA
    ) +
    geom_sf(
      data=divs %>% inner_join(turnout_obs),
      aes(fill=pmin(turnout / Shape__Are * 5280^2, color_max)), 
      color={if(use_outline) "black" else NA}
    ) +
    scale_fill_viridis_c(
      "Votes per\n Sq. Mile",
      labels=function(x) ifelse(x >= color_max, paste0(scales::comma(x),"+"), scales::comma(x))
    ) +
    theme_map_sixtysix()
}

map_turnout(df_838, turnout_dem) +
  ggtitle("Projected Turnout at 8:38", "Reporting divisions are outlined.")

Here’s what the turnout actually ended up looking like.

View code
map_turnout(df_full, turnout_dem[integer(0),], use_outline=FALSE) +
  ggtitle("Final Turnout")

The model did captured the broad pattern, but the average wasn’t enthusiastic enough about Center City/South Philly/University City turnout. Notice that’s also Justin’s base.

Underestimating turnout on average isn’t itself wrong. In fact, some of the individual simulations look a lot like this map:

View code
use_sim <- turnout_dem %>% 
  filter(substr(warddiv, 1, 2) == "27") %>% 
  group_by(sim) %>% 
  summarise(turnout_27 = sum(turnout)) %>% 
  arrange(desc(turnout_27)) %>%
  with(sim[1])

map_turnout(
  df_838,
  turnout_dem %>% filter(sim==use_sim), 
  use_outline=FALSE
) +
  ggtitle(sprintf("Simulation %s's Turnout", use_sim))

The problem, though, is that the model didn’t allow for a candidate’s performance to be correlated with turnout (it randomly sampled turnout, then the candidate’s performance independently). Instead, it’s likely that a candidate’s performance is able to itself drive turnout; this would have increased Justin’s ceiling in the model.

The second possibility is that I didn’t allow for enough ward-level correlation. I have city-wide correlations through the SVD, but it would be easy to add ward-level random effects, which account for the fact that wards as a whole may swing together, even separately from demographically similar wards in other parts of the city. This would add correlated uncertainty to the model, and make it a little less certain.

Finally, it might be productive to add in a prior for candidates. Remember that the model didn’t know anything about the candidates heading in. If I had pre-programmed the notion that DiBerardinis would do better in Center City than elsewhere, or added in some Ward endorsements, it may have been more bullish on his chances, rather than simultaneously having to allow for all of the possibilities of his performance in Center City.

Calling District 3

The big upset in the city was Jamie Gauthier in West Philly’s District 3. I was ready to call that one at 9:38.

View code
use_cands <- c("Gauthier", "Blackwell")
ggplot(
  divs %>% filter(warddiv %in% divs_to_council$warddiv[divs_to_council$council==3])
) + 
  geom_sf(fill = "grey50", color=NA) +
  geom_sf(
    data=divs %>% 
      inner_join(
        df_938 %>%
          filter(OFFICE == "DISTRICT COUNCIL-3RD DISTRICT-DEM") %>%
          mutate(candidate =  factor(candidate, use_cands)) 
      ),
    aes(fill = pvote*100), color = NA
  ) +
  scale_fill_viridis_c("Percent of Votes") +
  geom_text(
    data=candidate_order %>% filter(candidate %in% use_cands),
    aes(label = sprintf("%0.1f%%", 100 * pvote)),
    x = 2690e3, y=228e3, size = 10, fontface="bold",
    hjust=1, vjust=1,
    color = "grey30"
  )+
  theme_map_sixtysix()%+replace%
    theme(
      legend.position = "right",
      strip.text = element_text(face="bold", size = 15)
    ) +
  facet_wrap(~candidate)+
  ggtitle("District 3 results as of 9:38 PM", "With unadjusted percent of vote so far.")

Most of the reporting precincts were above Market Street, with a handful coming from the 51st in Southwest.

Here’s what the Needle thought:

View code
district_3_938 <- get_needle_for_office(
  df_938,
  use_office="DISTRICT COUNCIL-3RD DISTRICT-DEM",
  office_name="District 3",
  turnout_sim = turnout_dem,
  consider_divs=divs_to_council$warddiv[divs_to_council$council==3],
  n_winners=1
)

district_3_938$needle +
  ggtitle("Predictions for District 3 at 9:38")

Here’s what a family friend thought.

The Inquirer didn’t end up calling the race until about 10:30. So was the Needle right to be so confident? Here is the average of its 9:38 predictions.

View code
map_pred(
  df_938, 
  district_3_938$office_sim, 
  candidates=c("Blackwell", "Gauthier"), 
  consider_divs=divs_to_council$warddiv[divs_to_council$council==3],
  color_max = 0.8,
  color_min = 0.2
) +
  ggtitle("District 3 Simulated Results at 9:38", "Reporting divisions outlined in black.")

The Needle saw the Jamie divisions in Powelton and at the edge of gentrified University City, and knew that she would do well in the high-turnout University City divisions left to report. But more importantly, it also identified that she would run basically even with Jannie in farther West Philly (she ended up losing the area only 53-47).

Here’s the range of simulated results the model made at 9:38.

View code
final_d3 <- df_full %>% 
  filter(OFFICE == "DISTRICT COUNCIL-3RD DISTRICT-DEM") %>%
  group_by(candidate) %>%
  summarise(votes = sum(Vote_Count)) %>%
  group_by() %>%
  mutate(prop = votes/sum(votes))

histogram_simulation(
  district_3_938$office_sim_total, 
  "Gauthier", 
  final_d3, 
  "9:38", 
  50, 
  ndigits=0
)

It was dead on. The model said Jamie would have a modal win of 56% of the vote, she did. It said she had only a 0.5% chance of losing. She didn’t.

Model performance

Notice that Justin’s final performance was at the top of the simulations, while Jamie’s was right in the heart of the distribution. This gives a useful way to evaluate the model. We can calculate the percentile of the final results versus the simulated distribution. Justin’s final result was the 99.75th percentile of the simulations, Jamie’s was the 51st percentile of hers. If the model is well-calibrated, then these percentiles should be uniformly distributed between 0 and 1.

Alternatively, if we sort and then plot them, they should fall on the line y=x, maybe with some noise. Here’s that plot for At Large at 9:38.

View code
council_at_large_838$office_sim_total %>%
  left_join(final_at_large) %>% 
  group_by(candidate) %>%
  summarise(percentile = mean(pvote < prop)) %>%
  ggplot() +
  geom_abline(intercept = 0, slope = 1)+
  geom_point(aes(x=(rank(percentile)-0.5)/30, y=percentile)) +
  coord_fixed() +
  ylab("Final results as percentile")+
  xlab("Candidate's percentile rank") +
  theme_sixtysix() +
  ggtitle("Final results vs simulations", "Percentile of Simulations at 8:38, Council at Large")

This plot shows that we have too many values on the extremes: the high predictions are too high, the low ones are too low. This is a hallmark of the fact that the prediction intervals are too narrow, and didn’t allow enough uncertainty. It’s not terrible, but can be improved.

So are the race-callers wrong?

I’m claiming the Needle was basically right, if a little over-confident in a way I can fix. My hunch is that even after I tweak it, the model would still have been able to call all the races by 10pm, a lot faster than the race-callers in the media. Are the race-callers wrong to be so slow?

I think there are two things going on.

First, the incentives for the papers are to be super risk-averse. Getting a call wrong is a lot worse than waiting 30 minutes. And they have a lot more readership than I do.

When the needle is 99% sure, that 1% may still be just too large to warrant a newspaper making a call. If you have 5 races in each of 2 elections every year, you would expect to get a call wrong once a decade. That’s probably bad. Couple in the likelihood of maybe your model isn’t perfectly calibrated, and you may want to be especially cautious.

But second, people often don’t realize just how much information is contained in the historic data. They see a map like District 3’s 9:38 map above, and only notice the vast West Philly non-reports, and imagine a full range of possibilities. In their head, they imagine that those could swing +/- 10 points, leading to very different final outcomes. The truth is, though, that with 17 years worth of data (I have back to 2002), that uncertainty is really only +/- 4 points. You need a full model to tell you if the probability of a comeback is 10%, 1% or 0.1%.

One third thing that’s going on (I know I said two…) is that our elections aren’t really that close. Gauthier won by 12 points. In presidential elections, that would be the largest gap since Reagan beat Mondale in ’84. At 9:38, I was able to predict Gauthier within +/- 4 points. In a closer race I wouldn’t have been sure of the final outcome; in this one I was. Four point uncertainty on 56% of the vote means we know who will win. It’s similar with At Large, though with even smaller numbers. It took 6.6% of the vote to win (voters could vote for up to five candidates). At 8:38, I thought Justin would win about 4.8% of the vote. Getting up to 6.6 requires him receiving *38%* more votes than he currently was. That’s a huge ask, even if two percentage points looks small on paper.

The Needle will be back in November!

I’ll retool the needle using the insights above, and then we’ll be back in November! Municipal Election Novembers can be sleepy, but we’ve got some potentially interesting races, including District 10 Republican incumbent Brian O’Neill being challenged by Democrat Judy Moore, and whether maybe, possibly, a third party At Large candidate could beat one of the Republicans. More on each of those soon.