Individual voter patterns in Philadelphia’s 2022 General Election

Jonathan Tannen, Sixty Six Wards
Seth Bluestein, Philadelphia City Commissioner

Today, Sixty Six Wards is partnering with Philadelphia City Commissioner Seth Bluestein to look at individual voter heterogeneity in the 2022 General Election, and how well Division-level capture voter-level patterns.

Election results are generally released only by precinct. This means that media figures, political pollsters, candidates for office, and conscientious citizens are limited to explaining results at the precinct level, which might contain hundreds or thousands of voters. We can know precinct-level correlations between candidates–-Mehmet Oz did better in the precincts where Doug Mastriano performed well, and John Fetterman where Josh Shapiro did–-but can’t disentangle how much of that is within-voter correlations (everyone is assumed to be either an Oz-Mastriano voter or a Fetterman-Shapiro voter) versus the fact that Oz voters are more likely to live where Mastriano voters do, but might not be the same people. Anecdotal evidence clearly shows that there are voters who don’t vote exactly along the party line, but the limitation of precinct-level results would ordinarily make it impossible measure the actual rate.

The assumption that geographical aggregates are the same as individual patterns for the people within them is called the “Ecological Fallacy”. Much election analysis is Ecological (if you’re reading this, you’ve seen maps upon maps of division results). Ecological analysis can come in the form of demographic analyses (e.g. Black wards voted highly for Derek Green), or in the form of cross-candidate results (e.g. Mastriano Divisions also voted for Oz). Ecological analyses can hide important diversity among voters; that diversity will be especially relevant in our upcoming race for Mayor, where 11 candidates may split the vote.

Just how bad is the Ecological Fallacy? Obviously there were a lot of voters who did pick both Oz and Mastriano, or both Fetterman and Shapiro. So just how correct are precinct-level correlations as an approximation of voter-level ones?

Today, Sixty-Six Wards is partnering with the Office of City Commissioner Seth Bluestein to answer this question. The Commissioner’s Office has created a novel dataset of de-identified voter-level counts of candidate combinations from the 2022 Philadelphia General Election, which allows us to measure voter-level correlations without violating voter privacy.

Methodology

Broadly, we seek to answer the question: how close are Division-level inferences to individual-level reality? We’ll consider three measurements, all related but with results that can appear to tell very different stories.

First, we consider the correlation between votes for Mastriano and Oz (or Fetterman and Shapiro), at the Division and individual level.

Second, we consider the covariances between candidates’ votes. Although unscaled, using the covariance allows us to use the Law of Total Covariance: \[\begin{align*} Cov(X, Y) = Cov_D(E[X | D], E[Y | D]) + E_D[Cov(X, Y | D)] \end{align*}\] where the subscript \(D\) emphasizes the variable over which the expectations are taken. For our example, \(X\) might be a person’s vote for Oz, \(Y\) a person’s vote for Mastriano, and \(D\) their division. The Law states that the overall covariance between the votes can be decomposed into the covariance of division-level results (the first term), plus the average within-division covariances.

Third, we consider the OLS regression coefficient of one candidate’s votes on the other’s. For example, an additional vote for Oz was correlated with how many more votes for Mastriano?

Remember that the correlation is just a scaled covariance: \(Corr(X, Y) = Cov(X, Y) / \sqrt{Var(X)Var(Y)}\), as is the OLS coefficient: \(\beta = Cov(X, Y) / Var(X)\), so differences in these metrics are due to different denominators.

Reverse engineering individual patterns from ecological data is a holy grail of spatial statistics. Following King, Rosen and Tanner (1999), researchers have built Bayesian models to estimate individual patterns, gaining leverage by parametrizing the individual variances. We will finally consider a model inspired by those.

Is Mail-In Different?

For all of the metrics above, we will also consider questions by vote type.

In 2020, Pennsylvania introduced excuse-free mail-in voting, and the pandemic created a surge in its use. Mail ballots have become a partisan signal, with Democrats much more likely to use them than Republicans. Do those partisan differences carry over to ballot splitting?

We also might expect the physical act of mail-in voting to create different rates of crossover from in-person. Mail-in voters can fill out their ballot over days, doing more research with the ballot in their hands. While Election Day voters certainly can research ahead of time, it’s inevitable that the in-booth time constraints could encourage spur-of-the-moment decisions. Will that increase the partisan correlation? Decrease it?

Results

View code
library(tidyverse)
library(sf)
source("../../admin_scripts/util.R")
setwd("C:/Users/Jonathan Tannen/Dropbox/sixty_six/posts/voter_level_analysis/")
candidate_pairs_raw <- read_csv(
  "../../data/processed_data/candidate_pairs_2022.csv"
)

candidate_totals_raw <- read_csv(
  "../../data/processed_data/candidate_totals_2022.csv"
)

asnum <- function(x) as.numeric(as.character(x))

fix_month <- function(x){
  for(i in 1:12){
    has_month_second <- grep(paste0("[0-9]+-",month.abb[i]), x)
    x[has_month_second] <- paste(
      sprintf("%02d", i), 
      sprintf("%02d", asnum(gsub("^([0-9]+)-.*$", "\\1", x[has_month_second]))),
      sep="-"
    )
    has_month_first <- grep(paste0(month.abb[i], "-[0-9]+"), x)
    x[has_month_first] <- paste(
      sprintf("%02d", i), 
      sprintf("%02d", asnum(gsub("^.*-([0-9]+)$", "\\1", x[has_month_first]))),
      sep="-"
    )
  }
  x
}

candidate_pairs <- candidate_pairs_raw |> 
  filter(candidate.x != "overvote", candidate.y != "overvote") |>
  mutate(
    Division=fix_month(Division),
    ward = sapply(
      strsplit(Division, "-"),
      function(x) x[1]
    ) |> asnum(),
    div = sapply(
      strsplit(Division, "-"),
      function(x) x[2]
    ) |> asnum(),
    warddiv = paste(
      sprintf("%02d", ward),
      sprintf("%02d", div),
      sep="-"
    )
  ) |>
  rename(vote_type = `Mail Votes`) |>
  # Mail in seems to have extra "Fed" rows
  group_by(year, vote_type, warddiv, office.x, candidate.x, office.y, candidate.y) |>
  summarise(pair_count=sum(pair_count), .groups="drop")

candidate_totals <- candidate_totals_raw |> 
  mutate(
    Division=fix_month(Division),
    ward = sapply(
      strsplit(Division, "-"),
      function(x) x[1]
    ) |> asnum(),
    div = sapply(
      strsplit(Division, "-"),
      function(x) x[2]
    ) |> asnum(),
    warddiv = paste(
      sprintf("%02d", ward),
      sprintf("%02d", div),
      sep="-"
    )
  ) |>
    rename(vote_type = `Mail Votes`) |>
  # Mail in seems to have extra "Fed" rows
  group_by(year, vote_type, warddiv, office, candidate) |>
  summarise(
    candidate_total=sum(candidate_total),
    .groups="drop"
  )

divs <- st_read("../../data/gis/warddivs/201911/Political_Divisions.shp", quiet=TRUE) |>
  mutate(
    warddiv = paste0(substr(DIVISION_N,1,2), "-", substr(DIVISION_N, 3, 4))
  )

# Assert all the divisions match
testthat::expect_equal(
  divs |> filter(!warddiv %in% unique(candidate_totals$warddiv)) |> nrow(),
  0
)
testthat::expect_equal(
  candidate_totals |> filter(!warddiv %in% divs$warddiv) |> nrow(),
  0
)

In the 2022 General Election for US Senate, Fetterman won 84% of Philadelphia’s two-way votes, to Oz’s 16% (two-way votes percentages are calculated ignoring votes for other candidates). For Governor, Shapiro won 87% of Philadelphia’s two-way votes to Mastriano’s 13%. Below is a cross-tabulation of the voter-level results.

View code
twoway_offices <- c(
  "GOVERNOR AND LIEUTENANT GOVERNOR (8687)" = "Governor",
  "UNITED STATES SENATOR (8682)" = "Senate"  
)


twoway_cands <- c(
  "MEHMET OZ REP (102)" = "Oz",
  "DOUGLAS V MASTRIANO AND CARRIE LEWIS DELROSSO REP (107)" = "Mastriano",
  "JOHN FETTERMAN DEM (101)" = "Fetterman",
  "JOSH SHAPIRO AND AUSTIN DAVIS DEM (106)" = "Shapiro"
)


candidate_pairs <- candidate_pairs |> 
  mutate(
    office.x = twoway_offices[office.x],
    office.y = twoway_offices[office.y],
    candidate.x = replace_na(twoway_cands[candidate.x], "Other Candidate"),
    candidate.y = replace_na(twoway_cands[candidate.y], "Other Candidate")
  ) |>
  filter(!is.na(office.x), !is.na(office.y)) |>
  group_by(warddiv, vote_type, office.x, office.y, candidate.x, candidate.y) |>
  summarise(pair_count = sum(pair_count), .groups="drop")

candidate_totals <- candidate_totals |>
  mutate(
    office = twoway_offices[office],
    candidate = replace_na(twoway_cands[candidate], "Other Candidate")
  ) |>
  filter(!is.na(office)) |>
  group_by(warddiv, vote_type, office, candidate) |>
  summarise(candidate_total = sum(candidate_total), .groups="drop")
  
office_totals <- candidate_totals |>
  group_by(vote_type, warddiv, office) |>
  summarise(
    office_total = sum(candidate_total),
    .groups="drop"
  )

office_totals <- office_totals |>
  group_by(vote_type, warddiv) |>
  mutate(votes_both_offices = sum(office_total)) |>
  left_join(
    candidate_pairs |> group_by(vote_type, warddiv) |>
      summarise(total_pairs = sum(pair_count))
  ) |>
  mutate(
    denominator = votes_both_offices - total_pairs
  )

df_topline <- 
  expand.grid(
    warddiv = unique(candidate_pairs$warddiv),
    vote_type = unique(candidate_pairs$vote_type),
    office.x = unique(candidate_pairs$office.x),
    office.y = unique(candidate_pairs$office.y),
    candidate.x = unique(candidate_pairs$candidate.x),
    candidate.y = unique(candidate_pairs$candidate.y)
  ) |>
  left_join(candidate_pairs) |> 
  left_join(
    candidate_totals |> 
      rename(candidate_total.x = candidate_total) |>
      filter(office == "Governor"),
    by = c(
      "candidate.x"="candidate", 
      "warddiv"="warddiv", 
      "office.x"="office",
      "vote_type"="vote_type")
  ) |>
  left_join(
    candidate_totals |> 
      rename(candidate_total.y = candidate_total) |>
      filter(office == "Senate"),
    by = c(
      "candidate.y"="candidate", 
      "warddiv"="warddiv", 
      "office.y"="office",
      "vote_type"="vote_type")
  ) |>
  left_join(
    office_totals |> 
      filter(office == "Senate") |>
      group_by(vote_type, warddiv) |>
      summarise(denominator= sum(denominator, na.rm=TRUE))
  )

df_topline_total <- df_topline |>
  group_by(candidate.x, candidate.y) |>
  summarise(across(
    c(pair_count, candidate_total.x, candidate_total.y, denominator),
    function(x) sum(x, na.rm=TRUE)
  ))

# write.csv(df_topline_total, file="sen_gov_crosstabs.csv")

df_topline_type <- df_topline |>
  group_by(vote_type, candidate.x, candidate.y) |>
  summarise(across(
    c(pair_count, candidate_total.x, candidate_total.y, denominator),
    function(x) sum(x, na.rm=TRUE)
  ))

# write.csv(df_topline_type, file="sen_gov_crosstabs_type.csv")


# sen_div <- df_topline_twoway |> 
#   group_by(warddiv, candidate.y) |> 
#   rename(candidate = candidate.y) |> 
#   summarise(votes = sum(votes))
# 
# candidate_totals |> filter(office %in% sen_str) |>
#   mutate(candidate = replace_na(twoway_cands[candidate], "Other")) |>
#   group_by(candidate, warddiv) |>
#   summarise(votes_tot = sum(candidate_total)) |>
#   left_join(sen_div) |>
#   filter(votes_tot != votes)
# 
# candidate_totals |> filter(warddiv == "01-03", candidate == names(twoway_cands)[2])
# candidate_pairs |> filter(candidate.y == names(twoway_cands)[3], warddiv == "01-01", office.x == gov_str)

Notice from the table that there are more Shapiro-Oz than Mastriano-Fetterman voters. This suggests an ordering of candidate combinations from most Republican to most Democratic: Mastriano-Oz, Shapiro-Oz, Shapiro-Fetterman. (It’s not clear how other candidates or no votes fit into this ordering.)

There are obvious differences between Mail-in and Election Day voters, as well. Mail-In voters are more Democratic, with 89% voting for both Fetterman and Shapiro vs 76% on Election Day. Even among Oz voters, Mail-In Oz voters voted more for Shapiro than Election Day Oz voters, 32% vs 15%.

These rates of cross-party voting differed by Division as well. Below, we divide voters by their vote for Governor, and measure how they voted for Senate. These rates themselves differ by Division partisanship, which we estimate using overall percentage for Mastriano. This heterogeneity is the core of the Ecological Inference problem; Mastriano voters are themselves different in Mastriano Divisions and Shapiro Divisions (as are Shapiro voters).

View code
div_res <- readRDS("../../data/processed_data/df_major_type_20230116.Rds")
div_res <- div_res |> filter(year == 2022, election_type == "general")

div_gov_sen <- div_res |>
  filter(office %in% c("GOVERNOR", "UNITED STATES SENATOR")) |>
  mutate(candidate = case_when(
    grepl("FETTERMAN", candidate) ~ "Fetterman",
    grepl("MEHMET OZ", candidate) ~ "Oz",
    grepl("MASTRIANO", candidate) ~ "Mastriano",
    grepl("SHAPIRO", candidate) ~ "Shapiro",
    TRUE ~ paste(office, "- Other Candidate")
  )) |>
  group_by(office, candidate, warddiv) |>
  summarise(votes = sum(votes), .groups="drop") |>
  group_by(office, warddiv) |>
  mutate(
    total_votes = sum(votes),
    pvote = votes / total_votes
  ) |>
  ungroup() |>
  select(warddiv, candidate, votes, pvote, total_votes) |>
  pivot_wider(
    names_from = candidate,
    values_from = c(votes, pvote, total_votes)
  )



div_gov_sen <- div_gov_sen |>
  mutate(
    pvote_rep_avg = (pvote_Mastriano + pvote_Oz) / 2,
    pvote_rep_gov = pvote_Mastriano,
    pvote_rep_sen = pvote_Oz,
    rep_group = cut(pvote_rep_avg, breaks=c(-0.001, seq(0.1,1,0.1)))
  )
levels(div_gov_sen$rep_group)[1] <- "[0.0,0.1]"

testthat::expect_true(
  df_topline |> filter(is.na(denominator)) |> with(all(vote_type=="Provisional"))
)

df_topline_divcat <- df_topline |>
  group_by(warddiv, office.x, office.y, candidate.x, candidate.y) |>
  summarise(
    across(pair_count:denominator, function(x) sum(x, na.rm=TRUE)),
    .groups="drop"
  ) |>
  left_join(
    as.data.frame(div_gov_sen) |> 
      select(warddiv, rep_group, pvote_rep_avg, pvote_rep_gov, pvote_rep_sen)
  ) |>
  filter(!is.na(denominator)) |>
  group_by(
    rep_group, office.x, office.y, candidate.x, candidate.y
  ) |>
  summarise(
    pvote_rep_avg = weighted.mean(pvote_rep_avg, w=denominator, na.rm=TRUE),
    pvote_rep_gov = weighted.mean(pvote_rep_gov, w=denominator, na.rm=TRUE),
    pvote_rep_sen = weighted.mean(pvote_rep_sen, w=denominator, na.rm=TRUE),
    n_divs = length(unique(warddiv)),
    across(
      c(pair_count, candidate_total.x, candidate_total.y, denominator),
      function(x) sum(x, na.rm=TRUE)
    )
  )


df_topline_divcat |> 
  group_by(rep_group, candidate.x) |>
  mutate(
    prop_among_voters = pair_count / candidate_total.x
  ) |>
  ungroup() |>
  filter(candidate.y %in% c("Fetterman", "Oz")) |>
  select(rep_group, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov) |>
  bind_rows(
    df_topline_divcat |>
      ungroup() |>
      #arbitrary row
      filter(candidate.x == "Mastriano", candidate.y %in% c("Fetterman", "Oz")) |>
      mutate(
        candidate.x = "All",
        prop_among_voters = candidate_total.y / denominator
      ) |>
      select(rep_group, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov)
  ) |>
  mutate(
    candidate.x = case_when(
      candidate.x %in% c("Mastriano", "Shapiro", "All") ~ paste(candidate.x, "voters"),
      candidate.x == "Other Candidate" ~ "Other voters"
    ),
    candidate.x = factor(candidate.x, levels=sort(unique(candidate.x))[c(1,2,4,3)])
  ) |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=candidate.x)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean() +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(~candidate.y) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "Senate results based on individuals' Governor vote",
    subtitle = "By Division partisanship",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

A naive assumption would be that Shapiro voters would vote for Fetterman at the same rate in all Divisions; the only difference between Republican and Democratic Divisions is that there are different proportions of them. But Shapiro voters were actually quite different in Republican Divisions than in Democratic Divisions. In the most Democratic Divisions (where Mastriano received 4% of the vote), 97% of Shapiro voters also voted for Fetterman. But in Philadelphia’s most Republican Divisions (where Mastriano received 59% of the vote), only 76% of Shapiro voters also voted for Fetterman. In retrospect this is obvious, but naive ecological inference assumes this isn’t the case.

Voters for “Other” candidates for Governor also show strong correlation with their overall Division partisanship: those voters were more likely to vote for Fetterman in Democratic Divisions, and for Oz in Republican Divisions.

Consider these same results by vote type.

View code
df_topline_divcat_type <- df_topline |>
  group_by(warddiv, vote_type, office.x, office.y, candidate.x, candidate.y) |>
  summarise(
    across(pair_count:denominator, function(x) sum(x, na.rm=TRUE)),
    .groups="drop"
  ) |>
  left_join(
    as.data.frame(div_gov_sen) |> 
      select(warddiv, rep_group, pvote_rep_avg, pvote_rep_gov, pvote_rep_sen)
  ) |>
  filter(!is.na(denominator)) |>
  group_by(
    rep_group, vote_type, office.x, office.y, candidate.x, candidate.y
  ) |>
  summarise(
    pvote_rep_avg = weighted.mean(pvote_rep_avg, w=denominator, na.rm=TRUE),
    pvote_rep_gov = weighted.mean(pvote_rep_gov, w=denominator, na.rm=TRUE),
    pvote_rep_sen = weighted.mean(pvote_rep_sen, w=denominator, na.rm=TRUE),
    n_divs = length(unique(warddiv)),
    across(
      c(pair_count, candidate_total.x, candidate_total.y, denominator),
      function(x) sum(x, na.rm=TRUE)
    )
  )


df_topline_divcat_type |> 
  group_by(rep_group, vote_type, candidate.x) |>
  mutate(
    prop_among_voters = pair_count / candidate_total.x
  ) |>
  ungroup() |>
  filter(candidate.y %in% c("Fetterman", "Oz")) |>
  select(rep_group, vote_type, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov) |>
  bind_rows(
    df_topline_divcat_type |>
      ungroup() |>
      #arbitrary row
      filter(candidate.x == "Mastriano", candidate.y %in% c("Fetterman", "Oz")) |>
      group_by(vote_type) |>
      mutate(
        candidate.x = "All",
        prop_among_voters = candidate_total.y / denominator
      ) |>
      select(rep_group, vote_type, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov)
  ) |>
  mutate(
    candidate.x = case_when(
      candidate.x %in% c("Mastriano", "Shapiro", "All") ~ paste(candidate.x, "voters"),
      candidate.x == "Other Candidate" ~ "Other voters"
    ),
    candidate.x = factor(candidate.x, levels=sort(unique(candidate.x))[c(1,2,4,3)])
  ) |>
  filter(vote_type %in% c("Mail Votes", "Election Day")) |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=candidate.x)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean()  +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(vote_type~candidate.y) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "Senate results based on individuals' Governor vote",
    subtitle = "By Division partisanship and vote type",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

Mail-in voters showed more consistency across Divisions. Shapiro voters also voted for Fetterman, even within Republican Divisions: in the most Republican Divisions, 87% of mail-in Shapiro voters voted for Fetterman, but only 71% of Election Day Shapiro voters did.

Metrics for Ecological Inference

Having seen the heterogeneity in voters by Division, what do those differences mean for Division-level analyses of candidate correlations?

At the Division level, votes for Mastriano are strongly correlated with votes for Oz, and votes for Shapiro strongly correlated with votes for Fetterman.

View code
# stats::cov.wt(
#   x = div_gov_sen |> select(starts_with("pvote_")),
#   w = div_gov_sen |> with(total_votes_Oz + total_votes_Mastriano),
#   cor = TRUE
# )$cor
# 
# stats::cov.wt(
#   x = div_gov_sen |> select(starts_with("pvote_")),
#   w = div_gov_sen |> with(total_votes_Oz + total_votes_Mastriano)
# )$cov
# 
# 
# lm(
#   pvote_Oz ~ pvote_Mastriano, data = div_gov_sen
# )
# lm(
#   pvote_Fetterman ~ pvote_Shapiro, data = div_gov_sen
# )

div_gov_sen |>
    mutate(total_votes = (total_votes_Mastriano + total_votes_Oz)/2) |>
    select(
      warddiv, total_votes, pvote_Mastriano, pvote_Shapiro, pvote_Fetterman, pvote_Oz
    ) |>
    pivot_longer(
      cols=c(pvote_Mastriano, pvote_Shapiro, pvote_Fetterman, pvote_Oz),
      names_to="candidate",
      values_to="pvote"
    ) |>
  mutate(
    office = ifelse(grepl("Mastriano|Shapiro", candidate), "Governor", "Senator"),
    party = ifelse(grepl("Mastriano|Oz", candidate), "Republican Candidates", "Democratic Candidates")
  ) |>
  pivot_wider(
    id_cols = c(warddiv, party, total_votes),
    names_from = c(office),
    values_from = c(pvote)
  ) |>
ggplot(
  aes(x=100*Governor, y=100*Senator)
) +
  geom_point(aes(size=total_votes), alpha = 0.2) +
  scale_size_area() +
  facet_grid(~party) +
  theme_minimal() +
  geom_abline(slope=1, intercept=0) +
  coord_fixed() +
  labs(
    title="Division-level Governor and Senator results are highly correlated",
    x = "Percent for Governor",
    y = "Percent for Senate",
    size="Total Votes"
  )

How do summary metrics compare between Division- and individual levels?

View code
corr_from_counts <- function(
    a_and_b, a_only, b_only, total, 
    method = c("corr", "cov", "slope_a", "slope_b")
){
  method = match.arg(method)
  prop_a <- (a_and_b + a_only) / total
  prop_b <- (a_and_b + b_only) / total
  
  cov <- a_and_b / total - prop_a * prop_b 
  if(method == "cov") return(cov)
  
  sd_a <- sqrt(prop_a * (1-prop_a))
  sd_b <- sqrt(prop_b * (1-prop_b))
  corr <- cov / (sd_a * sd_b)
  
  if(method == "corr"){
    return(corr)
  } else if(method == "slope_a"){
    return(corr * sd_b / sd_a)
  } else if(method == "slope_b"){
    return(corr * sd_a / sd_b)
  } else {
      stop(paste("bad slope_denom:", slope_denom))
  }
}

get_corr_for_candidates <- function(cand_x, cand_y, df_topline_total, ...){
  a_and_b <- df_topline_total |> 
    filter(candidate.x==cand_x, candidate.y == cand_y) |> 
    with(pair_count)
  a_total <- df_topline_total |> 
    filter(candidate.x==cand_x) |> 
    with(candidate_total.x[1])
  b_total <- df_topline_total |> 
    filter(candidate.y==cand_y) |> 
    with(candidate_total.y[1])
  total <- df_topline_total$denominator[1]
  
  corr_from_counts(
    a_and_b,
    a_total - a_and_b,
    b_total - a_and_b,
    total,
    ...
  )
}

overall_corrs <- df_topline_total |> 
  rowwise() |>
  mutate(
    corr = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator
    ),
    slope = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "slope_a"
    ),
    cov = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "cov"
    )
  ) 

# write.csv(overall_corrs, file="overall_corrs.csv")

# cov.wt(
#   x = div_gov_sen |> select(pvote_Mastriano, pvote_Oz, pvote_Fetterman, pvote_Shapiro),
#   wt = (div_gov_sen$total_votes_Mastriano + div_gov_sen$total_votes_Oz)/2
# )$cov

type_corrs <- df_topline_type |>
  rowwise() |>
  mutate(
    corr = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator
    ),
    slope = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "slope_a"
    ),
    cov = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "cov"
    )
  )

# write.csv(type_corrs, file = "type_corrs.csv")
  
div_gov_sen_type <- div_res |>
  rename(vote_type = type) |>
  filter(office %in% c("GOVERNOR", "UNITED STATES SENATOR")) |>
  mutate(candidate = case_when(
    grepl("FETTERMAN", candidate) ~ "Fetterman",
    grepl("MEHMET OZ", candidate) ~ "Oz",
    grepl("MASTRIANO", candidate) ~ "Mastriano",
    grepl("SHAPIRO", candidate) ~ "Shapiro",
    TRUE ~ paste(office, "- Other Candidate")
  )) |>
  group_by(office, candidate, warddiv, vote_type) |>
  summarise(votes = sum(votes), .groups="drop") |>
  group_by(office, warddiv, vote_type) |>
  mutate(
    total_votes = sum(votes),
    pvote = votes / total_votes
  ) |>
  ungroup() |>
  select(vote_type, warddiv, candidate, votes, pvote, total_votes) |>
  pivot_wider(
    names_from = candidate,
    values_from = c(votes, pvote, total_votes)
  )

cov_types <- div_gov_sen_type |>
    filter(vote_type %in% c("In Person", "Mail")) |>
    group_by(vote_type) |>
    nest() |>
  mutate(
    corr = map(
      .x=data,
      ~stats::cov.wt(
        x = .x |> select(starts_with("pvote_")),
        w = .x |> with(total_votes_Oz + total_votes_Mastriano),
        cor = TRUE
      )$cor
    ),
    cov = map(
      .x=data,
      ~stats::cov.wt(
        x = .x |> select(starts_with("pvote_")),
        w = .x |> with(total_votes_Oz + total_votes_Mastriano),
      )$cov |> as.data.frame()
    ),
    slope = map(
      .x=data,
      ~coef(lm(pvote_Oz ~ pvote_Mastriano, data = .x, weights = (total_votes_Oz + total_votes_Mastriano)))
    )
  )

# cov_types |> select(vote_type, cov) |> unnest(cols=c(cov)) |>
#   write.csv("div_corr_type.csv")

At the individual voter level, voting for Mastriano has a 0.85 correlation with voting for Oz; voting for Shapiro a 0.79 correlation with voting for Fetterman. Translating to regression coefficients, people who voted for Mastriano were 91pp more likely to vote for Oz than those who didn’t, people who voted for Shapiro were 84pp more likely to vote for Fetterman than those who didn’t.

The correlation measured in Division results were much higher. A Division’s proportion for Mastriano and proportion for Oz has a near-perfect 0.994 correlation, and for Shapiro and for Fetterman a similar 0.993. The regression coefficients are also larger: a Division that voted 1pp more for Mastriano voted 1.15 pp more for Oz, and a Division that voted 1pp more for Shapiro voted 1.16 pp more for Fetterman. (Regression coefficients being greater than 1 is obviously impossible at the person level.)

The fact that Division-level correlations are higher than individual provides a warning for Ecological inference.

The total person-level covariance of Mastriano and Oz votes is 0.103, of Shapiro and Fetterman votes 0.111. The covariance of the candidates’ averages across Divisions is 0.023 for Mastriano-Oz, and 0.024 for Shapiro-Fetterman. This means that the total covariance within divisions is 0.080 and 0.087, respectively. (How do we reconcile that most of the person-level covariance is not explained by Division, yet the Division correlation is much higher? It’s because the standard deviation of Division averages–the denominator of the correlation–is also much smaller.)

Bayesian Model

Let’s pretend that we didn’t have the person-level results, and try to reconstruct them from the division-level results using a model inspired by King, Rosen and Tanner (1999). Consider the two-by-two contingency tables above, with votes for Governor by row and Senate by column. Use subscripts \(o,f,m,s,oth\) to mean Oz, Fetterman, Mastriano, and Shapiro, and Other, respectively. We’ll model the Senate votes conditioned on the total Gubernatorial votes \(G_{d,m}, G_{d,s}, G_{d, oth}\). Write the proportion of votes for senate among voters for a Governor candidate in division \(d\) as e.g. \(p_{d,o,m}\). Then the votes for Senate are generated as:

\[\begin{align*} \mu_{d, o} =& p_{d,o,m} G_{d, m} + p_{d,o,s} G_{d, s} + p_{d,o,oth} G_{d, oth} \\ var_{d, o} =& p_{d,o,m}(1-p_{d,o,m}) G_{d, m} + p_{d,o,s} (1-p_{d,o,s}) G_{d, s} + p_{d,o,oth}(1-p_{d,o,oth}) G_{d, oth} \\ S_{d, o} \sim& N(\text{mean}=G_d p_{d, o}, \text{var}=var_{d, o}) \end{align*}\]

But we also want each \(p\) to vary by the partisanship of the Division. So we additionally model

\[\begin{align*} \text{for(gov in (m, s, oth)):}& \\ &p_{d,o,gov} \propto \text{exp}(\alpha_{o,gov} + \beta_{o,gov} p_{d,m}) \\ &p_{d,s,gov} \propto \text{exp}(\alpha_{s,gov} + \beta_{s,gov} p_{d,m}) \\ &p_{d,oth,gov} \propto \text{exp}(\alpha_{oth,gov} + \beta_{oth,gov} p_{d,m}) \end{align*}\]

where we model the senate vote choice among the voters for each governor as a multinomial regression.

We fit this data in numpyro (see the numpyro notebook for the code). The model produces estimates of the full contingency table (Table 1) for each Division. Below we recreate the Division plot from above using the simulations.

View code
sims <- read.csv("sen_gov_sim.csv")
# sims |> 
#   group_by(sen, gov) |>
#   summarise(votes = sum(sim), .groups="drop") |>
#   mutate(total=sum(votes)) |>
#   pivot_wider(
#     id_cols = c(total),
#     names_from = c(sen, gov),
#     values_from=votes
#   ) |>
#   with(
#       corr_from_counts(
#         a_and_b = Oz_Mastriano,
#         a_only = Oz_Shapiro + Oz_Other,
#         b_only = Fetterman_Mastriano + Other_Mastriano,
#         total = total
#       )
#   )

sims |>
  left_join(div_gov_sen |> select(warddiv, rep_group)) |>
  group_by(rep_group, gov, sen) |>
  summarise(votes=sum(sim)) |>
  group_by(gov, rep_group) |>
  mutate(
    total_votes_gov = sum(votes),
    prop_among_voters = votes / total_votes_gov
  ) |>
  group_by(rep_group) |>
  mutate(pvote_rep_gov = sum(votes[gov == "Mastriano"]) / sum(votes)) |>
  ungroup() |>
  mutate(
    gov = case_when(
      gov %in% c("Mastriano", "Shapiro", "All") ~ paste(gov, "voters"),
      gov == "Other" ~ "Other voters"
    ),
    gov = factor(gov, levels=sort(unique(gov))[c(1,2,4,3)])
  ) |>
  filter(sen != "Other") |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=gov)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean() +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(~sen) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "(Bad) Simulated results using Ecological Inference",
    subtitle = "By Division partisanship",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

The model overestimated the heterogeneity of voters. Specifically, it believes that in more Republican Divisions, Shapiro voters are more likely than Mastriano voters to vote for Oz; it has the slopes of Mastriano voters with the entirely wrong sign (we know the true sign from the individual-level data). This illustrates the difficulty of ecological analysis: when Oz does better, the model assumes it must be because he did better among the more numerous Shapiro voters. Its only leverage for identifying the slope is the relative variances among divisions; that’s a very indirect identification strategy. Knowing the real individual patterns from above, we could obviously tweak the model to recreate the true results by adding strong priors. But in general we don’t know the actual individual patterns, so wouldn’t know whether the model was broken or how to fix it.

Conclusion

In sum, Division measurements clearly overstate voter-level correlations, though in this election both were extremely high. We hope to do more work to understand how we might infer voter-level statistics in general. And with May 16 approaching, we will be able to recreate this analysis on a heavily-contested primary election.

Mayor 2023: Back of the Envelope, Part 2

As the Mayor’s race heats up, I’m doing a series establishing some baseline numbers. What follows are simplistic calculations using reasonable assumptions. Welcome to the Back of the Envelope. See Part 1 here.

Will candidates split the vote?

In Part 1, I argued that Philadelphia’s votes will probably be constituted by 35% from the Black Voter wards, 30% from Wealthy Progressive wards, 25% from White Moderate wards, and 10% from Hispanic Voter wards.

This means no single bloc can win the election on its own, and the winner will need to dominate at least one bloc, and do well enough in the others.

The question on my mind is whether candidates could run for the same bloc, and split the vote. With 10 candidates announced, 20% of the vote could be enough to win. If three candidates each try to consolidate the Black Voter divisions, or the Progressive divisions, could a single candidate optimize for the White Moderate divisions and win? And what neighborhoods will candidates presumably do best in?

Below, I consider candidates’ past elections to guess which blocs they’ll be targetting.

View code
library(tidyverse)
library(sf)

source("../../admin_scripts/util.R")


setwd("C:/Users/Jonathan Tannen/Dropbox/sixty_six/posts/council_ballot_position_23/")

df_major_type <- readRDS("../../data/processed_data/df_major_type_20230116.Rds")
df_major <- df_major_type %>%
  group_by(office, candidate, party, warddiv, year, election_type, district, ward, is_topline_office) %>%
  summarise(votes = sum(votes))

df_major <- df_major %>% 
  group_by(year, election_type, office, district, warddiv) %>%
  mutate(pvote = votes / sum(votes)) %>%
  ungroup()

topline_votes <- df_major %>% 
  filter(is_topline_office) %>%
  group_by(election_type, year) %>%
  summarise(votes = sum(votes)) %>%
  mutate(
    year = asnum(year),
    cycle = case_when(
      year %% 4 == 0 ~ "President",
      year %% 4 == 1 ~ "District Attorney",
      year %% 4 == 2 ~ "Governor",
      year %% 4 == 3 ~ "Mayor"
    )
  )

cycle_colors <- c("President" = strong_red, "Mayor" = strong_blue, "District Attorney" = strong_green, "Governor" = strong_orange)

office_votes = df_major %>%
  group_by(
    year, warddiv, election_type, office, district
  ) %>%
  summarise(total_votes = sum(votes))

format_name <- function(x){
  x <- stringr::str_to_lower(x)
  x <- gsub("(^|\\s)([a-z])", "\\1\\U\\2", x, perl = TRUE)
  x
}
divs <- st_read("../../data/gis/warddivs/202011/Political_Divisions.shp") %>%
  mutate(warddiv = pretty_div(DIVISION_N))source("../../data/prep_data/div_svd_time_util.R")
div_cat_fn <- readRDS("../../data/processed_data/svd_time_20230116.RDS")

div_cats <- div_cat_fn %>% get_row_cats(2017) %>% rename(warddiv = row_id)

cats <- c(
  "Black Voters",
  "Wealthy Progressives",
  "Hispanic Voters",
  "White Moderates"
)

cat_colors <- c(light_blue, light_red, light_orange, light_green)
names(cat_colors) <- cats

What blocs will each candidate target?

For each candidate, let’s look at some relevant elections. First, the At Large Council candidates, for whom we have the best, city-wide, reference elections.

View code
relevant_elections <- tribble(
  ~candidate, ~year, ~election_type, ~office,
  "ALLAN DOMB", 2015, "primary", "COUNCIL AT LARGE",
  "ALLAN DOMB", 2019, "primary", "COUNCIL AT LARGE",
  "DEREK S GREEN", 2015, "primary", "COUNCIL AT LARGE",
  "DEREK S GREEN", 2019, "primary", "COUNCIL AT LARGE",
  "HELEN GYM", 2015, "primary", "COUNCIL AT LARGE",
  "HELEN GYM", 2019, "primary", "COUNCIL AT LARGE",
  "REBECCA RHYNHART", 2017, "primary", "CITY CONTROLLER",
  # "REBECCA RHYNHART", 2021, "primary", "CITY CONTROLLER",
) %>% mutate(year = as.character(year))

comp_results <- df_major %>% 
  inner_join(relevant_elections) %>%
  left_join(office_votes) %>%
  group_by(candidate, year, election_type, office) %>%
  mutate(
    pvote_total = sum(votes, na.rm=TRUE) / sum(total_votes, na.rm=TRUE)
  ) %>%
  ungroup()

cat_results <- comp_results %>% 
  left_join(div_cats %>% select(-year)) %>%
  group_by(year, cat, candidate, election_type, office) %>%
  summarise(pvote = sum(votes) / sum(total_votes))


ggplot(
  divs %>% 
    left_join(comp_results) %>%
    filter(office == "COUNCIL AT LARGE")
) +
  geom_sf(aes(fill=100 * pmin(pvote, 0.20)), color=NA) +
  facet_grid(format_name(candidate) ~ year) +
  scale_fill_viridis_c("% of Votes") +
  theme_map_sixtysix() %+replace%
  theme(legend.position = "right") +
  labs(
    title="City Council At Large Results"
  )

In 2015, Domb did best in the River Wards, the Northeast, South Philly, and immediate Center City. Derek Green did best among Black divisions in West, and North Philly. Helen Gym did best in the ring around Center City and Chestnut Hill / Mount Airy.

In 2019, we see the effects of incumbency. Gym led the pack by dominating the Wealthy Progressive and Black Wards from the first column, Green did well enough everywhere despite losing his top ballot position, and Domb improved across Black Wards thanks to incumbency and the party endorsement.

View code
ggplot(
  cat_results %>% filter(office == "COUNCIL AT LARGE"),
  aes(x=cat, y=100*pvote)
) +
  geom_bar(aes(fill=cat), color=NA, stat="identity") +
  facet_grid(year ~ format_name(candidate)) +
  scale_fill_manual(values=cat_colors, guide=FALSE) +
  theme_sixtysix() %+replace%
  theme(axis.text.x = element_text(angle=45, hjust = 1, vjust=1)) +
  labs(
    title="City Council At Large by Bloc",
    x=NULL,
    y = "% of Votes"
  )

Rhynhart won a competitive primary in 2017, albeit in the lower profile City Controller race. She did so by cleaning up in the Wealthy Progressive wards around Center City, though did beat 50% in the Black and Hispanic Voter blocs.

View code
ggplot(
  divs %>% 
    left_join(comp_results) %>%
    filter(office == "CITY CONTROLLER")
) +
  geom_sf(aes(fill=100 * pvote), color=NA) +
  # facet_grid(format_name(candidate) ~ year) +
  scale_fill_viridis_c("% of Votes") +
  theme_map_sixtysix() %+replace%
  theme(legend.position = "right") +
  labs(
    title="Rebecca Rhynhart in the 2017 Primary"
  )

View code
ggplot(
  cat_results %>% filter(office == "CITY CONTROLLER"),
  aes(x=cat, y=100*pvote)
) +
  geom_bar(aes(fill=cat), color=NA, stat="identity") +
  # facet_grid(year ~ format_name(candidate)) +
  scale_fill_manual(values=cat_colors, guide=FALSE) +
  theme_sixtysix() %+replace%
  theme(axis.text.x = element_text(angle=45, hjust = 1, vjust=1)) +
  labs(
    title="Rebecca Rhynhart's 2017 Primary by Bloc",
    x=NULL,
    y = "% of Votes"
  )

Three candidates come from districted seats: Cherelle Parker, Maria Quiñones-Sánchez, and Amen Brown. We don’t have city-wide elections for them, but can reasonably guess that they’ll do well in the blocs that their districts constitute.

View code
district_elections <- tribble(
  ~candidate, ~year, ~election_type, ~office,
  "CHERELLE L PARKER", 2019, "primary", "DISTRICT COUNCIL",
  "MARIA QUIÑONES SÁNCHEZ", 2019, "primary", "DISTRICT COUNCIL",
  "AMEN BROWN", 2022, "primary", "REPRESENTATIVE IN THE GENERAL ASSEMBLY"
) %>% mutate(year = as.character(year))

districts <- 
  divs %>% inner_join(
    df_major %>% 
    inner_join(district_elections) %>%
    select(office, district, warddiv, candidate)
  ) %>% group_by(office, district, candidate) %>% 
  summarise(
    geometry=st_union(geometry)
  )
districts <- nngeo::st_remove_holes(districts)


phila_whole <- st_read("../../data/gis/warddivs/201911/Political_Wards.shp") %>% 
  st_union()ggplot(phila_whole) + 
  geom_sf(color=NA) + 
  geom_sf(data=districts, aes(fill = candidate), color=NA) +
  geom_sf_text(
    data=districts %>% mutate(geom = st_centroid(geom)),
    aes(label = sprintf(
      "%s\n(%s-%s)", 
      format_name(candidate), 
      case_when(office == "DISTRICT COUNCIL" ~ "Council", TRUE ~ "PA"),
      district
    )),
    fontface="bold"
  ) +
  scale_fill_discrete(guide=F) +
  theme_map_sixtysix() +
  labs(
    title="Mayoral Candidates' Districts"
  )

Parker and Brown represent predominantly-Black districts, Quiñones-Sánchez a predominantly Hispanic one.

That leaves Warren Bloom, Jeff Brown, and Jimmy DeLeon as candidates without instructive elections. Brown is spending a lot of money, and running on a platform that I expect to do better in the White Moderate bloc. Bloom has run six times before without making any ripples, and DeLeon is a… longshot.

The crowded lanes

Where does that leave us? We have three candidates who have done best in Black wards, Parker, Green, and Amen Brown; two candidates who have done best in Wealthy Progressive wards, Gym and Rhynhart; two we can expect to do best in White Moderate wards, Domb and Jeff Brown; one who’ll do best in the Hispanic bloc, Quiñones-Sánchez; and two without an electoral history, Bloom and Deleon.

Obviously these candidates won’t split the blocs evenly, so we can’t just divide the vote by the number of candidates. And winning will require doing well enough in other blocs to hit a winning profile. The candidates from At Large council–Green, Gym, and Domb–probably enter with the best name recognition, and might have higher baselines even in their worse blocs.

That was clear in the last two competitive mayoral primaries of 2007 and 2015. In 2007, Nutter dominated the Wealthy Progressive bloc, but also won the Black Voter bloc and broke 20% in White Moderate and Hispanic Voter blocs. In 2015, Kenney did best in the White Moderate and Wealthy Progressive blocs, but actually won in all of them. Both Kenney and Nutter were City Councilmembers.

View code
df_major %>% 
  filter(
    office == "MAYOR",
    election_type == "primary",
    party == "DEMOCRATIC",
    year %in% c(2015, 2007)
  ) %>% 
  left_join(div_cats %>% select(-year)) %>%
  group_by(cat, year, candidate) %>%
  summarise(votes=sum(votes, na.rm=TRUE)) %>%
  group_by(cat, year) %>%
  mutate(pvote = votes / sum(votes)) %>%
  ggplot(
    aes(x = cat, y=100*pvote) 
  ) +
  geom_text(aes(label = format_name(candidate))) +
  facet_grid(year ~.) +
  geom_hline(yintercept=0)+
  theme_bw() +
  labs(
    x=NULL,
    y="% of Votes",
    title="Results in each bloc",
    subtitle="2007 and 2015 Mayoral Primaries"
  )

With petitions due soon and the race maturing, we will probably see several candidates drop out ahead of the election. That could help a candidate consolidate their bloc. Without drop-outs, though, this election would probably prove more fractured than 2007 and 2015. But the broader point holds: the winner will probably do well in two blocs, and well-enough in the rest.