Individual voter patterns in Philadelphia’s 2022 General Election

Jonathan Tannen, Sixty Six Wards
Seth Bluestein, Philadelphia City Commissioner

Today, Sixty Six Wards is partnering with Philadelphia City Commissioner Seth Bluestein to look at individual voter heterogeneity in the 2022 General Election, and how well Division-level capture voter-level patterns.

Election results are generally released only by precinct. This means that media figures, political pollsters, candidates for office, and conscientious citizens are limited to explaining results at the precinct level, which might contain hundreds or thousands of voters. We can know precinct-level correlations between candidates–-Mehmet Oz did better in the precincts where Doug Mastriano performed well, and John Fetterman where Josh Shapiro did–-but can’t disentangle how much of that is within-voter correlations (everyone is assumed to be either an Oz-Mastriano voter or a Fetterman-Shapiro voter) versus the fact that Oz voters are more likely to live where Mastriano voters do, but might not be the same people. Anecdotal evidence clearly shows that there are voters who don’t vote exactly along the party line, but the limitation of precinct-level results would ordinarily make it impossible measure the actual rate.

The assumption that geographical aggregates are the same as individual patterns for the people within them is called the “Ecological Fallacy”. Much election analysis is Ecological (if you’re reading this, you’ve seen maps upon maps of division results). Ecological analysis can come in the form of demographic analyses (e.g. Black wards voted highly for Derek Green), or in the form of cross-candidate results (e.g. Mastriano Divisions also voted for Oz). Ecological analyses can hide important diversity among voters; that diversity will be especially relevant in our upcoming race for Mayor, where 11 candidates may split the vote.

Just how bad is the Ecological Fallacy? Obviously there were a lot of voters who did pick both Oz and Mastriano, or both Fetterman and Shapiro. So just how correct are precinct-level correlations as an approximation of voter-level ones?

Today, Sixty-Six Wards is partnering with the Office of City Commissioner Seth Bluestein to answer this question. The Commissioner’s Office has created a novel dataset of de-identified voter-level counts of candidate combinations from the 2022 Philadelphia General Election, which allows us to measure voter-level correlations without violating voter privacy.

Methodology

Broadly, we seek to answer the question: how close are Division-level inferences to individual-level reality? We’ll consider three measurements, all related but with results that can appear to tell very different stories.

First, we consider the correlation between votes for Mastriano and Oz (or Fetterman and Shapiro), at the Division and individual level.

Second, we consider the covariances between candidates’ votes. Although unscaled, using the covariance allows us to use the Law of Total Covariance: \[\begin{align*} Cov(X, Y) = Cov_D(E[X | D], E[Y | D]) + E_D[Cov(X, Y | D)] \end{align*}\] where the subscript \(D\) emphasizes the variable over which the expectations are taken. For our example, \(X\) might be a person’s vote for Oz, \(Y\) a person’s vote for Mastriano, and \(D\) their division. The Law states that the overall covariance between the votes can be decomposed into the covariance of division-level results (the first term), plus the average within-division covariances.

Third, we consider the OLS regression coefficient of one candidate’s votes on the other’s. For example, an additional vote for Oz was correlated with how many more votes for Mastriano?

Remember that the correlation is just a scaled covariance: \(Corr(X, Y) = Cov(X, Y) / \sqrt{Var(X)Var(Y)}\), as is the OLS coefficient: \(\beta = Cov(X, Y) / Var(X)\), so differences in these metrics are due to different denominators.

Reverse engineering individual patterns from ecological data is a holy grail of spatial statistics. Following King, Rosen and Tanner (1999), researchers have built Bayesian models to estimate individual patterns, gaining leverage by parametrizing the individual variances. We will finally consider a model inspired by those.

Is Mail-In Different?

For all of the metrics above, we will also consider questions by vote type.

In 2020, Pennsylvania introduced excuse-free mail-in voting, and the pandemic created a surge in its use. Mail ballots have become a partisan signal, with Democrats much more likely to use them than Republicans. Do those partisan differences carry over to ballot splitting?

We also might expect the physical act of mail-in voting to create different rates of crossover from in-person. Mail-in voters can fill out their ballot over days, doing more research with the ballot in their hands. While Election Day voters certainly can research ahead of time, it’s inevitable that the in-booth time constraints could encourage spur-of-the-moment decisions. Will that increase the partisan correlation? Decrease it?

Results

View code
library(tidyverse)
library(sf)
source("../../admin_scripts/util.R")
setwd("C:/Users/Jonathan Tannen/Dropbox/sixty_six/posts/voter_level_analysis/")
candidate_pairs_raw <- read_csv(
  "../../data/processed_data/candidate_pairs_2022.csv"
)

candidate_totals_raw <- read_csv(
  "../../data/processed_data/candidate_totals_2022.csv"
)

asnum <- function(x) as.numeric(as.character(x))

fix_month <- function(x){
  for(i in 1:12){
    has_month_second <- grep(paste0("[0-9]+-",month.abb[i]), x)
    x[has_month_second] <- paste(
      sprintf("%02d", i), 
      sprintf("%02d", asnum(gsub("^([0-9]+)-.*$", "\\1", x[has_month_second]))),
      sep="-"
    )
    has_month_first <- grep(paste0(month.abb[i], "-[0-9]+"), x)
    x[has_month_first] <- paste(
      sprintf("%02d", i), 
      sprintf("%02d", asnum(gsub("^.*-([0-9]+)$", "\\1", x[has_month_first]))),
      sep="-"
    )
  }
  x
}

candidate_pairs <- candidate_pairs_raw |> 
  filter(candidate.x != "overvote", candidate.y != "overvote") |>
  mutate(
    Division=fix_month(Division),
    ward = sapply(
      strsplit(Division, "-"),
      function(x) x[1]
    ) |> asnum(),
    div = sapply(
      strsplit(Division, "-"),
      function(x) x[2]
    ) |> asnum(),
    warddiv = paste(
      sprintf("%02d", ward),
      sprintf("%02d", div),
      sep="-"
    )
  ) |>
  rename(vote_type = `Mail Votes`) |>
  # Mail in seems to have extra "Fed" rows
  group_by(year, vote_type, warddiv, office.x, candidate.x, office.y, candidate.y) |>
  summarise(pair_count=sum(pair_count), .groups="drop")

candidate_totals <- candidate_totals_raw |> 
  mutate(
    Division=fix_month(Division),
    ward = sapply(
      strsplit(Division, "-"),
      function(x) x[1]
    ) |> asnum(),
    div = sapply(
      strsplit(Division, "-"),
      function(x) x[2]
    ) |> asnum(),
    warddiv = paste(
      sprintf("%02d", ward),
      sprintf("%02d", div),
      sep="-"
    )
  ) |>
    rename(vote_type = `Mail Votes`) |>
  # Mail in seems to have extra "Fed" rows
  group_by(year, vote_type, warddiv, office, candidate) |>
  summarise(
    candidate_total=sum(candidate_total),
    .groups="drop"
  )

divs <- st_read("../../data/gis/warddivs/201911/Political_Divisions.shp", quiet=TRUE) |>
  mutate(
    warddiv = paste0(substr(DIVISION_N,1,2), "-", substr(DIVISION_N, 3, 4))
  )

# Assert all the divisions match
testthat::expect_equal(
  divs |> filter(!warddiv %in% unique(candidate_totals$warddiv)) |> nrow(),
  0
)
testthat::expect_equal(
  candidate_totals |> filter(!warddiv %in% divs$warddiv) |> nrow(),
  0
)

In the 2022 General Election for US Senate, Fetterman won 84% of Philadelphia’s two-way votes, to Oz’s 16% (two-way votes percentages are calculated ignoring votes for other candidates). For Governor, Shapiro won 87% of Philadelphia’s two-way votes to Mastriano’s 13%. Below is a cross-tabulation of the voter-level results.

View code
twoway_offices <- c(
  "GOVERNOR AND LIEUTENANT GOVERNOR (8687)" = "Governor",
  "UNITED STATES SENATOR (8682)" = "Senate"  
)


twoway_cands <- c(
  "MEHMET OZ REP (102)" = "Oz",
  "DOUGLAS V MASTRIANO AND CARRIE LEWIS DELROSSO REP (107)" = "Mastriano",
  "JOHN FETTERMAN DEM (101)" = "Fetterman",
  "JOSH SHAPIRO AND AUSTIN DAVIS DEM (106)" = "Shapiro"
)


candidate_pairs <- candidate_pairs |> 
  mutate(
    office.x = twoway_offices[office.x],
    office.y = twoway_offices[office.y],
    candidate.x = replace_na(twoway_cands[candidate.x], "Other Candidate"),
    candidate.y = replace_na(twoway_cands[candidate.y], "Other Candidate")
  ) |>
  filter(!is.na(office.x), !is.na(office.y)) |>
  group_by(warddiv, vote_type, office.x, office.y, candidate.x, candidate.y) |>
  summarise(pair_count = sum(pair_count), .groups="drop")

candidate_totals <- candidate_totals |>
  mutate(
    office = twoway_offices[office],
    candidate = replace_na(twoway_cands[candidate], "Other Candidate")
  ) |>
  filter(!is.na(office)) |>
  group_by(warddiv, vote_type, office, candidate) |>
  summarise(candidate_total = sum(candidate_total), .groups="drop")
  
office_totals <- candidate_totals |>
  group_by(vote_type, warddiv, office) |>
  summarise(
    office_total = sum(candidate_total),
    .groups="drop"
  )

office_totals <- office_totals |>
  group_by(vote_type, warddiv) |>
  mutate(votes_both_offices = sum(office_total)) |>
  left_join(
    candidate_pairs |> group_by(vote_type, warddiv) |>
      summarise(total_pairs = sum(pair_count))
  ) |>
  mutate(
    denominator = votes_both_offices - total_pairs
  )

df_topline <- 
  expand.grid(
    warddiv = unique(candidate_pairs$warddiv),
    vote_type = unique(candidate_pairs$vote_type),
    office.x = unique(candidate_pairs$office.x),
    office.y = unique(candidate_pairs$office.y),
    candidate.x = unique(candidate_pairs$candidate.x),
    candidate.y = unique(candidate_pairs$candidate.y)
  ) |>
  left_join(candidate_pairs) |> 
  left_join(
    candidate_totals |> 
      rename(candidate_total.x = candidate_total) |>
      filter(office == "Governor"),
    by = c(
      "candidate.x"="candidate", 
      "warddiv"="warddiv", 
      "office.x"="office",
      "vote_type"="vote_type")
  ) |>
  left_join(
    candidate_totals |> 
      rename(candidate_total.y = candidate_total) |>
      filter(office == "Senate"),
    by = c(
      "candidate.y"="candidate", 
      "warddiv"="warddiv", 
      "office.y"="office",
      "vote_type"="vote_type")
  ) |>
  left_join(
    office_totals |> 
      filter(office == "Senate") |>
      group_by(vote_type, warddiv) |>
      summarise(denominator= sum(denominator, na.rm=TRUE))
  )

df_topline_total <- df_topline |>
  group_by(candidate.x, candidate.y) |>
  summarise(across(
    c(pair_count, candidate_total.x, candidate_total.y, denominator),
    function(x) sum(x, na.rm=TRUE)
  ))

# write.csv(df_topline_total, file="sen_gov_crosstabs.csv")

df_topline_type <- df_topline |>
  group_by(vote_type, candidate.x, candidate.y) |>
  summarise(across(
    c(pair_count, candidate_total.x, candidate_total.y, denominator),
    function(x) sum(x, na.rm=TRUE)
  ))

# write.csv(df_topline_type, file="sen_gov_crosstabs_type.csv")


# sen_div <- df_topline_twoway |> 
#   group_by(warddiv, candidate.y) |> 
#   rename(candidate = candidate.y) |> 
#   summarise(votes = sum(votes))
# 
# candidate_totals |> filter(office %in% sen_str) |>
#   mutate(candidate = replace_na(twoway_cands[candidate], "Other")) |>
#   group_by(candidate, warddiv) |>
#   summarise(votes_tot = sum(candidate_total)) |>
#   left_join(sen_div) |>
#   filter(votes_tot != votes)
# 
# candidate_totals |> filter(warddiv == "01-03", candidate == names(twoway_cands)[2])
# candidate_pairs |> filter(candidate.y == names(twoway_cands)[3], warddiv == "01-01", office.x == gov_str)

Notice from the table that there are more Shapiro-Oz than Mastriano-Fetterman voters. This suggests an ordering of candidate combinations from most Republican to most Democratic: Mastriano-Oz, Shapiro-Oz, Shapiro-Fetterman. (It’s not clear how other candidates or no votes fit into this ordering.)

There are obvious differences between Mail-in and Election Day voters, as well. Mail-In voters are more Democratic, with 89% voting for both Fetterman and Shapiro vs 76% on Election Day. Even among Oz voters, Mail-In Oz voters voted more for Shapiro than Election Day Oz voters, 32% vs 15%.

These rates of cross-party voting differed by Division as well. Below, we divide voters by their vote for Governor, and measure how they voted for Senate. These rates themselves differ by Division partisanship, which we estimate using overall percentage for Mastriano. This heterogeneity is the core of the Ecological Inference problem; Mastriano voters are themselves different in Mastriano Divisions and Shapiro Divisions (as are Shapiro voters).

View code
div_res <- readRDS("../../data/processed_data/df_major_type_20230116.Rds")
div_res <- div_res |> filter(year == 2022, election_type == "general")

div_gov_sen <- div_res |>
  filter(office %in% c("GOVERNOR", "UNITED STATES SENATOR")) |>
  mutate(candidate = case_when(
    grepl("FETTERMAN", candidate) ~ "Fetterman",
    grepl("MEHMET OZ", candidate) ~ "Oz",
    grepl("MASTRIANO", candidate) ~ "Mastriano",
    grepl("SHAPIRO", candidate) ~ "Shapiro",
    TRUE ~ paste(office, "- Other Candidate")
  )) |>
  group_by(office, candidate, warddiv) |>
  summarise(votes = sum(votes), .groups="drop") |>
  group_by(office, warddiv) |>
  mutate(
    total_votes = sum(votes),
    pvote = votes / total_votes
  ) |>
  ungroup() |>
  select(warddiv, candidate, votes, pvote, total_votes) |>
  pivot_wider(
    names_from = candidate,
    values_from = c(votes, pvote, total_votes)
  )



div_gov_sen <- div_gov_sen |>
  mutate(
    pvote_rep_avg = (pvote_Mastriano + pvote_Oz) / 2,
    pvote_rep_gov = pvote_Mastriano,
    pvote_rep_sen = pvote_Oz,
    rep_group = cut(pvote_rep_avg, breaks=c(-0.001, seq(0.1,1,0.1)))
  )
levels(div_gov_sen$rep_group)[1] <- "[0.0,0.1]"

testthat::expect_true(
  df_topline |> filter(is.na(denominator)) |> with(all(vote_type=="Provisional"))
)

df_topline_divcat <- df_topline |>
  group_by(warddiv, office.x, office.y, candidate.x, candidate.y) |>
  summarise(
    across(pair_count:denominator, function(x) sum(x, na.rm=TRUE)),
    .groups="drop"
  ) |>
  left_join(
    as.data.frame(div_gov_sen) |> 
      select(warddiv, rep_group, pvote_rep_avg, pvote_rep_gov, pvote_rep_sen)
  ) |>
  filter(!is.na(denominator)) |>
  group_by(
    rep_group, office.x, office.y, candidate.x, candidate.y
  ) |>
  summarise(
    pvote_rep_avg = weighted.mean(pvote_rep_avg, w=denominator, na.rm=TRUE),
    pvote_rep_gov = weighted.mean(pvote_rep_gov, w=denominator, na.rm=TRUE),
    pvote_rep_sen = weighted.mean(pvote_rep_sen, w=denominator, na.rm=TRUE),
    n_divs = length(unique(warddiv)),
    across(
      c(pair_count, candidate_total.x, candidate_total.y, denominator),
      function(x) sum(x, na.rm=TRUE)
    )
  )


df_topline_divcat |> 
  group_by(rep_group, candidate.x) |>
  mutate(
    prop_among_voters = pair_count / candidate_total.x
  ) |>
  ungroup() |>
  filter(candidate.y %in% c("Fetterman", "Oz")) |>
  select(rep_group, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov) |>
  bind_rows(
    df_topline_divcat |>
      ungroup() |>
      #arbitrary row
      filter(candidate.x == "Mastriano", candidate.y %in% c("Fetterman", "Oz")) |>
      mutate(
        candidate.x = "All",
        prop_among_voters = candidate_total.y / denominator
      ) |>
      select(rep_group, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov)
  ) |>
  mutate(
    candidate.x = case_when(
      candidate.x %in% c("Mastriano", "Shapiro", "All") ~ paste(candidate.x, "voters"),
      candidate.x == "Other Candidate" ~ "Other voters"
    ),
    candidate.x = factor(candidate.x, levels=sort(unique(candidate.x))[c(1,2,4,3)])
  ) |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=candidate.x)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean() +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(~candidate.y) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "Senate results based on individuals' Governor vote",
    subtitle = "By Division partisanship",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

A naive assumption would be that Shapiro voters would vote for Fetterman at the same rate in all Divisions; the only difference between Republican and Democratic Divisions is that there are different proportions of them. But Shapiro voters were actually quite different in Republican Divisions than in Democratic Divisions. In the most Democratic Divisions (where Mastriano received 4% of the vote), 97% of Shapiro voters also voted for Fetterman. But in Philadelphia’s most Republican Divisions (where Mastriano received 59% of the vote), only 76% of Shapiro voters also voted for Fetterman. In retrospect this is obvious, but naive ecological inference assumes this isn’t the case.

Voters for “Other” candidates for Governor also show strong correlation with their overall Division partisanship: those voters were more likely to vote for Fetterman in Democratic Divisions, and for Oz in Republican Divisions.

Consider these same results by vote type.

View code
df_topline_divcat_type <- df_topline |>
  group_by(warddiv, vote_type, office.x, office.y, candidate.x, candidate.y) |>
  summarise(
    across(pair_count:denominator, function(x) sum(x, na.rm=TRUE)),
    .groups="drop"
  ) |>
  left_join(
    as.data.frame(div_gov_sen) |> 
      select(warddiv, rep_group, pvote_rep_avg, pvote_rep_gov, pvote_rep_sen)
  ) |>
  filter(!is.na(denominator)) |>
  group_by(
    rep_group, vote_type, office.x, office.y, candidate.x, candidate.y
  ) |>
  summarise(
    pvote_rep_avg = weighted.mean(pvote_rep_avg, w=denominator, na.rm=TRUE),
    pvote_rep_gov = weighted.mean(pvote_rep_gov, w=denominator, na.rm=TRUE),
    pvote_rep_sen = weighted.mean(pvote_rep_sen, w=denominator, na.rm=TRUE),
    n_divs = length(unique(warddiv)),
    across(
      c(pair_count, candidate_total.x, candidate_total.y, denominator),
      function(x) sum(x, na.rm=TRUE)
    )
  )


df_topline_divcat_type |> 
  group_by(rep_group, vote_type, candidate.x) |>
  mutate(
    prop_among_voters = pair_count / candidate_total.x
  ) |>
  ungroup() |>
  filter(candidate.y %in% c("Fetterman", "Oz")) |>
  select(rep_group, vote_type, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov) |>
  bind_rows(
    df_topline_divcat_type |>
      ungroup() |>
      #arbitrary row
      filter(candidate.x == "Mastriano", candidate.y %in% c("Fetterman", "Oz")) |>
      group_by(vote_type) |>
      mutate(
        candidate.x = "All",
        prop_among_voters = candidate_total.y / denominator
      ) |>
      select(rep_group, vote_type, candidate.x, candidate.y, prop_among_voters, pvote_rep_gov)
  ) |>
  mutate(
    candidate.x = case_when(
      candidate.x %in% c("Mastriano", "Shapiro", "All") ~ paste(candidate.x, "voters"),
      candidate.x == "Other Candidate" ~ "Other voters"
    ),
    candidate.x = factor(candidate.x, levels=sort(unique(candidate.x))[c(1,2,4,3)])
  ) |>
  filter(vote_type %in% c("Mail Votes", "Election Day")) |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=candidate.x)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean()  +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(vote_type~candidate.y) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "Senate results based on individuals' Governor vote",
    subtitle = "By Division partisanship and vote type",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

Mail-in voters showed more consistency across Divisions. Shapiro voters also voted for Fetterman, even within Republican Divisions: in the most Republican Divisions, 87% of mail-in Shapiro voters voted for Fetterman, but only 71% of Election Day Shapiro voters did.

Metrics for Ecological Inference

Having seen the heterogeneity in voters by Division, what do those differences mean for Division-level analyses of candidate correlations?

At the Division level, votes for Mastriano are strongly correlated with votes for Oz, and votes for Shapiro strongly correlated with votes for Fetterman.

View code
# stats::cov.wt(
#   x = div_gov_sen |> select(starts_with("pvote_")),
#   w = div_gov_sen |> with(total_votes_Oz + total_votes_Mastriano),
#   cor = TRUE
# )$cor
# 
# stats::cov.wt(
#   x = div_gov_sen |> select(starts_with("pvote_")),
#   w = div_gov_sen |> with(total_votes_Oz + total_votes_Mastriano)
# )$cov
# 
# 
# lm(
#   pvote_Oz ~ pvote_Mastriano, data = div_gov_sen
# )
# lm(
#   pvote_Fetterman ~ pvote_Shapiro, data = div_gov_sen
# )

div_gov_sen |>
    mutate(total_votes = (total_votes_Mastriano + total_votes_Oz)/2) |>
    select(
      warddiv, total_votes, pvote_Mastriano, pvote_Shapiro, pvote_Fetterman, pvote_Oz
    ) |>
    pivot_longer(
      cols=c(pvote_Mastriano, pvote_Shapiro, pvote_Fetterman, pvote_Oz),
      names_to="candidate",
      values_to="pvote"
    ) |>
  mutate(
    office = ifelse(grepl("Mastriano|Shapiro", candidate), "Governor", "Senator"),
    party = ifelse(grepl("Mastriano|Oz", candidate), "Republican Candidates", "Democratic Candidates")
  ) |>
  pivot_wider(
    id_cols = c(warddiv, party, total_votes),
    names_from = c(office),
    values_from = c(pvote)
  ) |>
ggplot(
  aes(x=100*Governor, y=100*Senator)
) +
  geom_point(aes(size=total_votes), alpha = 0.2) +
  scale_size_area() +
  facet_grid(~party) +
  theme_minimal() +
  geom_abline(slope=1, intercept=0) +
  coord_fixed() +
  labs(
    title="Division-level Governor and Senator results are highly correlated",
    x = "Percent for Governor",
    y = "Percent for Senate",
    size="Total Votes"
  )

How do summary metrics compare between Division- and individual levels?

View code
corr_from_counts <- function(
    a_and_b, a_only, b_only, total, 
    method = c("corr", "cov", "slope_a", "slope_b")
){
  method = match.arg(method)
  prop_a <- (a_and_b + a_only) / total
  prop_b <- (a_and_b + b_only) / total
  
  cov <- a_and_b / total - prop_a * prop_b 
  if(method == "cov") return(cov)
  
  sd_a <- sqrt(prop_a * (1-prop_a))
  sd_b <- sqrt(prop_b * (1-prop_b))
  corr <- cov / (sd_a * sd_b)
  
  if(method == "corr"){
    return(corr)
  } else if(method == "slope_a"){
    return(corr * sd_b / sd_a)
  } else if(method == "slope_b"){
    return(corr * sd_a / sd_b)
  } else {
      stop(paste("bad slope_denom:", slope_denom))
  }
}

get_corr_for_candidates <- function(cand_x, cand_y, df_topline_total, ...){
  a_and_b <- df_topline_total |> 
    filter(candidate.x==cand_x, candidate.y == cand_y) |> 
    with(pair_count)
  a_total <- df_topline_total |> 
    filter(candidate.x==cand_x) |> 
    with(candidate_total.x[1])
  b_total <- df_topline_total |> 
    filter(candidate.y==cand_y) |> 
    with(candidate_total.y[1])
  total <- df_topline_total$denominator[1]
  
  corr_from_counts(
    a_and_b,
    a_total - a_and_b,
    b_total - a_and_b,
    total,
    ...
  )
}

overall_corrs <- df_topline_total |> 
  rowwise() |>
  mutate(
    corr = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator
    ),
    slope = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "slope_a"
    ),
    cov = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "cov"
    )
  ) 

# write.csv(overall_corrs, file="overall_corrs.csv")

# cov.wt(
#   x = div_gov_sen |> select(pvote_Mastriano, pvote_Oz, pvote_Fetterman, pvote_Shapiro),
#   wt = (div_gov_sen$total_votes_Mastriano + div_gov_sen$total_votes_Oz)/2
# )$cov

type_corrs <- df_topline_type |>
  rowwise() |>
  mutate(
    corr = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator
    ),
    slope = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "slope_a"
    ),
    cov = corr_from_counts(
      pair_count, 
      candidate_total.x - pair_count, 
      candidate_total.y - pair_count, 
      denominator,
      method = "cov"
    )
  )

# write.csv(type_corrs, file = "type_corrs.csv")
  
div_gov_sen_type <- div_res |>
  rename(vote_type = type) |>
  filter(office %in% c("GOVERNOR", "UNITED STATES SENATOR")) |>
  mutate(candidate = case_when(
    grepl("FETTERMAN", candidate) ~ "Fetterman",
    grepl("MEHMET OZ", candidate) ~ "Oz",
    grepl("MASTRIANO", candidate) ~ "Mastriano",
    grepl("SHAPIRO", candidate) ~ "Shapiro",
    TRUE ~ paste(office, "- Other Candidate")
  )) |>
  group_by(office, candidate, warddiv, vote_type) |>
  summarise(votes = sum(votes), .groups="drop") |>
  group_by(office, warddiv, vote_type) |>
  mutate(
    total_votes = sum(votes),
    pvote = votes / total_votes
  ) |>
  ungroup() |>
  select(vote_type, warddiv, candidate, votes, pvote, total_votes) |>
  pivot_wider(
    names_from = candidate,
    values_from = c(votes, pvote, total_votes)
  )

cov_types <- div_gov_sen_type |>
    filter(vote_type %in% c("In Person", "Mail")) |>
    group_by(vote_type) |>
    nest() |>
  mutate(
    corr = map(
      .x=data,
      ~stats::cov.wt(
        x = .x |> select(starts_with("pvote_")),
        w = .x |> with(total_votes_Oz + total_votes_Mastriano),
        cor = TRUE
      )$cor
    ),
    cov = map(
      .x=data,
      ~stats::cov.wt(
        x = .x |> select(starts_with("pvote_")),
        w = .x |> with(total_votes_Oz + total_votes_Mastriano),
      )$cov |> as.data.frame()
    ),
    slope = map(
      .x=data,
      ~coef(lm(pvote_Oz ~ pvote_Mastriano, data = .x, weights = (total_votes_Oz + total_votes_Mastriano)))
    )
  )

# cov_types |> select(vote_type, cov) |> unnest(cols=c(cov)) |>
#   write.csv("div_corr_type.csv")

At the individual voter level, voting for Mastriano has a 0.85 correlation with voting for Oz; voting for Shapiro a 0.79 correlation with voting for Fetterman. Translating to regression coefficients, people who voted for Mastriano were 91pp more likely to vote for Oz than those who didn’t, people who voted for Shapiro were 84pp more likely to vote for Fetterman than those who didn’t.

The correlation measured in Division results were much higher. A Division’s proportion for Mastriano and proportion for Oz has a near-perfect 0.994 correlation, and for Shapiro and for Fetterman a similar 0.993. The regression coefficients are also larger: a Division that voted 1pp more for Mastriano voted 1.15 pp more for Oz, and a Division that voted 1pp more for Shapiro voted 1.16 pp more for Fetterman. (Regression coefficients being greater than 1 is obviously impossible at the person level.)

The fact that Division-level correlations are higher than individual provides a warning for Ecological inference.

The total person-level covariance of Mastriano and Oz votes is 0.103, of Shapiro and Fetterman votes 0.111. The covariance of the candidates’ averages across Divisions is 0.023 for Mastriano-Oz, and 0.024 for Shapiro-Fetterman. This means that the total covariance within divisions is 0.080 and 0.087, respectively. (How do we reconcile that most of the person-level covariance is not explained by Division, yet the Division correlation is much higher? It’s because the standard deviation of Division averages–the denominator of the correlation–is also much smaller.)

Bayesian Model

Let’s pretend that we didn’t have the person-level results, and try to reconstruct them from the division-level results using a model inspired by King, Rosen and Tanner (1999). Consider the two-by-two contingency tables above, with votes for Governor by row and Senate by column. Use subscripts \(o,f,m,s,oth\) to mean Oz, Fetterman, Mastriano, and Shapiro, and Other, respectively. We’ll model the Senate votes conditioned on the total Gubernatorial votes \(G_{d,m}, G_{d,s}, G_{d, oth}\). Write the proportion of votes for senate among voters for a Governor candidate in division \(d\) as e.g. \(p_{d,o,m}\). Then the votes for Senate are generated as:

\[\begin{align*} \mu_{d, o} =& p_{d,o,m} G_{d, m} + p_{d,o,s} G_{d, s} + p_{d,o,oth} G_{d, oth} \\ var_{d, o} =& p_{d,o,m}(1-p_{d,o,m}) G_{d, m} + p_{d,o,s} (1-p_{d,o,s}) G_{d, s} + p_{d,o,oth}(1-p_{d,o,oth}) G_{d, oth} \\ S_{d, o} \sim& N(\text{mean}=G_d p_{d, o}, \text{var}=var_{d, o}) \end{align*}\]

But we also want each \(p\) to vary by the partisanship of the Division. So we additionally model

\[\begin{align*} \text{for(gov in (m, s, oth)):}& \\ &p_{d,o,gov} \propto \text{exp}(\alpha_{o,gov} + \beta_{o,gov} p_{d,m}) \\ &p_{d,s,gov} \propto \text{exp}(\alpha_{s,gov} + \beta_{s,gov} p_{d,m}) \\ &p_{d,oth,gov} \propto \text{exp}(\alpha_{oth,gov} + \beta_{oth,gov} p_{d,m}) \end{align*}\]

where we model the senate vote choice among the voters for each governor as a multinomial regression.

We fit this data in numpyro (see the numpyro notebook for the code). The model produces estimates of the full contingency table (Table 1) for each Division. Below we recreate the Division plot from above using the simulations.

View code
sims <- read.csv("sen_gov_sim.csv")
# sims |> 
#   group_by(sen, gov) |>
#   summarise(votes = sum(sim), .groups="drop") |>
#   mutate(total=sum(votes)) |>
#   pivot_wider(
#     id_cols = c(total),
#     names_from = c(sen, gov),
#     values_from=votes
#   ) |>
#   with(
#       corr_from_counts(
#         a_and_b = Oz_Mastriano,
#         a_only = Oz_Shapiro + Oz_Other,
#         b_only = Fetterman_Mastriano + Other_Mastriano,
#         total = total
#       )
#   )

sims |>
  left_join(div_gov_sen |> select(warddiv, rep_group)) |>
  group_by(rep_group, gov, sen) |>
  summarise(votes=sum(sim)) |>
  group_by(gov, rep_group) |>
  mutate(
    total_votes_gov = sum(votes),
    prop_among_voters = votes / total_votes_gov
  ) |>
  group_by(rep_group) |>
  mutate(pvote_rep_gov = sum(votes[gov == "Mastriano"]) / sum(votes)) |>
  ungroup() |>
  mutate(
    gov = case_when(
      gov %in% c("Mastriano", "Shapiro", "All") ~ paste(gov, "voters"),
      gov == "Other" ~ "Other voters"
    ),
    gov = factor(gov, levels=sort(unique(gov))[c(1,2,4,3)])
  ) |>
  filter(sen != "Other") |>
  ggplot(
    aes(x=100*pvote_rep_gov, y=100*prop_among_voters, color=gov)
  ) +
    geom_line() +
    geom_point(size=2) +
    ggthemes::theme_clean() +
    # scale_size_area() +
    scale_color_manual(
      values = c(
        "Mastriano voters" = strong_red,
        "Shapiro voters" = strong_blue,
        "Other voters" = strong_orange,
        "All voters" = "black"
      )
    ) +
  facet_grid(~sen) +
  expand_limits(x=0, y=c(0,100))+
  labs(
    title = "(Bad) Simulated results using Ecological Inference",
    subtitle = "By Division partisanship",
    size = "Total Voters",
    color = NULL,
    linetype = NULL,
    x = "Divisions' percent vote for Mastriano",
    y = "Percent of votes received by Senate candidate"
  )

The model overestimated the heterogeneity of voters. Specifically, it believes that in more Republican Divisions, Shapiro voters are more likely than Mastriano voters to vote for Oz; it has the slopes of Mastriano voters with the entirely wrong sign (we know the true sign from the individual-level data). This illustrates the difficulty of ecological analysis: when Oz does better, the model assumes it must be because he did better among the more numerous Shapiro voters. Its only leverage for identifying the slope is the relative variances among divisions; that’s a very indirect identification strategy. Knowing the real individual patterns from above, we could obviously tweak the model to recreate the true results by adding strong priors. But in general we don’t know the actual individual patterns, so wouldn’t know whether the model was broken or how to fix it.

Conclusion

In sum, Division measurements clearly overstate voter-level correlations, though in this election both were extremely high. We hope to do more work to understand how we might infer voter-level statistics in general. And with May 16 approaching, we will be able to recreate this analysis on a heavily-contested primary election.