Tutorials

Learn how to use this data in your own work.

In these tutorials, learn how to analyze the traffic stop data and apply key statistical tests to measure racial disparities and possible bias. The video and accompanying R code can help you get started working with the data in our repository.

View tutorial on GitHub

Load libraries and data

We use Connecticut as an example because it is a small dataset and easy to load in, but has fairly complete data and a rich set of fields to analyze.

library(tidyverse)
library(lubridate)

d = read_csv('CT-clean.csv', col_types = list(stop_time = 'c', officer_id = 'c'))

Data exploration

Now let’s apply the same filters we used in the analysis. We analyze stops of white, black, and Hispanic drivers between 2011 and 2015.

d = filter(d, 
           driver_race %in% c('White', 'Black', 'Hispanic'), 
           year(stop_date) >= 2011, 
           year(stop_date) <= 2015)

After filtering, we lose about 8,000 rows. Now let’s compute some basic statistics broken down by the race of the driver.

summary_stats <- function(search_conducted, contraband_found) {
  n_stops     = length(search_conducted)
  n_searches  = sum(search_conducted)
  n_hits      = sum(contraband_found)
  search_rate = n_searches / n_stops
  hit_rate    = n_hits / n_searches
  return(data.frame(n_stops, n_searches, n_hits, search_rate, hit_rate))
}

basic_summary_statistics_by_race = d %>% 
  group_by(driver_race) %>% 
  do(summary_stats(.$search_conducted, .$contraband_found))
basic_summary_statistics_by_race
## Source: local data frame [3 x 6]
## Groups: driver_race [3]
## 
##   driver_race n_stops n_searches n_hits search_rate  hit_rate
##         <chr>   <int>      <int>  <int>       <dbl>     <dbl>
## 1       Black   37463       1219    346  0.03253877 0.2838392
## 2    Hispanic   31157        966    282  0.03100427 0.2919255
## 3       White  242349       3108   1179  0.01282448 0.3793436

The first thing we notice is that black and Hispanic drivers are much more likely to be searched: the search rate is about 2–3x higher than for white drivers. This pattern appears across states. This disparity, on its own, does not prove the police are being discriminatory — perhaps white drivers are less likely to carry contraband — but is still worth noting.

Another thing we note is the hit rate is higher for white drivers than for black and Hispanic drivers. This is the classic “outcome test”: if searches of minority drivers are less likely to be successful, it may indicate that minority drivers are searched when less likely to be carrying contraband, suggesting discriminatory search standards. In general, the combination of higher search rates for minority drivers, along with lower hit rates, suggests minority drivers are being searched on less evidence.

Incidentally, race isn’t the only variable you could stratify by. We can easily break down the data by gender or other categories too. Far fewer female drivers are stopped, and they’re less likely to be searched, but have comparable hit rates.

basic_summary_statistics_by_gender = d %>% 
  group_by(driver_gender) %>% 
  do(summary_stats(.$search_conducted, .$contraband_found))
basic_summary_statistics_by_gender
## Source: local data frame [2 x 6]
## Groups: driver_gender [2]
## 
##   driver_gender n_stops n_searches n_hits search_rate  hit_rate
##           <chr>   <int>      <int>  <int>       <dbl>     <dbl>
## 1             F  104696        821    275 0.007841751 0.3349574
## 2             M  206273       4472   1532 0.021680007 0.3425760</code></pre>

We can also break down the data by both race and location. It’s important to do this because search rates, hit rates, or racial composition could vary by location for legitimate reasons. So we want to see whether search and hit rates differ by race — even when we control for location.

basic_summary_statistics_by_race_and_county = d %>% 
  filter(!is.na(county_name)) %>%
  group_by(driver_race, county_name) %>%
  do(summary_stats(.$search_conducted, .$contraband_found))

Let’s make a scatter plot to compare search rates and hit rates for minority and white drivers within the same county or patrol district.

data_for_plot <- basic_summary_statistics_by_race_and_county %>%
  filter(driver_race == 'White') %>% 
  right_join(basic_summary_statistics_by_race_and_county %>% filter(driver_race != 'White'), by='county_name')

# plot search rates. 
max_val = max(basic_summary_statistics_by_race_and_county$search_rate) * 1.05
search_plot = ggplot(data_for_plot) + 
  # specify data we want to plot
  geom_point(aes(x = search_rate.x, y = search_rate.y, size = n_stops.y)) + 
  # make one subplot for each minority race group
  facet_grid(.~driver_race.y) + 
  # add a diagonal line to indicate parity
  geom_abline(slope = 1, intercept = 0, linetype='dashed') +   
  scale_x_continuous('White search rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) + 
  scale_y_continuous('Minority search rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) + 
  theme_bw(base_size=15) + 
  theme(legend.position="none") + 
  scale_size_area(max_size=5)
search_plot

search rate plot

Points are all above the diagonal line, indicating search rates are higher for minorities within the same county or district.

max_val = max(basic_summary_statistics_by_race_and_county$hit_rate) * 1.05
hit_plot = ggplot(data_for_plot) + 
  geom_point(aes(x = hit_rate.x, y = hit_rate.y, size = n_stops.y)) + 
  facet_grid(.~driver_race.y) + 
  geom_abline(slope = 1, intercept = 0) + 
  scale_x_continuous('White hit rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) + 
  scale_y_continuous('Minority hit rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) + 
  theme_bw(base_size=15) + 
  theme(legend.position="none") + 
  scale_size_area(max_size=5)
hit_plot

hit rate plot

Points are generally below the diagonal line, indicating hit rates are lower for minorities within the same county or district.

The outcome test isn’t perfect due to the problem of “infra-marginality”. We could observe different hit rates for different race groups even if there’s no discrimination. Imagine there are two types of white drivers: those with a 1% chance of carrying contraband, and those with a 75% chance. Assume black drivers have either a 1% or a 50% chance of carrying contraband. Even if the police apply the same threshold, hit rates will be different. You can read more about this problem, and solutions to it, in The Problem of Infra-marginality in Outcome Tests for Discrimination.

You can also perform regressions. So maybe we want to look at search rates controlling for race, location, and age.

summary_stats_for_regression = d %>% 
  mutate(driver_age_category = cut(driver_age, 
                                   c(15, 19, 29, 39, 49, 100), 
                                   labels = c('16-19', '20-29', '30-39', '40-49', '50+')),
         driver_race = factor(driver_race, levels = c('White', 'Black', 'Hispanic'))) %>%
  group_by(driver_race, driver_age_category, driver_gender, county_name) %>% 
  do(summary_stats(.$search_conducted, .$contraband_found))
model = glm(cbind(n_searches, n_stops - n_searches) ~ driver_race + driver_age_category + driver_gender + county_name, data = summary_stats_for_regression, family = binomial)
summary(model)
## 
## Call:
## glm(formula = cbind(n_searches, n_stops - n_searches) ~ driver_race + 
##     driver_age_category + driver_gender + county_name, family = binomial, 
##     data = summary_stats_for_regression)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3789  -0.9858  -0.2372   0.8305   3.1798  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -4.26089    0.06645 -64.120  < 2e-16 ***
## driver_raceBlack              0.83253    0.03583  23.235  < 2e-16 ***
## driver_raceHispanic           0.68305    0.03868  17.661  < 2e-16 ***
## driver_age_category20-29     -0.35959    0.04858  -7.403 1.33e-13 ***
## driver_age_category30-39     -0.85353    0.05319 -16.048  < 2e-16 ***
## driver_age_category40-49     -1.51798    0.06277 -24.184  < 2e-16 ***
## driver_age_category50+       -2.12448    0.07057 -30.103  < 2e-16 ***
## driver_genderM                1.02658    0.03841  26.724  < 2e-16 ***
## county_nameHartford County    0.15518    0.05078   3.056  0.00225 ** 
## county_nameLitchfield County  0.31639    0.05882   5.379 7.51e-08 ***
## county_nameMiddlesex County  -0.55763    0.06859  -8.130 4.29e-16 ***
## county_nameNew Haven County   0.09241    0.04942   1.870  0.06150 .  
## county_nameNew London County  0.17020    0.05173   3.290  0.00100 ** 
## county_nameTolland County    -0.07889    0.05526  -1.427  0.15344    
## county_nameWindham County     0.07303    0.06221   1.174  0.24046    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4360.01  on 239  degrees of freedom
## Residual deviance:  391.39  on 225  degrees of freedom
##   (45 observations deleted due to missingness)
## AIC: 1312.2
## 
## Number of Fisher Scoring iterations: 4

So that’s how to perform basic analyses for a single state. Of course, there are many other things you could do. There are several columns we haven’t even looked at in this tutorial.

What if you want to scale up and analyze multiple states? You’ll find some states take longer to load in than Connecticut. And loading in all states requires even more time. We suggest working with aggregate data if you want to analyze all states.