Learn how to use this data in your own work.

In these tutorials, learn how to analyze the traffic stop data and apply key statistical tests to measure racial disparities and possible bias. The video and accompanying R code can help you get started working with the data in our repository.

We use Connecticut as an example because it is a small dataset and easy to load in, but has fairly complete data and a rich set of fields to analyze.

```
library(tidyverse)
library(lubridate)
d = read_csv('CT-clean.csv', col_types = list(stop_time = 'c', officer_id = 'c'))
```

Now let’s apply the same filters we used in the analysis. We analyze stops of white, black, and Hispanic drivers between 2011 and 2015.

```
d = filter(d,
driver_race %in% c('White', 'Black', 'Hispanic'),
year(stop_date) >= 2011,
year(stop_date) <= 2015)
```

After filtering, we lose about 8,000 rows. Now let’s compute some basic statistics broken down by the race of the driver.

```
summary_stats <- function(search_conducted, contraband_found) {
n_stops = length(search_conducted)
n_searches = sum(search_conducted)
n_hits = sum(contraband_found)
search_rate = n_searches / n_stops
hit_rate = n_hits / n_searches
return(data.frame(n_stops, n_searches, n_hits, search_rate, hit_rate))
}
basic_summary_statistics_by_race = d %>%
group_by(driver_race) %>%
do(summary_stats(.$search_conducted, .$contraband_found))
basic_summary_statistics_by_race
```

```
## Source: local data frame [3 x 6]
## Groups: driver_race [3]
##
## driver_race n_stops n_searches n_hits search_rate hit_rate
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Black 37463 1219 346 0.03253877 0.2838392
## 2 Hispanic 31157 966 282 0.03100427 0.2919255
## 3 White 242349 3108 1179 0.01282448 0.3793436
```

The first thing we notice is that black and Hispanic drivers are much more likely to be searched: the search rate is about 2–3x higher than for white drivers. This pattern appears across states. This disparity, on its own, does not prove the police are being discriminatory — perhaps white drivers are less likely to carry contraband — but is still worth noting.

Another thing we note is the hit rate is higher for white drivers than for black and Hispanic drivers. This is the classic “outcome test”: if searches of minority drivers are less likely to be successful, it may indicate that minority drivers are searched when less likely to be carrying contraband, suggesting discriminatory search standards. In general, the combination of higher search rates for minority drivers, along with lower hit rates, suggests minority drivers are being searched on less evidence.

Incidentally, race isn’t the only variable you could stratify by. We can easily break down the data by gender or other categories too. Far fewer female drivers are stopped, and they’re less likely to be searched, but have comparable hit rates.

```
basic_summary_statistics_by_gender = d %>%
group_by(driver_gender) %>%
do(summary_stats(.$search_conducted, .$contraband_found))
basic_summary_statistics_by_gender
```

```
## Source: local data frame [2 x 6]
## Groups: driver_gender [2]
##
## driver_gender n_stops n_searches n_hits search_rate hit_rate
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 F 104696 821 275 0.007841751 0.3349574
## 2 M 206273 4472 1532 0.021680007 0.3425760</code></pre>
```

We can also break down the data by both race and location. It’s important to do this because search rates, hit rates, or racial composition could vary by location for legitimate reasons. So we want to see whether search and hit rates differ by race — even when we control for location.

```
basic_summary_statistics_by_race_and_county = d %>%
filter(!is.na(county_name)) %>%
group_by(driver_race, county_name) %>%
do(summary_stats(.$search_conducted, .$contraband_found))
```

Let’s make a scatter plot to compare search rates and hit rates for minority and white drivers within the same county or patrol district.

```
data_for_plot <- basic_summary_statistics_by_race_and_county %>%
filter(driver_race == 'White') %>%
right_join(basic_summary_statistics_by_race_and_county %>% filter(driver_race != 'White'), by='county_name')
# plot search rates.
max_val = max(basic_summary_statistics_by_race_and_county$search_rate) * 1.05
search_plot = ggplot(data_for_plot) +
# specify data we want to plot
geom_point(aes(x = search_rate.x, y = search_rate.y, size = n_stops.y)) +
# make one subplot for each minority race group
facet_grid(.~driver_race.y) +
# add a diagonal line to indicate parity
geom_abline(slope = 1, intercept = 0, linetype='dashed') +
scale_x_continuous('White search rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) +
scale_y_continuous('Minority search rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) +
theme_bw(base_size=15) +
theme(legend.position="none") +
scale_size_area(max_size=5)
search_plot
```

Points are all above the diagonal line, indicating search rates are higher for minorities within the same county or district.

```
max_val = max(basic_summary_statistics_by_race_and_county$hit_rate) * 1.05
hit_plot = ggplot(data_for_plot) +
geom_point(aes(x = hit_rate.x, y = hit_rate.y, size = n_stops.y)) +
facet_grid(.~driver_race.y) +
geom_abline(slope = 1, intercept = 0) +
scale_x_continuous('White hit rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) +
scale_y_continuous('Minority hit rate', limits=c(0, max_val), labels = scales::percent, expand=c(0,0)) +
theme_bw(base_size=15) +
theme(legend.position="none") +
scale_size_area(max_size=5)
hit_plot
```

Points are generally below the diagonal line, indicating hit rates are lower for minorities within the same county or district.

The outcome test isn’t perfect due to the problem of “infra-marginality”. We could observe different hit rates for different race groups even if there’s no discrimination. Imagine there are two types of white drivers: those with a 1% chance of carrying contraband, and those with a 75% chance. Assume black drivers have either a 1% or a 50% chance of carrying contraband. Even if the police apply the same threshold, hit rates will be different. You can read more about this problem, and solutions to it, in The Problem of Infra-marginality in Outcome Tests for Discrimination.

You can also perform regressions. So maybe we want to look at search rates controlling for race, location, and age.

```
summary_stats_for_regression = d %>%
mutate(driver_age_category = cut(driver_age,
c(15, 19, 29, 39, 49, 100),
labels = c('16-19', '20-29', '30-39', '40-49', '50+')),
driver_race = factor(driver_race, levels = c('White', 'Black', 'Hispanic'))) %>%
group_by(driver_race, driver_age_category, driver_gender, county_name) %>%
do(summary_stats(.$search_conducted, .$contraband_found))
model = glm(cbind(n_searches, n_stops - n_searches) ~ driver_race + driver_age_category + driver_gender + county_name, data = summary_stats_for_regression, family = binomial)
summary(model)
```

```
##
## Call:
## glm(formula = cbind(n_searches, n_stops - n_searches) ~ driver_race +
## driver_age_category + driver_gender + county_name, family = binomial,
## data = summary_stats_for_regression)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3789 -0.9858 -0.2372 0.8305 3.1798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.26089 0.06645 -64.120 < 2e-16 ***
## driver_raceBlack 0.83253 0.03583 23.235 < 2e-16 ***
## driver_raceHispanic 0.68305 0.03868 17.661 < 2e-16 ***
## driver_age_category20-29 -0.35959 0.04858 -7.403 1.33e-13 ***
## driver_age_category30-39 -0.85353 0.05319 -16.048 < 2e-16 ***
## driver_age_category40-49 -1.51798 0.06277 -24.184 < 2e-16 ***
## driver_age_category50+ -2.12448 0.07057 -30.103 < 2e-16 ***
## driver_genderM 1.02658 0.03841 26.724 < 2e-16 ***
## county_nameHartford County 0.15518 0.05078 3.056 0.00225 **
## county_nameLitchfield County 0.31639 0.05882 5.379 7.51e-08 ***
## county_nameMiddlesex County -0.55763 0.06859 -8.130 4.29e-16 ***
## county_nameNew Haven County 0.09241 0.04942 1.870 0.06150 .
## county_nameNew London County 0.17020 0.05173 3.290 0.00100 **
## county_nameTolland County -0.07889 0.05526 -1.427 0.15344
## county_nameWindham County 0.07303 0.06221 1.174 0.24046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4360.01 on 239 degrees of freedom
## Residual deviance: 391.39 on 225 degrees of freedom
## (45 observations deleted due to missingness)
## AIC: 1312.2
##
## Number of Fisher Scoring iterations: 4
```

So that’s how to perform basic analyses for a single state. Of course, there are many other things you could do. There are several columns we haven’t even looked at in this tutorial.

What if you want to scale up and analyze multiple states? You’ll find some states take longer to load in than Connecticut. And loading in all states requires even more time. We suggest working with aggregate data if you want to analyze all states.