The Stanford Open Policing Project

In these tutorials, learn how to analyze the traffic stop data and apply key statistical tests to measure racial disparities and possible bias.

View tutorial on GitHub

Setup

First, let’s load the necessary libraries and data that will allow us to begin our investigation!

## Libraries to include
library(tidyverse)
library(lubridate)
# For Veil of Darkness
library(lutz)
library(suncalc)
library(splines)

## Load the data
# Replace the path below with the path to where your data lives
data_path <- "~/Desktop/pa_philadelphia_2019_02_25.rds"
stops <- read_rds(data_path)

# Additional data and fixed values we'll be using
population_2017 <- tibble(
  subject_race = c(
    "asian/pacific islander", "black", "hispanic", "other/unknown","white"
  ),
  num_people = c(110864, 648846, 221777, 39858, 548312)
) %>% 
  mutate(subject_race = as.factor(subject_race))

center_lat <- 39.9525839
center_lng <- -75.1652215

Covering the basics

The core of R is the dataframe. We’ve given you one to start with, in the form of stops. Think of dataframes like a spreadsheet: they have rows and columns. Usually, rows are a “datapoint”: in stops, each row corresponds to a single stop from Philadelphia. The columns are the “variables”: again, in stops, these are the things we know about the stop, like where the stop happened, the age of the stop subject, whether an arrest was made, and so on.

First, let’s take a look at what columns exist in our stops dataframe.

colnames(stops)

##  [1] "raw_row_number"   "date"             "time"            
##  [4] "location"         "lat"              "lng"             
##  [7] "district"         "service_area"     "subject_age"     
## [10] "subject_race"     "subject_sex"      "type"            
## [13] "arrest_made"      "outcome"          "contraband_found"
## [16] "frisk_performed"  "search_conducted" "search_person"   
## [19] "search_vehicle"

How many stops do we have in our dataset? (Hint: Try the nrow() function.)

nrow(stops)

## [1] 1891916

What date range does our data cover? One way to do this would be to examine the min() and max() dates in our dataframe. The problem is we can’t just apply min() to our dataframe: doing so produces an error.

The min() function wants its input to be just one column from the stops dataframe. To access a single column from a dataframe, use $: what goes before $ is the dataframe, and what comes after is the name of the column we want. (Hint: What happens if you type stops$date into the console?)

min(stops$date)

## [1] "2014-01-01"

max(stops$date)

## [1] "2018-04-14"

Since we only have four and a half months of data for 2018, let’s filter it out. Note that there are ways to deal with partial years in analysis, but to make things easier for ourselves, let’s focus on 2014-2017.

To do this we’ll want to:

Use the filter() function to specify the years we want, and
Use the assignment operator <- (it’s like = in R) to overwrite stops.

Aside: `filter()`

The filter() function is used to separate rows from the dataframe that interest us from rows that do not. In particular, filter(DATA, CONDITION) returns DATA with all of the rows that don’t satisfy CONDITION removed. For instance, we might want to only look at stops where the subject was arrested. To do this, we would type filter(stops, arrest_made == TRUE), since we only want rows from stops where the arrest_made column is (i.e., ==) TRUE. (Note: we could actually just type filter(stops, arrest_made). Do you see why?)

You might have noticed in the colnames, that we have no year column (our data only has date). These dates are full dates that look like YYYY-MM-DD. Luckily, with the lubridate package, getting the year from the date is as easy as year(date)!

stops <- filter(stops, year(date) < 2018)

How many stops do we have now?

# NOTE that this is equivalent to `nrow(stops)`
stops %>% nrow()

## [1] 1782092

tidyverse tip: The “pipe” operator, %>%, helps to keep our code clean. It just places the previous item into the first argument of the function. So, x %>% f() is simply f(x). While in a one-function call the pipe might feel silly and unnecessary, it’s going to become really helpful once we start wanting to do multiple transformations to our data. For instance, that means that instead of typing nrows(filter(stops, year < 2018)) we can write stops %>% filter(year < 2018) %>% nrows(), which is easier to understand.

Now, you may have noticed before that there was a column called type. Let’s take a closer look to see what that means.

Aside: `select()`

A few minutes ago, we used filter() to whittle our dataframe down to just the rows that interested us. What happens if we want to whittle down the columns instead? The select() function is designed to solve exactly this problem. It takes a dataframe, and gets rid of all of the columns you don’t ask for. For instance, select(stops, lat, long) produces a dataframe with exactly two columns: lat and long. If we wanted to know the time as well, we could type select(stops, lat, long, time).

Try using select() to look at the type column.

stops %>% 
  select(type) %>% 
  head(10) # this allows us to look at just the first 10 rows

## # A tibble: 10 x 1
##    type      
##    <fct>     
##  1 pedestrian
##  2 pedestrian
##  3 pedestrian
##  4 vehicular 
##  5 pedestrian
##  6 pedestrian
##  7 vehicular 
##  8 pedestrian
##  9 pedestrian
## 10 vehicular

Ah! Some of our stops are vehicular (i.e., traffic) stops, but some are pedestrian stops. Many of the larger cities in our database provided us with both pedestrian and vehicular data. We’ve provided it all for you so that you can dive in and uncover stories relating to pedestrian stops too! Many of the analysis techniques we’ll be covering today can be applied to the pedestrian data as well. And we encourage you to spend some time on your own replicating the relevant analyses on the pedestrian data. But for now, let’s filter to just vehicular stops, since one of our analyses is only relevant for traffic stops, and since this matches what the majority of the data in our database looks like.

stops <- stops %>% filter(type == "vehicular")

How many stops do we have now?

stops %>% nrow()

## [1] 1090441

It’d be nice to see if the 1.1 million stops were evenly distributed across years, or if stop counts have changed. To find stop counts per year, we need to define a notion of year (recall that our data only has date).

The count() function gives us an easy way to tally up observations over any number of groupings we desire. In this case, let’s count over year.

Aside: `mutate()`

Notice, though, that we have the same problem as before: there’s no year column in the stops dataframe. We’ll use the mutate() function to fix that. The mutate() function adds new columns to a dataframe based on old columns. The basic setup is mutate(DATA, NEW_COL = FUN(OLD_COL)) where DATA is our dataframe, NEW_COL is the name of the new column we want, and FUN is the function we apply to the old column, OLD_COL, to get it.

Use the year() and mutate() functions to add a new column called year to our stops dataframe and then use count() to determine the number of stops per year.

stops %>%
  mutate(year = year(date)) %>% 
  count(year)

## # A tibble: 4 x 2
##    year      n
##   <dbl>  <int>
## 1 2014. 219071
## 2 2015. 285914
## 3 2016. 288306
## 4 2017. 297150

How about counts over race?

stops %>% 
  count(subject_race)

## # A tibble: 5 x 2
##   subject_race                n
##   <fct>                   <int>
## 1 asian/pacific islander  31197
## 2 black                  714772
## 3 hispanic               112950
## 4 other/unknown           15109
## 5 white                  216413

Let’s make another table that gives us the proprtion of stops by race. (There are a few equivalent ways to do this—choose the method that feels most natural to you.)

# This method builds off of using `count` as above
stops %>% 
  count(subject_race) %>% 
  mutate(prop = n / sum(n))

## # A tibble: 5 x 3
##   subject_race                n   prop
##   <fct>                   <int>  <dbl>
## 1 asian/pacific islander  31197 0.0286
## 2 black                  714772 0.655 
## 3 hispanic               112950 0.104 
## 4 other/unknown           15109 0.0139
## 5 white                  216413 0.198

# This method uses the group_by/summarize paradigm
stops %>% 
  group_by(subject_race) %>% 
  summarize(
    n = n(),
    prop = n / nrow(.)
  )

## # A tibble: 5 x 3
##   subject_race                n   prop
##   <fct>                   <int>  <dbl>
## 1 asian/pacific islander  31197 0.0286
## 2 black                  714772 0.655 
## 3 hispanic               112950 0.104 
## 4 other/unknown           15109 0.0139
## 5 white                  216413 0.198

At first glance, we see there are far more stops of black drivers than any other race. About two-thirds of stops in our dataset were of black drivers! This stat on is own, though, doesn’t actually say much. We’ll return to this more rigorously later on.

How about counting how many stops by year and race?

stops %>% 
  # notice that you can also mutate `date` *within* the count funciton
  count(year = year(date), subject_race)

## # A tibble: 20 x 3
##     year subject_race                n
##    <dbl> <fct>                   <int>
##  1 2014. asian/pacific islander   7317
##  2 2014. black                  134192
##  3 2014. hispanic                23036
##  4 2014. other/unknown            2631
##  5 2014. white                   51895
##  6 2015. asian/pacific islander   8613
##  7 2015. black                  183067
##  8 2015. hispanic                30105
##  9 2015. other/unknown            3793
## 10 2015. white                   60336
## 11 2016. asian/pacific islander   7763
## 12 2016. black                  193093
## 13 2016. hispanic                29810
## 14 2016. other/unknown            4156
## 15 2016. white                   53484
## 16 2017. asian/pacific islander   7504
## 17 2017. black                  204420
## 18 2017. hispanic                29999
## 19 2017. other/unknown            4529
## 20 2017. white                   50698

Now we’ve gotten to the point where we have too many rows to interpret in a single table. A simple visualization could really help us out here.

stops %>% 
  count(year = year(date), subject_race) %>% 
  ggplot(aes(x = year, y = n, color = subject_race)) +
  geom_point() +
  geom_line()

We used ggplot to make this plot. This package provides a lot of options for customization, and there are many tweaks we could make to this plot to make it nicer. Its syntax is a little complicated, though, and in our data exploration, it’s enough just to get a coarse visual like this.

From this plot we see that, at least for black and white drivers, the annual trends are very different by race. (It’s hard to tell from this plot for drivers of other races because the counts are comparatively so much smaller.) All races experienced a spike in enforcement in 2015, but thereafter, there were fewer white drivers stopped (in 2016 and 2017), whereas there continued to be an increase in number of black drivers stopped over those two years.

This is already a potential lead! We’d have to investigate further to see what the trend looks like when adjusting for population changes over time or other factors we haven’t considered in this example. But it’s often simple explorations like this that can uncover the path to stories.

Fun fact: This same trend does not hold true for pedestrian stops. In fact, if we hadn’t filtered to only vehicular stops, we wouldn’t have noticed this, because this same plot over the combined pedestrian and vehicular data looks unremarkable. The takeaway is that looking at trends by sub-categories can often be very helpful. (E.g., in Nashville, looking at the different listed stop reasons uncovers the extent of the disparities.)

To give us time to dive into a variety of analysis methods, we’ll let you investigate Philly’s annual traffic stop patterns on your own time. But because it seems like the disparities could be changing over time, we’ll filter to only 2017 and focus on that data for the rest of our analysis (and overwrite stops).

stops <- stops %>% filter(year(date) == 2017)

Benchmark test

We saw before that over two-thirds of stops were of black drivers. The by-race stop counts are only meaningful, though, when compared to some baseline. If the Philadelphia population was about two-thirds black, then two-thirds of stops being of black drivers wouldn’t be at all surprising.

Stop rates

In order to do this baseline comparison, we need to understand the racial demographics in our Philly population data. The data as we’ve given it to you has raw population numbers. To make it useful, we’ll need to compute the proportion of Philadelphia residents in each demographic group. (Hint: use the mutate() function.)

population_2017 %>% 
  mutate(prop = num_people / sum(num_people))

## # A tibble: 5 x 3
##   subject_race           num_people   prop
##   <fct>                       <dbl>  <dbl>
## 1 asian/pacific islander    110864. 0.0706
## 2 black                     648846. 0.413 
## 3 hispanic                  221777. 0.141 
## 4 other/unknown              39858. 0.0254
## 5 white                     548312. 0.349

As an eyeball comparison leads us to see that black drivers are being stopped disproportionately, relative to the city’s population. But let’s be a bit more rigorous about this. If we join the two tables together, we can compute stop rates by race (i.e., number of stops per person). Remember to take into acount how many years are in your stop data, in order to get a true value of stops per capita; we’re using only 2017 for stops and for population, so we’re in good shape.

Aside: `left_join()`

The way we put tables together is with the left_join() function. We need to input three things into this function: the two tables we’d like to combine, along with instructions on how to join them. In this case, the two tables we want to join are the table of stops counted according to subject race and population_2017. The instruction for combining the tables is to join rows that contain the same race, so we have to also give left_join() the argument by = "subject_race". This means that left_join() will look at the first table— i.e., the table stops counted by race—and then go to the subject_race column in each row. Then, left_join() will take what it finds there—in this case, "asian/pacific islander", "black", "hispanic", "other/unknown", and "white"—and look in the second table, i.e., population_2017, for all the rows that contain the same information in population_2017’s race column. Then, it will add the second row at the end of the first to create a new row with information from both. What we end up with is a dataframe with all of the columns from both tables.

The process is a little complicated, and we won’t use it too much, so don’t worry if the abstract description doesn’t make sense. To get a better understanding of what’s going on, try joining the two tables described above, being sure to include the by = "subject_race" argument.

stops %>% 
  count(subject_race) %>% 
  left_join(
    population_2017,
    by = "subject_race"
  ) %>% 
  mutate(stop_rate = n / num_people)

## # A tibble: 5 x 4
##   subject_race                n num_people stop_rate
##   <fct>                   <int>      <dbl>     <dbl>
## 1 asian/pacific islander   7504    110864.    0.0677
## 2 black                  204420    648846.    0.315 
## 3 hispanic                29999    221777.    0.135 
## 4 other/unknown            4529     39858.    0.114 
## 5 white                   50698    548312.    0.0925

Good! Now we can divide the black stop rate by the white stop rate to be able to make a quantitative statement about how much more often black drivers are stops compared to white drivers, relative to their share of the city’s population. Black drivers are stopped at a rate 3.4 times higher than white drivers, and Hispanic drivers are stopped at a rate 1.5 times higher than white drivers.

Search rates

Let’s do the same sort of benchmark comparison for search and frisk rates. These are easier than the last one since we don’t need an external population benchmark. We can use the stopped population as our baseline, defining search rate to be proportion of stopped people who were subsequently searched, and frisk rate as proportion of stopped people who were subsequently frisked. Let’s get these values by race.

Aside: `group_by()` and `summarize()`

One thing that we often want to do with data is disaggregate it. That is, we might want to take the data and break it down into smaller subpopulations. Then, when we ask questions, we can ask about each piece—for instance, each demographic group, or each police district—instead of asking about the population as a whole.

The way to do this in R is with group_by() and summarize(). The standard way to use group_by() is to call group_by(DATA, COL_NAME), where DATA is our dataframe and COL_NAME is the name of a column. What group_by() then does is take all the rows in the dataframe DATA and put them into different groups, one for each different value in the column COL_NAME. So, for instance, if we called group_by(stops, district), R would hand back to us the stops dataframe with all of its columns broken into different groups, one for each police district.

The second step is to do something with those groups. That’s what summarize() does. The way summarize() works is to take a dataframe broken into groups by group_by() and calculate a statistic for each group . The basic syntax is summarize(DATA, STAT = FUN(COL_NAME)), where DATA is some dataframe broken up by group_by(), STAT is some statistic we want to calculate, FUN is the function that calculates that statistic, and COL_NAME is the name of the column (or columns) used to calculate the statistic.

Let’s put it all together with an example. A straightforward one is: stops %>% group_by(subject_race) %>% summarize(arrest_rate = mean(arrest_made)). Typing this into the R console will calculates arrest rates disaggregated by race.

Using group_by() and summarize(), let’s calculate frisk rates by race.

stops %>% 
  group_by(subject_race) %>% 
  summarize(
    search_rate = mean(search_conducted, na.rm = T),
    frisk_rate = mean(frisk_performed, na.rm = T)
  )

## # A tibble: 5 x 3
##   subject_race           search_rate frisk_rate
##   <fct>                        <dbl>      <dbl>
## 1 asian/pacific islander      0.0261     0.0188
## 2 black                       0.0605     0.0636
## 3 hispanic                    0.0510     0.0471
## 4 other/unknown               0.0369     0.0340
## 5 white                       0.0410     0.0306

Note that with functions like mean(), if any of the values that are being averaged are NA, the output value of the mean will simply be NA. We don’t have to worry about that with our Philly data, because it’s super clean. But you may encounter this problem in some of your own data. The easy workaround is to pass the argument na.rm = T into mean()—as you see in the code above. This tells R to remove the NA values and compute the mean over all the non-NA values.

Now let’s dive into these results! As with the stop rates, we can make a quantitative claim about disparities in search and frisk rates by dividing the minority rate by the white rate. Here we see that among drivers who were stopped, black drivers were searched at a rate 1.5 times higher than white drivers, and Hispanic drivers were searched at a rate 1.2 times higher than white drivers. Black drivers were frisked at a rate 2.1 times higher than white drivers were, and Hispanic drivers were frisked at a rate 1.5 times higher than white drivers were.

Caveats about the benchmark test

While these baseline stats give us a sense that there are racial disparities in policing practices in Philadelphia, they are not evidence of discrimination. The argument against the benchmark test is that we haven’t identified the correct baseline to compare to.

For the stop rate benchmark, what we really want to know is what the true distribution is for individuals breaking traffic laws or exhibiting other criminal behavior in their vehicles. If black and Hispanic drivers are disproportionately stopped relative to their rates of offending, that would be stronger evidence. Some people then proposed to use benchmarks that approximate those offending rates, like arrests, for example. However, we know arrests to themselves be racially skewed (especially for low-level drug offenses, for example), so it wouldn’t give us the true offending population’s racial distribution. Furthermore, only 2.1% of stops result in an arrest, so the arrested population will naturally not match the stopped population.

An even simpler critique of the population bechmark for stop rates is that it doesn’t account for possible race-specific differences in driving behavior, including amount of time spent on the road (and adherence to traffic laws, as mentioned above). If black drivers, hypothetically, spend more time on the road than white drivers, that in and of itself could explain the higher stop rates we see for black drivers, even in the absence of discrimination.

Search and frisk rates are slightly less suspect, since among the stopped population, it’s more reasonable to believe that people of different races offend at equal rates. In the context of searches, this means assuming that all races exhibit probable cause of possessing contraband at equal rates. And in the case of frisks, this means assuming that all races exhibit reasonable articulable suspicion of possessing a weapon at equal rates. Of course, one could also argue against these assumptions. One could claim that the stopped population isn’t a good measure of the true racial distribution of probable cause. This is all to say that while benchmark stats are a good place to start, more investigation is required before we can draw any conclusions.

Outcome test

To circumvent the benchmarking problem, it’s common to turn to the search decision, rather than the stop decision. This is because we have a notion of what a “successful” search is. The legal justification for performing a search is probable cause that the driver possesses contraband. So a successful search is one which uncovers contraband.

We thus turn to rates of successful searches. That is, what proportion of searches, by race, were successful? This proportion is known as the contraband recovery rate, or the “hit rate.” If racial groups have different hit rates, it can imply that racial groups are being subjected to different standards.

As a caricatured example, suppose among white drivers who were searched, officers found contraband 99% of the time, while among black drivers who were searched, officers found contraband only 1% of the time. This would lead us to believe that officers made sure they were certain white individuals had contraband before deciding to search, but that they were searching black individuals on a whiff of evidence.

Let’s investigate hit rates by race in Philly in 2017.

stops %>% 
  filter(search_conducted) %>% 
  group_by(subject_race) %>% 
  summarize(
    hit_rate = mean(contraband_found, na.rm = T)
  )

## # A tibble: 5 x 2
##   subject_race           hit_rate
##   <fct>                     <dbl>
## 1 asian/pacific islander    0.245
## 2 black                     0.220
## 3 hispanic                  0.201
## 4 other/unknown             0.174
## 5 white                     0.247

We see that hit rates are slightly lower for black and Hispanic drivers than for white drivers.

However, what if hit rates vary by police district? If the bar for stopping people, irrespective of race, is lower in certain police districts, and black individuals are more likely to live in neighborhoods in those districts, then the observed disparities may not reflect bias.

Adjusting for location

Now let’s compute hit rates by race and district. Name your result hit_rates.

hit_rates <- 
  stops %>% 
  filter(search_conducted) %>% 
  group_by(subject_race, district) %>% 
  summarize(hit_rate = mean(contraband_found, na.rm = T))

hit_rates

## # A tibble: 104 x 3
## # Groups:   subject_race [?]
##    subject_race           district hit_rate
##    <fct>                  <chr>       <dbl>
##  1 asian/pacific islander 01          0.167
##  2 asian/pacific islander 02          0.500
##  3 asian/pacific islander 03          0.400
##  4 asian/pacific islander 06          0.333
##  5 asian/pacific islander 07          0.   
##  6 asian/pacific islander 08          0.   
##  7 asian/pacific islander 09          0.667
##  8 asian/pacific islander 12          0.600
##  9 asian/pacific islander 14          0.167
## 10 asian/pacific islander 15          0.   
## # ... with 94 more rows

Again, this is too many hit rates to compare in one table. To plot the hit rates of black vs. white drivers and of Hispanic vs. white drivers, we need to reshape our table to have each row containing a district, a minority race, minority hit rate in that district, and white hit rate in that district. We’ll walk you through the code below that reshapes the data for us.

# Reshape table to show hit rates of minorities vs white drivers
hit_rates <-
  hit_rates %>% 
  filter(subject_race %in% c("black", "white", "hispanic")) %>% 
  spread(subject_race, hit_rate, fill = 0) %>% 
  rename(white_hit_rate = white) %>% 
  gather(minority_race, minority_hit_rate, c(black, hispanic)) %>%
  arrange(district)

hit_rates

## # A tibble: 44 x 4
##    district white_hit_rate minority_race minority_hit_rate
##    <chr>             <dbl> <chr>                     <dbl>
##  1 01                0.611 black                     0.186
##  2 01                0.611 hispanic                  0.500
##  3 02                0.364 black                     0.344
##  4 02                0.364 hispanic                  0.247
##  5 03                0.413 black                     0.250
##  6 03                0.413 hispanic                  0.424
##  7 05                0.188 black                     0.286
##  8 05                0.188 hispanic                  0.   
##  9 06                0.304 black                     0.302
## 10 06                0.304 hispanic                  0.444
## # ... with 34 more rows

Quick Tip: You might have noticed a strange symbol in the code above: %in%. This is really just shorthand for the sentence, “the subject_race is "black" or it is "white" or it is "hispanic".” Without the %in% syntax, we would have to write this in R as: subject_race == "black" | subject_race == "white" | subject_race == "hispanic". The %in% syntax is a lot shorter and clearer, as we think you’ll agree.

Now let’s plot it!

# We'll use this just to make our axes' limits nice and even
max_hit_rate <-
  hit_rates %>% 
  select(ends_with("hit_rate")) %>% 
  max()

hit_rates %>% 
  ggplot(aes(
    x = white_hit_rate,
    y = minority_hit_rate
  )) +
  geom_point() +
  # This sets a diagonal reference line (line of equal hit rates)
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  # These next few lines just make the axes pretty and even
  scale_x_continuous("White hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  scale_y_continuous("Minority hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  # This makes sure that 1% on the x-axis is the same as 1% on the y-axis
  coord_fixed() +
  # This allows us to compare black v. white and Hispanic v. white side by
  # side, in panels
  facet_grid(. ~ minority_race)

  # Depending on your version of ggplot2, you may be able to use the syntax 
  # below (the newer ggplot2 syntax)---which is clearer, in my opinion.
  # But older versions of ggplot2 will only accept the above syntax
  # facet_grid(cols = vars(minority_race))

This is a good start. However, it’s hard to tell where the mass is. That is, maybe the points above the dotted line (i.e., the points where minority hit rates are higher than white hit rates), maybe those districts have all of Philadelphia’s population, and the districts below the line only represent a few people. While this is unlikely, it’s still good to have a marker of how much weight or emphasis to give each of the points on our plot.

Let’s therefore size each of the points by number of searches.

# Get corresponding number of searches (to size points).
# Again, for each district we want to know the number of white+black searches
# and white+Hispanic searches. This requires the same spreading and gathering
# as our previous data-munging.
search_counts <-
  stops %>% 
  filter(
    search_conducted, 
    subject_race %in% c("black", "white", "hispanic")
  ) %>%  
  count(district, subject_race) %>% 
  spread(subject_race, n, fill = 0) %>% 
  rename(num_white_searches = white) %>% 
  gather(minority_race, num_minority_searches, c(black, hispanic)) %>% 
  mutate(num_searches = num_minority_searches + num_white_searches) %>% 
  select(district, minority_race, num_searches)

hit_rates %>% 
  left_join(
    search_counts, 
    by = c("district", "minority_race")
  ) %>% 
  ggplot(aes(
    x = white_hit_rate,
    y = minority_hit_rate
  )) +
  geom_point(aes(size = num_searches), pch = 21) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous("White hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  scale_y_continuous("Minority hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  coord_fixed() +
  facet_grid(. ~ minority_race)

  # same facet_grid syntax issue here as before!
  # facet_grid(cols = vars(minority_race))

It seems like we can now say with confidence that even looking within location, nearly every police district has lower hit rates for minorities than white drivers.

Note that we can also run a simple model (logistic regression) to quantify how much lower the odds are of recovering contraband from searching a black or Hispanic driver versus from searching a white driver, adjusting for location.

Investigating anomalies

One thing that we should always keep an eye out for, both as journalists and as data scientists, are anomalies in the data that could be changing the meaning of our results.

In the plot above, you may have noticed a rather larger point (i.e., a district with a lot of searches) that has a hit rate of essentially zero for both white and minority drivers. That’s a red flag that something’s up.

Let’s figure out what district that is by filtering our hit_rates table to the smallest white hit rate.

hit_rates %>% 
  filter(white_hit_rate == min(white_hit_rate))

## # A tibble: 2 x 4
##   district white_hit_rate minority_race minority_hit_rate
##   <chr>             <dbl> <chr>                     <dbl>
## 1 77                   0. black                        0.
## 2 77                   0. hispanic                     0.

Next, let’s take a look at what addresses are usually written in for stops in this district.

# Hint: Note that districts in our dataset are encoded as characters, 
# so 77 is actually "77"

stops %>% 
  filter(district == "77") %>% 
  count(location, sort = T)

## # A tibble: 84 x 2
##    location                            n
##    <chr>                           <int>
##  1 0 BLOCK PIA WAY                   245
##  2 8800 BLOCK ESSINGTON AVE IA WAY   212
##  3 8800 BLOCK ESSINGTON AVE          150
##  4 98 PIA WAY                         84
##  5 1 PIA WAY                          44
##  6 DEPARTURE RD RAMP                  24
##  7 DEPARTURE ROAD                     17
##  8 PIA DEPARTURE ROAD                 16
##  9 ARRIVAL D                          11
## 10 CELL PHONE LOT                     11
## # ... with 74 more rows

Depending on how you investigated the location, you might have found that the majority of the stops happen at the airport, or have location listed as NA or unavailable. Indeed, if we check the Philadelphia Police’s website, we’ll find that there is no District 77, and that the airport does not fall within the jurisdiction of any particular police district. Between this information and the fact that hit rates are essentially zero for this district, we can conclude that the types of searches happening in this location are qualitatively different from searches in our other locations.

Without knowing more about stops and searches in this district, it makes sense to remove them from our hit rate analysis. Try computing citywide hit rates again, this time filtering out district 77.

# Remember: Districts in our dataset are encoded as characters, 
# so 77 is actually "77"

stops %>% 
  filter(search_conducted, district != "77") %>% 
  group_by(subject_race) %>% 
  summarize(
    hit_rate = mean(contraband_found, na.rm = T)
  )

## # A tibble: 5 x 2
##   subject_race           hit_rate
##   <fct>                     <dbl>
## 1 asian/pacific islander    0.308
## 2 black                     0.223
## 3 hispanic                  0.206
## 4 other/unknown             0.197
## 5 white                     0.308

Wow! A sizeable share of searches of white individuals must have been taking place at the airport with zero hit rates, deflating the white hit rate we were calculating before. But this didn’t influence hit rates of black or Hispanic drivers as much. Our aggregate hit rate analysis now shows stronger evidence of bias, similar to our by-location plot. (Note, thought, that aggregate and by-location hit rates do not necessarily always tell the same story—a phenomenon known as Simpson’s Paradox—which is why it’s important to check both to get the full picture.)

Caveats about the outcome test

While the outcome test is a very simple and intuitively appealing test, it doesn’t allow us to observe the actual threshold for searching someone, it only allows us to observe outcomes. There is a subtle statistical flaw with the outcome test, known as infra-marginality, which in certain cases can lead to the outcome test indicating discrimination when equal thresholds are actually being applied, and in other cases can lead to the outcome test indicating equal hit rates when different thresholds are actually being applied. We thus often use the threshold test (a more complex test that uses both search and hit rates to infer thresholds), in order to validate the outcome test.

(Fun fact: The threshold test confirms the results we see in Philadelphia. Officers are indeed applying lower thresholds when deciding to search black and Hispanic drivers than when deciding to search white drivers.)

We don’t have time to go into the threshold test now. The important takeaway, though, is that every statistical test has its flaws. So when reporting on data, we have to make sure not overstate our findings.

Veil of Darkness test

The outcome test is a good way to get around the benchmarking problem at the search decision, but when it comes to assessing bias at the stop decision, it’s a little trickier. The outcome test usually doesn’t work, since rarely is there a notion of what a “successful” stop is. (There are, of course, some exceptions to this, like when stop reason is listed as suspected Criminal Possession of a Weapon (CPW)—which is the case in NYC stop-question-frisk data.)

An alternative approach to assessing bias in stop decisions was proposed by Grogger and Ridgeway in 2006. This approach, known as the Veil of Darkness test, relies on the hypothesis that officers who are engaged in racial profiling are less likely to be able to identify a driver’s race after dark than during daylight. Under this hypothesis, if stops made after dark had smaller proportion of black drivers stopped than stops made during daylight, this would be evidence of the presence of racial profiling.

The naive approach is just to compare whether daytime stops or nighttime stops have higher proportion of black drivers. This, however, is problematic. More black drivers being stopped at night could be the case for a multitude of reasons: different deployment and enforcement patterns, different driving patterns, etc. All of these things correlate with clock time and would be confounding our results. So let’s find a way to adjust for clock time.

To adjust for clock time, we want to compare times that were light at some point during the year (i.e., occurred before dusk) but were dark at other points during the year (i.e., occurred after dusk). This is called the “inter-twilight period”: the range from the earliest time dusk occurs in the year to the latest time dusk occurs in the year.

Let’s calculate the sunset and dusk times for Philly in 2017.

(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark. We’ll explain why this is relevant in a moment.)

# Get timezone for Philly
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
  hour(hms(time)) * 60 + minute(hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- 
  stops %>%
  mutate(
    lat = center_lat,
    lon = center_lng
  ) %>% 
  select(date, lat, lon) %>%
  distinct() %>%
  getSunlightTimes(
    data = ., 
    keep = c("sunset", "dusk"), 
    tz = tz
  ) %>% 
  mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
  mutate(
    sunset_minute = time_to_minute(sunset),
    dusk_minute = time_to_minute(dusk),
    date = ymd(str_sub(date, 1, 10))
  ) %>% 
  select(date, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Philly’s inter-twilight period?

sunset_times %>% 
  filter(dusk == min(dusk) | dusk == max(dusk))

##         date   sunset     dusk sunset_minute dusk_minute
## 1 2017-06-27 20:34:40 21:07:26          1234        1267
## 2 2017-06-26 20:34:39 21:07:26          1234        1267
## 3 2017-12-07 16:36:53 17:07:09           996        1027
## 4 2017-12-06 16:36:56 17:07:09           996        1027

So the earliest sunset in 2017 was at around 4:30 p.m., in December (and it was fully dark about 30 minutes later, at dusk, around 5:00 p.m.). The latest sunset time in 2017 was at around 8:30 p.m., in late June (and it was fully dark about 30 minutes later, at dusk, around 9:00 p.m.).

For a time in this range, like 6:30 p.m., part of the year it’s still light, and part of the year, it’s still dark. So we can compare the proportion of black drivers stopped at 6:30 p.m. when it’s light (usually in the summer) to the proportion of black drivers stopped at 6:30 p.m. when it’s dark (usually in the winter). This way, we’re really seeing the effect of darkness, since driving and deployment patterns tend to change only with clock time. For example, the demographics of commuters who are on the road at 6:30 p.m. likely stays pretty steady throughout the year, compared to demographics of commuters at 5:00 p.m. versus at 9:00 p.m.

One quick note: We want to make sure to filter out stops that occurred in the approximately 30-minute period between sunset and dusk (end of civil twilight), since it is ambiguous whether that period is light or dark, which would muddle up our analysis.

Let’s join the sunset times to our stops data and:

Filter to the inter-twilight period;
Remove that ambigous pre-dusk period; and
For simplicity, compare only black and white drivers.

vod_stops <- 
  stops %>% 
  left_join(
    sunset_times,
    by = "date"
  ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
  ) %>% 
  filter(
    # Filter to get only the intertwilight period
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
    # Remove ambigous period between sunset and dusk
    !(minute > sunset_minute & minute < dusk_minute),
    # Compare only white and black drivers
    subject_race %in% c("black", "white")
  )

How many stops are in our new, filtered data frame?

vod_stops %>% nrow()

## [1] 80436

Now, to get some intuition, let’s use our 6:30 p.m. example. If we filter the data to stops that occurred in the narrow window from 6:30 p.m. to 6:45 p.m., we can compute what proportion of stops were of black drivers for stops when it was dark and compare that to the proportion for stops when it was not dark yet.

vod_stops %>% 
  filter(time > hm("18:30"), time < hm("18:45")) %>% 
  group_by(is_dark) %>% 
  summarize(prop_black = mean(is_black))

## # A tibble: 2 x 2
##   is_dark prop_black
##   <lgl>        <dbl>
## 1 FALSE        0.832
## 2 TRUE         0.785

Indeed, we see that the proportion is smaller when 6:30–6:45 p.m. is dark than when it’s light. According to the veil of darkness hypothesis, this is because officers engaged in racial profiling can’t determine drivers’ races as well at night. We know that the proportion of black drivers stopped dropped by about 5% when it was dark and race was harder to identify. It’s hard to say what this means in terms of the extent or pervasiveness, though. We cannot make any claims about whether some small percent of the officers were engaged in stops that were 100% profiled on race, or whether all officers were engaged in racial profiling in some small percentage of their stops, or anything of the sort.

Furthermore, while this stat gives us some intuition that racial profiling might be present in Philly, we’ve only looked at one time. In order to make sure the result is robust, we want to check this for every time in our inter-twilight period. To do this, we can use logistic regression. The basic idea here is that we’re finding what our data implied about how darkness and clock time influence the whether a stopped driver will be black. We then extract the coefficient of darkness. (You can think of this loosely as the measure of how much darkness—as opposed to clock time—influences whether a stopped driver will be black.)

mod1 <- glm(
  is_black ~ is_dark + splines::ns(minute, df = 6),
  family = binomial,
  data = vod_stops
)

summary(mod1)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]

##    Estimate  Std. Error 
## -0.20676109  0.02171533

The fact that the coefficient is negative means that darkness lessens the likelihood that a stopped driver will be black, after adjusting for clock time. This matches our intuitive result from 6:30 p.m.

The fact that the standard error is small means that this is a statistically significant finding. Another important thing to do, besides checking statistical significance of your results, is to do a bunch of different robustness checks. The idea is that all models make some choices about how to turn the question at hand into math, and we want to vary those choices to make sure that none of them change the model result—that the result holds regardless of, for example, how you choose to control for time. In our model we used a natural spline with six degrees of freedom, but we could have chosen to use 3 degrees of freedom, or to do something way simpler like bin time to one-minute or five-minute windows.

Another robustness check that often comes up with this type of modeling is adding in a control for sub-geography. For the outcome test, we controlled for police district under the proposition that maybe it was an intentional policy decision for certain police districts to lower search thresholds than other police districts—and if those districts happened to be disproportionately black, that would “explain away” the bias we saw in the aggregate outcome test results.

In the case of the Veil of Darkness test, one could make an analogous claim: is the result that we’re seeing just a product of certain districts being policed more after dark and those districts have a lower proportion of black drivers than the districts being policed before dark? If so, then adjusting for location in our model would lead to the darkness coefficient being zero. But if the coefficient on darkness doesn’t change much (i.e., stays negative in a statistically significant way), then that’s another marker that our finding is robust.

mod2 <- glm(
  is_black ~ is_dark + splines::ns(minute, df = 6) + as.factor(district),
  family = binomial,
  data = vod_stops
)

summary(mod2)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]

##    Estimate  Std. Error 
## -0.16660152  0.02502906

Looks like no matter how you slice it, the veil of darkness test shows racial profiling of black drivers is present in Philly traffic stops.

Caveats on the Veil of Darkness test

The veil-of-darkness test is a popular technique for assessing disparate treatment, but, like all statistical methods, it comes with caveats. (Hopefully that’s one of the takeaways today—data and statistical analysis can be very powerful, but it’s good to know what the limitations are.) Perhaps most importantly, darkness—after adjusting for time of day—is a function of the date. As such, to the extent that driver behavior changes throughout the year, and these changes are correlated with race, the test can suggest discrimination where there is none. One way to account for seasonal effects is to consider the brief period around daylight savings. Driving behavior likely doesn’t change much at, say, 6:00 p.m. the day before daylight savings to the day after; but the sky is light on one day and dark on the next. Clever checks like this can help circumvent some of the concerns about the veil of darkness test. However, if driving behavior is more related to lighting than time of day, the daylight savings test wouldn’t help.

Another thing to note is that artificial lighting (e.g., from street lamps) can weaken the relationship between sunlight and visibility, and so the method may underestimate the extent to which stops are predicated on perceived race.

Other things, like vehicle make, year, and model often correlate with race and are still visible at night, which could lead to the test under-estimating the extent of racial profiling. Similarly, the test doesn’t control for stop reason, which is often correlated with both race and time of day. (Consider, for example, pretextual broken tail light stops.)

Finally, the test only speaks to presence of racial profiling in the intertwilight period – doesn’t say anything about other hours.

Nevertheless, despite these shortcomings, the test provides a useful if imperfect measure of bias in stop decisions.

Parting thoughts

Hopefully this gets you started on the basics of how to analyze data, and gives you a little glimpse of some more complex modeling you can do with the data too. While these analyses (from simple hit rates to modeling racial profiling) can be really helpful in quantifying racial disparities, it’s often basic exploration of the data that yields the most interesting stories. So, the number one rule in working with data is: be curious!

Make sure to read the full paper for a more in-depth analysis. And we look forward to reading about all of the analysis you all do, and all the interesting stories you uncover!

Tutorials

Open Policing Project Tutorial

Setup

Covering the basics

Aside: `filter()`

Aside: `select()`

Aside: `mutate()`

Benchmark test

Stop rates

Aside: `left_join()`

Search rates

Aside: `group_by()` and `summarize()`

Caveats about the benchmark test

Outcome test

Adjusting for location

Investigating anomalies

Caveats about the outcome test

Veil of Darkness test

Caveats on the Veil of Darkness test

Parting thoughts

Tutorials

Open Policing Project Tutorial

Setup

Covering the basics

Aside: filter()

Aside: select()

Aside: mutate()

Benchmark test

Stop rates

Aside: left_join()

Search rates

Aside: group_by() and summarize()

Caveats about the benchmark test

Outcome test

Adjusting for location

Investigating anomalies

Caveats about the outcome test

Veil of Darkness test

Caveats on the Veil of Darkness test

Parting thoughts

Aside: `filter()`

Aside: `select()`

Aside: `mutate()`

Aside: `left_join()`

Aside: `group_by()` and `summarize()`