To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
We are returning to the speeding tickets data that we began to explore in R Practice 15 on Multivariate Regression. Download and read in (read_csv) the data below.
This data again comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:
| Variable | Description |
|---|---|
Amount |
Amount of fine (in dollars) assessed for speeding |
Age |
Age of speeding driver (in years) |
MPHover |
Miles per hour over the speed limit |
Black |
Dummy =1 if driver was black, =0 if not |
Hispanic |
Dummy =1 if driver was Hispanic, =0 if not |
Female |
Dummy =1 if driver was female, =0 if not |
OutTown |
Dummy =1 if driver was not from local town, =0 if not |
OutState |
Dummy =1 if driver was not from local state, =0 if not |
StatePol |
Dummy =1 if driver was stopped by State Police, =0 if stopped by other (local) |
We again want to explore who gets fines, and how much.
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## Black = col_double(),
## Hispanic = col_double(),
## Female = col_double(),
## Amount = col_double(),
## MPHover = col_double(),
## Age = col_double(),
## OutTown = col_double(),
## OutState = col_double(),
## StatePol = col_double()
## )
We will have to do a little bit of cleaning to get the data in a more usable form.
Inspect the data with str() or head() or glimpse() to see what it looks like.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 68357 obs. of 9 variables:
## $ Black : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hispanic: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Female : num 1 1 1 0 0 0 1 0 1 0 ...
## $ Amount : num NA NA NA NA NA NA NA NA NA NA ...
## $ MPHover : num 14 15 15 13 12 17 15 15 15 15 ...
## $ Age : num 22 43 32 24 54 30 18 53 51 33 ...
## $ OutTown : num 1 1 0 1 1 1 0 0 1 1 ...
## $ OutState: num 0 0 0 0 0 0 0 0 0 0 ...
## $ StatePol: num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Black = col_double(),
## .. Hispanic = col_double(),
## .. Female = col_double(),
## .. Amount = col_double(),
## .. MPHover = col_double(),
## .. Age = col_double(),
## .. OutTown = col_double(),
## .. OutState = col_double(),
## .. StatePol = col_double()
## .. )
Black <dbl> | Hispanic <dbl> | Female <dbl> | Amount <dbl> | MPHover <dbl> | Age <dbl> | OutTown <dbl> | OutState <dbl> | StatePol <dbl> |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | NA | 14 | 22 | 1 | 0 | 0 |
| 0 | 0 | 1 | NA | 15 | 43 | 1 | 0 | 0 |
| 0 | 0 | 1 | NA | 15 | 32 | 0 | 0 | 0 |
| 0 | 0 | 0 | NA | 13 | 24 | 1 | 0 | 0 |
| 0 | 0 | 0 | NA | 12 | 54 | 1 | 0 | 0 |
| 0 | 0 | 0 | NA | 17 | 30 | 1 | 0 | 0 |
## Observations: 68,357
## Variables: 9
## $ Black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hispanic <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Female <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0…
## $ Amount <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MPHover <dbl> 14, 15, 15, 13, 12, 17, 15, 15, 15, 15, 13, 11, 15, 15,…
## $ Age <dbl> 22, 43, 32, 24, 54, 30, 18, 53, 51, 33, 48, 36, 31, 45,…
## $ OutTown <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ OutState <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ StatePol <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
What class of variable are Black, Hispanic, Female, OutTown, and OutState?1
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
Notice that when importing the data from the .csv file, R interpretted these variables as integer, but we want them to be factor variables, to ensure R recognizes that there are two groups (categories), 0 and 1. Convert each of these variables into factors by reassigning it according to the format:
where - df is the name of your dataframe - my_var and my_var2 are example variables2
Confirm they are each now factors by checking their class again.
## [1] "factor"
## [1] "factor"
## [1] "factor"
## [1] "factor"
## [1] "factor"
Get a summary() of Amount. Note that there are a lot of NA’s (these are people that were stopped but did not recieve a fine)! filter() only those observations for which Amount is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe). Then double check that this worked by getting a summary() of Amount again.
## Amount
## Min. : 3
## 1st Qu.: 75
## Median :115
## Mean :122
## 3rd Qu.:155
## Max. :725
## NA's :36683
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3 75 115 122 155 725 36683
## Amount
## Min. : 3
## 1st Qu.: 75
## Median :115
## Mean :122
## 3rd Qu.:155
## Max. :725
Now let’s start looking at the distribution conditionally to find the different group means.
Find the mean and standard deviation4 of Amount for male drivers and again for female drivers.
mean <dbl> | sd <dbl> | |||
|---|---|---|---|---|
| 124.6654 | 58.28387 |
mean <dbl> | sd <dbl> | |||
|---|---|---|---|---|
| 116.726 | 51.48665 |
What is the difference between the average Amount for Males and Females?
## [1] 124.6654
## [1] 116.726
## [1] 7.939397
Males get $7.94 higher fines on average than Females.
We did not go over how to do this in class, but you can run a t-test for the difference in group means to see if the difference is statistically significant. The syntax is similar for a regression:
t.test(y~d, data=df)
where: - y is the variable we are testing - d is the dummy variable
Is there a statistically significant difference between Amount for male and female drivers?5
##
## Welch Two Sample t-test
##
## data: Amount by Female
## t = 12.356, df = 23400, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.679941 9.198853
## sample estimates:
## mean in group 0 mean in group 1
## 124.6654 116.7260
This gives us the confidence interval for the difference in average amount between men and women: (6.68, 9.20) and runs the following hypothesis test:
Since the test-statistic t is quite high, and the p-value is very small (basically 0), we have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis: there appears to be a statistically significant difference between average fines for men and women.
Now run the following regression to ensure we get the same result
Amounti=^β0+^β1Femalei
##
## Call:
## lm(formula = Amount ~ Female, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.67 -49.67 -6.73 30.33 600.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.6654 0.3857 323.23 <2e-16 ***
## Female1 -7.9394 0.6698 -11.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared: 0.004416, Adjusted R-squared: 0.004385
## F-statistic: 140.5 on 1 and 31672 DF, p-value: < 2.2e-16
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 124.665423 | 0.3856913 | 323.22592 | 0.000000e+00 |
| Female1 | -7.939397 | 0.6698475 | -11.85254 | 2.443741e-32 |
Let’s recode the sex variable.
Make a new variable called Male and save it in your dataframe.6
Male <fctr> | Female <fctr> | |||
|---|---|---|---|---|
| 1 | 0 | |||
| 1 | 0 | |||
| 1 | 0 | |||
| 0 | 1 | |||
| 1 | 0 | |||
| 1 | 0 | |||
| 0 | 1 | |||
| 0 | 1 | |||
| 1 | 0 | |||
| 0 | 1 |
Run the same regression as in question 5, but use Male instead of Female.
##
## Call:
## lm(formula = Amount ~ Male, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.67 -49.67 -6.73 30.33 600.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.7260 0.5477 213.13 <2e-16 ***
## Male1 7.9394 0.6698 11.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared: 0.004416, Adjusted R-squared: 0.004385
## F-statistic: 140.5 on 1 and 31672 DF, p-value: < 2.2e-16
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 116.726026 | 0.5476659 | 213.13363 | 0.000000e+00 |
| Male1 | 7.939397 | 0.6698475 | 11.85254 | 2.443741e-32 |
Run a regression of Amount on Male and Female. What happens, and why?
##
## Call:
## lm(formula = Amount ~ Male + Female, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.67 -49.67 -6.73 30.33 600.33
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.7260 0.5477 213.13 <2e-16 ***
## Male1 7.9394 0.6698 11.85 <2e-16 ***
## Female1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared: 0.004416, Adjusted R-squared: 0.004385
## F-statistic: 140.5 on 1 and 31672 DF, p-value: < 2.2e-16
Male and Female are perfectly multicollinear, as for every person i, Malei+Femalei=1. We can confirm this by seeing the correlation between Male and Female is exactly −1. To run a regression, we must exclude one of the dummies, and as we’ve seen, it makes no difference which one we exclude.
## Male Female
## Male 1 -1
## Female -1 1
Age probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females.
Run a regression of Amount on Age and Female. How does the coefficient on Female change?
##
## Call:
## lm(formula = Amount ~ Age + Female, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.77 -45.63 -5.91 33.67 597.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.19965 0.90969 147.52 <2e-16 ***
## Age -0.28598 0.02472 -11.57 <2e-16 ***
## Female1 -7.85172 0.66849 -11.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56 on 31671 degrees of freedom
## Multiple R-squared: 0.008604, Adjusted R-squared: 0.008542
## F-statistic: 137.4 on 2 and 31671 DF, p-value: < 2.2e-16
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 134.1996506 | 0.90969259 | 147.52198 | 0.000000e+00 |
| Age | -0.2859785 | 0.02472361 | -11.56702 | 6.985839e-31 |
| Female1 | -7.8517192 | 0.66849063 | -11.74544 | 8.676153e-32 |
The coefficient on Female goes from −7.94 (question 5) to −7.85 when controlling for Age.
Now let’s see if the difference in fine between men and women are different depending on the driver’s age. Run the regression again, but add an interaction term between Female and Age interaction term.
##
## Call:
## lm(formula = Amount ~ Age + Female + Age:Female, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.35 -44.68 -5.49 33.49 597.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.54592 1.07428 126.173 < 2e-16 ***
## Age -0.32636 0.03008 -10.848 < 2e-16 ***
## Female1 -12.02331 1.89296 -6.352 2.16e-10 ***
## Age:Female1 0.12435 0.05279 2.355 0.0185 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56 on 31670 degrees of freedom
## Multiple R-squared: 0.008778, Adjusted R-squared: 0.008684
## F-statistic: 93.49 on 3 and 31670 DF, p-value: < 2.2e-16
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 135.5459174 | 1.07428490 | 126.173158 | 0.000000e+00 |
| Age | -0.3263597 | 0.03008437 | -10.848147 | 2.273784e-27 |
| Female1 | -12.0233093 | 1.89296463 | -6.351576 | 2.160090e-10 |
| Age:Female1 | 0.1243543 | 0.05279367 | 2.355478 | 1.850497e-02 |
Write out your estimated regression equation.
^Amounti=135.55−12.02Femalei−0.33Agei+0.12(Femalei×Agei)
Interpret the interaction effect. Is it statistically significant?
The coefficient on the interaction term, ^β3 is 0.12. For every additional year of age, females can expect their fine to increase by $0.12 more than males gain for every additional year of age.
^beta3 has a standard error of 0.52, which means it has a t-statistic of 2.355 and p-value of 0.0185, so it is statistically significant (at the 95% level).
Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as two separate equations, one for Males and one for Females.
For Males (Female=0):
^Amount=135.55−0.33Age−12.02Female+0.12Female×Age=135.55−0.33Age−12.02(0)+0.12(0)Age=135.55−0.33Age
For Females (Female=1):
^Amount=135.55−0.33Age−12.02Female+0.12Female×Age=135.55−0.33Age−12.02(1)+0.12(1)Age=(135.55−12.02)+(−0.33+0.12)Age=123.53−0.21Age
# Another way to do this is to run conditional regressions:
# subset data for only Males
speed_males<-speed %>%
filter(Female==0)
# subset data for only Feales
speed_females<-speed %>%
filter(Female==1)
# run a regression for each subset
male_age_reg<-lm(Amount~Age, data = speed_males)
tidy(male_age_reg)term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 135.5459174 | 1.11524335 | 121.53932 | 0.000000e+00 |
| Age | -0.3263597 | 0.03123137 | -10.44974 | 1.694998e-25 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 123.5226081 | 1.43128789 | 86.30172 | 0.000000e+00 |
| Age | -0.2020053 | 0.03983956 | -5.07047 | 4.036045e-07 |
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:purrr':
##
## every
## The following object is masked from 'package:ggplot2':
##
## theme_grey
| Males Only | Females Only | |
| (Intercept) | 135.546 *** | 123.523 *** |
| (1.115) | (1.431) | |
| Age | -0.326 *** | -0.202 *** |
| (0.031) | (0.040) | |
| N | 21173 | 10501 |
| R2 | 0.005 | 0.002 |
| logLik | -116063.355 | -56274.760 |
| AIC | 232132.710 | 112555.521 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Let’s try to visualize this. Make a scatterplot of Age (X) and Amount (Y) and include a regression line.
Try adding to your base layer aes(), set color=Female. This will produce two lines and color the points by Female.7
Now let’s look at the possible interaction between Sex (Male or Female) and whether a driver is from In-State or Out-of-State (OutState).
Use R to examine the data and find the mean for (i) Males In-State, (ii) Males Out-of-State, (iii) Females In-State, and (iv) Females Out-of-State.
| mean |
| 124 |
| mean |
| 128 |
| mean |
| 115 |
| mean |
| 124 |
Now run a regression of the following model:
Amounti=^β0+^β1Femalei+^β2OutStatei+^β3Femalei∗OutStatei
##
## Call:
## lm(formula = Amount ~ Female + Female:OutState, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.68 -48.68 -8.68 31.32 601.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.6775 0.4391 281.650 < 2e-16 ***
## Female1 -8.8790 0.7541 -11.775 < 2e-16 ***
## Female0:OutState1 4.2915 0.9152 4.689 2.76e-06 ***
## Female1:OutState1 9.4672 1.3586 6.968 3.27e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.06 on 31670 degrees of freedom
## Multiple R-squared: 0.006629, Adjusted R-squared: 0.006535
## F-statistic: 70.44 on 3 and 31670 DF, p-value: < 2.2e-16
| term | estimate | std.error | statistic | p.value |
| (Intercept) | 124 | 0.439 | 282 | 0 |
| Female1 | -8.88 | 0.754 | -11.8 | 6.14e-32 |
| Female0:OutState1 | 4.29 | 0.915 | 4.69 | 2.76e-06 |
| Female1:OutState1 | 9.47 | 1.36 | 6.97 | 3.27e-12 |
Write out the estimated regression equation.
^Amounti=123.68−8.88Femalei+4.29OutStatei+5.17(Femalei×OutStatei)
What does each coefficient mean?
Using the regression equation, what are the means for
Compare to your answers in part A.
Collect your regressions from questions 5, 6b, 8a, 8b, and 9b and output them in a regression table with huxtable().
| (1) | (2) | (3) | (4) | (5) | |
| (Intercept) | 124.665 *** | 116.726 *** | 134.200 *** | 135.546 *** | 123.678 *** |
| (0.386) | (0.548) | (0.910) | (1.074) | (0.439) | |
| Female1 | -7.939 *** | -7.852 *** | -12.023 *** | -8.879 *** | |
| (0.670) | (0.668) | (1.893) | (0.754) | ||
| Male1 | 7.939 *** | ||||
| (0.670) | |||||
| Age | -0.286 *** | -0.326 *** | |||
| (0.025) | (0.030) | ||||
| Age:Female1 | 0.124 * | ||||
| (0.053) | |||||
| Female0:OutState1 | 4.291 *** | ||||
| (0.915) | |||||
| Female1:OutState1 | 9.467 *** | ||||
| (1.359) | |||||
| N | 31674 | 31674 | 31674 | 31674 | 31674 |
| R2 | 0.004 | 0.004 | 0.009 | 0.009 | 0.007 |
| logLik | -172510.224 | -172510.224 | -172443.461 | -172440.687 | -172474.987 |
| AIC | 345026.449 | 345026.449 | 344894.922 | 344891.373 | 344959.974 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
# let's customize a bit
huxreg(female_reg,
male_reg,
age_reg,
interact_reg,
sex_state_reg,
coefs = c("Constant" = "(Intercept)",
"Female" = "Female1",
"Male" = "Male1",
"Age" = "Age",
"Age*Female" = "Age:Female1",
"Out of State" = "Female0:OutState1",
"Female*Out of State" = "Female1:OutState1"),
statistics = c("N" = "nobs",
"R-Squared" = "r.squared",
"SER" = "sigma"),
number_format = 2)| (1) | (2) | (3) | (4) | (5) | |
| Constant | 124.67 *** | 116.73 *** | 134.20 *** | 135.55 *** | 123.68 *** |
| (0.39) | (0.55) | (0.91) | (1.07) | (0.44) | |
| Female | -7.94 *** | -7.85 *** | -12.02 *** | -8.88 *** | |
| (0.67) | (0.67) | (1.89) | (0.75) | ||
| Male | 7.94 *** | ||||
| (0.67) | |||||
| Age | -0.29 *** | -0.33 *** | |||
| (0.02) | (0.03) | ||||
| Age*Female | 0.12 * | ||||
| (0.05) | |||||
| Out of State | 4.29 *** | ||||
| (0.92) | |||||
| Female*Out of State | 9.47 *** | ||||
| (1.36) | |||||
| N | 31674 | 31674 | 31674 | 31674 | 31674 |
| R-Squared | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 |
| SER | 56.12 | 56.12 | 56.00 | 56.00 | 56.06 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
Hint use the class(df$variable) command to ask what class something is, where df is your dataframe, and variable is the name of a variable.↩︎
As a bonus, you can try doing this with just one conditional command: mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"),factor). See our Data Wrangling slides for refreshers of all the fancy mutate() possibilities!↩︎
Hint: Use geom_jitter() instead of geom_point() to plot the points, and play around with width settings inside geom_jitter()↩︎
Hint: use the summarize() command, once you have properly filter()ed the data.↩︎
Hint: this is like any hypothesis test. A t-value needs to be large enough to be greater than a critical value of t. Check the p-value and see if it is less than our standard of α=0.05.↩︎
Hint: mutate() and define Male equal to factor(ifelse()). This makes the variable a factor (so we don’t have to later), and the ifelse() function has three arguments: 1. test = the condition to test, 2. yes = what to define “Male” as when the condition is TRUE, and 3. no = what to define “Male” as when the condition is FALSE.↩︎
Sometimes we may also need to remind R that Female is a factor with as.factor(Female). We don’t need to in this case because we already reset Female as a faction in question 1.↩︎