To minimize confusion, I suggest creating a new R Project
(e.g. regression_practice
) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
Download and read in (read_csv
) the data below.
speeding_tickets.csv
This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:Variable | Description |
---|---|
Amount |
Amount of fine (in dollars) assessed for speeding |
Age |
Age of speeding driver (in years) |
MPHover |
Miles per hour over the speed limit |
We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## Black = col_double(),
## Hispanic = col_double(),
## Female = col_double(),
## Amount = col_double(),
## MPHover = col_double(),
## Age = col_double(),
## OutTown = col_double(),
## OutState = col_double(),
## StatePol = col_double()
## )
How does the age of a driver affect the amount of the fine? Make a scatterplot of the Amount
of the fine and the driver’s Age
. Add a regression line with an additional layer of geom_smooth(method="lm")
.
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).
Find the correlation between Amount
and Age
.
## Amount
## Min. : 3
## 1st Qu.: 75
## Median :115
## Mean :122
## 3rd Qu.:155
## Max. :725
## NA's :36683
## Age Amount
## Age 1.00000000 -0.06546571
## Amount -0.06546571 1.00000000
## Age Amount
## Age 1.00000000 -0.06546571
## Amount -0.06546571 1.00000000
my_cor <dbl> | ||||
---|---|---|---|---|
-0.06546571 |
## [1] -0.06546571
We want to predict the following model:
^Amounti=^β0+^β1Agei
Run a regression, and save it as an object. Now get a summary()
of it.
##
## Call:
## lm(formula = Amount ~ Age, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.21 -46.58 -5.92 32.55 600.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.70665 0.88649 148.57 <2e-16 ***
## Age -0.28927 0.02478 -11.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.13 on 31672 degrees of freedom
## (36683 observations deleted due to missingness)
## Multiple R-squared: 0.004286, Adjusted R-squared: 0.004254
## F-statistic: 136.3 on 1 and 31672 DF, p-value: < 2.2e-16
Write out the estimated regression equation.
^Amounti=131.71−0.29Agei
## $$
## \text{Amount} = 131.71 - 0.29(\text{Age}) + \epsilon
## $$
What is ^β0? What does it mean in the context of our question?
^β0=131.71. It means when Age
is 0, the fine Amount
is expected to be $131.71.
What is ^β1? What does it mean in the context of our question?
^β1=−0.29. It means for every additional year old someone is, their expected Amount
of the fine decreases by $0.29.
What is the marginal effect of age on amount?
The marginal effect is ^β1, described in the previous part. Just checking to make sure you know it’s the marginal effect!
Redo question 4 with the broom
package. Try out tidy()
and glance()
. This is just to keep you versatile.
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 131.7066508 | 0.88649485 | 148.57013 | 0.000000e+00 |
Age | -0.2892712 | 0.02477541 | -11.67574 | 1.967241e-31 |
r.squared <dbl> | adj.r.squared <dbl> | sigma <dbl> | statistic <dbl> | p.value <dbl> | df <int> | logLik <dbl> | AIC <dbl> | |
---|---|---|---|---|---|---|---|---|
0.004285759 | 0.004254321 | 56.1254 | 136.3228 | 1.967241e-31 | 2 | -172512.3 | 345030.6 |
How big would the difference in expected fine be for two drivers, one 18 years old and one 40 years old?
18-year-old driver:
^Amounti=^β0+^β1Agei=131.71−0.29(18)=126.49
40-year-old driver:
^Amounti=^β0+^β1Agei=131.71−0.29(40)=120.11
The difference is 126.49−120.11=6.38 only!
# we can use R as a calculator and use the regression coefficients
# extract beta0 and save it
beta_0<-reg1_tidy %>%
filter(term=="(Intercept)") %>%
select(estimate)
# extract beta1 and save it
beta_1<-reg1_tidy %>%
filter(term=="Age") %>%
select(estimate)
# predicted amount for 18 year old
amount_18yr<-beta_0+beta_1*18
# predicted amount for 40 year old
amount_40yr<-beta_0+beta_1*40
# difference
amount_18yr-amount_40yr
estimate <dbl> | ||||
---|---|---|---|---|
6.363966 |
Now run the regression again, controlling for speed (MPHover
).
##
## Call:
## lm(formula = Amount ~ Age + MPHover, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308.763 -19.783 7.682 25.757 226.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.49386 0.95428 3.661 0.000251 ***
## Age 0.02496 0.01760 1.418 0.156059
## MPHover 6.89175 0.03869 178.130 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.67 on 31671 degrees of freedom
## (36683 observations deleted due to missingness)
## Multiple R-squared: 0.5026, Adjusted R-squared: 0.5026
## F-statistic: 1.6e+04 on 2 and 31671 DF, p-value: < 2.2e-16
Write the new regression equation.
^Amounti=3.49+0.02Agei+6.89MPHoveri
## $$
## \text{Amount} = 3.49 + 0.02(\text{Age}) + 6.89(\text{MPHover}) + \epsilon
## $$
What is the marginal effect of Age
on Amount
? What happened to it?
^β1=0.02. For every additional year old someone is, holding their speed constant, their expected fine increases by $0.02. The magnitude of the effect shrank and it switched from negative to positive!
What is the marginal effect of MPHover
on Amount
?
^β2=6.89. For every additional MPH someone drives above the speed limit, holding their Age constant, their expected fine increases by $6.89
What is ^β0, and what does it mean?
^β0=3.49. This is the expected fine for a driver of Age
=0 and MPHover
=0.
What is the adjusted ˉR2? What does it mean?
It is 0.5026. We can explain 50% of the variation in Amount
from variation in our model.
Now suppose both the 18 year old and the 40 year old each went 10 MPH over the speed limit. How big would the difference in expected fine be for the two drivers?
18-year-old driver:
^Amounti=^β0+^β1Agei+^β2MPHoveri=3.49+0.02(18)+6.89(10)=72.75
40-year-old driver:
^Amounti=^β0+^β1Agei+^β2MPHoveri=3.49+0.02(40)+6.89(10)=73.19
The difference is 72.75−73.19=−0.44 only!
# we can use R as a calculator and use the regression coefficients
# first we need to tidy reg2
reg2_tidy<-tidy(reg2)
# extract beta0 and save it
multi_beta_0<-reg2_tidy %>%
filter(term=="(Intercept)") %>%
select(estimate)
# extract beta1 and save it
multi_beta_1<-reg2_tidy %>%
filter(term=="Age") %>%
select(estimate)
# extract beta2 and save it
multi_beta_2<-reg2_tidy %>%
filter(term=="MPHover") %>%
select(estimate)
# predicted amount for 18 year old going 10 MPH over
amount_18yr_10mph<-multi_beta_0+multi_beta_1*18+multi_beta_2*10
# predicted amount for 40 year old going 10 MPH over
amount_40yr_10mph<-multi_beta_0+multi_beta_1*40+multi_beta_2*10
# difference
amount_18yr_10mph-amount_40yr_10mph # close, we have some rounding error!
estimate <dbl> | ||||
---|---|---|---|---|
-0.5492232 |
How about the difference in expected fine between two 18 year olds, one who went 10 MPH over, and one who went 30 MPH over?
18-year-old driver, 10 MPH over: we saw was $72.75.
18-year-old driver, 30 MPH over:
^Amounti=^β0+^β1Agei+^β2MPHoveri=3.49+0.02(18)+6.89(30)=210.55
The difference is now 210.55−72.75=−137.80!
estimate <dbl> | ||||
---|---|---|---|---|
137.835 |
So clearly the speed plays a much bigger role than age does!
Use the huxtable
package’s huxreg()
command to make a regression table of your two regressions: the one from question 4, and the one from question 7.
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:purrr':
##
## every
## The following object is masked from 'package:ggplot2':
##
## theme_grey
(1) | (2) | |
(Intercept) | 131.707 *** | 3.494 *** |
(0.886) | (0.954) | |
Age | -0.289 *** | 0.025 |
(0.025) | (0.018) | |
MPHover | 6.892 *** | |
(0.039) | ||
N | 31674 | 31674 |
R2 | 0.004 | 0.503 |
logLik | -172512.295 | -161520.081 |
AIC | 345030.591 | 323048.163 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Simple OLS | Mulitvariate OLS | |
Constant | 131.71 *** | 3.49 *** |
(0.89) | (0.95) | |
Age | -0.29 *** | 0.02 |
(0.02) | (0.02) | |
MPH Over | 6.89 *** | |
(0.04) | ||
N | 31674 | 31674 |
R-Squared | 0.00 | 0.50 |
SER | 56.13 | 39.67 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Look at that change in R2!
Are our two independent variables multicollinear? Do younger people tend to drive faster?
Get the correlation between Age
and MPHover
.
## Age MPHover
## Age 1.0000000 -0.1035001
## MPHover -0.1035001 1.0000000
Make a scatterplot of MPHover
on Age
.
Run an auxiliary regression of MPHover
on Age
.
##
## Call:
## lm(formula = MPHover ~ Age, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.683 -4.020 -0.488 2.512 59.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.541094 0.054399 304.07 <2e-16 ***
## Age -0.039007 0.001434 -27.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.056 on 68355 degrees of freedom
## Multiple R-squared: 0.01071, Adjusted R-squared: 0.0107
## F-statistic: 740.2 on 1 and 68355 DF, p-value: < 2.2e-16
Interpret the coefficient on Age
.
In our notation, this is ^δ1=−0.03, which says for every 1 year older someone is, they drive 0.03 MPH less over the speed limit.
Look at your regression table in question 10. What happened to the standard error on Age
? Why (consider the formula for variation in ^β1)
The formula for var[^β1]=11−R21×SER2n×var[Age]
The standard error of the regression (SER) went from 56.13 to 39.67!1 Adding the second variable created a much better fit, which lowered variance on the betas (and hence, standard error, the square root of variance).
Variance might have slightly increased to more than it otherwise would be, due to multicollinearity between Age
and MPHover
, which we investigate next.
Calculate the Variance Inflation Factor (VIF) using the car
package’s vif()
command.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Age MPHover
## 1.010149 1.010149
The VIF is 1.01 for ^β1 and ^β2. Variance increases only by 1% due to (weak) multicollinearity between Age
and MPHover
.
Calculate the VIF manually, using what you learned in this question.
VIF=11−R21, where R21 comes from the auxiliary regression of MPHover
on Age
(Part C). The R2 from that regression was 0.011, so:
VIF=11−R21=11−0.011=10.989≈1.011
## [1] 1.011122
Again, we have some slight rounding error.
Let’s now think about the omitted variable bias. Suppose the “true” model is the one we ran from Question 7.
Do you suppose that MPHover
fits the two criteria for omitted variable bias?
MPHover
probably affects Amount
Possibly, though it is only weakly correlated with Age (−0.10). Speed clearly has a big role to play in affecting and predicting fines, but probably does not add very much bias to the marginal effect of Age on Amount.
Look at the regression we ran in Question 4. Consider this the “omitted” regression, where we left out MPHover
. Does our estimate of the marginal effect of Age
on Amount
overstate or understate the true marginal effect?
Since there is negative correlation between MPHover
and Age
, ^β1 likely understates the effect of Age
on Amount.
Use the “true” model (Question 7), the “omitted” regression (Question 4), and our “auxiliary” regression (Question 11) to identify each of the following parameters that describe our biased estimate of the marginal effect of Age
on Amount
:3 α1=β1+β2δ1
Using our notation from class, and the regressions from questions 4, 7, and 11, we have three regressions:
^α1=β1+β2δ1−0.29=0.02+6.89(−0.04)
Where:
Age
on Amount
(holding MPHover
constant)MPHover
on Amount
(holding Age
constant)MPHover
on Age
We have some slight rounding error, but this is the relationship.
From your answer in part C, how large is the omitted variable bias from leaving out MPHover
?
^α1=β1+β2δ1⏟Bias−0.29=0.02+6.89(−0.04)⏟biasBias=−0.2756
Make a coefficient plot of your coefficients from the regression in Question 7.
# first we have to make sure to tidy our reg with confidence intervals!
reg2_tidy<-tidy(reg2, conf.int=TRUE)
ggplot(data = reg2_tidy)+
aes(x = estimate,
y = term,
color = term)+
geom_point()+ # point for estimate
# now make "error bars" using conf. int.
geom_segment(aes(x = conf.low,
xend = conf.high,
y=term,
yend=term))+
# add line at 0
geom_vline(xintercept=0, size=1, color="black", linetype="dashed")+
#scale_x_continuous(breaks=seq(-1.5,0.5,0.5),
# limits=c(-1.5,0.5))+
labs(x = expression(paste("Marginal effect, ", hat(beta[j]))),
y = "Variable")+
guides(color=F)+ # hide legend
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)