• Set Up
    • Question 1
    • Question 2
    • Question 3
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
      • Part F
      • Part G
      • Part H
      • Part I
    • Question 4
    • Question 5
    • Question 6
      • Part A
      • Part B
      • Part C
      • Part D
    • Question 7
    • Question 8
      • Part A
      • Part B

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Question 1

We are returning to the speeding tickets data that we began to explore in R Practice 15 on Multivariate Regression and R Practice 19 on Dummy Variables & Interaction Effects. Download and read in (read_csv) the data below.

This data again comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit
Black Dummy =1 if driver was black, =0 if not
Hispanic Dummy =1 if driver was Hispanic, =0 if not
Female Dummy =1 if driver was female, =0 if not
OutTown Dummy =1 if driver was not from local town, =0 if not
OutState Dummy =1 if driver was not from local state, =0 if not
StatePol Dummy =1 if driver was stopped by State Police, =0 if stopped by other (local)

We again want to explore who gets fines, and how much.


library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
speed<-read_csv("https://metricsf19.classes.ryansafner.com/data/speeding_tickets.csv")
## Parsed with column specification:
## cols(
##   Black = col_double(),
##   Hispanic = col_double(),
##   Female = col_double(),
##   Amount = col_double(),
##   MPHover = col_double(),
##   Age = col_double(),
##   OutTown = col_double(),
##   OutState = col_double(),
##   StatePol = col_double()
## )

Question 2

Run a regression of Amount on Age. Write out the estimated regression equation, and interpret the coefficient on Age.


reg_linear<-lm(Amount ~ Age, data = speed)
summary(reg_linear)
## 
## Call:
## lm(formula = Amount ~ Age, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.21  -46.58   -5.92   32.55  600.24 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 131.70665    0.88649  148.57   <2e-16 ***
## Age          -0.28927    0.02478  -11.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.13 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.004286,   Adjusted R-squared:  0.004254 
## F-statistic: 136.3 on 1 and 31672 DF,  p-value: < 2.2e-16

^Amounti=131.710.29Agei

For every year of age, expected fines decrease by $0.29.


Question 3

Is the effect of Age on Amount nonlinear? Let’s run a quadratic regression.

Part A

Create a new variable for Age2. Then run a quadratic regression.


# make Age_sq variable
speed<-speed %>%
  mutate(Age_sq=Age^2)

# run quadratic regression
reg_quad<-lm(Amount ~ Age + Age_sq, data = speed) 
summary(reg_quad) 
## 
## Call:
## lm(formula = Amount ~ Age + Age_sq, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.54  -44.96   -5.25   33.25  599.88 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 146.746357   2.269451  64.662  < 2e-16 ***
## Age          -1.173833   0.125360  -9.364  < 2e-16 ***
## Age_sq        0.011357   0.001578   7.198 6.25e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.08 on 31671 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.005912,   Adjusted R-squared:  0.005849 
## F-statistic: 94.18 on 2 and 31671 DF,  p-value: < 2.2e-16

Part B

Try running the same regression using the alternate notation: lm(Y~X+I(X^2)). This method allows you to not have to create a new variable first. Do you get the same results?


reg_quad_alt<-lm(Amount ~ Age + I(Age^2), data = speed) 
summary(reg_quad_alt) 
## 
## Call:
## lm(formula = Amount ~ Age + I(Age^2), data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.54  -44.96   -5.25   33.25  599.88 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 146.746357   2.269451  64.662  < 2e-16 ***
## Age          -1.173833   0.125360  -9.364  < 2e-16 ***
## I(Age^2)      0.011357   0.001578   7.198 6.25e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.08 on 31671 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.005912,   Adjusted R-squared:  0.005849 
## F-statistic: 94.18 on 2 and 31671 DF,  p-value: < 2.2e-16

Yes, this is the same output.


Part C

Write out the estimated regression equation.


^Amounti=146.751.17Agei+0.01Age2i


Part D

Is this model an improvement from the linear model?1


Yes, a slight improvement. R2 went from 0.004 on the linear model to 0.006 on the quadratic model.


Part E

Write an equation for the marginal effect of Age on Amount.


The marginal effect is measured by the first derivative of the regression equation with respect to Age. But you can just remember the resulting formula and plug in the parameters:

dYdX=β1+2β2XdAmountdAge=1.17+2(0.01)Age=1.17+0.02Age

# PUT CODE HERE

Part F

Predict the marginal effect on Amount of being one year older when you are 18. How about when you are 40?


For 18 year olds:

dAmountdAge=1.17+0.02(18)=1.17+0.36=$0.81

For 40 year olds:

dAmountdAge=1.17+0.02(40)=1.17+0.80=$0.37

# Let's do this in R:

# we need broom
library(broom)
tidy_reg_quad<-tidy(reg_quad)

tidy_reg_quad
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)146.746357072.26945104164.6616100.000000e+00
Age-1.173832750.125359691-9.3637188.189882e-21
Age_sq0.011357190.0015778417.1979326.249087e-13
# save beta 1 
quad_beta_1<-tidy_reg_quad %>%
  filter(term=="Age") %>%
  pull(estimate)

# save beta 2
quad_beta_2<-tidy_reg_quad %>%
  filter(term=="Age_sq") %>%
  pull(estimate)

# create function to estimate marginal effects
marginal_effect<-function(x){
  return(quad_beta_1+2*quad_beta_2*x)
}

# run the function on the 18-year-old and the 40-year-old
marginal_effect(c(18,40))
## [1] -0.7649738 -0.2652574
# close enough, we had some rounding error

Part G

Our quadratic function is a U-shape. According to the model, at what age is the amount of the fine minimized?


We can set the derivative equal to 0, or you can just remember the formula and plug in the parameters:

dYdX=β1+2β2X0=β1+2β2Xβ1=2β2X12β1β2=Age121.170.01=Age12117Age58.5Age

# Let's do this in R:

-0.5*(quad_beta_1/quad_beta_2)
## [1] 51.67795
# again, some rounding error

Part H

Create a scatterplot between Amount and Age and add a a layer of a linear regression (as always), and an additional layer of your predicted quadratic regression curve. The regression curve, just like any regression line, is a geom_smooth() layer on top of the geom_point() layer. We will need to customize geom_smooth() to geom_smooth(method="lm", formula="y~x+I(x^2) (copy/paste this verbatim)! This is the same as a regression line (method="lm"), but we are modifying the formula to a polynomial of degree 2 (quadratic): y=a+bx+cx2.


ggplot(data=speed)+
  aes(x = Age,
      y = Amount)+
  geom_point()+
  geom_smooth(method="lm",
              formula="y~poly(x,2)",
              color="red")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).


Part I

It’s quite hard to see the quadratic curve with all those data points. Redo another plot and this time, only keep the quadratic stat_smooth() layer and leave out the geom_point() layer. This will only plot the regression curve.


ggplot(data=speed)+
  aes(x = Age,
      y = Amount)+
  geom_smooth(method="lm",
              formula="y~poly(x,2)",
              color="red")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).


Question 4

Should we use a higher-order polynomial equation? Run a cubic regression, and determine whether it is necessary.


reg_cube<-lm(Amount ~ Age + I(Age^2) + I(Age^3), data = speed)
summary(reg_cube)
## 
## Call:
## lm(formula = Amount ~ Age + I(Age^2) + I(Age^3), data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.83  -44.71   -5.53   33.21  600.01 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.519e+02  5.558e+00  27.337  < 2e-16 ***
## Age         -1.612e+00  4.453e-01  -3.619 0.000296 ***
## I(Age^2)     2.240e-02  1.089e-02   2.058 0.039646 *  
## I(Age^3)    -8.457e-05  8.251e-05  -1.025 0.305363    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.08 on 31670 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.005945,   Adjusted R-squared:  0.005851 
## F-statistic: 63.13 on 3 and 31670 DF,  p-value: < 2.2e-16

^Amounti=151.951.61Agei+0.02Age2i0.00008Age3

The t-statistic on Age3 is small (-1.03) and the p-value is 0.31, so the cubic term does not have a significant impact on Amount. We should not include it.

Just for fun, would the cubic model look any better?

ggplot(data=speed)+
  aes(x = Age,
      y = Amount)+
  geom_point()+
  geom_smooth(method="lm", formula="y~x+I(x^2)", color="red")+
  geom_smooth(method="lm", formula="y~x+I(x^2)+I(x^3)", color="orange")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).

## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).

ggplot(data=speed)+
  aes(x = Age,
      y = Amount)+
  geom_smooth(method="lm", formula="y~x+I(x^2)", color="red")+
  geom_smooth(method="lm", formula="y~x+I(x^2)+I(x^3)", color="orange")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).

## Warning: Removed 36683 rows containing non-finite values (stat_smooth).


Question 5

Run an F-test to check if a nonlinear model is appropriate. Your null hypothesis is H0:β2=β3=0 from the regression in pert (h). The command is linearHypothesis(reg_name, c("var1", "var2")) where reg_name is the name of the lm object you saved your regression in, and var1 and var2 (or more) in quotes are the names of the variables you are testing. This function requires (installing and) loading the “car” package (additional regression tools).


# install.packages("car") # install once if you don't have
library("car") # load car package
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
linearHypothesis(reg_cube, c("I(Age^2)", "I(Age^3)")) # F-test 
ABCDEFGHIJ0123456789
 
 
Res.Df
<dbl>
RSS
<dbl>
Df
<dbl>
Sum of Sq
<dbl>
F
<dbl>
Pr(>F)
<dbl>
13167299768711NANANANA
231670996024632166248.326.430483.395905e-12

We get a large F of 26.43, with a very small p-value. Therefore, we can reject the null hyothesis that the model is linear (β2=0,β3=0). We should in fact not use a linear model. Note it does not tell us if the model should be quadratic or cubic (or even logarithmic of some sort), only that it is not linear. Remember, this was a joint hypothesis of all non-linear terms (beta2 and β3)!


Question 6

Now let’s take a look at speed (MPHover the speed limit).

Part A

Creating new variables as necessary, run a linear-log model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover (^β1). Make a scatterplot with the regression line.2


# create log of MPHover
speed<-speed %>%
  mutate(log_mph=log(MPHover))

# Run linear-log regression
linear_log_reg<-lm(Amount ~ log_mph, data = speed)
summary(linear_log_reg)
## 
## Call:
## lm(formula = Amount ~ log_mph, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.61  -16.44    8.56   20.52  425.33 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -200.0975     1.9401  -103.1   <2e-16 ***
## log_mph      115.7544     0.6922   167.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.99 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4689, Adjusted R-squared:  0.4689 
## F-statistic: 2.796e+04 on 1 and 31672 DF,  p-value: < 2.2e-16
# note we could have done this without creating the variable
# just take log() inside the regression:
linear_log_reg_alt<-lm(Amount ~ log(MPHover), data = speed)
summary(linear_log_reg)
## 
## Call:
## lm(formula = Amount ~ log_mph, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.61  -16.44    8.56   20.52  425.33 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -200.0975     1.9401  -103.1   <2e-16 ***
## log_mph      115.7544     0.6922   167.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.99 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4689, Adjusted R-squared:  0.4689 
## F-statistic: 2.796e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

^Amounti=200.10+115.75ln(MPHoveri)

A 1% increase in speed (over the speed limit) increases the fine by 115.75100$1.16


Part B

Creating new variables as necessary, run a log-linear model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover (^β1). Make a scatterplot with the regression line.3


# create log of Amount
speed<-speed %>%
  mutate(log_Amount=log(Amount))

# Run log-linear regression
log_linear_reg<-lm(log_Amount ~ MPHover, data = speed)
summary(log_linear_reg)
## 
## Call:
## lm(formula = log_Amount ~ MPHover, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2353 -0.1895  0.0929  0.2733  1.2970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.8790452  0.0058716   660.6   <2e-16 ***
## MPHover     0.0484962  0.0003256   148.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3355 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4119, Adjusted R-squared:  0.4119 
## F-statistic: 2.219e+04 on 1 and 31672 DF,  p-value: < 2.2e-16
# again we could have done this without creating the variable
# just take log() inside the regression:
log_linear_reg_alt<-lm(log(Amount) ~ MPHover, data = speed)
summary(log_linear_reg_alt)
## 
## Call:
## lm(formula = log(Amount) ~ MPHover, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2353 -0.1895  0.0929  0.2733  1.2970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.8790452  0.0058716   660.6   <2e-16 ***
## MPHover     0.0484962  0.0003256   148.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3355 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4119, Adjusted R-squared:  0.4119 
## F-statistic: 2.219e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

^ln(Amounti)=3.87+0.05MPHoveri

For every 1 MPH in speed (over the speed limit), expected fine increases by 0.05×100%=5%


Part C

Creating new variables as necessary, run a log-log model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover (^β1). Make a scatterplot with the regression line.4


# Run log-log regression
log_log_reg<-lm(log_Amount ~ log_mph, data = speed)
summary(log_log_reg)
## 
## Call:
## lm(formula = log_Amount ~ log_mph, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1417 -0.1727  0.1034  0.2319  2.3669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.30860    0.01564   147.6   <2e-16 ***
## log_mph      0.86196    0.00558   154.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3304 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4297, Adjusted R-squared:  0.4297 
## F-statistic: 2.386e+04 on 1 and 31672 DF,  p-value: < 2.2e-16
# again we could have done this just taking log()s inside the regression:
log_log_reg_alt<-lm(log(Amount) ~ log(MPHover), data = speed)
summary(log_log_reg_alt)
## 
## Call:
## lm(formula = log(Amount) ~ log(MPHover), data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1417 -0.1727  0.1034  0.2319  2.3669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.30860    0.01564   147.6   <2e-16 ***
## log(MPHover)  0.86196    0.00558   154.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3304 on 31672 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.4297, Adjusted R-squared:  0.4297 
## F-statistic: 2.386e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

^ln(Amounti)=2.31+0.86ln(MPHoveri)

For every 1% increase in speed (over the speed limit), expected fine increases by 0.86%.


Part D

Which of the three log models has the best fit?5


We can compare the R2’s of the three models or compare scatterplots with the regression lines. I will make a table of the three regressions with huxreg for easy comparison of fit:

library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:purrr':
## 
##     every
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
huxreg("Linear-Log" = linear_log_reg,
       "Log-Linear" = log_linear_reg,
       "Log-Log" = log_log_reg,
       coefs = c("Constant" = "(Intercept)",
                 "MPH Over" = "MPHover",
                 "Log(MPH Over)" = "log_mph"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)
Linear-Log Log-Linear Log-Log
Constant -200.10 *** 3.88 *** 2.31 ***
(1.94)    (0.01)    (0.02)   
MPH Over         0.05 ***        
        (0.00)           
Log(MPH Over) 115.75 ***         0.86 ***
(0.69)            (0.01)   
N 31674        31674        31674       
R-Squared 0.47     0.41     0.43    
SER 40.99     0.34     0.33    
*** p < 0.001; ** p < 0.01; * p < 0.05.

It appears the linear-log model has the best fit with the highest R2 out of the three, but not by very much.

If we wanted to compare scatterplots:

ggplot(data = speed)+
  aes(x = log_mph,
      y = Amount)+
  geom_point()+
  geom_smooth(method="lm", color="red")+
  labs(title="Linear-Log Model")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).

ggplot(data = speed)+
  aes(x = MPHover,
      y = log_Amount)+
  geom_point()+
  geom_smooth(method="lm", color="red")+
  labs(title="Log-Linear Model")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).

ggplot(data = speed)+
  aes(x = log_mph,
      y = log_Amount)+
  geom_point()+
  geom_smooth(method="lm", color="red")+
  labs(title="Log-Log Model")
## Warning: Removed 36683 rows containing non-finite values (stat_smooth).
## Warning: Removed 36683 rows containing missing values (geom_point).


Question 7

Return to the quadratic model. Run a quadratic regression of Amount on Age, Age2, MPHover, and all of the race dummy variables. Test the null hypothesis: “the race of the driver has no effect on Amount”


full_reg<-lm(Amount~Age+Age_sq+MPHover+Black+Hispanic, data=speed)
summary(full_reg)
## 
## Call:
## lm(formula = Amount ~ Age + Age_sq + MPHover + Black + Hispanic, 
##     data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -308.22  -19.61    7.46   24.77  226.48 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.676048   1.782585   4.867 1.14e-06 ***
## Age         -0.278472   0.088851  -3.134 0.001725 ** 
## Age_sq       0.003914   0.001118   3.501 0.000464 ***
## MPHover      6.887107   0.038754 177.716  < 2e-16 ***
## Black       -1.641981   1.018421  -1.612 0.106911    
## Hispanic     2.482807   1.057868   2.347 0.018932 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.66 on 31668 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.5029, Adjusted R-squared:  0.5029 
## F-statistic:  6408 on 5 and 31668 DF,  p-value: < 2.2e-16
library(car)
linearHypothesis(full_reg, c("Black", "Hispanic"))
Res.Df RSS Df Sum of Sq F Pr(>F)
3.17e+04 4.98e+07                  
3.17e+04 4.98e+07 2 1.34e+04 4.27 0.0139

Question 8

Now let’s try standardizing variables. Let’s try running a regression of Amount on Age and MPHover, but standardizing each variable.

Part A

Create new standardized variables for Amount, Age, and MPHover.6


# make standardized variables
speed<-speed %>%
  mutate(std_Amount = scale(Amount),
         std_Age = scale(Age),
         std_MPH = scale(MPHover))

Part B

Run a regression of standardized Amount on standardized Age and MPHover. Interpret ^β1 and ^β2 Which variable has a bigger effect on Amount?


# make standardized variables
std_reg<-lm(std_Amount ~ std_Age + std_MPH, data = speed)
summary(std_reg)
## 
## Call:
## lm(formula = std_Amount ~ std_Age + std_MPH, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4896 -0.3517  0.1366  0.4579  4.0234 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.234516   0.004205 -55.768   <2e-16 ***
## std_Age      0.005986   0.004220   1.418    0.156    
## std_MPH      0.622823   0.003496 178.130   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7053 on 31671 degrees of freedom
##   (36683 observations deleted due to missingness)
## Multiple R-squared:  0.5026, Adjusted R-squared:  0.5026 
## F-statistic: 1.6e+04 on 2 and 31671 DF,  p-value: < 2.2e-16

  1. Check R2.↩︎

  2. Hint: The simple geom_smooth(method="lm") is sufficient, so long as you use the right variables on the plot!↩︎

  3. Hint: The simple geom_smooth(method="lm") is sufficient, so long as you use the right variables on the plot!↩︎

  4. Hint: The simple geom_smooth(method="lm") is sufficient, so long as you use the right variables on the plot!↩︎

  5. Hint: Check R2↩︎

  6. Hint: use the scale() function inside of mutate()↩︎