• Set Up
    • Question 1
    • Question 2
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
    • Question 3
    • Question 4
      • Part A
      • Part B
      • Part C
    • Question 5
      • Part A
      • Part B
    • Question 6
      • Part A
      • Part B
      • Part C
      • Part D
    • Question 7
    • Question 8
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
      • Part F
      • Part G
      • Part H
    • Question 9
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
    • Question 10

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Question 1

We are returning to the speeding tickets data that we began to explore in R Practice 15 on Multivariate Regression. Download and read in (read_csv) the data below.

This data again comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit
Black Dummy =1 if driver was black, =0 if not
Hispanic Dummy =1 if driver was Hispanic, =0 if not
Female Dummy =1 if driver was female, =0 if not
OutTown Dummy =1 if driver was not from local town, =0 if not
OutState Dummy =1 if driver was not from local state, =0 if not
StatePol Dummy =1 if driver was stopped by State Police, =0 if stopped by other (local)

We again want to explore who gets fines, and how much.


library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
speed<-read_csv("../Data/speeding_tickets.csv")
## Parsed with column specification:
## cols(
##   Black = col_double(),
##   Hispanic = col_double(),
##   Female = col_double(),
##   Amount = col_double(),
##   MPHover = col_double(),
##   Age = col_double(),
##   OutTown = col_double(),
##   OutState = col_double(),
##   StatePol = col_double()
## )

Question 2

We will have to do a little bit of cleaning to get the data in a more usable form.

Part A

Inspect the data with str() or head() or glimpse() to see what it looks like.


str(speed)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 68357 obs. of  9 variables:
##  $ Black   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hispanic: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Female  : num  1 1 1 0 0 0 1 0 1 0 ...
##  $ Amount  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MPHover : num  14 15 15 13 12 17 15 15 15 15 ...
##  $ Age     : num  22 43 32 24 54 30 18 53 51 33 ...
##  $ OutTown : num  1 1 0 1 1 1 0 0 1 1 ...
##  $ OutState: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StatePol: num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Black = col_double(),
##   ..   Hispanic = col_double(),
##   ..   Female = col_double(),
##   ..   Amount = col_double(),
##   ..   MPHover = col_double(),
##   ..   Age = col_double(),
##   ..   OutTown = col_double(),
##   ..   OutState = col_double(),
##   ..   StatePol = col_double()
##   .. )
head(speed)
ABCDEFGHIJ0123456789
Black
<dbl>
Hispanic
<dbl>
Female
<dbl>
Amount
<dbl>
MPHover
<dbl>
Age
<dbl>
OutTown
<dbl>
OutState
<dbl>
StatePol
<dbl>
001NA1422100
001NA1543100
001NA1532000
000NA1324100
000NA1254100
000NA1730100
glimpse(speed)
## Observations: 68,357
## Variables: 9
## $ Black    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hispanic <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Female   <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0…
## $ Amount   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MPHover  <dbl> 14, 15, 15, 13, 12, 17, 15, 15, 15, 15, 13, 11, 15, 15,…
## $ Age      <dbl> 22, 43, 32, 24, 54, 30, 18, 53, 51, 33, 48, 36, 31, 45,…
## $ OutTown  <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ OutState <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ StatePol <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Part B

What class of variable are Black, Hispanic, Female, OutTown, and OutState?1


class(speed$Black)
## [1] "numeric"
class(speed$Hispanic)
## [1] "numeric"
class(speed$Female)
## [1] "numeric"
class(speed$OutTown)
## [1] "numeric"
class(speed$OutState)
## [1] "numeric"

Part C

Notice that when importing the data from the .csv file, R interpretted these variables as integer, but we want them to be factor variables, to ensure R recognizes that there are two groups (categories), 0 and 1. Convert each of these variables into factors by reassigning it according to the format:

df<-df %>%
  mutate(my_var=as.factor(my_var),
         my_var2=as.factor(my_var2))

where - df is the name of your dataframe - my_var and my_var2 are example variables2


speed<-speed %>%
  mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"),factor)

Part D

Confirm they are each now factors by checking their class again.


class(speed$Black)
## [1] "factor"
class(speed$Hispanic)
## [1] "factor"
class(speed$Female)
## [1] "factor"
class(speed$OutTown)
## [1] "factor"
class(speed$OutState)
## [1] "factor"

Part E

Get a summary() of Amount. Note that there are a lot of NA’s (these are people that were stopped but did not recieve a fine)! filter() only those observations for which Amount is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe). Then double check that this worked by getting a summary() of Amount again.

speed %>%
  select(Amount) %>%
  summary()
##      Amount     
##  Min.   :  3    
##  1st Qu.: 75    
##  Median :115    
##  Mean   :122    
##  3rd Qu.:155    
##  Max.   :725    
##  NA's   :36683
# alternatively, using base R:
summary(speed$Amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       3      75     115     122     155     725   36683
# keep only positive Amount
speed<-speed %>%
  filter(Amount>0)

# check new data
speed %>%
  select(Amount) %>%
  summary()
##      Amount   
##  Min.   :  3  
##  1st Qu.: 75  
##  Median :115  
##  Mean   :122  
##  3rd Qu.:155  
##  Max.   :725

Question 3

Create a scatterplot between Amount (as Y) and Female (as X).3


ggplot(data = speed)+
  aes(x = Female,
      y = Amount,
      color = Female)+
  geom_jitter(width=0.4)

# note if you had not made Female a factor, and kept it numeric,
# it might alter the plot.
# you can make Female a factor just for plotting by setting
# aes(x = as.factor(Female))

Question 4

Now let’s start looking at the distribution conditionally to find the different group means.

Part A

Find the mean and standard deviation4 of Amount for male drivers and again for female drivers.


# Summarize for Men

# save as male_summary
# we'll use this in next part

male_summary<-speed %>% 
  filter(Female==0) %>%
  summarize(mean = mean(Amount),
            sd = sd(Amount))

# look at it
male_summary
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
124.665458.28387
# Summarize for Women

# save as female_summary
# we'll use this in next part

female_summary<-speed %>%
  filter(Female==1) %>%
  summarize(mean = mean(Amount),
            sd = sd(Amount))

# look at it
female_summary
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
116.72651.48665

Part B

What is the difference between the average Amount for Males and Females?


# we can do this with R using our summary dataframes we made last part

male_avg<-male_summary %>%
  pull(mean) # pull extracts the value

male_avg #check
## [1] 124.6654
female_avg<-female_summary %>%
  pull(mean) 

female_avg #check
## [1] 116.726
# difference
male_avg-female_avg
## [1] 7.939397

Males get $7.94 higher fines on average than Females.


Part C

We did not go over how to do this in class, but you can run a t-test for the difference in group means to see if the difference is statistically significant. The syntax is similar for a regression:

t.test(y~d, data=df)

where: - y is the variable we are testing - d is the dummy variable

Is there a statistically significant difference between Amount for male and female drivers?5


t.test(Amount~Female, data = speed)
## 
##  Welch Two Sample t-test
## 
## data:  Amount by Female
## t = 12.356, df = 23400, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6.679941 9.198853
## sample estimates:
## mean in group 0 mean in group 1 
##        124.6654        116.7260

This gives us the confidence interval for the difference in average amount between men and women: (6.68, 9.20) and runs the following hypothesis test:

  • d=¯Amountmen¯Amountwomen
  • H0:d=0
  • Ha:d0

Since the test-statistic t is quite high, and the p-value is very small (basically 0), we have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis: there appears to be a statistically significant difference between average fines for men and women.


Question 5

Now run the following regression to ensure we get the same result

Amounti=^β0+^β1Femalei


female_reg<-lm(Amount~Female, data = speed)
summary(female_reg)
## 
## Call:
## lm(formula = Amount ~ Female, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.67  -49.67   -6.73   30.33  600.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 124.6654     0.3857  323.23   <2e-16 ***
## Female1      -7.9394     0.6698  -11.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared:  0.004416,   Adjusted R-squared:  0.004385 
## F-statistic: 140.5 on 1 and 31672 DF,  p-value: < 2.2e-16
# can also do broom's tidy()
library(broom)
tidy(female_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)124.6654230.3856913323.225920.000000e+00
Female1-7.9393970.6698475-11.852542.443741e-32

Part A

Write out the estimated regression equation.


^Amounti=124.677.94Femalei

# PUT CODE HERE

Part B

Use the regression coefficients to find

    1. the average Amount for men
    1. the average Amount for women
    1. the difference in average Amount between men and women

  • Males get fined $124.67 (^β0)
  • Females get fined 124.677.94=116.73 (^β0+^β1)
  • The difference is $7.94 (^β3)
# PUT CODE HERE

Question 6

Let’s recode the sex variable.

Part A

Make a new variable called Male and save it in your dataframe.6


speed<-speed %>%
  mutate(Male = factor(ifelse(test = Female==0,
                              yes = 1, # if it is (Women), code Male as 1
                              no = 0))) # if it isn't (Men), code Male as 0

# check to make sure it worked
speed %>%
  select(Male, Female) # just look at these two variables to avoid clutter
ABCDEFGHIJ0123456789
Male
<fctr>
Female
<fctr>
10
10
10
01
10
10
01
01
10
01

Part B

Run the same regression as in question 5, but use Male instead of Female.


male_reg<-lm(Amount~Male, data = speed)
summary(male_reg)
## 
## Call:
## lm(formula = Amount ~ Male, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.67  -49.67   -6.73   30.33  600.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.7260     0.5477  213.13   <2e-16 ***
## Male1         7.9394     0.6698   11.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared:  0.004416,   Adjusted R-squared:  0.004385 
## F-statistic: 140.5 on 1 and 31672 DF,  p-value: < 2.2e-16
tidy(male_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)116.7260260.5476659213.133630.000000e+00
Male17.9393970.669847511.852542.443741e-32

Part C

Write out the estimated regression equation.


^Amounti=116.73+7.94Malei

# PUT CODE HERE

Part D

Use the regression coefficients to find

    1. the average Amount for men
    1. the average Amount for women
    1. the difference in average Amount between men and women

  • Females get fined $116.73 (^β0)
  • Males get fined 116.73+7.94=$124.67 (^β0+^β1)
  • The difference is 7.94 (^β3)
# PUT CODE HERE

Question 7

Run a regression of Amount on Male and Female. What happens, and why?


dummy_variable_trap<-lm(Amount~Male+Female, data = speed)
summary(dummy_variable_trap)
## 
## Call:
## lm(formula = Amount ~ Male + Female, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.67  -49.67   -6.73   30.33  600.33 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.7260     0.5477  213.13   <2e-16 ***
## Male1         7.9394     0.6698   11.85   <2e-16 ***
## Female1           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.12 on 31672 degrees of freedom
## Multiple R-squared:  0.004416,   Adjusted R-squared:  0.004385 
## F-statistic: 140.5 on 1 and 31672 DF,  p-value: < 2.2e-16

Male and Female are perfectly multicollinear, as for every person i, Malei+Femalei=1. We can confirm this by seeing the correlation between Male and Female is exactly 1. To run a regression, we must exclude one of the dummies, and as we’ve seen, it makes no difference which one we exclude.

speed %>%
  select(Male, Female) %>%
  mutate_if(is.factor, as.numeric) %>% # make both factors into numeric
  cor() # correlation requires numeric data, can't have factors
##        Male Female
## Male      1     -1
## Female   -1      1

Question 8

Age probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females.

Part A

Run a regression of Amount on Age and Female. How does the coefficient on Female change?


age_reg<-lm(Amount~Age+Female, data = speed)
summary(age_reg)
## 
## Call:
## lm(formula = Amount ~ Age + Female, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.77  -45.63   -5.91   33.67  597.66 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 134.19965    0.90969  147.52   <2e-16 ***
## Age          -0.28598    0.02472  -11.57   <2e-16 ***
## Female1      -7.85172    0.66849  -11.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56 on 31671 degrees of freedom
## Multiple R-squared:  0.008604,   Adjusted R-squared:  0.008542 
## F-statistic: 137.4 on 2 and 31671 DF,  p-value: < 2.2e-16
tidy(age_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)134.19965060.90969259147.521980.000000e+00
Age-0.28597850.02472361-11.567026.985839e-31
Female1-7.85171920.66849063-11.745448.676153e-32

The coefficient on Female goes from 7.94 (question 5) to 7.85 when controlling for Age.


Part B

Now let’s see if the difference in fine between men and women are different depending on the driver’s age. Run the regression again, but add an interaction term between Female and Age interaction term.


# note you can also do Age*Female for the interaction term

interact_reg<-lm(Amount~Age+Female+Age:Female, data = speed)
summary(interact_reg)
## 
## Call:
## lm(formula = Amount ~ Age + Female + Age:Female, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126.35  -44.68   -5.49   33.49  597.29 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 135.54592    1.07428 126.173  < 2e-16 ***
## Age          -0.32636    0.03008 -10.848  < 2e-16 ***
## Female1     -12.02331    1.89296  -6.352 2.16e-10 ***
## Age:Female1   0.12435    0.05279   2.355   0.0185 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56 on 31670 degrees of freedom
## Multiple R-squared:  0.008778,   Adjusted R-squared:  0.008684 
## F-statistic: 93.49 on 3 and 31670 DF,  p-value: < 2.2e-16
tidy(interact_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)135.54591741.07428490126.1731580.000000e+00
Age-0.32635970.03008437-10.8481472.273784e-27
Female1-12.02330931.89296463-6.3515762.160090e-10
Age:Female10.12435430.052793672.3554781.850497e-02

Part C

Write out your estimated regression equation.


^Amounti=135.5512.02Femalei0.33Agei+0.12(Femalei×Agei)

# PUT CODE HERE

Part D

Interpret the interaction effect. Is it statistically significant?


The coefficient on the interaction term, ^β3 is 0.12. For every additional year of age, females can expect their fine to increase by $0.12 more than males gain for every additional year of age.

^beta3 has a standard error of 0.52, which means it has a t-statistic of 2.355 and p-value of 0.0185, so it is statistically significant (at the 95% level).

# PUT CODE HERE

Part E

Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as two separate equations, one for Males and one for Females.


For Males (Female=0):

^Amount=135.550.33Age12.02Female+0.12Female×Age=135.550.33Age12.02(0)+0.12(0)Age=135.550.33Age

For Females (Female=1):

^Amount=135.550.33Age12.02Female+0.12Female×Age=135.550.33Age12.02(1)+0.12(1)Age=(135.5512.02)+(0.33+0.12)Age=123.530.21Age

# Another way to do this is to run conditional regressions:

# subset data for only Males
speed_males<-speed %>%
  filter(Female==0)

# subset data for only Feales
speed_females<-speed %>%
  filter(Female==1)

# run a regression for each subset
male_age_reg<-lm(Amount~Age, data = speed_males)

tidy(male_age_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)135.54591741.11524335121.539320.000000e+00
Age-0.32635970.03123137-10.449741.694998e-25
female_age_reg<-lm(Amount~Age, data = speed_females)

tidy(female_age_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)123.52260811.4312878986.301720.000000e+00
Age-0.20200530.03983956-5.070474.036045e-07
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:purrr':
## 
##     every
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
huxreg("Males Only"=male_age_reg, "Females Only"=female_age_reg)
Males Only Females Only
(Intercept) 135.546 *** 123.523 ***
(1.115)    (1.431)   
Age -0.326 *** -0.202 ***
(0.031)    (0.040)   
N 21173         10501        
R2 0.005     0.002    
logLik -116063.355     -56274.760    
AIC 232132.710     112555.521    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Part F

Let’s try to visualize this. Make a scatterplot of Age (X) and Amount (Y) and include a regression line.


p1<-ggplot(data = speed, aes(x = Age, y = Amount))+
  geom_point()+
  geom_smooth(method="lm")
p1


Part G

Try adding to your base layer aes(), set color=Female. This will produce two lines and color the points by Female.7


p2<-ggplot(data = speed, aes(x = Age, y = Amount, color=Female))+
  geom_point()+
  geom_smooth(method="lm")
p2


Part H

Add a facet layer to make two different scatterplots with an additional layer +facet_wrap(~Female)


p2+facet_wrap(~Female)


Question 9

Now let’s look at the possible interaction between Sex (Male or Female) and whether a driver is from In-State or Out-of-State (OutState).

Part A

Use R to examine the data and find the mean for (i) Males In-State, (ii) Males Out-of-State, (iii) Females In-State, and (iv) Females Out-of-State.


# Summarize for Men in State

speed %>% 
  filter(Female==0,
         OutState==0) %>%
  summarize(mean = mean(Amount))
mean
124
# Summarize for Men Out of State

speed %>% 
  filter(Female==0,
         OutState==1) %>%
  summarize(mean = mean(Amount))
mean
128
# Summarize for Women in State

speed %>% 
  filter(Female==1,
         OutState==0) %>%
  summarize(mean = mean(Amount))
mean
115
# Summarize for Women Out of State

speed %>% 
  filter(Female==1,
         OutState==1) %>%
  summarize(mean = mean(Amount))
mean
124

Part B

Now run a regression of the following model:

Amounti=^β0+^β1Femalei+^β2OutStatei+^β3FemaleiOutStatei


sex_state_reg<-lm(Amount~Female+Female:OutState, data=speed)
summary(sex_state_reg)
## 
## Call:
## lm(formula = Amount ~ Female + Female:OutState, data = speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.68  -48.68   -8.68   31.32  601.32 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       123.6775     0.4391 281.650  < 2e-16 ***
## Female1            -8.8790     0.7541 -11.775  < 2e-16 ***
## Female0:OutState1   4.2915     0.9152   4.689 2.76e-06 ***
## Female1:OutState1   9.4672     1.3586   6.968 3.27e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.06 on 31670 degrees of freedom
## Multiple R-squared:  0.006629,   Adjusted R-squared:  0.006535 
## F-statistic: 70.44 on 3 and 31670 DF,  p-value: < 2.2e-16
tidy(sex_state_reg)
term estimate std.error statistic p.value
(Intercept) 124    0.439 282    0       
Female1 -8.88 0.754 -11.8  6.14e-32
Female0:OutState1 4.29 0.915 4.69 2.76e-06
Female1:OutState1 9.47 1.36  6.97 3.27e-12

Part C

Write out the estimated regression equation.


^Amounti=123.688.88Femalei+4.29OutStatei+5.17(Femalei×OutStatei)

# PUT CODE HERE

Part D

What does each coefficient mean?


  • ^β0=$123.68; mean for in-state males
  • ^β1=$8.88: difference between in-state males and females
  • ^β2=$4.29: difference between males in-state vs. out-of-state
  • ^β3=$5.17: difference between effect of being in-state vs. out-of-state between males vs. females (or, equivalently, difference between effect of being male vs. female between in-state vs. out-of-state)
# PUT CODE HERE

Part E

Using the regression equation, what are the means for

    1. Males In-State
    1. Males Out-of-State
    1. Females In-State
    1. Females Out-of-State?

Compare to your answers in part A.


  • Males In-State: ^β0=$123.68
  • Males Out-of-State: ^β0+^β2=123.68+4.29=$127.97
  • Females In-State: ^β0+^β1=123.688.88=$114.80
  • Females Out-of-State: ^β0+^β1+^β2+^β3=123.688.88+4.29+5.17=$124.26
# PUT CODE HERE

Question 10

Collect your regressions from questions 5, 6b, 8a, 8b, and 9b and output them in a regression table with huxtable().


library(huxtable)
huxreg(female_reg, male_reg, age_reg, interact_reg, sex_state_reg) # the default
(1) (2) (3) (4) (5)
(Intercept) 124.665 *** 116.726 *** 134.200 *** 135.546 *** 123.678 ***
(0.386)    (0.548)    (0.910)    (1.074)    (0.439)   
Female1 -7.939 ***          -7.852 *** -12.023 *** -8.879 ***
(0.670)             (0.668)    (1.893)    (0.754)   
Male1          7.939 ***                           
         (0.670)                              
Age                   -0.286 *** -0.326 ***         
                  (0.025)    (0.030)            
Age:Female1                            0.124 *           
                           (0.053)            
Female0:OutState1                                     4.291 ***
                                    (0.915)   
Female1:OutState1                                     9.467 ***
                                    (1.359)   
N 31674         31674         31674         31674         31674        
R2 0.004     0.004     0.009     0.009     0.007    
logLik -172510.224     -172510.224     -172443.461     -172440.687     -172474.987    
AIC 345026.449     345026.449     344894.922     344891.373     344959.974    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# let's customize a bit
huxreg(female_reg,
       male_reg,
       age_reg,
       interact_reg,
       sex_state_reg,
       coefs = c("Constant" = "(Intercept)",
                 "Female" = "Female1",
                 "Male" = "Male1",
                 "Age" = "Age",
                 "Age*Female" = "Age:Female1",
                 "Out of State" = "Female0:OutState1",
                 "Female*Out of State" = "Female1:OutState1"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)
(1) (2) (3) (4) (5)
Constant 124.67 *** 116.73 *** 134.20 *** 135.55 *** 123.68 ***
(0.39)    (0.55)    (0.91)    (1.07)    (0.44)   
Female -7.94 ***         -7.85 *** -12.02 *** -8.88 ***
(0.67)            (0.67)    (1.89)    (0.75)   
Male         7.94 ***                        
        (0.67)                           
Age                 -0.29 *** -0.33 ***        
                (0.02)    (0.03)           
Age*Female                         0.12 *          
                        (0.05)           
Out of State                                 4.29 ***
                                (0.92)   
Female*Out of State                                 9.47 ***
                                (1.36)   
N 31674        31674        31674        31674        31674       
R-Squared 0.00     0.00     0.01     0.01     0.01    
SER 56.12     56.12     56.00     56.00     56.06    
*** p < 0.001; ** p < 0.01; * p < 0.05.

  1. Hint use the class(df$variable) command to ask what class something is, where df is your dataframe, and variable is the name of a variable.↩︎

  2. As a bonus, you can try doing this with just one conditional command: mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"),factor). See our Data Wrangling slides for refreshers of all the fancy mutate() possibilities!↩︎

  3. Hint: Use geom_jitter() instead of geom_point() to plot the points, and play around with width settings inside geom_jitter()↩︎

  4. Hint: use the summarize() command, once you have properly filter()ed the data.↩︎

  5. Hint: this is like any hypothesis test. A t-value needs to be large enough to be greater than a critical value of t. Check the p-value and see if it is less than our standard of α=0.05.↩︎

  6. Hint: mutate() and define Male equal to factor(ifelse()). This makes the variable a factor (so we don’t have to later), and the ifelse() function has three arguments: 1. test = the condition to test, 2. yes = what to define “Male” as when the condition is TRUE, and 3. no = what to define “Male” as when the condition is FALSE.↩︎

  7. Sometimes we may also need to remind R that Female is a factor with as.factor(Female). We don’t need to in this case because we already reset Female as a faction in question 1.↩︎