• Set Up
    • Question 1
    • Question 2
    • Question 3
    • Question 4
    • Question 5
      • Part A
      • Part B
      • Part C
      • Part D
    • Question 6
      • Part A
      • Part B
      • Part C
      • Part D
    • Question 7
      • Part A
      • Part B
    • Question 8
    • Question 9
      • Part A
      • Part B
      • Part C
      • Part D
    • Question 10
      • Part A
      • Part B
      • Part C
    • Question 11 (Optional)
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Question 1

Our data comes from fivethirtyeight’s Trump Congress tracker. Download and read in (read_csv) the data.


library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# again, my path (on website) may be different than yours 
congress<-read_csv("../data/congress-trump-score.csv") 
## Parsed with column specification:
## cols(
##   congress = col_double(),
##   chamber = col_character(),
##   bioguide = col_character(),
##   last_name = col_character(),
##   state = col_character(),
##   district = col_double(),
##   party = col_character(),
##   votes = col_double(),
##   agree_pct = col_double(),
##   predicted_agree = col_double(),
##   net_trump_vote = col_double()
## )

Question 2

Look at the data with glimpse().


congress %>%
  glimpse()
## Observations: 1,735
## Variables: 11
## $ congress        <dbl> 0, 115, 116, 0, 115, 116, 0, 115, 116, 0, 115, 1…
## $ chamber         <chr> "house", "house", "house", "house", "house", "ho…
## $ bioguide        <chr> "A000055", "A000055", "A000055", "A000367", "A00…
## $ last_name       <chr> "Aderholt", "Aderholt", "Aderholt", "Amash", "Am…
## $ state           <chr> "AL", "AL", "AL", "MI", "MI", "MI", "NV", "NV", …
## $ district        <dbl> 4, 4, 4, 3, 3, 3, 2, 2, 2, 12, 12, 12, 31, 31, 3…
## $ party           <chr> "Republican", "R", "R", "Independent", "R", "I",…
## $ votes           <dbl> 141, 95, 46, 140, 96, 44, 140, 94, 46, 138, 92, …
## $ agree_pct       <dbl> 0.97872340, 0.96842105, 1.00000000, 0.62142857, …
## $ predicted_agree <dbl> 0.95556332, 0.94634916, 0.97459258, 0.77280357, …
## $ net_trump_vote  <dbl> 63.0, 63.0, 63.0, 9.4, 9.4, 9.4, 12.3, 12.3, 12.…

Question 3

We want to see how does the percentage that a member of Congress’ agrees with President Trump (agree_pct) depend on the result of the 2016 Presidential election in their district (net_trump_vote)? First, plot a scatterplot of agree_pct on net_trump_vote. Add a regression line with an additional layer of geom_smooth(method="lm").


scatter<-ggplot(data = congress)+
  aes(x = net_trump_vote, y=agree_pct)+
  geom_point()+
  geom_smooth(method="lm")+
  theme_light()
scatter


Question 4

Find the correlation between agree_pct and net_trump_vote.


# using base R
cor(congress$net_trump_vote, congress$agree_pct)
## [1] 0.7784996
# using tidyverse
congress %>%
  summarize(cor = cor(net_trump_vote, agree_pct))
ABCDEFGHIJ0123456789
cor
<dbl>
0.7784996

Question 5

We weant to predict the following model:

^agree_pct=^β0+^β1net_trump_vote

Run a regression, and save it as an object. Now get a summary() of it.


reg<-lm(agree_pct ~ net_trump_vote, data = congress)
summary(reg)
## 
## Call:
## lm(formula = agree_pct ~ net_trump_vote, data = congress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72261 -0.16975  0.03761  0.19851  0.70728 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5616451  0.0060307   93.13   <2e-16 ***
## net_trump_vote 0.0099218  0.0001922   51.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.251 on 1733 degrees of freedom
## Multiple R-squared:  0.6061, Adjusted R-squared:  0.6058 
## F-statistic:  2666 on 1 and 1733 DF,  p-value: < 2.2e-16

Part A

What is ^β0? What does it mean in the context of our question?


# I just want to show you how to extract these and put it into text with markdown
# This relies on the code in Question 6 Part B
library(broom)
reg_tidy<-reg %>%
  tidy()

beta_0_hat<-reg_tidy %>% # save this as beta_0_hat
  slice(1) %>% # look only at first row (intercept)
  pull(estimate) %>% # extract the value of the estimate
  round(.,2) # round to 2 decimal places

beta_0_hat # look at it
## [1] 0.56

^β0= 0.56, meaning the rate at which Members of Congress agree with the President when the Net Trump Vote was 0 (i.e. both candidates got 50%) is 0.56 (out of 1).1


Part B

What is ^β1? What does it mean in the context of our question?


beta_1_hat<-reg_tidy %>% # save this as beta_0_hat
  slice(2) %>% # look only at second row (net_trump_score)
  pull(estimate) %>% # extract the value of the estimate
  round(.,2) # round to 2 decimal places

beta_1_hat # look at it
## [1] 0.01

^β1= 0.01, meaning for every additional Net Trump Vote % in a Member of Congress’ district, that Member will agree 0.01 more with the President (out of 1).


Part C

What is R2? What does it mean?


# this uses the glance command in question 6 part C. 

r_sq<-glance(reg)$r.squared %>%
  round(.,2) # round to 2 decimals

r_sq # look at it
## [1] 0.61

R2= 0.61, meaning we have explained 61%2 of the variation in agree_pct.


Part D

What is the SER? What does it mean?


# this uses the glance command in question 6 part C. 

ser<-glance(reg)$sigma %>%
  round(.,2) # round to 2 decimals

ser # look at it
## [1] 0.25

SER= 0.25, meaning the average size of the residuals or errors (distance from an actual data point to the predicted regression line) is 0.25 (out of 1).


Question 6

We can look at regression outputs in a tidier way, with the broom package.

Part A

Install and then load broom.


#install.packages("broom")
library(broom)

Part B

Run the function tidy() on your regression object (saved in question 5). Save this as an object and then look at it.


reg_tidy<-reg %>%
  tidy()
reg_tidy
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)0.5616451330.006030681493.131290
net_trump_vote0.0099218440.000192153851.634920

Part C

Run the glance() function on your original regression object. What does it show you?


reg %>%
  glance()
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<int>
logLik
<dbl>
AIC
<dbl>
BIC
<dbl>
0.60606160.60583430.25100932666.16502-62.62784131.2557147.632

It shows you summary statistics of the regression, particularly its goodness of fit.

  • r.squared is the R2 value
  • sigma is the SER(^σu)

That’s all we need for now.


Part D

Now run the augment() function on your original regression object, and save this as an object. Look at it. What does it show you?


reg_aug<-reg %>%
  augment()
reg_aug
ABCDEFGHIJ0123456789
agree_pct
<dbl>
net_trump_vote
<dbl>
.fitted
<dbl>
.se.fit
<dbl>
.resid
<dbl>
.hat
<dbl>
0.9787234063.00000001.1867213010.013732227-0.20799789720.0029929687
0.9684210563.00000001.1867213010.013732227-0.21830024880.0029929687
1.0000000063.00000001.1867213010.013732227-0.18672130150.0029929687
0.621428579.40000000.6549104660.006362055-0.03348189500.0006424141
0.541666679.40000000.6549104660.006362055-0.11324379980.0006424141
0.795454559.40000000.6549104660.0063620550.14054407900.0006424141
0.9785714312.30000000.6836838140.0065619910.29488761470.0006834260
0.9893617012.30000000.6836838140.0065619910.30567788830.0006834260
0.9565217412.30000000.6836838140.0065619910.27283792530.0006834260
0.12318841-40.00000000.1647713760.009584055-0.04158297020.0014578700

It takes the original data points for Y and X and adds new variables with values for those observations: - .fitted is ^Yi|X, the predicted value for a specific net_trump_vote - .resid is ^ui, the residual or error for a specific net_trump_vote ^ui=Yi^Yi.resid=agree_pct.fitted

That’s all we need right now.


Question 7

Now let’s start looking at the residuals of the regression.

Part A

Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where x is .resid. Does it look roughly normal?


ggplot(data = reg_aug)+
  aes(x = .resid)+
  geom_histogram(color="white")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It does look rougly normal.


Part B

Take the augmented regression object and make a residual plot, which is a scatterplot where x is the normal x variable, and y is the .resid. Feel free to add a horizontal line at 0 with geom_hline(yintercept=0).


ggplot(data = reg_aug)+
  aes(x = net_trump_vote, y=.resid)+
  geom_point()+
  geom_hline(yintercept = 0, color = "red")+
  theme_classic()


Question 8

Now let’s try presenting your results in a regression table. Install and load huxtable, and run the huxreg() command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).


#install.packages("huxtable")
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:purrr':
## 
##     every
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
# basic
huxreg(reg)
(1)
(Intercept) 0.562 ***
(0.006)   
net_trump_vote 0.010 ***
(0.000)   
N 1735        
R2 0.606    
logLik -62.628    
AIC 131.256    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# some customizations
huxreg("Agree with President (Proportion)" = reg,
       coefs = c("Intercept" = "(Intercept)",
                 "Net Trump Vote" = "net_trump_vote"),
       statistics = c("n" = "nobs",
                      "R-squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 4)
Agree with President (Proportion)
Intercept 0.5616 ***
(0.0060)   
Net Trump Vote 0.0099 ***
(0.0002)   
n 1735         
R-squared 0.6061    
SER 0.2510    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Question 9

Now let’s check for heteroskedasticity.

Part A

Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?


It does seem like it would be, looking at the residual plot, since there are clear downward-moving clusters of points that are at times above the line, then below the line, then above and below the line.


Part B

Install and load the lmtest package and run bptest on your regression object. Was the data heteroskedastic or homoskedastic?


#install.packages("lmtest")
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 6.0092, df = 1, p-value = 0.01423

Since the p-value was less than 0.05, we can reject the null hypothesis H0 (that the errors are homoskedastic), and conclude that the errors are heteroskedastic. We will need to fix them if we want accurate standard errors.


Part C

Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr package and use the lm_robust() command (instead of the lm() command) to run a new regression (and save it). Make sure you add se_type="stata" inside the command to calculate robust SEs. Look at it. What changes?


#install.packages("estimatr")
library(estimatr)

reg_rse<-lm_robust(agree_pct ~ net_trump_vote, data = congress, se_type="stata")
reg_rse
##                   Estimate   Std. Error  t value Pr(>|t|)    CI Lower
## (Intercept)    0.561645133 0.0060117565 93.42447        0 0.549854072
## net_trump_vote 0.009921844 0.0001493938 66.41403        0 0.009628833
##                  CI Upper   DF
## (Intercept)    0.57343619 1733
## net_trump_vote 0.01021486 1733

The standard errors on both estimates (^β0 and ^β1) change, but the estimates themselves do not change!

Note t values and Pr(>|t|) will change, and this command adds confidence intervals (CI) and degrees of freedom for t, DF, but we have not covered any of these yet!


Part D

Now let’s see this in a nice regression table. Use huxreg() again, but add both your original regression and your regression saved in part C. Notice any changes?


huxreg("Original" = reg, "Robust SEs" = reg_rse,
       number_format = 6) # need to see more decimal places!
Original Robust SEs
(Intercept) 0.561645 *** 0.561645 ***
(0.006031)    (0.006012)   
net_trump_vote 0.009922 *** 0.009922 ***
(0.000192)    (0.000149)   
N 1735            1735           
R2 0.606062     0.606062    
logLik -62.627843                
AIC 131.255686                
*** p < 0.001; ** p < 0.01; * p < 0.05.

Note that the standard errors are too small to register anything within 3 decimal places (the default). If we were to make a true percent variable out of 100% instead of out of 1 (as I have you do in Question 13), we would see more of a difference. Look carefully, the standard error on net_trump_vote is 0.000192 (from Question 5), and the robust standard error is 0.000149 from Part C.


Question 10

Now let’s check for outliers.

Part A

Just looking at the scatterplot in Question 3, do you see any outliers?


It looks like there might be one - for the Member whose district’s Net vote was somewhere around 60 and s/he agrees with the President about 0.6.


Part B

Install and load the car package. Run the outlierTest command on your regression object. Does it detect any outliers?


#install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
outlierTest(reg)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 1708 -2.886125          0.0039485           NA

Yes, surprisingly it’s not the point that looks like it stands out on the graph (middle-left), it’s a different point.


Part C

Look in your original data to match this outlier with an observation. Hint: use the slice() command, as the outlier test gave you an observation (row) number!


congress %>%
  slice(1708)
congress chamber bioguide last_name state district party votes agree_pct predicted_agree net_trump_vote
116 senate T000464 Tester MT D 24 0.0417 0.751 20.4

Question 11 (Optional)

This data is still a bit messy. Let’s check in on your tidyverse skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.

Part A

First, the variable congress (session of Congress) seems a bit off. Get a count() of congress.


congress %>%
  count(congress)
congress n
0 647
115 552
116 536

Part B

Let’s get rid of the 0 values for congress (someone made a mistake coding this, probably). Also, while we’re at it, let’s take agree_pct and mutate a variable that is a proper percentage (i.e. *100).


congress_tidy<-congress %>%
  filter(congress!=0) %>%
  mutate(agree_pct = agree_pct * 100)

Part C

The variable party is also quite a mess. count() by party to see. Then let’s mutate a variable to make a better measure of political party - just "Republican", "Democrat", and "Independent". Try doing this with the case_when() command (as your mutate formula).3 When you’re done count() by your new party variable to make sure it worked.


congress_tidy <- congress_tidy %>%
  mutate(pol_party = case_when(
    party %in% c("Democrat", "D") ~ "Democrat",
    party %in% c("Republican", "R") ~ "Republican",
    party %in% c("Independent", "I") ~ "Independent"
  ))

congress_tidy %>%
  count(pol_party)
pol_party n
Democrat 528
Independent 5
Republican 555

Part D

Now plot a scatterplot (same as Question 3) and set color to your party variable. Notice R uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows: party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray"). Then, run your plot again, adding the following line to customize the colors +scale_colour_manual("Parties", values = party_colors).4


party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")

p<-ggplot(data = congress_tidy)+
  aes(x = net_trump_vote, y=agree_pct)+
  geom_point(aes(color = pol_party))+
  geom_smooth(method="lm", color = "black")+
  scale_colour_manual("Parties", values = party_colors)+
  theme_light()
p


Part E

Now facet your scatterplot by chamber.


p+facet_wrap(~chamber)

# just for kicks
p+facet_grid(chamber~pol_party)


  1. Look at my markdown code to see that I can write my R code directly in my text with one backtick and r. This calls my beta_0_hat object I made.↩︎

  2. Again, look at my markdown: I am calling r_sq and multiplying it by 100 inside the code chunk.↩︎

  3. The syntax for case_when() is to have a series of condition ~ "Outcome", separated by commas. For example, one condition is to assign both "Democrat" and "D" to "Democrat", as in party %in% c("Democrat", "D") ~ "Democrat". You could also do this with a few ifelse() commands, but that’s a bit more awkward.↩︎

  4. "Parties" is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F).↩︎