Set Up

Question 1
Question 2
Question 3
Question 4
Question 5

Part A
Part B
Part C
Part D

Question 6

Part A
Part B
Part C
Part D

Question 7

Part A
Part B

Question 8
Question 9

Part A
Part B
Part C
Part D

Question 10

Part A
Part B
Part C

Question 11 (Optional)

Part A
Part B
Part C
Part D
Part E

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

View Project on R Studio Cloud

Question 1

Our data comes from fivethirtyeight’s Trump Congress tracker. Download and read in (read_csv) the data.

congress-trump-score.csv

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# again, my path (on website) may be different than yours 
congress<-read_csv("../data/congress-trump-score.csv") 

## Parsed with column specification:
## cols(
##   congress = col_double(),
##   chamber = col_character(),
##   bioguide = col_character(),
##   last_name = col_character(),
##   state = col_character(),
##   district = col_double(),
##   party = col_character(),
##   votes = col_double(),
##   agree_pct = col_double(),
##   predicted_agree = col_double(),
##   net_trump_vote = col_double()
## )

Question 2

Look at the data with glimpse().

congress %>%
  glimpse()

## Observations: 1,735
## Variables: 11
## $ congress        <dbl> 0, 115, 116, 0, 115, 116, 0, 115, 116, 0, 115, 1…
## $ chamber         <chr> "house", "house", "house", "house", "house", "ho…
## $ bioguide        <chr> "A000055", "A000055", "A000055", "A000367", "A00…
## $ last_name       <chr> "Aderholt", "Aderholt", "Aderholt", "Amash", "Am…
## $ state           <chr> "AL", "AL", "AL", "MI", "MI", "MI", "NV", "NV", …
## $ district        <dbl> 4, 4, 4, 3, 3, 3, 2, 2, 2, 12, 12, 12, 31, 31, 3…
## $ party           <chr> "Republican", "R", "R", "Independent", "R", "I",…
## $ votes           <dbl> 141, 95, 46, 140, 96, 44, 140, 94, 46, 138, 92, …
## $ agree_pct       <dbl> 0.97872340, 0.96842105, 1.00000000, 0.62142857, …
## $ predicted_agree <dbl> 0.95556332, 0.94634916, 0.97459258, 0.77280357, …
## $ net_trump_vote  <dbl> 63.0, 63.0, 63.0, 9.4, 9.4, 9.4, 12.3, 12.3, 12.…

Question 3

We want to see how does the percentage that a member of Congress’ agrees with President Trump (agree_pct) depend on the result of the 2016 Presidential election in their district (net_trump_vote)? First, plot a scatterplot of agree_pct on net_trump_vote. Add a regression line with an additional layer of geom_smooth(method="lm").

scatter<-ggplot(data = congress)+
  aes(x = net_trump_vote, y=agree_pct)+
  geom_point()+
  geom_smooth(method="lm")+
  theme_light()
scatter

Question 4

Find the correlation between agree_pct and net_trump_vote.

# using base R
cor(congress$net_trump_vote, congress$agree_pct)

## [1] 0.7784996

# using tidyverse
congress %>%
  summarize(cor = cor(net_trump_vote, agree_pct))

ABCDEFGHIJ0123456789

cor <dbl>
0.7784996

Question 5

We weant to predict the following model:

Run a regression, and save it as an object. Now get a summary() of it.

reg<-lm(agree_pct ~ net_trump_vote, data = congress)
summary(reg)

## 
## Call:
## lm(formula = agree_pct ~ net_trump_vote, data = congress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72261 -0.16975  0.03761  0.19851  0.70728 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5616451  0.0060307   93.13   <2e-16 ***
## net_trump_vote 0.0099218  0.0001922   51.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.251 on 1733 degrees of freedom
## Multiple R-squared:  0.6061, Adjusted R-squared:  0.6058 
## F-statistic:  2666 on 1 and 1733 DF,  p-value: < 2.2e-16

Part A

What is ? What does it mean in the context of our question?

# I just want to show you how to extract these and put it into text with markdown
# This relies on the code in Question 6 Part B
library(broom)
reg_tidy<-reg %>%
  tidy()

beta_0_hat<-reg_tidy %>% # save this as beta_0_hat
  slice(1) %>% # look only at first row (intercept)
  pull(estimate) %>% # extract the value of the estimate
  round(.,2) # round to 2 decimal places

beta_0_hat # look at it

## [1] 0.56

0.56, meaning the rate at which Members of Congress agree with the President when the Net Trump Vote was 0 (i.e. both candidates got 50%) is 0.56 (out of 1).¹

Part B

What is ? What does it mean in the context of our question?

beta_1_hat<-reg_tidy %>% # save this as beta_0_hat
  slice(2) %>% # look only at second row (net_trump_score)
  pull(estimate) %>% # extract the value of the estimate
  round(.,2) # round to 2 decimal places

beta_1_hat # look at it

## [1] 0.01

0.01, meaning for every additional Net Trump Vote % in a Member of Congress’ district, that Member will agree 0.01 more with the President (out of 1).

Part C

What is ? What does it mean?

# this uses the glance command in question 6 part C. 

r_sq<-glance(reg)$r.squared %>%
  round(.,2) # round to 2 decimals

r_sq # look at it

## [1] 0.61

0.61, meaning we have explained 61%² of the variation in agree_pct.

Part D

What is the ? What does it mean?

# this uses the glance command in question 6 part C. 

ser<-glance(reg)$sigma %>%
  round(.,2) # round to 2 decimals

ser # look at it

## [1] 0.25

0.25, meaning the average size of the residuals or errors (distance from an actual data point to the predicted regression line) is 0.25 (out of 1).

Question 6

We can look at regression outputs in a tidier way, with the broom package.

Part A

Install and then load broom.

#install.packages("broom")
library(broom)

Part B

Run the function tidy() on your regression object (saved in question 5). Save this as an object and then look at it.

reg_tidy<-reg %>%
  tidy()
reg_tidy

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	0.561645133	0.0060306814	93.13129	0
net_trump_vote	0.009921844	0.0001921538	51.63492	0

Part C

Run the glance() function on your original regression object. What does it show you?

reg %>%
  glance()

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <int>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.6060616	0.6058343	0.2510093	2666.165	0	2	-62.62784	131.2557	147.632

It shows you summary statistics of the regression, particularly its goodness of fit.

r.squared is the value
sigma is the

That’s all we need for now.

Part D

Now run the augment() function on your original regression object, and save this as an object. Look at it. What does it show you?

reg_aug<-reg %>%
  augment()
reg_aug

ABCDEFGHIJ0123456789

agree_pct <dbl>	net_trump_vote <dbl>	.fitted <dbl>	.se.fit <dbl>	.resid <dbl>	.hat <dbl>
0.97872340	63.0000000	1.186721301	0.013732227	-0.2079978972	0.0029929687
0.96842105	63.0000000	1.186721301	0.013732227	-0.2183002488	0.0029929687
1.00000000	63.0000000	1.186721301	0.013732227	-0.1867213015	0.0029929687
0.62142857	9.4000000	0.654910466	0.006362055	-0.0334818950	0.0006424141
0.54166667	9.4000000	0.654910466	0.006362055	-0.1132437998	0.0006424141
0.79545455	9.4000000	0.654910466	0.006362055	0.1405440790	0.0006424141
0.97857143	12.3000000	0.683683814	0.006561991	0.2948876147	0.0006834260
0.98936170	12.3000000	0.683683814	0.006561991	0.3056778883	0.0006834260
0.95652174	12.3000000	0.683683814	0.006561991	0.2728379253	0.0006834260
0.12318841	-40.0000000	0.164771376	0.009584055	-0.0415829702	0.0014578700

It takes the original data points for Y and X and adds new variables with values for those observations: - .fitted is , the predicted value for a specific net_trump_vote - .resid is , the residual or error for a specific net_trump_vote

That’s all we need right now.

Question 7

Now let’s start looking at the residuals of the regression.

Part A

Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where is .resid. Does it look roughly normal?

ggplot(data = reg_aug)+
  aes(x = .resid)+
  geom_histogram(color="white")+
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It does look rougly normal.

Part B

Take the augmented regression object and make a residual plot, which is a scatterplot where x is the normal x variable, and y is the .resid. Feel free to add a horizontal line at 0 with geom_hline(yintercept=0).

ggplot(data = reg_aug)+
  aes(x = net_trump_vote, y=.resid)+
  geom_point()+
  geom_hline(yintercept = 0, color = "red")+
  theme_classic()

Question 8

Now let’s try presenting your results in a regression table. Install and load huxtable, and run the huxreg() command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).

#install.packages("huxtable")
library(huxtable)

## 
## Attaching package: 'huxtable'

## The following object is masked from 'package:dplyr':
## 
##     add_rownames

## The following object is masked from 'package:purrr':
## 
##     every

## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

# basic
huxreg(reg)

	(1)
(Intercept)	0.562 ***
	(0.006)
net_trump_vote	0.010 ***
	(0.000)
N	1735
R2	0.606
logLik	-62.628
AIC	131.256
* p < 0.001; p < 0.01; * p < 0.05.

# some customizations
huxreg("Agree with President (Proportion)" = reg,
       coefs = c("Intercept" = "(Intercept)",
                 "Net Trump Vote" = "net_trump_vote"),
       statistics = c("n" = "nobs",
                      "R-squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 4)

	Agree with President (Proportion)
Intercept	0.5616 ***
	(0.0060)
Net Trump Vote	0.0099 ***
	(0.0002)
n	1735
R-squared	0.6061
SER	0.2510
* p < 0.001; p < 0.01; * p < 0.05.

Question 9

Now let’s check for heteroskedasticity.

Part A

Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?

It does seem like it would be, looking at the residual plot, since there are clear downward-moving clusters of points that are at times above the line, then below the line, then above and below the line.

Part B

Install and load the lmtest package and run bptest on your regression object. Was the data heteroskedastic or homoskedastic?

#install.packages("lmtest")
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(reg)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 6.0092, df = 1, p-value = 0.01423

Since the -value was less than 0.05, we can reject the null hypothesis (that the errors are homoskedastic), and conclude that the errors are heteroskedastic. We will need to fix them if we want accurate standard errors.

Part C

Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr package and use the lm_robust() command (instead of the lm() command) to run a new regression (and save it). Make sure you add se_type="stata" inside the command to calculate robust SEs. Look at it. What changes?

#install.packages("estimatr")
library(estimatr)

reg_rse<-lm_robust(agree_pct ~ net_trump_vote, data = congress, se_type="stata")
reg_rse

##                   Estimate   Std. Error  t value Pr(>|t|)    CI Lower
## (Intercept)    0.561645133 0.0060117565 93.42447        0 0.549854072
## net_trump_vote 0.009921844 0.0001493938 66.41403        0 0.009628833
##                  CI Upper   DF
## (Intercept)    0.57343619 1733
## net_trump_vote 0.01021486 1733

The standard errors on both estimates and change, but the estimates themselves do not change!

Note t values and Pr(>|t|) will change, and this command adds confidence intervals (CI) and degrees of freedom for , DF, but we have not covered any of these yet!

Part D

Now let’s see this in a nice regression table. Use huxreg() again, but add both your original regression and your regression saved in part C. Notice any changes?

huxreg("Original" = reg, "Robust SEs" = reg_rse,
       number_format = 6) # need to see more decimal places!

	Original	Robust SEs
(Intercept)	0.561645 ***	0.561645 ***
	(0.006031)	(0.006012)
net_trump_vote	0.009922 ***	0.009922 ***
	(0.000192)	(0.000149)
N	1735	1735
R2	0.606062	0.606062
logLik	-62.627843
AIC	131.255686
* p < 0.001; p < 0.01; * p < 0.05.

Note that the standard errors are too small to register anything within 3 decimal places (the default). If we were to make a true percent variable out of 100% instead of out of 1 (as I have you do in Question 13), we would see more of a difference. Look carefully, the standard error on net_trump_vote is 0.000192 (from Question 5), and the robust standard error is 0.000149 from Part C.

Question 10

Now let’s check for outliers.

Part A

Just looking at the scatterplot in Question 3, do you see any outliers?

It looks like there might be one - for the Member whose district’s Net vote was somewhere around and s/he agrees with the President about .

Part B

Install and load the car package. Run the outlierTest command on your regression object. Does it detect any outliers?

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

outlierTest(reg)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 1708 -2.886125          0.0039485           NA

Yes, surprisingly it’s not the point that looks like it stands out on the graph (middle-left), it’s a different point.

Part C

Look in your original data to match this outlier with an observation. Hint: use the slice() command, as the outlier test gave you an observation (row) number!

congress %>%
  slice(1708)

congress	chamber	bioguide	last_name	state	district	party	votes	agree_pct	predicted_agree	net_trump_vote
116	senate	T000464	Tester	MT		D	24	0.0417	0.751	20.4

Question 11 (Optional)

This data is still a bit messy. Let’s check in on your tidyverse skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.

Part A

First, the variable congress (session of Congress) seems a bit off. Get a count() of congress.

congress %>%
  count(congress)

congress	n
0	647
115	552
116	536

Part B

Let’s get rid of the 0 values for congress (someone made a mistake coding this, probably). Also, while we’re at it, let’s take agree_pct and mutate a variable that is a proper percentage (i.e. *100).

congress_tidy<-congress %>%
  filter(congress!=0) %>%
  mutate(agree_pct = agree_pct * 100)

Part C

The variable party is also quite a mess. count() by party to see. Then let’s mutate a variable to make a better measure of political party - just "Republican", "Democrat", and "Independent". Try doing this with the case_when() command (as your mutate formula).³ When you’re done count() by your new party variable to make sure it worked.

congress_tidy <- congress_tidy %>%
  mutate(pol_party = case_when(
    party %in% c("Democrat", "D") ~ "Democrat",
    party %in% c("Republican", "R") ~ "Republican",
    party %in% c("Independent", "I") ~ "Independent"
  ))

congress_tidy %>%
  count(pol_party)

pol_party	n
Democrat	528
Independent	5
Republican	555

Part D

Now plot a scatterplot (same as Question 3) and set color to your party variable. Notice R uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows: party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray"). Then, run your plot again, adding the following line to customize the colors +scale_colour_manual("Parties", values = party_colors).⁴

party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")

p<-ggplot(data = congress_tidy)+
  aes(x = net_trump_vote, y=agree_pct)+
  geom_point(aes(color = pol_party))+
  geom_smooth(method="lm", color = "black")+
  scale_colour_manual("Parties", values = party_colors)+
  theme_light()
p

Part E

Now facet your scatterplot by chamber.

p+facet_wrap(~chamber)

# just for kicks
p+facet_grid(chamber~pol_party)

Look at my markdown code to see that I can write my R code directly in my text with one backtick and r. This calls my beta_0_hat object I made.↩︎
Again, look at my markdown: I am calling r_sq and multiplying it by 100 inside the code chunk.↩︎
The syntax for case_when() is to have a series of condition ~ "Outcome", separated by commas. For example, one condition is to assign both "Democrat" and "D" to "Democrat", as in party %in% c("Democrat", "D") ~ "Democrat". You could also do this with a few ifelse() commands, but that’s a bit more awkward.↩︎
"Parties" is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F).↩︎

2.6 OLS: Precision and Diagnostics - R Practice (Solutions)

Ryan Safner

2019-10-04

Set Up

Question 1

Question 2

Question 3

Question 4

Question 5

Part A

Part B

Part C

Part D

Question 6

Part A

Part B

Part C

Part D

Question 7

Part A

Part B

Question 8

Question 9

Part A

Part B

Part C

Part D

Question 10

Part A

Part B

Part C

Question 11 (Optional)

Part A

Part B

Part C

Part D

Part E