To minimize confusion, I suggest creating a new R Project
(e.g. regression_practice
) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
Our data comes from fivethirtyeight’s Trump Congress tracker. Download and read in (read_csv
) the data.
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## congress = col_double(),
## chamber = col_character(),
## bioguide = col_character(),
## last_name = col_character(),
## state = col_character(),
## district = col_double(),
## party = col_character(),
## votes = col_double(),
## agree_pct = col_double(),
## predicted_agree = col_double(),
## net_trump_vote = col_double()
## )
Look at the data with glimpse()
.
## Observations: 1,735
## Variables: 11
## $ congress <dbl> 0, 115, 116, 0, 115, 116, 0, 115, 116, 0, 115, 1…
## $ chamber <chr> "house", "house", "house", "house", "house", "ho…
## $ bioguide <chr> "A000055", "A000055", "A000055", "A000367", "A00…
## $ last_name <chr> "Aderholt", "Aderholt", "Aderholt", "Amash", "Am…
## $ state <chr> "AL", "AL", "AL", "MI", "MI", "MI", "NV", "NV", …
## $ district <dbl> 4, 4, 4, 3, 3, 3, 2, 2, 2, 12, 12, 12, 31, 31, 3…
## $ party <chr> "Republican", "R", "R", "Independent", "R", "I",…
## $ votes <dbl> 141, 95, 46, 140, 96, 44, 140, 94, 46, 138, 92, …
## $ agree_pct <dbl> 0.97872340, 0.96842105, 1.00000000, 0.62142857, …
## $ predicted_agree <dbl> 0.95556332, 0.94634916, 0.97459258, 0.77280357, …
## $ net_trump_vote <dbl> 63.0, 63.0, 63.0, 9.4, 9.4, 9.4, 12.3, 12.3, 12.…
We want to see how does the percentage that a member of Congress’ agrees with President Trump (agree_pct
) depend on the result of the 2016 Presidential election in their district (net_trump_vote
)? First, plot a scatterplot of agree_pct
on net_trump_vote
. Add a regression line with an additional layer of geom_smooth(method="lm")
.
Find the correlation between agree_pct
and net_trump_vote
.
## [1] 0.7784996
cor <dbl> | ||||
---|---|---|---|---|
0.7784996 |
We weant to predict the following model:
^agree_pct=^β0+^β1net_trump_vote
Run a regression, and save it as an object. Now get a summary()
of it.
##
## Call:
## lm(formula = agree_pct ~ net_trump_vote, data = congress)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72261 -0.16975 0.03761 0.19851 0.70728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5616451 0.0060307 93.13 <2e-16 ***
## net_trump_vote 0.0099218 0.0001922 51.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.251 on 1733 degrees of freedom
## Multiple R-squared: 0.6061, Adjusted R-squared: 0.6058
## F-statistic: 2666 on 1 and 1733 DF, p-value: < 2.2e-16
What is ^β0? What does it mean in the context of our question?
# I just want to show you how to extract these and put it into text with markdown
# This relies on the code in Question 6 Part B
library(broom)
reg_tidy<-reg %>%
tidy()
beta_0_hat<-reg_tidy %>% # save this as beta_0_hat
slice(1) %>% # look only at first row (intercept)
pull(estimate) %>% # extract the value of the estimate
round(.,2) # round to 2 decimal places
beta_0_hat # look at it
## [1] 0.56
^β0= 0.56, meaning the rate at which Members of Congress agree with the President when the Net Trump Vote was 0 (i.e. both candidates got 50%) is 0.56 (out of 1).1
What is ^β1? What does it mean in the context of our question?
## [1] 0.01
^β1= 0.01, meaning for every additional Net Trump Vote % in a Member of Congress’ district, that Member will agree 0.01 more with the President (out of 1).
What is R2? What does it mean?
## [1] 0.61
R2= 0.61, meaning we have explained 61%2 of the variation in agree_pct
.
What is the SER? What does it mean?
## [1] 0.25
SER= 0.25, meaning the average size of the residuals or errors (distance from an actual data point to the predicted regression line) is 0.25 (out of 1).
We can look at regression outputs in a tidier way, with the broom
package.
Run the function tidy()
on your regression object (saved in question 5). Save this as an object and then look at it.
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 0.561645133 | 0.0060306814 | 93.13129 | 0 |
net_trump_vote | 0.009921844 | 0.0001921538 | 51.63492 | 0 |
Run the glance()
function on your original regression object. What does it show you?
r.squared <dbl> | adj.r.squared <dbl> | sigma <dbl> | statistic <dbl> | p.value <dbl> | df <int> | logLik <dbl> | AIC <dbl> | BIC <dbl> | |
---|---|---|---|---|---|---|---|---|---|
0.6060616 | 0.6058343 | 0.2510093 | 2666.165 | 0 | 2 | -62.62784 | 131.2557 | 147.632 |
It shows you summary statistics of the regression, particularly its goodness of fit.
r.squared
is the R2 valuesigma
is the SER(^σu)That’s all we need for now.
Now run the augment()
function on your original regression object, and save this as an object. Look at it. What does it show you?
agree_pct <dbl> | net_trump_vote <dbl> | .fitted <dbl> | .se.fit <dbl> | .resid <dbl> | .hat <dbl> | |
---|---|---|---|---|---|---|
0.97872340 | 63.0000000 | 1.186721301 | 0.013732227 | -0.2079978972 | 0.0029929687 | |
0.96842105 | 63.0000000 | 1.186721301 | 0.013732227 | -0.2183002488 | 0.0029929687 | |
1.00000000 | 63.0000000 | 1.186721301 | 0.013732227 | -0.1867213015 | 0.0029929687 | |
0.62142857 | 9.4000000 | 0.654910466 | 0.006362055 | -0.0334818950 | 0.0006424141 | |
0.54166667 | 9.4000000 | 0.654910466 | 0.006362055 | -0.1132437998 | 0.0006424141 | |
0.79545455 | 9.4000000 | 0.654910466 | 0.006362055 | 0.1405440790 | 0.0006424141 | |
0.97857143 | 12.3000000 | 0.683683814 | 0.006561991 | 0.2948876147 | 0.0006834260 | |
0.98936170 | 12.3000000 | 0.683683814 | 0.006561991 | 0.3056778883 | 0.0006834260 | |
0.95652174 | 12.3000000 | 0.683683814 | 0.006561991 | 0.2728379253 | 0.0006834260 | |
0.12318841 | -40.0000000 | 0.164771376 | 0.009584055 | -0.0415829702 | 0.0014578700 |
It takes the original data points for Y
and X
and adds new variables with values for those observations: - .fitted
is ^Yi|X, the predicted value for a specific net_trump_vote
- .resid
is ^ui, the residual or error for a specific net_trump_vote
^ui=Yi−^Yi.resid=agree_pct−.fitted
That’s all we need right now.
Now let’s start looking at the residuals of the regression.
Take the augmented regression object from Question 6-D and use it as the source of your data to create a histogram, where x is .resid
. Does it look roughly normal?
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It does look rougly normal.
Take the augmented regression object and make a residual plot, which is a scatterplot where x
is the normal x
variable, and y
is the .resid
. Feel free to add a horizontal line at 0 with geom_hline(yintercept=0)
.
Now let’s try presenting your results in a regression table. Install and load huxtable
, and run the huxreg()
command. Your main input is your regression object you saved in Question 5. Feel free to customize the output of this table (see the slides).
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:purrr':
##
## every
## The following object is masked from 'package:ggplot2':
##
## theme_grey
(1) | |
(Intercept) | 0.562 *** |
(0.006) | |
net_trump_vote | 0.010 *** |
(0.000) | |
N | 1735 |
R2 | 0.606 |
logLik | -62.628 |
AIC | 131.256 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Agree with President (Proportion) | |
Intercept | 0.5616 *** |
(0.0060) | |
Net Trump Vote | 0.0099 *** |
(0.0002) | |
n | 1735 |
R-squared | 0.6061 |
SER | 0.2510 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Now let’s check for heteroskedasticity.
Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?
It does seem like it would be, looking at the residual plot, since there are clear downward-moving clusters of points that are at times above the line, then below the line, then above and below the line.
Install and load the lmtest
package and run bptest
on your regression object. Was the data heteroskedastic or homoskedastic?
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 6.0092, df = 1, p-value = 0.01423
Since the p-value was less than 0.05, we can reject the null hypothesis H0 (that the errors are homoskedastic), and conclude that the errors are heteroskedastic. We will need to fix them if we want accurate standard errors.
Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr
package and use the lm_robust()
command (instead of the lm()
command) to run a new regression (and save it). Make sure you add se_type="stata"
inside the command to calculate robust SEs. Look at it. What changes?
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 0.561645133 0.0060117565 93.42447 0 0.549854072
## net_trump_vote 0.009921844 0.0001493938 66.41403 0 0.009628833
## CI Upper DF
## (Intercept) 0.57343619 1733
## net_trump_vote 0.01021486 1733
The standard errors on both estimates (^β0 and ^β1) change, but the estimates themselves do not change!
Note t values
and Pr(>|t|)
will change, and this command adds confidence intervals (CI
) and degrees of freedom for t, DF
, but we have not covered any of these yet!
Now let’s see this in a nice regression table. Use huxreg()
again, but add both your original regression and your regression saved in part C. Notice any changes?
Original | Robust SEs | |
(Intercept) | 0.561645 *** | 0.561645 *** |
(0.006031) | (0.006012) | |
net_trump_vote | 0.009922 *** | 0.009922 *** |
(0.000192) | (0.000149) | |
N | 1735 | 1735 |
R2 | 0.606062 | 0.606062 |
logLik | -62.627843 | |
AIC | 131.255686 | |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Note that the standard errors are too small to register anything within 3 decimal places (the default). If we were to make a true percent variable out of 100% instead of out of 1 (as I have you do in Question 13), we would see more of a difference. Look carefully, the standard error on net_trump_vote
is 0.000192 (from Question 5), and the robust standard error is 0.000149 from Part C.
Now let’s check for outliers.
Just looking at the scatterplot in Question 3, do you see any outliers?
It looks like there might be one - for the Member whose district’s Net vote was somewhere around −60 and s/he agrees with the President about 0.6.
Install and load the car
package. Run the outlierTest
command on your regression object. Does it detect any outliers?
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 1708 -2.886125 0.0039485 NA
Yes, surprisingly it’s not the point that looks like it stands out on the graph (middle-left), it’s a different point.
Look in your original data to match this outlier with an observation. Hint: use the slice()
command, as the outlier test gave you an observation (row) number!
congress | chamber | bioguide | last_name | state | district | party | votes | agree_pct | predicted_agree | net_trump_vote |
116 | senate | T000464 | Tester | MT | D | 24 | 0.0417 | 0.751 | 20.4 |
This data is still a bit messy. Let’s check in on your tidyverse
skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.
First, the variable congress
(session of Congress) seems a bit off. Get a count()
of congress
.
congress | n |
0 | 647 |
115 | 552 |
116 | 536 |
Let’s get rid of the 0
values for congress
(someone made a mistake coding this, probably). Also, while we’re at it, let’s take agree_pct
and mutate
a variable that is a proper percentage (i.e. *100
).
The variable party
is also quite a mess. count()
by party
to see. Then let’s mutate
a variable to make a better measure of political party - just "Republican"
, "Democrat"
, and "Independent"
. Try doing this with the case_when()
command (as your mutate
formula).3 When you’re done count()
by your new party variable to make sure it worked.
pol_party | n |
Democrat | 528 |
Independent | 5 |
Republican | 555 |
Now plot a scatterplot (same as Question 3) and set color
to your party variable. Notice R
uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows: party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")
. Then, run your plot again, adding the following line to customize the colors +scale_colour_manual("Parties", values = party_colors)
.4
party_colors <- c("Democrat" = "blue", "Republican" = "red", "Independent" = "gray")
p<-ggplot(data = congress_tidy)+
aes(x = net_trump_vote, y=agree_pct)+
geom_point(aes(color = pol_party))+
geom_smooth(method="lm", color = "black")+
scale_colour_manual("Parties", values = party_colors)+
theme_light()
p
Look at my markdown code to see that I can write my R code directly in my text with one backtick and r. This calls my beta_0_hat
object I made.↩︎
Again, look at my markdown: I am calling r_sq
and multiplying it by 100 inside the code chunk.↩︎
The syntax for case_when()
is to have a series of condition ~ "Outcome"
, separated by commas. For example, one condition is to assign both "Democrat"
and "D"
to "Democrat"
, as in party %in% c("Democrat", "D") ~ "Democrat"
. You could also do this with a few ifelse()
commands, but that’s a bit more awkward.↩︎
"Parties"
is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F)
.↩︎