Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.
In your own words, describe what exogeneity and endogeneity mean, and how they are related to bias. What can we learn about the bias if we know X is endogenous?
The OLS estimators ^β0 and ^β1 are unbiased estimates of the true population parameters β0 and β1 if and only if X is exogenous. That is to say, if cor(X,u)=0 (i.e. there is no correlation between X and any unobserved variable that affects Y), then E[^β1]=β1.
If X is correlated with the error term, then X is endogenous. The true expected value of the OLS estimator is E[^β1]=β1+cor(X,u)σuσX
The bias is (E[^β1]−β1), i.e. the difference between average estimated sample slope and the `true’ population slope, so we can determine first the size of the bias based on how large cor(X,u) is. The stronger the correlation, the larger the bias.
Second, we can determine the direction of the bias depending on the sign of cor(X,u).
If X and u are positively correlated (move in the same direction), we know that we have overstated the true effect of ΔX on ΔY, since a change in Y is picking up both a change in X and a further change (in the same direction as X) in the unobserved u.
If the correlation is negative (move in opposite directions), we know that we have understated the true effect of ΔX on ΔY, since a change in Y is picking up both a change in X that is dampened by a change in the opposite direction of u.
In your own words, describe what R2 means. How do we calculate it, what does it tell us, and how do we interpret it?
The R2 is a measure of how well the OLS regression line “fits” our observed data points. It is the proportion of the total variation in Y (TSS) that is explained by the variation from our model (ESS):
R2=ESSTSS=∑(^Yi−ˉY)2∑(Yi−ˉY)2
Equivalently, it can be found by subtracting the proportion of unexplained variation in Y (SSE/TSS) from 1:
R2=1−SSETSS=1−∑(ui)2∑(Yi−ˉY)2
This is because SSE+ESSTSS=1. Finally, R2 is the square of the correlation coefficient between X and Y, R2=(rX,Y)2
The closer R2 is to 1, the better the fit, the closer to 0, the poorer the fit.
In your own words, describe what the standard error of the regression (SER) means. How do we calculate it, what does it tell us, and how do we interpret it?
SER (^σu) is the average size of the error (or residual), ^ui, that is, the average distance from the regression line to the actual data value for Y at a given X. The goal of OLS is to minimize this (well, technically minimize the sum of squared errors!).
SER=√1n−2∑^ui2SER=√SSEn−2
We calculate it by squaring the residuals (to get a positive distance) and taking the mean of them by adding them all up and dividing by n−2, and then taking the square root to return to normal (non-squared) units.
We divide by n−2 rather than by n due to the degrees of freedom correction for calculating two prior statistics with our data already, ^β0 and ^β1.
In your own words, describe what homoskedasticity and heteroskedasticity mean: both in ordinary English, and in terms of the graph of the OLS regression line.
Homoskedasticity means the errors are distributed with the same variance for all levels of X. Knowing anything about X will not tell us anything about the distribution of errors at that level of X.
Heteroskedasticity means the errors are distributed differently for different levels of X. So, at different levels of X, there will be much more or much less variation in the residuals.
In your own words, describe what the variation in ^β1 (either variance or standard error) means, or is measuring. What three things determine the variation, and in what way?
The variation of ^β1 (either it’s variance or standard error) is a measure of how precise our estimate is. This idea comes from the sampling distribution of ^β1, since it is a random variable: if we were to take other samples and calculate the slope of a regression line ^β1 for each, the estimate would vary from sample to sample.
The standard error of ^β1 (square this to get variance) is:
se(^β1)=σu√n×se(X)
The three things that affect it are:
Goodness of Fit of the Regression (σu) or SER. The worse the fit, the higher the SER, and the worse the precision (higher standard error) of ^β1.
Sample size, n: the more data, the better the precision (lower standard error) of ^β1.
Standard error of X: the more variation (spread) in X-values, the better the precision (lower standard error) of ^β1.
See the graphs in slides 6-8 of class 2.5 for more.
In your own words, describe what a p-value means, and how it is used to establish statistical significance.
The p-value is the probability that, if the null hypothesis were true, of observing a test statistic at least as extreme as the one found in our sample. Specifically, if H0:β1, it is the probability of getting a sample slope at least as extreme as the one in our sample, if the slope were truly 0.1
Prob(δ≥δi|H0 is true)
where δ is a test-statistic and δi is the test statistic we obtained from our sample.
Another way to interpret this is that the p-value is the probability we commit a Type I error: the probability that, if the null hypothesis were true, we falsely reject it from our sample evidence.
Be careful, the p-value is not the probability that our alternative hypothesis is true given our findings (commonly believed)! In fact it is basically the opposite, the probability of our findings being valid given the null hypothesis!
A researcher is interested in examining the impact of illegal music downloads on commercial music sales. The author collects data on commercial sales of the top 500 singles from 2017 (Y) and the number of downloads from a web site that allows `file sharing’ (X). The author estimates the following model
music salesi=β0+β1illegal downloadsi+ui
The author finds a large, positive, and statistically significant estimate of ^β1. The author concludes these results demonstrate that illegal downloads actually boost music sales. Is this an unbiased estimate of the impact of illegal music on sales? Why or why not? Do you expect the estimate to overstate or understate the true relationship between illegal downloads and sales?
Does knowing the amount of illegal downloads an artist has convey any information about other variables that affect music sales? In other words, we are asking if E[u|X]=0 (or more simply, cor(X,u)=0).
It is likely that artists and songs that are the most heavily pirated are the most popular ones, and also are likely have very high music sales. Economists say piracy is like a tax on success–it happens more to those who are already successful and less to those who are still trying to make it big.
In any case, illegal downloads is probably endogenous. Since there is likely a positive correlation between music sales and popularity (in the error term), and popularity is also positively correlated with music sales, it is likely that we are overstating the effect of illegal downloads on sales. In other words, ^β1 is also picking up the positive effect of popular songs, and is too large. The true estimate of β1 is likely much lower than measured.
A pharmaceutical company is interested in estimating the impact of a new drug on cholesterol levels. They enroll 200 people in a clinical trial. People are randomly assigned the treatment group or into the control group. Half of the people are given the new drug and half the people are given a sugar pill with no active ingredient. To examine the impact of dosage on reductions in cholesterol levels, the authors of the study regress the following model:
cholesterol leveli=β0+β1dosage leveli+ui
For people in the control group, dosage leveli=0 and for people in the treatment group, dosage leveli measures milligrams of the active ingredient. In this case, the authors find a large, negative, statistically significant estimate of ^β1. Is this an unbiased estimate of the impact of dosage on change in cholesterol level? Why or why not? Do you expect the estimate to overstate or understate the true relationship between dosage and cholesterol level?
Does knowing whether (or how much) a person was treated convey any information about other characteristics that affect cholesterol level (in ui)? Again, we are asking if E[u|X]=0 or cor(X,u)=0
In this case, the answer is clearly no, the equations do hold and treatment is exogenous. Xi is determined by random assignment, some people assigned a the drug, and some nothing at all. But, the important fact is that dosage is determined randomly so we expect that on average it will not be correlated with u. In this case, E[^β1]=β1, ^β1 is unbiased.
For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R
to verify your answers, but you are expected to reach the answers in this section “manually.”
A researcher wants to estimate the relationship between average weekly earnings (AWE, measured in dollars) and Age (measured in years) using a simple OLS model. Using a random sample of college-educated full-time workers aged 25-65 yields the following:
^AWE=696.70+9.60 Age
Interpret what ^β0 means in this context.
^β0 is 696.70. This is the vertical intercept of the regression line. It means that a person that is 0 years old earns a $696.70 per week on average. This is often nonsensical, so we don’t often care about the economic meaning of the intercept.
Interpret what ^β1 means in this context.
^β1 is 9.60 This is the slope of the regression line. It means that for every year older a person is, they can expect their wages to increase by $9.60, on average. This is the marginal effect of Age on AWE (and the causal effect if this model were exogenous).
The R2=0.023 for this regression. What are the units of the R2, and what does this mean?
R2 has no units, it is the proportion of variation in AWE that is explained by our model, between 0 and 1. This model explains only 2.3% of the variation in AWE, meaning this model is poor, and the line does not fit the data points well.
The SER,^σu=624.1 for this regression. What are the units of the SER in this context, and what does it mean? Is the SER large in the context of this regression?
SER is measured in the same units as the dependent variable, AWE, so it is measured in dollars. It is the average error or residual for an individual, the difference (in dollars) between OLS’ predicted ^AWE for that person, and their true AWE in the data. This SER is quite big, $624 in average weekly earnings.
Suppose Maria is 20 years old. What is her predicted ^AWE?
^AWEMaria=696.70+9.60(20)=888.70
Suppose the data shows her actual AWE is $430. What is her residual? Is this a relatively good or a bad prediction?2
ˆuMaria=YMaria−ˆYMaria=430−888.70=−458.70
This actually a relatively good prediction, as it is much lower than the average prediction error (SER), which was 624.1.
What does the error term, ui represent in this case? What might individuals have different values of ui?
The error term represents all factors other than age that affects an individual’s average weekly earnings. This could include things like experience, ability, job type, education level, conscienciousness etc.
Do you think that Age is exogenous? Why or why not? Would we expect ^β1 to be too large or too small?
It’s very unlikely that Age is exogenous. Knowing someone’s age likely gives us information about u: we can guess about their experience or level of education (they are likely higher for older people), and most of these positively affect wages. Thus, we have probably overstimated the effect of age on earnings (i.e. ^β1), and the true β1 is likely smaller.
Suppose a researcher is interested in estimating a simple linear regression model:
Yi=β0+β1Xi+ui
What is the OLS estimate of ^β1?
The formula for ^β1=n∑i=1(Xi−ˉX)(Yi−ˉY)n∑i=1(Xi−ˉX)2=cov(X,Y)var(X)=138006900=2
What is the OLS estimate of ^β0?
The formula for ^β0=ˉY−^β1ˉX=63−30(2)=3
Suppose the OLS estimate of ^β1 has a standard error of 0.072. Could we probably reject a null hypothesis of H0:β1=0 at the 95% level?
Yes, we could reject the null hypothesis as the estimate of ^β1=2 is more than 2 times its standard error of 0.072. The test-statistic would actually be
t=^β1−β1,0se(^β1)t=2−00.072t≈27.78
This is well beyond the critical value needed to reject H0, and the p-value would be basically 0.
Calculate the R2 for this model. How much variation in Y is explained by our model?
We know TSS (4 bullet point) and SSE (last bullet point).
R2=1−SSETSS=1−165629000=1−0.057=0.943
This model explains 94.3% of the variation in Yi.
How large is the average residual?
We need to find the standard error of the regression (SER), but luckily we know the SSE (last bullet point)
SER=√SSEn−2=√165648−2=√36=6
This tells us the average residual is 36 (units of Y).
For the following problems, please attach/write the answers to each question on the same document as the previous problems, but also include a printed/attached (and commented!) .R script file of your commands to answer the questions.
Download the MLBattend
dataset from Blackboard. This data contains data on attendance at major league baseball games for all 32 MLB teams from the 1970s-2000.
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## team = col_character(),
## city = col_character(),
## nickname = col_character(),
## league = col_character(),
## division = col_character(),
## season = col_double(),
## home_attend = col_double(),
## runs_scored = col_double(),
## runs_allowed = col_double(),
## wins = col_double(),
## losses = col_double(),
## games_behind = col_double()
## )
Clean up the data a bit by making a new variable to measure home attendance in millions. This will make it easier to interpret your regression later on.
Get the correlation between Runs Scored and Home Attendance.
Correlation <dbl> | ||||
---|---|---|---|---|
0.4944431 |
Plot a scatterplot of Runs Scored (y
) on Home Attendance (x
). Add a regression line.
Run a regression of Runs Scored on Home Attendance. What are β0 and ^β1? Interpret them in the context of our question.
##
## Call:
## lm(formula = runs_scored ~ home_attend_mil, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.566 -52.754 1.414 63.769 271.377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 572.618 8.081 70.86 <2e-16 ***
## home_attend_mil 68.798 4.183 16.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.47 on 836 degrees of freedom
## Multiple R-squared: 0.2445, Adjusted R-squared: 0.2436
## F-statistic: 270.5 on 1 and 836 DF, p-value: < 2.2e-16
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 572.61829 | 8.080646 | 70.86294 | 0.000000e+00 |
home_attend_mil | 68.79777 | 4.182920 | 16.44731 | 7.132757e-53 |
## [1] 572.62
## [1] 68.8
^β0= 572.62 ^β1: 68.8
Teams that have a home attendance of 0 over their season score 572.618 runs. For every additional 1 million fans attending at home games over the season, a team score s68.798 more runs.
Write out the estimated regression equation.
^Runs scoredi=572.618−68.798 Home attendance (mil)
## $$
## \text{runs\_scored} = 572.618 + 68.798(\text{home\_attend\_mil}) + \epsilon
## $$
Make a regression table of the output.
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:purrr':
##
## every
## The following object is masked from 'package:ggplot2':
##
## theme_grey
## The following object is masked from 'package:pander':
##
## wrap
(1) | |
Constant | 572.618 *** |
(8.081) | |
Home Attendance (Millions) | 68.798 *** |
(4.183) | |
N | 838 |
R-Squared | 0.244 |
SER | 91.473 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Now let’s start running some diagnostics of the regression. Make a histogram of the residuals. Do they look roughly normal?
# here we need broom's augment() command to add residuals to the data
# load broom
library(broom)
# augment the regression, save as reg_aug
reg_aug<-reg %>%
augment()
# now we use this as the data in our histogram plot in ggplot, (x is .resid)
ggplot(data = reg_aug)+
aes(x = .resid)+
geom_histogram(color="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
They do look roughly normal.
Make a residual plot.
Test the regression for heteroskedasticity. Are the errors homoskedastic or heteroskedastic? Generate robust standard errors. Make a regression output table, with one column with regular standard errors and another with robust standard errors.
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 0.22515, df = 1, p-value = 0.6351
The null hypothesis H0 is that the errors are homoskedastic. The p-value for this test is very large, so we cannot reject the null hypothesis.
This is good, it means the errors are homoskedastic, and our OLS estimators’ standard errors are accurate and do not need to be corrected for heteroskedasticity.
Test the data for outliers. If there are any, identify which team(s) and season(s) are outliers.
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 816 -3.255201 0.0011787 0.98779
This test detected one outlier, which is observation (row) number 816. Let’s look it up:
team | city | nickname | league | division | season | home_attend | runs_scored | runs_allowed | wins | losses | games_behind | home_attend_mil |
TOR | Toronto | Blue Jays | AL | East | 1.98e+03 | 7.55e+05 | 329 | 466 | 37 | 69 | 23.5 | 0.755 |
The Toronto Blue Jays’ 1981 season is an outlier. Just for kicks, let’s point it out on the scatterplot.
What is the marginal effect of home attendance on runs scored? Is this statistically significant? Why or why not?
This is an interpretation question, no need to calculate anything. The marginal effect is ^β1: for every 1 additional million fans attending home games over a team’s season, the team scores 69 more runs.
Looking back at the regression output in part c, the t-score for the hypothesis test H0:β1=0, H1:β1≠0 is 16.45, yielding a very very small p-value. We have sufficient evidence to reject H0 in favor of our alternative hypothesis, that there is a relationship between home attendance and runs scored over a season.
Now we’ll try out the infer
package to understand the p-value and confidence interval for our observed slope in our regression model. Save the (value of) our sample ^β1 from your regression in Part D as an object. Then, install and load the infer
package, and then calculate the slope3 under the null hypothesis that there is no connection between attendance and runs.4 for 1000 additional simulated samples5, and save this as an object (it’s a tibble
). Then, use this to get_p_value()
6. Compare to the p-value given by lm()
and tidy()
above.
## [1] 68.79777
# note I could also just use the beta_1_hat I optionally made in Part D
# make 1000 simulations of sample slopes under null hypothesis that slope = 0
#install.packages(infer)
library(infer) # load infer
slope_simulations<-mlb %>% # save as a tibble
specify(runs_scored ~ home_attend_mil) %>% # our lm model
hypothesize(null = "independence") %>% # null hypothesis, slope = 0 (X and Y independent)
generate(reps = 1000, type = "permute") %>% # make 1000 permutations
calculate(stat = "slope") # calculate the sample slope of each permutation
# make sure it worked
slope_simulations %>%
head(., n = 10) # there are a LOT! I only print the first 10 for space
replicate | stat |
1 | -1.64 |
2 | 0.498 |
3 | -1.57 |
4 | 1.06 |
5 | 5.93 |
6 | 1.88 |
7 | 7.25 |
8 | 8.61 |
9 | 4.28 |
10 | 9.48 |
p_value |
0 |
The p-value is basically 0. According to the regression output in part D, it is smaller than 0.00000000000000002.
Make a histogram of the simulated slopes, and plot our sample slope on that histogram, shading the p-value.7
## Warning: `visualize()` should no longer be used to plot a p-value.
## Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.
# infer no longer shades the p-value, so I'll do it manually
slope_simulations %>%
ggplot(data = .)+
aes(x = stat)+ # slope from simulations
geom_histogram(color = "white", # the histogram
fill = "indianred")+
# add "shading" for p-value's two sides
# right side
geom_rect(xmin=68.798,
xmax=100,
ymin=0,
ymax=450,
fill = "pink",
alpha=0.4)+
# left side
geom_rect(xmin=-100,
xmax=-68.798,
ymin=0,
ymax=450,
fill = "pink",
alpha=0.4)+
# add vertical line for our sample slope
geom_vline(xintercept = sample_slope,
color = "red",
size = 2)+
# add label
geom_label(x = sample_slope,
y = 200,
color = "red",
label = expression(paste("Our ", hat(beta[1]))))+
scale_x_continuous(breaks=c(-75,-50,-25,0,25,50,75),
limits=c(-75,75),
expand=c(0,0))+
scale_y_continuous(limits=c(0,450),
expand=c(0,0))+
labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)),
y = "Samples")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
Note this is the sampling distribution of ^β1 under the null hypothesis (the true β1=0). Values on the horizontal axis are values of ^β1, not the number of standard deviations away from the null hypothesis.
What the test-statistic does is standardize this distribution much like we would standardize a distribution to the standard normal distribution via calculating Z-scores.
Let me visualize what would happen (and this is what R does to calculate the t value
and associationed p
-value.)
mean | se |
0.0517 | 4.86 |
## [1] 14.14881
# now plot t-statistics %>%
ggplot(data = tstatistics)+
aes(x = tscores)+ # slope from simulations
geom_histogram(color = "white", # the histogram
fill = "indianred")+
# add vertical line for our sample slope's t-statistic
geom_vline(xintercept = our_t,
color = "red",
size = 2)+
#add label
geom_label(x = our_t,
y = 50,
color = "red",
label = "Our t")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is off slightly (see the t-statistic R calculated in part C), but it gives you an idea. Our test-statistic is somewhere around 14-16, this means it is 14-16 standard deviations away from 0, so again, a very very low p-value!
Get the 95% confidence interval for your slope estimate,8 and then make a histogram of the simulated slopes (like part L), but instead, add +shade_confidence_interval()
.9 Compare this to what you get with tidy()
above.
lower | upper |
59.3 | 78.3 |
## Warning: `visualize()` should no longer be used to plot a p-value.
## Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.
Note in the classic sense, the p-value is actually measuring the probability of a test statistic (t) being at least as extreme as ours. The test statistic essentially standardizes our sample statistic (^β1) so that it measures standard deviations from the null-hypothesized value (i.e. 0), much like a Z-score.↩︎
Hint: compare your answer here to your answer in Part D.↩︎
calculate(stat = "slope")
↩︎
hypothesize(null = "independence")
↩︎
generate(reps = 1000, type = "permute")
↩︎
Set obs_stat
equal to your observed slope, and set direction = "two_sided"
↩︎
You can use ggplot2
to plot a histogram in the normal way and add a geom_vline()
, setting xintercept
equal to your saved object with the sample ^β1 value. Alternatively, you can use infer
to pipe your tibble
of simulations into visualize()
, and inside visualize()
set obs_stat
equal to your saved ^β1 object. Regardless of which method you use, add +shade_p_value()
. Inside this, set obs_stat
equal to your saved slope, and add direction = "two_sided"
.↩︎
tidy()
your original regression, with conf.int = TRUE
inside the command, then select(conf.low, conf.high)
and filter
by your x
variable. Save this as an object.↩︎
Inside of this, set endpoints
equal to the object you just made with the low and high confidence interval values.↩︎