min^β0,^β1,^β2,⋯,^βkn∑i=1[Yi−(^β0+^β1X1i+^β2X2i+⋯+^βkXki)⏟ui]2
Math FYI: in linear algebra terms, a regression model with n observations of k independent variables:
Y=Xβ+u
(y1y2⋮yn)⏟Y(n×1)=(x1,1x1,2⋯x1,nx2,1x2,2⋯x2,n⋮⋮⋱⋮xk,1xk,2⋯xk,n)⏟X(n×k)(β1β2⋮βk)⏟β(k×1)+(u1u2⋮un)⏟u(n×1)
Math FYI: in linear algebra terms, a regression model with n observations of k independent variables:
Y=Xβ+u
(y1y2⋮yn)⏟Y(n×1)=(x1,1x1,2⋯x1,nx2,1x2,2⋯x2,n⋮⋮⋱⋮xk,1xk,2⋯xk,n)⏟X(n×k)(β1β2⋮βk)⏟β(k×1)+(u1u2⋮un)⏟u(n×1)
Math FYI: in linear algebra terms, a regression model with n observations of k independent variables:
Y=Xβ+u
(y1y2⋮yn)⏟Y(n×1)=(x1,1x1,2⋯x1,nx2,1x2,2⋯x2,n⋮⋮⋱⋮xk,1xk,2⋯xk,n)⏟X(n×k)(β1β2⋮βk)⏟β(k×1)+(u1u2⋮un)⏟u(n×1)
The OLS estimator for β is ˆβ=(X′X)−1X′Y
Appreciate that I am saving you from such sorrow
^βj∼N(E[^βj],se(^βj))
As before, E[^βj]=βj when Xj is exogenous (i.e. cor(Xj,u)=0)
We know the true E[^βj]=βj+cor(Xj,u)σuσXj⏟O.V. Bias
If Xj is endogenous (i.e. cor(Xj,u)≠0), contains omitted variable bias
We can now try to quantify the omitted variable bias
Yi=β0+β1X1i+β2X2i+ui
Yi=β0+β1X1i+β2X2i+ui
What happens when we run a regression and omit X2i?
Suppose we estimate the following omitted regression of just Yi on X1i (omitting X2i):+
Yi=α0+α1X1i+νi
+ Note: I am using α's and νi only to denote these are different estimates than the true model β's and ui
Key Question: are X1i and X2i correlated?
Run an auxiliary regression of X2i on X1i to see:+
X2i=δ0+δ1X1i+τi
Key Question: are X1i and X2i correlated?
Run an auxiliary regression of X2i on X1i to see:+
X2i=δ0+δ1X1i+τi
If δ1=0, then X1i and X2i are not linearly related
If |δ1| is very big, then X1i and X2i are strongly linearly related
+ Note: I am using δ's and τ to differentiate estimates for this model.
Yi=β0+β1X1i+β2X2i+ui
Yi=β0+β1X1i+β2X2i+uiYi=β0+β1X1i+β2(δ0+δ1X1i+τi)+ui
Yi=β0+β1X1i+β2X2i+uiYi=β0+β1X1i+β2(δ0+δ1X1i+τi)+uiYi=(β0+β2δ0)+(β1+β2δ1)X1i+(β2τi+ui)
Yi=β0+β1X1i+β2X2i+uiYi=β0+β1X1i+β2(δ0+δ1X1i+τi)+uiYi=(β0+β2δ0⏟α0)+(β1+β2δ1⏟α1)X1i+(β2τi+ui⏟νi)
Yi=α0+α1X1i+νi
Yi=β0+β1X1i+β2X2i+uiYi=β0+β1X1i+β2(δ0+δ1X1i+τi)+uiYi=(β0+β2δ0⏟α0)+(β1+β2δ1⏟α1)X1i+(β2τi+ui⏟νi)
Yi=α0+α1X1i+νi
α1=β1 + β2δ1
The true effect of X1 on Yi: (β1)
The true effect of X2 on Yi: (β2)
α1=β1 + β2δ1
The true effect of X1 on Yi: (β1)
The true effect of X2 on Yi: (β2)
Zi must be a determinant of Yi ⟹ β2≠0
Zi must be correlated with Xi ⟹ δ1≠0
α1=β1 + β2δ1
The true effect of X1 on Yi: (β1)
The true effect of X2 on Yi: (β2)
Zi must be a determinant of Yi ⟹ β2≠0
Zi must be correlated with Xi ⟹ δ1≠0
## ## Call:## lm(formula = testscr ~ str + el_pct, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -48.845 -10.240 -0.308 9.815 43.461 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 686.03225 7.41131 92.566 < 2e-16 ***## str -1.10130 0.38028 -2.896 0.00398 ** ## el_pct -0.64978 0.03934 -16.516 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 14.46 on 417 degrees of freedom## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 ## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
^Test Scorei=686.03−1.10 STRi−0.65 %ELi
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
^Test Scorei=698.93−2.28 STRi
## ## Call:## lm(formula = el_pct ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -20.823 -13.006 -6.849 7.834 74.601 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -19.8541 9.1626 -2.167 0.03081 * ## str 1.8137 0.4644 3.906 0.00011 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 17.98 on 418 degrees of freedom## Multiple R-squared: 0.03521, Adjusted R-squared: 0.0329 ## F-statistic: 15.25 on 1 and 418 DF, p-value: 0.0001095
^%ELi=−19.85+1.81 STRi
"True" Regression
^Test Scorei=686.03−1.10 STRi−0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+1.81 STRi
"True" Regression
^Test Scorei=686.03 −1.10 STRi−0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+1.81 STRi
α1=β1 + β2δ1
"True" Regression
^Test Scorei=686.03 −1.10 STRi −0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+1.81 STRi
α1=β1 + β2δ1
The true effect of STR on Test Score: -1.10
The true effect of %EL on Test Score: -0.65
"True" Regression
^Test Scorei=686.03 −1.10 STRi −0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+ 1.81 STRi
α1=β1 + β2δ1
The true effect of STR on Test Score: -1.10
The true effect of %EL on Test Score: -0.65
The relationship between STR and %EL: 1.81
"True" Regression
^Test Scorei=686.03 −1.10 STRi −0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+ 1.81 STRi
α1=β1 + β2δ1
The true effect of STR on Test Score: -1.10
The true effect of %EL on Test Score: -0.65
The relationship between STR and %EL: 1.81
So, for the omitted regression:
α1=−2.28 =−1.10 + −0.65 ( 1.81 )
"True" Regression
^Test Scorei=686.03 −1.10 STRi −0.65 %EL
"Omitted" Regression
^Test Scorei=698.93 −2.28 STRi
"Auxiliary" Regression
^%ELi=−19.85+ 1.81 STRi
α1=β1 + β2δ1
The true effect of STR on Test Score: -1.10
The true effect of %EL on Test Score: -0.65
The relationship between STR and %EL: 1.81
So, for the omitted regression:
α1=−2.28 =−1.10 + −0.65 ( 1.81 )
var(^βj)=11−R2j⏟VIF×(SER)2n×var(X)
se(^βj)=√var(^β1)
+ See Class 2.5 for a reminder of variation with just variable.
cor(Xj,Xl)≠0∀j≠l
cor(Xj,Xl)≠0∀j≠l
cor(Xj,Xl)≠0∀j≠l
Multicollinearity between X variables does not bias OLS estimates
Multicollinearity does increase the variance of an estimate by
VIF=1(1−R2j)
VIF=1(1−R2j)
VIF=1(1−R2j)
Example: Suppose we have a regression with three regressors (k=3):
Yi=β0+β1X1i+β2X2i+β3X3i
VIF=1(1−R2j)
Example: Suppose we have a regression with three regressors (k=3):
Yi=β0+β1X1i+β2X2i+β3X3i
R21 for X1i=γ+γX2i+γX3iR22 for X2i=ζ0+ζ1X1i+ζ2X3iR23 for X3i=η0+η1X1i+η2X2i
VIF=1(1−R2j)
R2j is the R2 from an auxiliary regression of Xj on all other regressors (X's)
The R2j tells us how much other regressors explain regressor Xj
Key Takeaway: If other X variables explain Xj well (high R2J), it will be harder to tell how cleanly Xj→Yi, and so var(^βj) will be higher
VIF=1(1−R2j)
VIF=1(1−R2j)
VIF=1(1−R2j)
Baseline: R2j=0 ⟹ no multicollinearity ⟹ VIF = 1 (no inflation)
Larger R2j ⟹ larger VIF
ggplot(data=CASchool, aes(x=str,y=el_pct))+ geom_point(color="blue")+ geom_smooth(method="lm", color="red")+ labs(x = "Student to Teacher Ratio", y = "Percentage of ESL Students")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
CAcorr_ex<-subset(CASchool, select=c("testscr", "str", "el_pct"))# Make a correlation tablecor(CAcorr_ex)
## testscr str el_pct## testscr 1.0000000 -0.2263628 -0.6441237## str -0.2263628 1.0000000 0.1876424## el_pct -0.6441237 0.1876424 1.0000000
# our multivariate regressionelreg<-lm(testscr~str+el_pct,data=CASchool)# use the "car" package for VIF function library("car") # syntax: vif(lm.object)vif(elreg)
## str el_pct ## 1.036495 1.036495
# our multivariate regressionelreg<-lm(testscr~str+el_pct,data=CASchool)# use the "car" package for VIF function library("car") # syntax: vif(lm.object)vif(elreg)
## str el_pct ## 1.036495 1.036495
str
increases by 1.036 times due to multicollinearity with el_pct
el_pct
increases by 1.036 times due to multicollinearity with str
# run auxiliary regression of x2 on x1auxreg<-lm(el_pct~str, data=CASchool)# use broom package's tidy() command (cleaner)library(broom) # load broomtidy(auxreg) # look at reg output
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | -19.854055 | 9.1626044 | -2.166857 | 0.0308099863 |
str | 1.813719 | 0.4643735 | 3.905733 | 0.0001095165 |
glance(auxreg) # look at aux reg stats for R^2
ABCDEFGHIJ0123456789 |
r.squared <dbl> | adj.r.squared <dbl> | sigma <dbl> | statistic <dbl> | p.value <dbl> | df <int> | logLik <dbl> | AIC <dbl> | |
---|---|---|---|---|---|---|---|---|
0.03520966 | 0.03290155 | 17.98259 | 15.25475 | 0.0001095165 | 2 | -1808.502 | 3623.003 |
# extract our R-squared from aux regression (R_j^2)aux_r_squared<-glance(auxreg) %>% pull(r.squared)aux_r_squared # look at it
## [1] 0.03520966
# calculate VIF manuallyour_vif<-1/(1-aux_r_squared) # VIF formula our_vif
## [1] 1.036495
Example: For our Test Scores and Class Size example, what about district expenditures per student?
CAcorr2<-subset(CASchool, select=c("testscr", "str", "expn_stu"))# Make a correlation tablecorr2<-cor(CAcorr2)# look at itcorr2
## testscr str expn_stu## testscr 1.0000000 -0.2263628 0.1912728## str -0.2263628 1.0000000 -0.6199821## expn_stu 0.1912728 -0.6199821 1.0000000
ggplot(data=CASchool, aes(x=str,y=expn_stu))+ geom_point(color="blue")+ geom_smooth(method="lm", color="red")+ scale_y_continuous(labels = scales::dollar)+ labs(x = "Student to Teacher Ratio", y = "Expenditures per Student ($)")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
cor(Test score, expn)≠0
cor(STR, expn)≠0
cor(Test score, expn)≠0
cor(STR, expn)≠0
cor(Test score, expn)≠0
cor(STR, expn)≠0
Omitting expn will bias ^β1 on STR
Including expn will not bias ^β1 on STR, but will make it less precise (higher variance)
cor(Test score, expn)≠0
cor(STR, expn)≠0
Omitting expn will bias ^β1 on STR
Including expn will not bias ^β1 on STR, but will make it less precise (higher variance)
Data tells us little about the effect of a change in STR holding expn constant
expreg<-lm(testscr~str+expn_stu, data=CASchool)summary(expreg)
## ## Call:## lm(formula = testscr ~ str + expn_stu, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.507 -14.403 0.407 13.195 48.392 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 675.577174 19.562222 34.535 <2e-16 ***## str -1.763216 0.610914 -2.886 0.0041 ** ## expn_stu 0.002487 0.001823 1.364 0.1733 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.56 on 417 degrees of freedom## Multiple R-squared: 0.05545, Adjusted R-squared: 0.05092 ## F-statistic: 12.24 on 2 and 417 DF, p-value: 6.824e-06
vif(expreg)
## str expn_stu ## 1.624373 1.624373
expn_stu
increases variance of ^β1 by 1.62 times library(huxtable)huxreg("Model 1" = school_reg, "Model 2" = expreg, coefs = c("Intercept" = "(Intercept)", "Class Size" = "str", "Expenditures per Student" = "expn_stu"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2)
str
increases from 0.48 to 0.61 when we add expn_stu
Model 1 | Model 2 | |
Intercept | 698.93 *** | 675.58 *** |
(9.47) | (19.56) | |
Class Size | -2.28 *** | -1.76 ** |
(0.48) | (0.61) | |
Expenditures per Student | 0.00 | |
(0.00) | ||
N | 420 | 420 |
R-Squared | 0.05 | 0.06 |
SER | 18.58 | 18.56 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
^Sales=^β0+^β1Temperature (C)+^β2Temperature (F)
^Sales=^β0+^β1Temperature (C)+^β2Temperature (F)
Temperature (F)=32+1.8∗Temperature (C)
^Sales=^β0+^β1Temperature (C)+^β2Temperature (F)
Temperature (F)=32+1.8∗Temperature (C)
^Sales=^β0+^β1Temperature (C)+^β2Temperature (F)
Temperature (F)=32+1.8∗Temperature (C)
cor(temperature (F), temperature (C))=1
R2j=1 is implying VIF=11−1 and var(^βj)=0!
^Sales=^β0+^β1Temperature (C)+^β2Temperature (F)
Temperature (F)=32+1.8∗Temperature (C)
cor(temperature (F), temperature (C))=1
R2j=1 is implying VIF=11−1 and var(^βj)=0!
This is fatal for a regression
Example:
^TestScorei=^β0+^β1STRi+^β2%EL+^β3%ES
%EL: the percentage of students learning English
%ES: the percentage of students fluent in English
ES=100−EL
|cor(ES,EL)|=1
# generate %EF variable from %ELCASchool_ex <- CASchool %>% mutate(ef_pct = 100 - el_pct)CASchool_ex %>% summarize(cor = cor(ef_pct, el_pct))
cor |
-1 |
ggplot(data=CASchool_ex, aes(x=el_pct,y=ef_pct))+ geom_point(color="blue")+ scale_y_continuous(labels = scales::dollar)+ labs(x = "Percent of ESL Students", y = "Percent of Non-ESL Students")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
mcreg<-lm(testscr~str+el_pct+ef_pct, data=CASchool_ex)summary(mcreg)
## ## Call:## lm(formula = testscr ~ str + el_pct + ef_pct, data = CASchool_ex)## ## Residuals:## Min 1Q Median 3Q Max ## -48.845 -10.240 -0.308 9.815 43.461 ## ## Coefficients: (1 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 686.03225 7.41131 92.566 < 2e-16 ***## str -1.10130 0.38028 -2.896 0.00398 ** ## el_pct -0.64978 0.03934 -16.516 < 2e-16 ***## ef_pct NA NA NA NA ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 14.46 on 417 degrees of freedom## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 ## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
R
ignores one of the multicollinear regressors (ef_pct
) if you include both in a regression^βj on Xj is biased only if there is an omitted variable (Z) such that:
var[^βj] and se[^βj] measure precision of estimate:
^βj on Xj is biased only if there is an omitted variable (Z) such that:
var[^βj] and se[^βj] measure precision of estimate:
var[^βj]=1(1−R2j)∗SER2n×var[Xj]
Again, how well does a linear model fit the data?
How much variation in Yi is "explained" by variation in the model ($\hat{Y_i}$)?
Again, how well does a linear model fit the data?
How much variation in Yi is "explained" by variation in the model ($\hat{Y_i}$)?
Yi=^Yi+^ui^ui=Yi−^Yi
SER=SSEn−k−1
A measure of the spread of the observations around the regression line (in units of Y), the average "size" of the residual
Only new change: divided by n−k−1 due to use of k+1 degrees of freedom to first estimate β0 and then all of the other β's for the k number of regressors1
1 Again, because your textbook defines (k)
as including the constant, the denominator would be (n-k)
instead of (n-k-1)
.
R2=ESSTSS=1−SSETSS=(rX,Y)2
Problem: R2 of a regression increases every time a new variable is added (it reduces SSE!)
This does not mean adding a variable improves the fit of the model per se, R2 gets inflated
Problem: R2 of a regression increases every time a new variable is added (it reduces SSE!)
This does not mean adding a variable improves the fit of the model per se, R2 gets inflated
We correct for this effect with the adjusted R2:
ˉR2=1−n−1n−k−1×SSETSS
## ## Call:## lm(formula = testscr ~ str + el_pct, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -48.845 -10.240 -0.308 9.815 43.461 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 686.03225 7.41131 92.566 < 2e-16 ***## str -1.10130 0.38028 -2.896 0.00398 ** ## el_pct -0.64978 0.03934 -16.516 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 14.46 on 417 degrees of freedom## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 ## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
R
calls it multiple R-squared
) went up Adjusted R-squared
went down elreg %>% glance()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
0.426 | 0.424 | 14.5 | 155 | 4.62e-51 | 3 | -1.72e+03 | 3.44e+03 | 3.46e+03 | 8.72e+04 | 417 |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |