Problem Set 4
Due by Thursday, November 21, 2019
ANSWERS:
Instructions
For this problem set, you may submit handwritten answers on a plain sheet of paper, or download and type/handwrite on the PDF.
Alternatively, you may download the .Rmd
file, do the homework in markdown, and email to me a single knit
ted html
or pdf
file (and be sure that it shows all of your code).
You may work together (and I highly encourage that) but you must turn in your own answers. I grade homeworks 70% for completion, and for the remaining 30%, pick one question to grade for accuracy - so it is best that you try every problem, even if you are unsure how to complete it accurately.
Theory and Concepts
Question 1
In your own words, describe what omitted variable bias means. What are the two conditions for an omitted variable to cause a bias?
Question 2
In your own words, describe what multicollinearity means. What is the cause, and what are the consequences of multicollinearity? How can we measure multicollinearity and its effects? What happens if multicollinearity is perfect?
Question 3
In your own words, describe what a proxy variable is. When or why would we use a proxy variable, and what effects does it have on our model?
Question 4
Explain how we use Directed Acyclic Graphs (DAGs) to depict a causal model: what are the two criteria that must hold for identifying a causal effect of X on Y? When should we control a variable, and when should we not control a variable?
Theory Problems
For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R
to verify your answers, but you are expected to reach the answers in this section “manually.”
Question 5
Data were collected from a random sample of 220 home sales from a community in 2017.
^Price=119.2+0.485BDR+23.4Bath+0.156Hsize+0.002Lsize+0.090Age
- Price: selling price (in $1,000s)
- BDR: number of bedrooms
- Bath: number of bathrooms
- Hsize: size of the house (in ft2)
- Lsize: lot size (in ft2)
- Age: age of the house (in years)
Part A
Suppose that a homeowner converts part of an existing living space in her house to a new bathroom. What is the expected increase in the value of the house?
Part B
Suppose a homeowner adds a new bathroom to her house, which also increases the size of the house by 100 square feet. What is the expected increase in the value of the house?
Part C
Suppose the R2 of this regression is 0.727. Calculate the adjusted ˉR2.
Part D
Suppose the following auxiliary regression for BDR has an R2 of 0.841.
^BDR=δ0+δ1Bath+δ2Hsize+δ3Lsize+δ4Age
Calculate the Variance Inflation Factor for BDR and explain what it means.
Question 6
A researcher wants to investigate the effect of education on average hourly wages. Wage, education, and experience in the dataset have the following correlations:
Wage | Education | Experience | |
---|---|---|---|
Wage | 1.0000 | ||
Education | 0.4059 | ||
Experience | 0.1129 | −0.2995 | 1.0000 |
She runs a simple regression first, and gets the results:
^Wage=−0.9049+0.5414Education
She runs another regression, and gets the results:
^Experience=35.4615−1.4681Education
Part A
If the true marginal effect of experience on wages (holding education constant) is 0.0701, calculate the omitted variable bias in the first regression caused by omitting experience. Does the estimate of ^β1 in the first regression overstate or understate the effect of education on wages?
Part B
Knowing this, what would be the true effect of education on wages, holding experience constant?
Part C
The R2 for the second regression is 0.0897. If she were to run a better regression including both education and experience, how much would the variance of the coefficients on education and experience increase? Why?
R Questions
Answer the following questions using R
. When necessary, please write answers in the same document (knitted Rmd
to html
or pdf
, typed .doc(x)
, or handwritten) as your answers to the above questions. Be sure to include (email or print an .R
file, or show in your knitted markdown
) your code and the outputs of your code with the rest of your answers.
Question 7
Download the heightwages.csv
dataset. This data is a part of a larger dataset from the National Longitudinal Survey of Youth (NLSY) 1979 cohort: a nationally representative sample of 12,686 men and women aged 14-22 years old when they were first surveyed in 1979. They were subsequently interviewed every year through 1994 and then every other year afterwards. There are many included variables, but for now we will just focus on:
wage96
: Adult hourly wages ($/hr) reported in 1996height85
: Adult height (inches) reported in 1985height81
: Adolescent height (inches) reported in 1981
We want to figure out what is the effect of height on wages (e.g. do taller people earn more on average than shorter people?)
Part A
Create a quick scatterplot between height85
(as X) amd wage96
(as Y).
Part B
Regress wages on adult height. Write the equation of the estimated OLS regression. Interpret the coefficient on height85
.
Part C
How much would someone who is 5’10" be predicted to earn per hour, according to the model?
Part D
Would adolescent height cause an omitted variable bias if it were left out? Explain using both your intuition, and some statistical evidence with R
.
Part E
Now add adolescent height to the regression, and write the new regression equation below, as before. Interpret the coefficient on height85
.
Part F
How much would someone who is 5’10" in 1985 and 4’8" in 1981 be predicted to earn, according to the model?
Part G
What happened to the estimate on height85
and its standard error?
Part H
Is there multicollinearity between height85
and height81
? Explore with a scatterplot.Hint: to avoid overplotting, use geom_jitter()
instead of geom_point()
to get a better view of the data.
Part I
Quantify how much multicollinearity affects the variance of the OLS estimates on both heights.Hint: You’ll need the car
package.
Part J
Reach the same number as in part I by running an auxiliary regression.Hint: There’s some missing wage96
data that may give you a different answer, so filter()
your data here by !is.na(wage96)
before running this regression - this will include only observations for wage96
that are not NA
’s.
Part K
Make a regression table from part B and D using huxtable
.