Problem Set 4

Due by Thursday, November 21, 2019

ANSWERS:

Instructions

For this problem set, you may submit handwritten answers on a plain sheet of paper, or download and type/handwrite on the PDF.

Alternatively, you may download the .Rmd file, do the homework in markdown, and email to me a single knitted html or pdf file (and be sure that it shows all of your code).

You may work together (and I highly encourage that) but you must turn in your own answers. I grade homeworks 70% for completion, and for the remaining 30%, pick one question to grade for accuracy - so it is best that you try every problem, even if you are unsure how to complete it accurately.

Theory and Concepts

Question 1

In your own words, describe what omitted variable bias means. What are the two conditions for an omitted variable to cause a bias?

Question 2

In your own words, describe what multicollinearity means. What is the cause, and what are the consequences of multicollinearity? How can we measure multicollinearity and its effects? What happens if multicollinearity is perfect?

Question 3

In your own words, describe what a proxy variable is. When or why would we use a proxy variable, and what effects does it have on our model?

Question 4

Explain how we use Directed Acyclic Graphs (DAGs) to depict a causal model: what are the two criteria that must hold for identifying a causal effect of X on Y? When should we control a variable, and when should we not control a variable?

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 5

Data were collected from a random sample of 220 home sales from a community in 2017.

^Price=119.2+0.485BDR+23.4Bath+0.156Hsize+0.002Lsize+0.090Age

Part A

Suppose that a homeowner converts part of an existing living space in her house to a new bathroom. What is the expected increase in the value of the house?

Part B

Suppose a homeowner adds a new bathroom to her house, which also increases the size of the house by 100 square feet. What is the expected increase in the value of the house?

Part C

Suppose the R2 of this regression is 0.727. Calculate the adjusted ˉR2.

Part D

Suppose the following auxiliary regression for BDR has an R2 of 0.841.

^BDR=δ0+δ1Bath+δ2Hsize+δ3Lsize+δ4Age

Calculate the Variance Inflation Factor for BDR and explain what it means.

Question 6

A researcher wants to investigate the effect of education on average hourly wages. Wage, education, and experience in the dataset have the following correlations:

Wage Education Experience
Wage 1.0000
Education 0.4059
Experience 0.1129 0.2995 1.0000

She runs a simple regression first, and gets the results:

^Wage=0.9049+0.5414Education

She runs another regression, and gets the results:

^Experience=35.46151.4681Education

Part A

If the true marginal effect of experience on wages (holding education constant) is 0.0701, calculate the omitted variable bias in the first regression caused by omitting experience. Does the estimate of ^β1 in the first regression overstate or understate the effect of education on wages?

Part B

Knowing this, what would be the true effect of education on wages, holding experience constant?

Part C

The R2 for the second regression is 0.0897. If she were to run a better regression including both education and experience, how much would the variance of the coefficients on education and experience increase? Why?

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (knitted Rmd to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your knitted markdown) your code and the outputs of your code with the rest of your answers.

Question 7

Download the heightwages.csv dataset. This data is a part of a larger dataset from the National Longitudinal Survey of Youth (NLSY) 1979 cohort: a nationally representative sample of 12,686 men and women aged 14-22 years old when they were first surveyed in 1979. They were subsequently interviewed every year through 1994 and then every other year afterwards. There are many included variables, but for now we will just focus on:

We want to figure out what is the effect of height on wages (e.g. do taller people earn more on average than shorter people?)

Part A

Create a quick scatterplot between height85 (as X) amd wage96 (as Y).

Part B

Regress wages on adult height. Write the equation of the estimated OLS regression. Interpret the coefficient on height85.

Part C

How much would someone who is 5’10" be predicted to earn per hour, according to the model?

Part D

Would adolescent height cause an omitted variable bias if it were left out? Explain using both your intuition, and some statistical evidence with R.

Part E

Now add adolescent height to the regression, and write the new regression equation below, as before. Interpret the coefficient on height85.

Part F

How much would someone who is 5’10" in 1985 and 4’8" in 1981 be predicted to earn, according to the model?

Part G

What happened to the estimate on height85 and its standard error?

Part H

Is there multicollinearity between height85 and height81? Explore with a scatterplot.Hint: to avoid overplotting, use geom_jitter() instead of geom_point() to get a better view of the data.

Part I

Quantify how much multicollinearity affects the variance of the OLS estimates on both heights.Hint: You’ll need the car package.

Part J

Reach the same number as in part I by running an auxiliary regression.Hint: There’s some missing wage96 data that may give you a different answer, so filter() your data here by !is.na(wage96) before running this regression - this will include only observations for wage96 that are not NA’s.

Part K

Make a regression table from part B and D using huxtable.