Problem Set 2

Due by Thursday, September 26, 2019

ANSWERS:

Instructions

For this problem set, you may submit handwritten answers on a plain sheet of paper, or download and type/handwrite on the PDF.

Alternatively, you may download the .Rmd file, do the homework in markdown, and email to me a single knitted html or pdf file (and be sure that it shows all of your code).

You may work together (and I highly encourage that) but you must turn in your own answers. I grade homeworks 70% for completion, and for the remaining 30%, pick one question to grade for accuracy - so it is best that you try every problem, even if you are unsure how to complete it accurately.

Theory and Concepts

Question 1

Part A

In your own words, explain the difference between endogeneity and exogeneity.Hint: you may want to look back at class 1.2 for this question.

Part B

In your own words, explain how conducting a randomized controlled trial helps to identify the causal connection between two variables.

Question 2

Part A

In your own words, explain what (sample) standard deviation means.

Part B

In your own words, explain how (sample) standard deviation is calculated. You may also write the formula, but it is not necessary.

Problems

For the remaining questions, you may use R to verify, but please calculate all sample statistics by hand and show all work.

Question 3

Suppose you have a very small class of four students that all take a quiz. Their scores are reported as follows:

Part A

Calculate the median.

Part B

Calculate the sample mean, .

Part C

Calculate the sample standard deviation, .

Part D

Make or sketch a rough histogram of this data, with the size of each bin being 10 (i.e. 70’s, 80’s, 90’s, 100’s). You can draw this by hand or use R.If you are using ggplot, you want to use +geom_histogram(breaks=seq(start,end,by)) and add +scale_x_continuous(breaks=seq(start,end,by)). For each, it creates bins in the histogram, and ticks on the x axis by creating a sequence starting at start (a number), ending at end (number), by a certain interval (i.e. by 10s.).

Is this distribution roughly symmetric or skewed? What would we expect about the mean and the median?

Part E

Suppose instead the person who got the 72 did not show up that day to class, and got a 0 instead. Recalculate the mean and median. What happened and why?

Question 4

Suppose the probabilities of a visitor to Amazon’s website buying 0, 1, or 2 books are 0.2, 0.4, and 0.4 respectively.

Part A

Calculate the expected number of books a visitor will purchase.

Part B

Calculate the standard deviation of book purchases.

Part C

Bonus: try doing this in R by making an initial dataframe of the data, and then making new columns to the “table” like we did in class.

Question 5

Scores on the SAT (out of 1600) are approximately normally distributed with a mean of 500 and standard deviation of 100.

Part A

What is the probability of getting a score between a 400 and a 600?

Part B

What is the probability of getting a score between a 300 and a 700?

Part C

What is the probability of getting at least a 700?

Part D

What is the probability of getting at most a 700?

Part E

What is the probability of getting exactly a 500?

Question 6

Redo problem 5 by using the pnorm() command in R.Hint: This function has four arguments: 1. the value of the random variable, 2. the mean of the distribution, 3. the sd of the distribution, and 4. lower.tail TRUE or FALSE.