Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.
In your own words, explain the difference between endogeneity and exogeneity.1
An exogenous model is one where the independent variable (X) is not associated with any other factors that affect the dependent variable (Y). If a model is truly exogenous, we can estimate the causal effect of X on Y.
An endogenous model is one where the independent variable (X) is associated with any other factors that affect the dependent variable (Y). If a model is endogenous, we have no accurately estimated the causal effect of X on Y, since other factors are getting entangled with X and Y.
In your own words, explain how conducting a randomized controlled trial helps to identify the causal connection between two variables.
Randomized controlled experiments are where a pool of subjects representative of a population are randomly assigned into a treatment group (or into one of a number of groups given different levels of a treatment) or into a control group. The treatment group(s) is(are) given the treatment(s), the control group is given nothing (perhaps a placebo, though this is not always necessary), and then the average results of the two groups are compared to measure the true average effect of treatment.
The key is that the assignment must be random, which controls for all factors that potentially determine the outcome (e.g. when measuring individual outcomes, their height, family background, income, race, age, etc). If subjects are randomly assigned, then knowing anything about the individual (e.g. age, height, etc) tells us nothing about whether or not they got the treatment(s). The only thing that separates a member of the treatment group(s) from the control group is whether or not they were assigned to treatment. This ensures that the average person in the treatment group(s) looks like the average person in the control group, and that we are truly comparing apples to apples, rather than apples to oranges.
In your own words, explain what (sample) standard deviation means.
Sample standard deviation measures the average deviation (distance) of any given value of a variable from the variable’s mean.
In your own words, explain how (sample) standard deviation is calculated. You may also write the formula, but it is not necessary.
The formula is sd(X)=√n∑i=1(xi−ˉX)2n−1
It helps to consider the calculation as a series of steps:
Note on Step 4: because this is a sample, we have do deal with degrees of freedom (df) loss. We use up one df to calculate the mean, ˉx, which is needed before calculating variance or standard deviation. Hence, instead of averaging like normal: 1n∑xi, we need to divide by n−1.
Suppose you have a very small class of four students that all take a quiz. Their scores are reported as follows:
{83,92,72,81}
For the remaining questions, you may use R
to verify, but please calculate all sample statistics by hand and show all work.
Calculate the median.
Arrange the values in numerical order from smallest to largest. Find the value in the middle (i.e. an equal number of values are on either side); possibly by crossing-out one number on either side at a time (like in Elementary School).
72_,81,83,92_
Since we have an even number of observations, we have two numbers in the middle, 81 and 83, so we must take the average of them:
81+832=82
Using R:
## [1] 82
Calculate the sample mean, ˉx.
ˉx=1nn∑i=1xiˉx=72+81+83+924ˉx=3284ˉx=82
Using R:
## [1] 82
Calculate the sample standard deviation, s.
My suggestion is to use the “table” method, and follow the 5 steps described in problem 2b.
xi | xi−ˉX | (xi−ˉX)2 |
---|---|---|
72 | −10 | 100 |
81 | −1 | 1 |
83 | 1 | 1 |
92 | 10 | 100 |
∑ | 202 | |
13×∑ | ≈67.33 | |
√13×∑ | ≈8.21 |
In R:
## [1] 8.205689
Make or sketch a rough histogram of this data, with the size of each bin being 10 (i.e. 70’s, 80’s, 90’s, 100’s). You can draw this by hand or use R
.2 Is this distribution roughly symmetric or skewed? What would we expect about the mean and the median?
# load tidyverse (for tibble and ggplot2)
library(tidyverse)
# make a dataframe of our data,
# called df
# one variable in it, called quiz
df <- tibble(quiz = c(83,92,72,81))
# use this as our data for plot
ggplot(data = df)+
aes(x = quiz)+
geom_histogram(breaks=seq(0,100,10), # make bins of size 10 between 0 and 100
color = "white", # color is for borders
fill = "blue")+ # fill is for area
scale_x_continuous(breaks=seq(0,100,10))+ # have x axix ticks same as breaks
theme_classic()
We can see it is roughly symmetric. We would therefore expect the mean and the median to be approximately the same (which parts A and B showed was true).
Suppose instead the person who got the 72 did not show up that day to class, and got a 0 instead. Recalculate the mean and median. What happened and why?
0_,81,83,92_
81+832=82
Replacing the 72 with a 0, and keeping the same number of observations does not change the median!
ˉx=1nn∑i=1xiˉx=0+81+83+924ˉx=2564ˉx=64
The mean is pulled down significantly by the outlier.
In R:
## [1] 64
## [1] 82
If we were to look at the histogram, it would be skewed, and the mean would be lower than the mean:
# make new tibble called df_2
df_2 <- tibble(quiz = c(83,92,0,81)) # replace 72 with 0
# use this as our data for plot
ggplot(data = df_2)+
aes(x = quiz)+
geom_histogram(breaks=seq(0,100,10), # make bins of size 10 between 0 and 100
color = "white", # color is for borders
fill = "blue")+ # fill is for area
geom_vline(aes(xintercept = median(quiz)), size = 1, color = "green", linetype = "dashed")+ # green dashed line is median
geom_label(aes(x = median(quiz), y = 1), label = "Median", color = "green")+ # label median line
geom_vline(aes(xintercept = mean(quiz)), size = 1, color = "red", linetype = "dotted")+ # green dashed line is mean
geom_label(aes(x = mean(quiz), y = 1), label = "Mean", color = "red")+ # label mean line
scale_x_continuous(breaks=seq(0,100,10))+ # have x axix ticks same as breaks
theme_classic()
Suppose the probabilities of a visitor to Amazon’s website buying 0, 1, or 2 books are 0.2, 0.4, and 0.4 respectively.
Calculate the expected number of books a visitor will purchase.
Define X to be the number of books a visitor to Amazon’s website purchases. The pdf of X is as follows:
xi | P(X=xi) |
---|---|
0 | 0.20 |
1 | 0.40 |
2 | 0.40 |
The expected value of X is the probability weighted average of X:
E(X)=n∑i=1pixi=0.2(0)+(0.4)1+(0.4)2=0+0.4+0.8=1.2
Calculate the standard deviation of book purchases.
The formula(s) for standard deviation of a random variable is:
σX=sd(X)=√E[(X−E[X])2]=√n∑i=1pi(xi−E[X])2
I suggest using the table method, again. Working from the inside out of the formula, the steps are:
xi | P(X=xi) | xi−E[X] | (xi−E[X])2 | pi(xi−E[X])2 |
---|---|---|---|---|
0 | 0.20 | −1.20 | 1.44 | 0.288 |
1 | 0.40 | −0.20 | 0.04 | 0.016 |
2 | 0.40 | 0.80 | 0.64 | 0.256 |
∑ | 0.560 | |||
√∑ | 0.748 |
Bonus: try doing this in R
by making an initial dataframe of the data, and then making new columns to the “table” like we did in class.
books <dbl> | prob <dbl> | |||
---|---|---|---|---|
0 | 0.2 | |||
1 | 0.4 | |||
2 | 0.4 |
exp_value <dbl> | ||||
---|---|---|---|---|
1.2 |
books <dbl> | prob <dbl> | devs <dbl> | devs_sq <dbl> | p_weight_devs_sq <dbl> |
---|---|---|---|---|
0 | 0.2 | -1.2 | 1.44 | 0.288 |
1 | 0.4 | -0.2 | 0.04 | 0.016 |
2 | 0.4 | 0.8 | 0.64 | 0.256 |
var <dbl> | sd <dbl> | |||
---|---|---|---|---|
0.56 | 0.7483315 |
Scores on the SAT (out of 1600) are approximately normally distributed with a mean of 500 and standard deviation of 100.
What is the probability of getting a score between a 400 and a 600?
Let random variable S be the score earned on the SAT.
Convert thse numbers to Z-scores.
P(400≤S≤600)=P(400−500100≤S−500100≤600−500100)=P(−1≤Z≤1)≈0.68
Using the 68-95-99.7 rule: about 68% of the values fall within one standard deviation (±1 Z-score) of the mean.
You don’t need to draw the pdf, but it helps to visualize what we’re looking for, and how converting to Z-scores helps:
# see class 2.3 notes on how to graph and shade stats graphs
# it helps to first figure out where the x-axis ticks should be
# show about 4 standard deviations above and below the mean (mu +/- 4*sd)
# then have ticks in intervals of one sd
# in this case, with mean 500 and sd 100, it should be seq(100,900,100)
s_plot<-ggplot(data = tibble(scores=seq(from = 100,
to = 900,
by = 100)))+
aes(x = scores)+
stat_function(fun = dnorm,
args = list(mean = 500, sd = 100),
size = 2, color = "blue")+
labs(x = "SAT Scores (out of 1600)",
y = "Probability")+
scale_x_continuous(breaks=seq(from = 100,
to = 900,
by = 100))+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)
s_plot+stat_function(fun = dnorm,
args = list(mean = 500, sd = 100),
geom = "area",
xlim = c(400,600),
size = 2, fill = "blue", alpha = 0.5)
Z<-ggplot(data = tibble(Z=seq(from = -4,
to = 4,
by = 1)))+
aes(x = Z)+
stat_function(fun = dnorm,
size = 2, color = "blue")+
labs(x = "Z-Scores",
y = "Probability")+
scale_x_continuous(breaks=seq(from = -4,
to = 4,
by = 1))+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)
Z+stat_function(fun = dnorm,
geom = "area",
xlim = c(-1,1),
size = 2, fill = "blue", alpha = 0.5)
What is the probability of getting a score between a 300 and a 700?
Convert thse numbers to Z-scores.
P(300≤S≤700)=P(300−500100≤S−500100≤700−500100)=P(−2≤Z≤2)≈0.95
Using the 68-95-99.7 rule: about 95% of the values fall within two standard deviations (±2 Z-score) of the mean.
What is the probability of getting at least a 700?
We saw in part B that Z for 700 is 2.
Using the 68-95-99.7 rule, we know about 95% of the values fall within two standard deviations (±2 Z-score) of the mean. That means that 5% of values fall beyond a Z score of ±2, or 2.5% in each direction. We only want the right-tail (probability above Z=2, so 2.5%).
What is the probability of getting at most a 700?
We saw in Part C that the area to the right of Z=2 is 2.5%, so the remaining 97.5% falls to the left of Z=2.
What is the probability of getting exactly a 500?
This is a trick question! For a continuous random variable, the probability of any one specific value is 0.
Redo problem 5 by using the pnorm()
command in R
.3
Look back to the graphs of the pdfs in Question 5 to visualize what we are looking for.
pnorm
converts X to Z and takes the standard normal cdf, Φ of a variable, i.e.
Φ(k)=P(Z≤k)
That means, it calculates the probability of everything up to (to the left of) that value on the pdf. So, if you want to calculate the area between values j and k, you need to take the cdf of the larger number and subtract the cdf of the smaller number:
P(j≤Z≤k)=Φ(k)−Φ(j)
In this case (in Z-scores):
P(−1≤Z≤1)=Φ(1)−Φ(−1)
## [1] 0.6826895
In this case (in Z-scores):
P(−2≤Z≤2)=Φ(2)−Φ(−2)
## [1] 0.9544997
In this case (in Z-scores):
P(Z≥2)=1−P(Z≤2)=1−Φ(2)
## [1] 0.02275013
In this case (in Z-scores):
P(Z≤2)=Φ(2)
## [1] 0.9772499
We can see that the 68-95-99.7 rule is close, but not exactly equal to the true probabilities of Z. As such, it’s just a good rule of thumb!
Again, the probability is 0, no need to do anything here.
Hint: you may want to look back at class 1.2 for this question.↩︎
If you are using ggplot
, you want to use +geom_histogram(breaks=seq(start,end,by))
and add +scale_x_continuous(breaks=seq(start,end,by))
. For each, it creates bins in the histogram, and ticks on the x axis by creating a seq
uence starting at start
(a number), ending at end
(number), by
a certain interval (i.e. by 10
s.).↩︎
Hint: This function has four arguments: 1. the value of the random variable, 2. the mean of the distribution, 3. the sd of the distribution, and 4. lower.tail
TRUE
or FALSE
.↩︎