Theory and Concepts

Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8

Theory Problems

Question 9

Part A
Part B
Part C
Part D
Part E
Part F
Part G
Part H

Question 10

Part A
Part B
Part C
Part D
Part E

R Questions

Question 11

Part A
Part B
Part C
Part D
Part E
Part F
Part G
Part H
Part I
Part J
Part K
Part L
Part M
Part N

Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.

Theory and Concepts

Question 1

In your own words, describe what exogeneity and endogeneity mean, and how they are related to bias. What can we learn about the bias if we know is endogenous?

The OLS estimators and are unbiased estimates of the true population parameters and if and only if is exogenous. That is to say, if (i.e. there is no correlation between and any unobserved variable that affects ), then .

If is correlated with the error term, then is endogenous. The true expected value of the OLS estimator is

The bias is , i.e. the difference between average estimated sample slope and the `true’ population slope, so we can determine first the size of the bias based on how large is. The stronger the correlation, the larger the bias.

Second, we can determine the direction of the bias depending on the sign of .

If and are positively correlated (move in the same direction), we know that we have overstated the true effect of on , since a change in is picking up both a change in and a further change (in the same direction as ) in the unobserved .
If the correlation is negative (move in opposite directions), we know that we have understated the true effect of on , since a change in is picking up both a change in that is dampened by a change in the opposite direction of .

Question 2

In your own words, describe what means. How do we calculate it, what does it tell us, and how do we interpret it?

The is a measure of how well the OLS regression line “fits” our observed data points. It is the proportion of the total variation in (TSS) that is explained by the variation from our model (ESS):

Equivalently, it can be found by subtracting the proportion of unexplained variation in (SSE/TSS) from 1:

This is because . Finally, is the square of the correlation coefficient between and ,

The closer is to 1, the better the fit, the closer to 0, the poorer the fit.

Question 3

In your own words, describe what the standard error of the regression () means. How do we calculate it, what does it tell us, and how do we interpret it?

SER ) is the average size of the error (or residual), , that is, the average distance from the regression line to the actual data value for at a given . The goal of OLS is to minimize this (well, technically minimize the sum of squared errors!).

We calculate it by squaring the residuals (to get a positive distance) and taking the mean of them by adding them all up and dividing by , and then taking the square root to return to normal (non-squared) units.

We divide by rather than by due to the degrees of freedom correction for calculating two prior statistics with our data already, and .

Question 4

In your own words, describe what homoskedasticity and heteroskedasticity mean: both in ordinary English, and in terms of the graph of the OLS regression line.

Homoskedasticity means the errors are distributed with the same variance for all levels of X. Knowing anything about X will not tell us anything about the distribution of errors at that level of X.

Heteroskedasticity means the errors are distributed differently for different levels of X. So, at different levels of X, there will be much more or much less variation in the residuals.

Question 5

In your own words, describe what the variation in (either variance or standard error) means, or is measuring. What three things determine the variation, and in what way?

The variation of (either it’s variance or standard error) is a measure of how precise our estimate is. This idea comes from the sampling distribution of , since it is a random variable: if we were to take other samples and calculate the slope of a regression line for each, the estimate would vary from sample to sample.

The standard error of (square this to get variance) is:

The three things that affect it are:

Goodness of Fit of the Regression or . The worse the fit, the higher the , and the worse the precision (higher standard error) of .
Sample size, : the more data, the better the precision (lower standard error) of .
Standard error of : the more variation (spread) in -values, the better the precision (lower standard error) of .

See the graphs in slides 6-8 of class 2.5 for more.

Question 6

In your own words, describe what a -value means, and how it is used to establish statistical significance.

The -value is the probability that, if the null hypothesis were true, of observing a test statistic at least as extreme as the one found in our sample. Specifically, if , it is the probability of getting a sample slope at least as extreme as the one in our sample, if the slope were truly 0.¹

where is a test-statistic and is the test statistic we obtained from our sample.

Another way to interpret this is that the -value is the probability we commit a Type I error: the probability that, if the null hypothesis were true, we falsely reject it from our sample evidence.

Be careful, the -value is not the probability that our alternative hypothesis is true given our findings (commonly believed)! In fact it is basically the opposite, the probability of our findings being valid given the null hypothesis!

Question 7

A researcher is interested in examining the impact of illegal music downloads on commercial music sales. The author collects data on commercial sales of the top 500 singles from 2017 () and the number of downloads from a web site that allows `file sharing’ (). The author estimates the following model

The author finds a large, positive, and statistically significant estimate of . The author concludes these results demonstrate that illegal downloads actually boost music sales. Is this an unbiased estimate of the impact of illegal music on sales? Why or why not? Do you expect the estimate to overstate or understate the true relationship between illegal downloads and sales?

Does knowing the amount of illegal downloads an artist has convey any information about other variables that affect music sales? In other words, we are asking if (or more simply, .

It is likely that artists and songs that are the most heavily pirated are the most popular ones, and also are likely have very high music sales. Economists say piracy is like a tax on success–it happens more to those who are already successful and less to those who are still trying to make it big.

In any case, illegal downloads is probably endogenous. Since there is likely a positive correlation between music sales and popularity (in the error term), and popularity is also positively correlated with music sales, it is likely that we are overstating the effect of illegal downloads on sales. In other words, is also picking up the positive effect of popular songs, and is too large. The true estimate of is likely much lower than measured.

Question 8

A pharmaceutical company is interested in estimating the impact of a new drug on cholesterol levels. They enroll 200 people in a clinical trial. People are randomly assigned the treatment group or into the control group. Half of the people are given the new drug and half the people are given a sugar pill with no active ingredient. To examine the impact of dosage on reductions in cholesterol levels, the authors of the study regress the following model:

For people in the control group, dosage level and for people in the treatment group, dosage level measures milligrams of the active ingredient. In this case, the authors find a large, negative, statistically significant estimate of . Is this an unbiased estimate of the impact of dosage on change in cholesterol level? Why or why not? Do you expect the estimate to overstate or understate the true relationship between dosage and cholesterol level?

Does knowing whether (or how much) a person was treated convey any information about other characteristics that affect cholesterol level (in )? Again, we are asking if or

In this case, the answer is clearly no, the equations do hold and treatment is exogenous. is determined by random assignment, some people assigned a the drug, and some nothing at all. But, the important fact is that dosage is determined randomly so we expect that on average it will not be correlated with . In this case, , is unbiased.

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 9

A researcher wants to estimate the relationship between average weekly earnings , measured in dollars) and (measured in years) using a simple OLS model. Using a random sample of college-educated full-time workers aged 25-65 yields the following:

Part A

Interpret what means in this context.

is 696.70. This is the vertical intercept of the regression line. It means that a person that is 0 years old earns a $696.70 per week on average. This is often nonsensical, so we don’t often care about the economic meaning of the intercept.

Part B

Interpret what means in this context.

is 9.60 This is the slope of the regression line. It means that for every year older a person is, they can expect their wages to increase by $9.60, on average. This is the marginal effect of Age on AWE (and the causal effect if this model were exogenous).

Part C

The for this regression. What are the units of the , and what does this mean?

has no units, it is the proportion of variation in that is explained by our model, between 0 and 1. This model explains only 2.3% of the variation in , meaning this model is poor, and the line does not fit the data points well.

Part D

The for this regression. What are the units of the SER in this context, and what does it mean? Is the SER large in the context of this regression?

is measured in the same units as the dependent variable, , so it is measured in dollars. It is the average error or residual for an individual, the difference (in dollars) between OLS’ predicted for that person, and their true in the data. This SER is quite big, $624 in average weekly earnings.

Part E

Suppose Maria is 20 years old. What is her predicted ?

Part F

Suppose the data shows her actual is $430. What is her residual? Is this a relatively good or a bad prediction?²

This actually a relatively good prediction, as it is much lower than the average prediction error (SER), which was 624.1.

Part G

What does the error term, represent in this case? What might individuals have different values of ?

The error term represents all factors other than age that affects an individual’s average weekly earnings. This could include things like experience, ability, job type, education level, conscienciousness etc.

Part H

Do you think that is exogenous? Why or why not? Would we expect to be too large or too small?

It’s very unlikely that is exogenous. Knowing someone’s age likely gives us information about : we can guess about their experience or level of education (they are likely higher for older people), and most of these positively affect wages. Thus, we have probably overstimated the effect of age on earnings (i.e. ), and the true is likely smaller.

Question 10

Suppose a researcher is interested in estimating a simple linear regression model:

In a sample of 48 observations, she generates the following descriptive statistics:

Part A

What is the OLS estimate of ?

The formula for

Part B

What is the OLS estimate of ?

The formula for

Part C

Suppose the OLS estimate of has a standard error of . Could we probably reject a null hypothesis of at the 95% level?

Yes, we could reject the null hypothesis as the estimate of is more than 2 times its standard error of 0.072. The test-statistic would actually be

This is well beyond the critical value needed to reject , and the -value would be basically 0.

Part D

Calculate the for this model. How much variation in is explained by our model?

We know TSS (4 bullet point) and SSE (last bullet point).

This model explains 94.3% of the variation in .

Part E

How large is the average residual?

We need to find the standard error of the regression (SER), but luckily we know the SSE (last bullet point)

This tells us the average residual is 36 (units of ).

R Questions

For the following problems, please attach/write the answers to each question on the same document as the previous problems, but also include a printed/attached (and commented!) .R script file of your commands to answer the questions.

Question 11

Download the MLBattend dataset from Blackboard. This data contains data on attendance at major league baseball games for all 32 MLB teams from the 1970s-2000.

# first load tidyverse 
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# import data
mlb<-read_csv("../data/MLBAttend.csv")

## Parsed with column specification:
## cols(
##   team = col_character(),
##   city = col_character(),
##   nickname = col_character(),
##   league = col_character(),
##   division = col_character(),
##   season = col_double(),
##   home_attend = col_double(),
##   runs_scored = col_double(),
##   runs_allowed = col_double(),
##   wins = col_double(),
##   losses = col_double(),
##   games_behind = col_double()
## )

Part A

Clean up the data a bit by making a new variable to measure home attendance in millions. This will make it easier to interpret your regression later on.

# make home attendance variable in millions
mlb<-mlb %>%
  mutate(home_attend_mil=home_attend/1000000)

Part B

Get the correlation between Runs Scored and Home Attendance.

# summarize and get correlation
mlb %>%
  summarize(Correlation = cor(runs_scored, home_attend_mil))

ABCDEFGHIJ0123456789

Correlation <dbl>
0.4944431

Part C

Plot a scatterplot of Runs Scored (y) on Home Attendance (x). Add a regression line.

# create scatterplot with regression line 
scatter<-ggplot(data = mlb)+
  aes(x = home_attend_mil,
      y = runs_scored)+
  geom_point()+
  geom_smooth(method="lm")
scatter

Part D

Run a regression of Runs Scored on Home Attendance. What are and ? Interpret them in the context of our question.

# run regression, save as reg
reg<-lm(runs_scored ~ home_attend_mil, data = mlb)

# get summary of reg
summary(reg)

## 
## Call:
## lm(formula = runs_scored ~ home_attend_mil, data = mlb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -295.566  -52.754    1.414   63.769  271.377 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      572.618      8.081   70.86   <2e-16 ***
## home_attend_mil   68.798      4.183   16.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.47 on 836 degrees of freedom
## Multiple R-squared:  0.2445, Adjusted R-squared:  0.2436 
## F-statistic: 270.5 on 1 and 836 DF,  p-value: < 2.2e-16

# Here I'm going to save beta 0 hat and beta 1 hat
# as objects to call up in the text of the markdown document
# We'll need broom and tidy() first
library(broom)
reg_tidy<-tidy(reg)

reg_tidy

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	572.61829	8.080646	70.86294	0.000000e+00
home_attend_mil	68.79777	4.182920	16.44731	7.132757e-53

# first beta 0 hat 

beta_0_hat<-reg_tidy %>%
  filter(term=="(Intercept)") %>% # look at intercept row
  pull(estimate) %>% # extract beta 0 hat
  round(., 2) # round to 2 decimal places

beta_0_hat # check 

## [1] 572.62

# now beta 1 hat 

beta_1_hat<-reg_tidy %>%
  filter(term=="home_attend_mil") %>% # look at X-variable row
  pull(estimate) %>% # extract beta 1 hat
  round(., 2) # round to 2 decimal places

beta_1_hat

## [1] 68.8

572.62 : 68.8

Teams that have a home attendance of 0 over their season score 572.618 runs. For every additional 1 million fans attending at home games over the season, a team score s68.798 more runs.

Part E

Write out the estimated regression equation.

# if you are using markdown, try out the equatiomatic package
#install.packages("equatiomatic")
library(equatiomatic)
extract_eq(reg, # the regression
           use_coefs = TRUE, # use the estimated numbers
           coef_digits = 3, # how many digits to show
           fix_signs = TRUE) # fix negatives

## $$
## \text{runs\_scored} = 572.618 + 68.798(\text{home\_attend\_mil}) + \epsilon
## $$

Part F

Make a regression table of the output.

# load huxtable
library(huxtable)

## 
## Attaching package: 'huxtable'

## The following object is masked from 'package:dplyr':
## 
##     add_rownames

## The following object is masked from 'package:purrr':
## 
##     every

## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

## The following object is masked from 'package:pander':
## 
##     wrap

huxreg(reg, # this is sufficient, the rest is decoration
       coefs = c("Constant" = "(Intercept)",
                 "Home Attendance (Millions)" = "home_attend_mil"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 3)

	(1)
Constant	572.618 ***
	(8.081)
Home Attendance (Millions)	68.798 ***
	(4.183)
N	838
R-Squared	0.244
SER	91.473
* p < 0.001; p < 0.01; * p < 0.05.

Part G

Now let’s start running some diagnostics of the regression. Make a histogram of the residuals. Do they look roughly normal?

# here we need broom's augment() command to add residuals to the data

# load broom
library(broom)
# augment the regression, save as reg_aug
reg_aug<-reg %>%
  augment()

# now we use this as the data in our histogram plot in ggplot, (x is .resid)
ggplot(data = reg_aug)+
  aes(x = .resid)+
  geom_histogram(color="white")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

They do look roughly normal.

Part H

Make a residual plot.

# this is another plot from reg_aug, where x is home attendance and y is .fitted
ggplot(data = reg_aug)+
  aes(x = home_attend_mil,
      y = .resid)+
  geom_point()+
  geom_hline(yintercept=0, color="red")

Part I

Test the regression for heteroskedasticity. Are the errors homoskedastic or heteroskedastic? Generate robust standard errors. Make a regression output table, with one column with regular standard errors and another with robust standard errors.

# this requires the lmtest package for the bptest() command

# install.packages("lmtest")
# load lmtest
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(reg)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 0.22515, df = 1, p-value = 0.6351

The null hypothesis is that the errors are homoskedastic. The -value for this test is very large, so we cannot reject the null hypothesis.

This is good, it means the errors are homoskedastic, and our OLS estimators’ standard errors are accurate and do not need to be corrected for heteroskedasticity.

Part J

Test the data for outliers. If there are any, identify which team(s) and season(s) are outliers.

# this requires the car package for the outlierTest() command

# install.packages("car")
# load car

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

outlierTest(reg)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 816 -3.255201          0.0011787      0.98779

This test detected one outlier, which is observation (row) number 816. Let’s look it up:

mlb %>%
  slice(816)

team

city

nickname

league

division

season

home_attend

runs_scored

runs_allowed

wins

losses

games_behind

home_attend_mil

TOR

Toronto

Blue Jays

East

1.98e+03

7.55e+05

329

466

23.5

0.755

The Toronto Blue Jays’ 1981 season is an outlier. Just for kicks, let’s point it out on the scatterplot.

outlier<-mlb %>%
  slice(816)

library(ggrepel)
scatter+ # our scatterplot saved from part c
  geom_point(data = outlier,
             aes(x = home_attend_mil,
                 y = runs_scored),
             color = "red")+
  geom_text_repel(data = outlier,
             aes(x = home_attend_mil,
                 y = runs_scored),
             label = "1981 Blue Jays",
             color = "red")

Part K

What is the marginal effect of home attendance on runs scored? Is this statistically significant? Why or why not?

This is an interpretation question, no need to calculate anything. The marginal effect is : for every 1 additional million fans attending home games over a team’s season, the team scores 69 more runs.

Looking back at the regression output in part c, the -score for the hypothesis test , is 16.45, yielding a very very small -value. We have sufficient evidence to reject in favor of our alternative hypothesis, that there is a relationship between home attendance and runs scored over a season.

Part L

Now we’ll try out the infer package to understand the -value and confidence interval for our observed slope in our regression model. Save the (value of) our sample from your regression in Part D as an object. Then, install and load the infer package, and then calculate the slope³ under the null hypothesis that there is no connection between attendance and runs.⁴ for 1000 additional simulated samples⁵, and save this as an object (it’s a tibble). Then, use this to get_p_value()⁶. Compare to the -value given by lm() and tidy() above.

# save slope from our regression as sample_slope

sample_slope<-reg_tidy %>%
  filter(term=="home_attend_mil") %>%
  pull(estimate)

# double check it worked
sample_slope

## [1] 68.79777

# note I could also just use the beta_1_hat I optionally made in Part D

# make 1000 simulations of sample slopes under null hypothesis that slope = 0

#install.packages(infer)
library(infer) # load infer
slope_simulations<-mlb %>% # save as a tibble
  specify(runs_scored ~ home_attend_mil) %>% # our lm model
  hypothesize(null = "independence") %>% # null hypothesis, slope = 0 (X and Y independent)
  generate(reps = 1000, type = "permute") %>% # make 1000 permutations
  calculate(stat = "slope") # calculate the sample slope of each permutation

# make sure it worked
slope_simulations %>%
  head(., n = 10) # there are a LOT! I only print the first 10 for space

replicate	stat
1	-1.64
2	0.498
3	-1.57
4	1.06
5	5.93
6	1.88
7	7.25
8	8.61
9	4.28
10	9.48

slope_simulations %>%
  get_p_value(obs_stat = sample_slope,
              direction = "both")

p_value

The -value is basically 0. According to the regression output in part D, it is smaller than 0.00000000000000002.

Part M

Make a histogram of the simulated slopes, and plot our sample slope on that histogram, shading the -value.⁷

# now make a histogram of this
slope_simulations %>%
  visualize(obs_stat = sample_slope)+
  shade_p_value(obs_stat = sample_slope, # set the obs_stat equal to our saved slope from above
                direction = "both") # two-sided test, shade both sides

## Warning: `visualize()` should no longer be used to plot a p-value.
## Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.

# infer no longer shades the p-value, so I'll do it manually

slope_simulations %>%
  ggplot(data = .)+
  aes(x = stat)+ # slope from simulations
  geom_histogram(color = "white", # the histogram
                 fill = "indianred")+
  # add "shading" for p-value's two sides
  # right side
  geom_rect(xmin=68.798,
            xmax=100,
            ymin=0,
            ymax=450,
            fill = "pink",
            alpha=0.4)+
  # left side
  geom_rect(xmin=-100,
            xmax=-68.798,
            ymin=0,
            ymax=450,
            fill = "pink",
            alpha=0.4)+
  # add vertical line for our sample slope
  geom_vline(xintercept = sample_slope,
             color = "red",
             size = 2)+
  # add label
  geom_label(x = sample_slope,
             y = 200,
             color = "red",
             label = expression(paste("Our ", hat(beta[1]))))+
  scale_x_continuous(breaks=c(-75,-50,-25,0,25,50,75),
                     limits=c(-75,75),
                     expand=c(0,0))+
  scale_y_continuous(limits=c(0,450),
                       expand=c(0,0))+
  labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)),
       y = "Samples")+
    theme_classic(base_family = "Fira Sans Condensed",
           base_size=20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

Note this is the sampling distribution of under the null hypothesis (the true . Values on the horizontal axis are values of , not the number of standard deviations away from the null hypothesis.

What the test-statistic does is standardize this distribution much like we would standardize a distribution to the standard normal distribution via calculating -scores.

Let me visualize what would happen (and this is what R does to calculate the t value and associationed p-value.)

slope_simulations %>%
  summarize(mean = mean(stat), # get the mean slope of the simulated distribution of slopes
            se = (sd(stat))) # get the standard error of the slope 

mean	se
0.0517	4.86

# standardize slopes to t-statistics
tstatistics<-slope_simulations %>%
  mutate(tscores = ((stat - mean(stat))/sd(stat)))

our_t<-((sample_slope-mean(slope_simulations$stat))/sd(slope_simulations$stat))

# what is our t-statistic?
our_t

## [1] 14.14881

# now plot t-statistics %>%
ggplot(data = tstatistics)+
  aes(x = tscores)+ # slope from simulations
  geom_histogram(color = "white", # the histogram
                 fill = "indianred")+
  # add vertical line for our sample slope's t-statistic
  geom_vline(xintercept = our_t,
             color = "red",
             size = 2)+
  #add label
  geom_label(x = our_t,
             y = 50,
             color = "red",
             label = "Our t")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is off slightly (see the t-statistic R calculated in part C), but it gives you an idea. Our test-statistic is somewhere around 14-16, this means it is 14-16 standard deviations away from 0, so again, a very very low p-value!

Part N

Get the 95% confidence interval for your slope estimate,⁸ and then make a histogram of the simulated slopes (like part L), but instead, add +shade_confidence_interval().⁹ Compare this to what you get with tidy() above.

# save as a tibble called "ci_values"
ci_values<-slope_simulations %>%
  get_confidence_interval(level = 0.95,
                          type = "se",
                          point_estimate = sample_slope)

# see what we made
ci_values

lower	upper
59.3	78.3

# we'll use this for the endpoints in the shade_ci() command 

slope_simulations %>%
  visualize(obs_stat = sample_slope)+
  shade_confidence_interval(endpoints = ci_values)

## Warning: `visualize()` should no longer be used to plot a p-value.
## Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.

Note in the classic sense, the -value is actually measuring the probability of a test statistic being at least as extreme as ours. The test statistic essentially standardizes our sample statistic so that it measures standard deviations from the null-hypothesized value (i.e. 0), much like a -score.↩︎
Hint: compare your answer here to your answer in Part D.↩︎
calculate(stat = "slope")↩︎
hypothesize(null = "independence")↩︎
generate(reps = 1000, type = "permute")↩︎
Set obs_stat equal to your observed slope, and set direction = "two_sided"↩︎
You can use ggplot2 to plot a histogram in the normal way and add a geom_vline(), setting xintercept equal to your saved object with the sample value. Alternatively, you can use infer to pipe your tibble of simulations into visualize(), and inside visualize() set obs_stat equal to your saved object. Regardless of which method you use, add +shade_p_value(). Inside this, set obs_stat equal to your saved slope, and add direction = "two_sided".↩︎
tidy() your original regression, with conf.int = TRUE inside the command, then select(conf.low, conf.high) and filter by your x variable. Save this as an object.↩︎
Inside of this, set endpoints equal to the object you just made with the low and high confidence interval values.↩︎

Problem Set 3 (Answers)

Ryan Safner

ECON 480 - Fall 2019

Theory and Concepts

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Theory Problems

Question 9

Part A

Part B

Part C

Part D

Part E

Part F

Part G

Part H

Question 10

Part A

Part B

Part C

Part D

Part E

R Questions

Question 11

Part A

Part B

Part C

Part D

Part E

Part F

Part G

Part H

Part I

Part J

Part K

Part L

Part M

Part N