2.2: Random Variables and Distributions - Class Notes
Contents
Thursday, September 19, 2019
Overview
Today we finish your crash course/review of basic statistics with random variables and distributions.
Slides
Assignments: Problem Set 1 DUE and Problem Set 2
Problem Set 1 is DUE
Problem Set 2 (on classes 2.1-2.2) is due by Thursday September 26.
Appendix: Graphing Statistical and Mathematical Functions in R
The mosaic
package is useful for making and using mathematical functions in R
.
Creating Mathematical Functions
You can create custom mathematical functions using mosaic by defining an R function()
with multiple arguments. As a simple example, make the function \(f(x) = 10x-x^2\) as follows:
# store as a named function, I'll call it "my_function"
my_function<-function(x){10*x-x^2}
# look at it
my_function
## function(x){10*x-x^2}
There are some notational requirements from R
for making functions. Any coefficient in front of a variable (such as the 10 in 10x
must be explicitly multiplied by the variable, as in 10*x
).
To use the function, simply define what the input (x)
is and run your named function on it:
## [1] 16
## [1] 16 24
## [1] 16 21 24 25 24 21
## [1] 16 24
Graphing Mathematical Functions
In ggplot
there is a dedicated stat_function()
(equivalent to a geom_
layer) to graph mathematical and statistical functions. All that is needed is a data.frame
of a range of x
values to act as the source for data
, and set x
equal to those values for aes
thetics.
Then we add the stat_function
, where fun =
is the most important argument where you define the to function to graph as your function created above, for example, our my_function
.
You can also adjust things like size, color, and line type.
ggplot(data = data.frame(x = 1:10))+
aes(x = x)+
stat_function(fun = my_function, color = "blue", size = 2, linetype = "dashed")
Bultin Statistical Functions
There are some standard statistical distributions built into R. They require a combination of a specific prefix and a distribution.
Prefixes:
Action/Type | Prefix |
---|---|
random draw | r |
density (pdf) | d |
cumulative density (cdf) | p |
quantile (inverse cdf) | q |
Distributions:
Distribution | Name in R |
---|---|
Normal | norm |
Uniform | unif |
Student’s t | t |
Binomial | binom |
Negative binomial | nbinom |
Hypergeometric | hyper |
Weibull | weibull |
Beta | beta |
Gamma | gamma |
Thus, what you want is a combination of the prefix and the distribution.
Some common examples:
- Take random draws from a normal distribution:
rnorm(n = 10, # take 10 draws from a normal distribution with:
mean = 2, # mean of 2
sd = 1) # sd of 1
## [1] 2.7245133 0.6634132 2.4913083 1.6882600 2.4278137 1.7143854 3.8300644
## [8] 1.6628005 2.1862470 3.0243747
- Get probability of a random variable being less than or equal to a value (cdf) from a normal distribution:
# find probability of area to the LEFT of a number on pdf (note this = cdf of that number!)
pnorm(q = 80, # number is 80 from a distribution where:
mean = 200, # mean is 100
sd = 100, # sd is 100
lower.tail = TRUE) # looking to the LEFT in lower tail
## [1] 0.1150697
- Find the value of a distribution that is a specified percentile.
# find the 38th percentile value
qnorm(p = 0.38, # 38th percentile from a distribution where:
mean = 200, # mean is 200
sd = 100) # sd is 100
## [1] 169.4519
Graphing Statistical Functions
You can also graph these commonly used statistical functions by setting fun =
the named functions in your stat_function()
layer. If you want to specify the mean and standard deviation, use args = list()
to include the required arguments from the named function above (e.g. dnorm
needs mean
and sd
).
ggplot(data = data.frame(x = -400:600))+
aes(x = x)+
stat_function(fun = dnorm, args = list(mean = 200, sd = 200), color = "blue", size = 2, linetype = "dashed")
If you don’t include this, it will graph the standard distribution:
ggplot(data = data.frame(x = -4:4))+
aes(x = x)+
stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")
To add shading under a distribution, simply add a duplicate of the stat_function()
layer, but add geom="area"
to indicate the area beneath the function should be filled, and you can limit the domain of the fill
with xlim=c(start,end)
, where start
and end
are the x-values for the endpoints of the fill.
# graph normal distribution and shade area between -2 and 2
ggplot(data = data.frame(x = -4:4))+
aes(x = x)+
stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")+
stat_function(fun = dnorm, xlim = c(-2,2), geom = "area", fill = "green", alpha=0.5)
Hence, here is one graph from my slides:
ggplot(data = tibble(x=35:115))+
aes(x = x)+
stat_function(fun = dnorm, args = list(mean = 75, sd = 10), size = 2, color="blue")+
stat_function(fun = dnorm, args = list(mean = 75, sd = 10), geom = "area", xlim = c(65,85), fill="blue", alpha=0.5)+
labs(x = "X",
y = "Probability")+
scale_x_continuous(breaks = seq(35,115,5))+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)