2.2: Random Variables and Distributions - Class Notes

Contents

Thursday, September 19, 2019

Overview

Today we finish your crash course/review of basic statistics with random variables and distributions.

Slides

Assignments: Problem Set 1 DUE and Problem Set 2

Problem Set 1 is DUE

Problem Set 2 (on classes 2.1-2.2) is due by Thursday September 26.

Appendix: Graphing Statistical and Mathematical Functions in R

The mosaic package is useful for making and using mathematical functions in R.

# install.packages("mosaic") install for first use
library(mosaic) # load 

Creating Mathematical Functions

You can create custom mathematical functions using mosaic by defining an R function() with multiple arguments. As a simple example, make the function \(f(x) = 10x-x^2\) as follows:

# store as a named function, I'll call it "my_function"
my_function<-function(x){10*x-x^2}

# look at it
my_function
## function(x){10*x-x^2}

There are some notational requirements from R for making functions. Any coefficient in front of a variable (such as the 10 in 10x must be explicitly multiplied by the variable, as in 10*x).

To use the function, simply define what the input (x) is and run your named function on it:

# f of 2 
my_function(2)
## [1] 16
# f of 2 and 4
my_function(c(2,4))
## [1] 16 24
# f of 2 through 7
my_function(2:7)
## [1] 16 21 24 25 24 21
# ALTERNATIVELY
# define x first and then run function on it

x<-c(2,4)
my_function(x)
## [1] 16 24

Graphing Mathematical Functions

In ggplot there is a dedicated stat_function() (equivalent to a geom_ layer) to graph mathematical and statistical functions. All that is needed is a data.frame of a range of x values to act as the source for data, and set x equal to those values for aesthetics.

# x values are integers 1 through 10
ggplot(data = data.frame(x = 1:10))+
  aes(x = x)

Then we add the stat_function, where fun = is the most important argument where you define the to function to graph as your function created above, for example, our my_function.

ggplot(data = data.frame(x = 1:10))+
  aes(x = x)+
  stat_function(fun = my_function) 

You can also adjust things like size, color, and line type.

ggplot(data = data.frame(x = 1:10))+
  aes(x = x)+
  stat_function(fun = my_function, color = "blue", size = 2, linetype = "dashed") 

Bultin Statistical Functions

There are some standard statistical distributions built into R. They require a combination of a specific prefix and a distribution.

Prefixes:

Action/Type Prefix
random draw r
density (pdf) d
cumulative density (cdf) p
quantile (inverse cdf) q

Distributions:

Distribution Name in R
Normal norm
Uniform unif
Student’s t t
Binomial binom
Negative binomial nbinom
Hypergeometric hyper
Weibull weibull
Beta beta
Gamma gamma

Thus, what you want is a combination of the prefix and the distribution.

Some common examples:

  1. Take random draws from a normal distribution:
rnorm(n = 10, # take 10 draws from a normal distribution with:
      mean = 2, # mean of 2
      sd = 1) # sd of 1
##  [1] 2.7245133 0.6634132 2.4913083 1.6882600 2.4278137 1.7143854 3.8300644
##  [8] 1.6628005 2.1862470 3.0243747
  1. Get probability of a random variable being less than or equal to a value (cdf) from a normal distribution:
# find probability of area to the LEFT of a number on pdf (note this = cdf of that number!)
pnorm(q = 80, # number is 80 from a distribution where: 
      mean = 200, # mean is 100
      sd = 100, # sd is 100
      lower.tail = TRUE) # looking to the LEFT in lower tail
## [1] 0.1150697
  1. Find the value of a distribution that is a specified percentile.
# find the 38th percentile value
qnorm(p = 0.38, # 38th percentile from a distribution where:
      mean = 200, # mean is 200
      sd = 100) # sd is 100
## [1] 169.4519

Graphing Statistical Functions

You can also graph these commonly used statistical functions by setting fun = the named functions in your stat_function() layer. If you want to specify the mean and standard deviation, use args = list() to include the required arguments from the named function above (e.g. dnorm needs mean and sd).

ggplot(data = data.frame(x = -400:600))+
  aes(x = x)+
  stat_function(fun = dnorm, args = list(mean = 200, sd = 200), color = "blue", size = 2, linetype = "dashed") 

If you don’t include this, it will graph the standard distribution:

ggplot(data = data.frame(x = -4:4))+
  aes(x = x)+
  stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed") 

To add shading under a distribution, simply add a duplicate of the stat_function() layer, but add geom="area" to indicate the area beneath the function should be filled, and you can limit the domain of the fill with xlim=c(start,end), where start and end are the x-values for the endpoints of the fill.

# graph normal distribution and shade area between -2 and 2
ggplot(data = data.frame(x = -4:4))+
  aes(x = x)+
  stat_function(fun = dnorm, color = "blue", size = 2, linetype = "dashed")+
  stat_function(fun = dnorm, xlim = c(-2,2), geom = "area", fill = "green", alpha=0.5)

Hence, here is one graph from my slides:

ggplot(data = tibble(x=35:115))+
  aes(x = x)+
  stat_function(fun = dnorm, args = list(mean = 75, sd = 10), size = 2, color="blue")+
  stat_function(fun = dnorm, args = list(mean = 75, sd = 10), geom = "area", xlim = c(65,85), fill="blue", alpha=0.5)+
  labs(x = "X",
       y = "Probability")+
  scale_x_continuous(breaks = seq(35,115,5))+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=20)