You go into data analysis with the tools you know, not the tools you need
The next 2-3 weeks are all about giving you the tools you need
We will extend them as we learn specific models
Free and open source
A very large community
R
firstCan handle virtually any data format
Makes replication easy
Can integrate into documents (with R markdown
)
R is a language so it can do everything
library("gapminder")ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent))+ geom_point(alpha=0.3)+ geom_smooth(method = "lm")+ scale_x_log10(breaks=c(1000,10000, 100000), label=scales::dollar)+ labs(x = "GDP/Capita", y = "Life Expectancy (Years)")+ facet_wrap(~continent)+ guides(color = F)+ theme_light()
library(gapminder)
The average GDP per capita is $`r
round(mean(gapminder$gdpPercap),2)`
with a standard deviation of $`r
round(sd(gapminder$gdpPercap),2)`
.
The average GDP per capita is $7215.33 with a standard deviation of $9857.45.
R is the programming language that executes commands
R Studio is an integrated development environment (IDE) that makes your coding life a lot easier
R Studio
R is like your car's engine, R Studio is the dashboard
You will do everything in R Studio and never open the R program itself
R itself is just a command language
R
app is basically just a command line, you can even just use your computer's command line!2R Studio
1The (free) Desktop version.
2 "Command Prompt" on Windows, "Terminal" on Unix (Mac and Linux). Type r
and hit enter, and you can now execute R
commands.
R Studio has 4 window panes:
CTRL+SHIFT+[number]
will maximize a pane. Type again to see all four. R Studio
1May not be immediately visible until you create new files.
You don't "learn R", you learn how to do things in R
In order to do learn this, you need to learn how to search for what you want to do
You don't "learn R", you learn how to do things in R
In order to do learn this, you need to learn how to search for what you want to do
My #rstats learning path:
— Jesse Mostipak (@kierisi) August 18, 2017
1. Install R
2. Install RStudio
3. Google "How do I [THING I WANT TO DO] in R?"
Repeat step 3 ad infinitum.
Type individual commands into the console window
Great for testing individual commands to see what happens
Not saved! Not reproducible! Not recommended!
2+2
## [1] 4
summary(mpg$hwy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12.00 18.00 24.00 23.44 27.00 44.00
Type individual commands into the console window
Great for testing individual commands to see what happens
Not saved! Not reproducible! Not recommended!
ggplot(mpg, aes(x=displ, y=hwy))+geom_point(aes(color=class))+geom_smooth()
Source pane is a text-editor
Make .R
files: all input commands in a single script
Comment with #
Can run any or all of script at once
Can save, reproduce, and send to others!
2+2 # just testing!head(mpg) # look at mpg data # create a plot ggplot(mpg, aes(x=displ, y=hwy))+geom_point(aes(color=class))+geom_smooth()
For a later lecture: R Markdown
, a simple markup language to write documents in
Can integrate text, R
code, figures, citations and bibliographies into a single plain-text file1 and then output into a variety of formats: PDF, webpage, slides, Word doc, etc.
1OK, to be fair, citations require one additional file!
Practicing typing at the Command line/Console
Learning different commands and objects relevant for analysis
Saving and running .R
scripts
Later: R markdown
, literate programming, workflow management
Today may seem a bit overwhelming
R assumes a default (often inconvenient) "working directory" on your computer
open
or save
files Find out where R this is with getwd()
Change it with setwd(path/to/folder)
1
Soon I'll show you better ways where you won't ever have to worry about this
1 Note the path will be OS-specific. For Windows it might be C:/Documents/
. For Mac it is often your username folder.
Hadley Wickham
Chief Scientist, R Studio
"There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters." - R for Data Science, Ch. 4
help(function_name)
or ?(function_name)
to get documentation on a functionFrom Kieran Healy's excellent (free online!) book on Data Visualization.
]
#
starts a comment, R will ignore everything on the rest of that line# Run regression of y on x, save as reg1 reg1<-lm(y~x, data=data) #runs regression summary(reg1$coefficients) #prints coefficients
I follow this style guide (you are not required to)1
Naming objects and files will become important2
my webpage in html
turned into http://my%20webpage%20in%20html.html
i_use_underscoressome.people.use.snake.caseothersUseCamelCase
1 Also described in today's course notes page and the course reference page.
2 Consider your folders on your computer as well...
You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
> 2+2
## [1] 4
You'll have to get used to the fact that you are coding in commands to execute
Start with the easiest: simple math operators and calculations:
> 2+2
## [1] 4
>
and give you output starting with ## [1]
2^3
## [1] 8
2^3
## [1] 8
sqrt(25)
## [1] 5
2^3
## [1] 8
sqrt(25)
## [1] 5
log(6)
## [1] 1.791759
2^3
## [1] 8
sqrt(25)
## [1] 5
log(6)
## [1] 1.791759
pi/2
## [1] 1.570796
library()
library("package_name")
install.packages()
1 install.packages("package_name")
tidyverse
: really a meta-package combining the following packages (among others)dplyr
: used for better data-wranglingggplot2
: used for fancy plottinghuxtable
: used for automatically producing regression tablescreating objects
<-
running functions
on objects
function_name(object_name)
# make an objectmy_object<-c(1,2,3,4,5)# look at it my_object
## [1] 1 2 3 4 5
# find the sumsum(my_object)
## [1] 15
# find the mean mean(my_object)
## [1] 3
Functions have "arguments," the input(s)
Some functions may have multiple inputs
The argument of a function can be another function!
# find the sdsd(my_object)
## [1] 1.581139
# round everything in my object to two decimalsround(my_object,2)
## [1] 1 2 3 4 5
# round the sd to two decimalsround(sd(my_object),2)
## [1] 1.58
Numeric objects are just numbers1
Can be mathematically manipulated
x <- 2 y <- 3x+y
## [1] 5
x*y
## [1] 6
integer
or double
if there are decimal values.Character objects are "strings" of text held inside quote marks
Can contain spaces, so long as contained within quote marks
name <- "Ryan Safner"address <- "401 Rosemont Ave."name
## [1] "Ryan Safner"
address
## [1] "401 Rosemont Ave."
TRUE
or FALSE
indicators>
, <
: greater than, less than>=
, <=
: greater than or equal to, less than or equal to==
, !=
: is equal to, is not equal to1&in&
: Is a member of the set of ($\in$)&
: "AND"|
: "OR" z <- 10 # set z equal to 10z==10 # is z equal to 10?
## [1] TRUE
"red"=="blue" # is red equal to blue?
## [1] FALSE
z > 1 & z < 12 # is z > 1 AND < 12?
## [1] TRUE
z <= 1 | z==10 # is z >= 1 OR equal to 10?
## [1] TRUE
1 One =
assigns a value (like <-
). Two ==
evaluate a conditional statement.
Factor objects contain categorical data - membership in mutually exclusive groups
Look like strings, behave more like logicals, but with more than two options
## [1] junior sophomore sophomore senior sophomore sophomore junior ## [8] junior freshman junior ## Levels: freshman sophomore junior senior
## [1] junior sophomore sophomore senior sophomore sophomore junior ## [8] junior freshman junior ## Levels: freshman < sophomore < junior < senior
Vector
: the simplest type of object, just a collection of objects
Make a vector using the combine c()
function
# create a vector called vecvec<-c(1,"orange", 83.5, pi)# look at vecvec
## [1] "1" "orange" "83.5" ## [4] "3.14159265358979"
Data frame
: what we'll be using almost always
Think like a "spreadsheet"
Each column is a vector (variable)
Each row is an observation (pair of values for all variables)
library("ggplot2")diamonds
## # A tibble: 53,940 x 10## carat cut color clarity depth table price x y z## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39## # … with 53,930 more rows
Dataframes are really just combinations of (column) vectors
You can make data frames by combinining named vectors with data.frame()
or creating each column/vector in each argument
# make two vectorsfruits<-c("apple","orange","pear","kiwi","pineapple")numbers<-c(3.3,2.0,6.1,7.5,4.2)# combine into dataframedf<-data.frame(fruits,numbers)# do it all in one step (note the = instead of <-)df<-data.frame(fruits=c("apple","orange","pear","kiwi","pineapple"), numbers=c(3.3,2.0,6.1,7.5,4.2))# look at itdf
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2
<-
my_vector <- c(1,2,3,4,5)
<-
my_vector <- c(1,2,3,4,5)
R will not give any output after an assignment
View an object (and list its contents) by typing its name
my_vector
## [1] 1 2 3 4 5
<-
my_vector <- c(1,2,3,4,5)
R will not give any output after an assignment
View an object (and list its contents) by typing its name
my_vector
## [1] 1 2 3 4 5
my_vector <- c(2,7,9,1,5)my_vector
## [1] 2 7 9 1 5
class()
class("six")
## [1] "character"
class(6)
## [1] "numeric"
class()
class("six")
## [1] "character"
class(6)
## [1] "numeric"
is.()
is.numeric("six")
## [1] FALSE
is.character("six")
## [1] TRUE
as.object_class()
numeric
, etc! as.character(6)
## [1] "6"
as.numeric("six")
## [1] NA
mixed_vector <- c(pi, 12, "apple", 6.32)class(mixed_vector)
## [1] "character"
mixed_vector
## [1] "3.14159265358979" "12" "apple" ## [4] "6.32"
df
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2
class(df$fruits)
## [1] "factor"
class(df$numbers)
## [1] "numeric"
1remember each column in a data frame is a vector!
str()
command to view its structureclass(df)
## [1] "data.frame"
str(df)
## 'data.frame': 5 obs. of 2 variables:## $ fruits : Factor w/ 5 levels "apple","kiwi",..: 1 3 4 2 5## $ numbers: num 3.3 2 6.1 7.5 4.2
n
) rows with head()
head(df)
## fruits numbers## 1 apple 3.3## 2 orange 2.0## 3 pear 6.1## 4 kiwi 7.5## 5 pineapple 4.2
head(df, n=2)
## fruits numbers## 1 apple 3.3## 2 orange 2.0
summary()
summary(df)
## fruits numbers ## apple :1 Min. :2.00 ## kiwi :1 1st Qu.:3.30 ## orange :1 Median :4.20 ## pear :1 Mean :4.62 ## pineapple:1 3rd Qu.:6.10 ## Max. :7.50
1 for numeric
data only; a frequency table is displayed for character
or factor
data
data.frame
objects can be viewed in their own panel by clicking on the name of the objectmy_vector<-c(2,4,5,10)my_vector+4 # add 4 to all elements
## [1] 6 8 9 14
my_vector^2 # square all elements
## [1] 4 16 25 100
length(my_vector) # how many elements
## [1] 4
sum(my_vector) # add all elements
## [1] 21
max(my_vector) # find largest element
## [1] 10
min(my_vector) # find smallest element
## [1] 2
mean(my_vector) # mean of all elements
## [1] 5.25
median(my_vector) # median of all elements
## [1] 4.5
sd(my_vector) # standard deviation
## [1] 3.40343
+
sign waiting for you to finish the command> 2+(2*3+
)
--or hit Esc
to cancelmtcars
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90## 12 16.4 8 275.8 180 3.07 4.070 17.40
df[r,c]
r
or c
blank selects all rows or columnsc()
1:
r
and c
! 1 You can also "negate" values, selecting everything except for values with a -
in front of them.
mtcars
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90## 12 16.4 8 275.8 180 3.07 4.070 17.40
mtcars[1,] # first row
## mpg cyl disp hp drat wt qsec## 1 21 6 160 110 3.9 2.62 16.46
mtcars[c(1,3,4),] # first, third, and fourth rows
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160 110 3.90 2.620 16.46## 3 22.8 4 108 93 3.85 2.320 18.61## 4 21.4 6 258 110 3.08 3.215 19.44
mtcars[1:3,] # first three rows
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160 110 3.90 2.620 16.46## 2 21.0 6 160 110 3.90 2.875 17.02## 3 22.8 4 108 93 3.85 2.320 18.61
mtcars
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90## 12 16.4 8 275.8 180 3.07 4.070 17.40
mtcars[,6] # select column 6
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440## [12] 4.070
mtcars[,2:4] # select columns 2 through 4
## cyl disp hp## 1 6 160.0 110## 2 6 160.0 110## 3 4 108.0 93## 4 6 258.0 110## 5 8 360.0 175## 6 6 225.0 105## 7 8 360.0 245## 8 4 146.7 62## 9 4 140.8 95## 10 6 167.6 123## 11 6 167.6 123## 12 8 275.8 180
mtcars
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90## 12 16.4 8 275.8 180 3.07 4.070 17.40
[[]]
selects a column by positionmtcars[[6]] # same thing
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440## [12] 4.070
$
mtcars$wt
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440## [12] 4.070
mtcars
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90## 12 16.4 8 275.8 180 3.07 4.070 17.40
mtcars[mtcars$wt>4,] # select obs with wt>4
## mpg cyl disp hp drat wt qsec## 12 16.4 8 275.8 180 3.07 4.07 17.4
mtcars[mtcars$cyl==6,] # select obs with exactly 6 cyl
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 4 21.4 6 258.0 110 3.08 3.215 19.44## 6 18.1 6 225.0 105 2.76 3.460 20.22## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90
mtcars[mtcars$wt<4 & mtcars$wt>2,] # select obs where 2<wt<4
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 5 18.7 8 360.0 175 3.15 3.440 17.02## 6 18.1 6 225.0 105 2.76 3.460 20.22## 7 14.3 8 360.0 245 3.21 3.570 15.84## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90
mtcars[mtcars$cyl==4 | mtcars$cyl==6,] # select obs with 4 OR 6 cyl
## mpg cyl disp hp drat wt qsec## 1 21.0 6 160.0 110 3.90 2.620 16.46## 2 21.0 6 160.0 110 3.90 2.875 17.02## 3 22.8 4 108.0 93 3.85 2.320 18.61## 4 21.4 6 258.0 110 3.08 3.215 19.44## 6 18.1 6 225.0 105 2.76 3.460 20.22## 8 24.4 4 146.7 62 3.69 3.190 20.00## 9 22.8 4 140.8 95 3.92 3.150 22.90## 10 19.2 6 167.6 123 3.92 3.440 18.30## 11 17.8 6 167.6 123 3.92 3.440 18.90
Next class: data visualization with ggplot2
And then: data wrangling with tidyverse
And then: literate programming and workflow management with R Markdown
Finally: back to econometric theory!
You go into data analysis with the tools you know, not the tools you need
The next 2-3 weeks are all about giving you the tools you need
We will extend them as we learn specific models
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |