• Getting Set Up
  • Creating Objects
    • 1.
      • a.
      • b.
      • c.
    • 2.
    • 3.
    • 4.
    • 5.
  • Playing with Data
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
  • Execute your R Script

Getting Set Up

Before we begin, start a new file with File New File R Script. As you work through this sheet in the console in R, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.

Creating Objects

1.

Work on the following parts:

a.

Create a vector called me with two objects, your first name, and your last name.

me <- c("Ryan", "Safner")

b.

Call the vector to inspect it.

me
## [1] "Ryan"   "Safner"

c.

Confirm it is a character class vector.

class(me)
## [1] "character"

2.

Use R’s help functions to determine what the paste() function does. Then paste together your first name and last name.

?paste() # or help(paste)
 # paste is a function that combines (concatenates) multiple string objects into a single object
paste("Ryan", "Safner")
## [1] "Ryan Safner"
# note you can choose how to separate string objects with the "sep" argument
# for example
paste("Ryan", "Safner", sep="") # no separation
## [1] "RyanSafner"
paste("Ryan", "Safner", sep=" ") # separate with a space " " (the default)
## [1] "Ryan Safner"
paste("Ryan", "Safner", sep="_") # separate with underscore
## [1] "Ryan_Safner"

3.

Create a vector called my_vector with all the even integers from 2 to 10.

my_vector <- c(2,4,6,8,10)

# verify it worked
my_vector
## [1]  2  4  6  8 10
# alternatively, you can use the sequence function, seq()
# see the Class page for more about this function
my_vector <- seq(from = 2, # starting integer
                 to = 10, # ending integer
                 by = 2) # by 2's

# you can shorten it by not including the names of the arguments:
my_vector <- seq(2,10,2)

# verify it worked
my_vector
## [1]  2  4  6  8 10

4.

Find the mean of my_vector with mean().

mean(my_vector)
## [1] 6

5.

Take all the integers from 18 to 763,1 then get the mean.

# create a sequence of integers by 1 with starting_number:ending_number
# see Class 3 page for more

# you can do this all at once without making an object
mean(18:763)
## [1] 390.5
# alternatively you can save this as a vector and run the mean on it
vec1 <- 18:763

mean(vec1)
## [1] 390.5

Playing with Data

For the following questions, we will use the diamonds dataset, included as part of ggplot2.

6.

Install ggplot2.

install.packages("ggplot2") # note the s and the quotes

7.

Load ggplot2 with the library() command.

library("ggplot2") # quotes not necessary, but can be used

8.

Get the structure of the diamonds data frame. What are the different variables and what kind of data does each contain?

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# We have
# - carat: a number
# - cut: an ordered factor
# - color: an ordered factor
# - clarity: an ordered factor
# - depth: a number
# - table: a number
# - price: an integer
# - x: a number
# - y: a number
# - z: a number

9.

Get summary statistics separately for carat, depth, table, and price.

summary(diamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100
summary(diamonds$depth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   61.00   61.80   61.75   62.50   79.00
summary(diamonds$table)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   56.00   57.00   57.46   59.00   95.00
summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

10.

color, cut, and clarity are categorical variables (factors). Use the table() command to generate frequency tables for each.

table(diamonds$color)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808
table(diamonds$cut)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551
table(diamonds$clarity)
## 
##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790

11.

Now rerun the summary() command on the entire data frame.

summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

12.

Now look only at (subset) the first 4 diamonds in the dataset.

# remember, dataframes are indexed by: df[row#s, column#s]
diamonds[1:4,] # select first through fourth rows, all columns
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.23IdealESI261.5553263.953.982.43
0.21PremiumESI159.8613263.893.842.31
0.23GoodEVS156.9653274.054.072.31
0.29PremiumIVS262.4583344.204.232.63
# alternatively
diamonds[c(1,2,3,4),] # using a vector-approach
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.23IdealESI261.5553263.953.982.43
0.21PremiumESI159.8613263.893.842.31
0.23GoodEVS156.9653274.054.072.31
0.29PremiumIVS262.4583344.204.232.63

13.

Now look only at (subset) the third and seventh diamond in the dataset.

diamonds[c(3,7),] # select 3rd and 7th row, all columns
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.23GoodEVS156.9653274.054.072.31
0.24Very GoodIVVS162.3573363.953.982.47

14.

Now look only at (subset) the second column of the dataset.

diamonds[,2] # select all rows, 2nd column
ABCDEFGHIJ0123456789
cut
<ord>
Ideal
Premium
Good
Premium
Good
Very Good
Very Good
Very Good
Fair
Very Good

15.

Do this again, but look using the $ to pull up the second column by name.

# second column is called "cut"
diamonds$cut # dont' run this, it'll print 53,000 rows!

16.

Now look only at diamonds that have a carat greater than or equal to 1.

# use the [square brackets] to subset, 
# first argument (rows) are chosen by conditional: 
# - choose diamonds based on their carat, and only carats >= 1
diamonds[diamonds$carat >= 1,] # select rows on condition, and all columns
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
1.17Very GoodJI160.261.027746.836.904.13
1.01PremiumFI161.860.027816.396.363.94
1.01FairEI164.558.027886.296.214.03
1.01PremiumHSI262.759.027886.316.223.93
1.05Very GoodJSI263.256.027896.496.454.09
1.05FairJSI265.859.027896.416.274.18
1.00PremiumISI258.260.027956.616.553.83
1.01FairESI267.460.027976.196.054.13
1.04PremiumGI162.258.028016.466.414.00
1.00PremiumJSI262.358.028016.456.343.98

17.

Now look only at diamonds that have a VVS1 clarity.

# we are testing for equality, so we need two ==
# we are selecting based on clarity, a character/factor, so we need quotes
diamonds[diamonds$clarity=="VVS1",] # select rows on condition, and all columns
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.24Very GoodIVVS162.357.03363.953.982.47
0.32IdealIVVS162.055.35534.394.422.73
0.24PremiumEVVS160.758.05534.014.032.44
0.24Very GoodDVVS161.560.05533.974.002.45
0.26Very GoodEVVS162.659.05544.064.092.55
0.26Very GoodEVVS163.459.05544.004.042.55
0.26Very GoodDVVS162.160.05544.034.122.53
0.26GoodEVVS157.960.05544.224.252.45
0.24PremiumGVVS162.359.05543.953.922.45
0.24PremiumHVVS161.258.05544.013.962.44

18.

Now look only at dimaonds that have a color of E, F, I, and J.

# same idea as last problem, except now we want one of any of these 4 colors

# first (tedious) way, a series of checking equality and using "OR"s (|) 
diamonds[diamonds$color=="E" | diamonds$color=="F" | diamonds$color=="I" | diamonds$color=="J",] # select rows on condition, and all columns
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.23IdealESI261.555.03263.953.982.43
0.21PremiumESI159.861.03263.893.842.31
0.23GoodEVS156.965.03274.054.072.31
0.29PremiumIVS262.458.03344.204.232.63
0.31GoodJSI263.358.03354.344.352.75
0.24Very GoodJVVS262.857.03363.943.962.48
0.24Very GoodIVVS162.357.03363.953.982.47
0.22FairEVS265.161.03373.873.782.49
0.30GoodJSI164.055.03394.254.282.73
0.23IdealJVS162.856.03403.933.902.46
# second (better) way, using group membership operator (%in%) and list the elements as a vector
diamonds[diamonds$color %in% c("E","F","I","J"),] # select rows on condition, and all columns 
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
0.23IdealESI261.555.03263.953.982.43
0.21PremiumESI159.861.03263.893.842.31
0.23GoodEVS156.965.03274.054.072.31
0.29PremiumIVS262.458.03344.204.232.63
0.31GoodJSI263.358.03354.344.352.75
0.24Very GoodJVVS262.857.03363.943.962.48
0.24Very GoodIVVS162.357.03363.953.982.47
0.22FairEVS265.161.03373.873.782.49
0.30GoodJSI164.055.03394.254.282.73
0.23IdealJVS162.856.03403.933.902.46

19.

Now look only at diamonds that have a carat greater than or equal to 1 and a VVS1 clarity.

# testing for two conditions (AND)
diamonds[diamonds$carat>=1 & diamonds$clarity=="VVS1",] # select rows on condition, and all columns 
ABCDEFGHIJ0123456789
carat
<dbl>
cut
<ord>
color
<ord>
clarity
<ord>
depth
<dbl>
table
<dbl>
price
<int>
x
<dbl>
y
<dbl>
z
<dbl>
1.00GoodIVVS156.562.044456.586.553.71
1.00GoodJVVS163.559.046336.296.344.01
1.00Very GoodJVVS163.559.047176.346.294.01
1.01PremiumHVVS160.557.049556.556.483.94
1.01PremiumIVVS162.059.049896.396.323.94
1.04PremiumHVVS160.458.051026.586.533.96
1.01IdealIVVS161.756.054786.426.473.98
1.09Very GoodJVVS163.956.055886.476.514.15
1.27PremiumJVVS160.158.057617.066.994.22
1.21PremiumJVVS161.359.058936.816.864.19

20.

Get the average price of diamonds in question 18.2

# use command from last question as the argument to the mean function, 
## but be sure that you look at the price, specifically

mean(diamonds$price[diamonds$carat>=1 & diamonds$color=="D" & diamonds$clarity=="VVS1"])
## [1] 13935.48

21.

What is the highest price for a diamond with a 1.0 carat, D color, and VVS1 clarity?

max(diamonds$price[diamonds$carat>=1 & diamonds$color=="D" & diamonds$clarity=="VVS1"])
## [1] 17932

Execute your R Script

Save the R Script you created at the beginning and (hopefully) have been pasting all of your valid commands to. This creates a .R file wherever you choose to save it to. Now looking at the file in the upper left pane of R Studio look for the button in the upper right corner that says Run. Sit back and watch R redo everything you’ve carefully worked on, all at once.

Your .R file should look something like this:

# 1 -------------------

## a 
me <- c("Ryan", "Safner")

## b
me

## c 
class(me)

# 2 ----------------

?paste() # or help(paste)
 # paste is a function that combines (concatenates) multiple string objects into a single object
paste("Ryan", "Safner")

# note you can choose how to separate string objects with the "sep" argument
# for example
paste("Ryan", "Safner", sep="") # no separation
paste("Ryan", "Safner", sep=" ") # separate with a space " " (the default)
paste("Ryan", "Safner", sep="_") # separate with underscore

# 3 ---------------

my_vector <- c(2,4,6,8,10)

# verify it worked
my_vector
# alternatively, you can use the sequence function, seq()
# see the Class page for more about this function
my_vector <- seq(from = 2, # starting integer
                 to = 10, # ending integer
                 by = 2) # by 2's

# you can shorten it by not including the names of the arguments:
my_vector <- seq(2,10,2)

# verify it worked
my_vector

# 4 -------------------

mean(my_vector)


# 5 -------------------

# create a sequence of integers by 1 with starting_number:ending_number
# see Class 3 page for more

# you can do this all at once without making an object
mean(18:763)

# alternatively you can save this as a vector and run the mean on it
vec1 <- 18:763

mean(vec1)

# 6 ------------------

install.packages("ggplot2") # note the s and the quotes

# 7 ------------------

library("ggplot2") # quotes not necessary, but can be used

# 8 ------------------

str(diamonds)

# We have
# - carat: a number
# - cut: an ordered factor
# - color: an ordered factor
# - clarity: an ordered factor
# - depth: a number
# - table: a number
# - price: an integer
# - x: a number
# - y: a number
# - z: a number

# 9 ------------------

summary(diamonds$carat)
summary(diamonds$depth)
summary(diamonds$table)
summary(diamonds$price)

# 10 ------------------

table(diamonds$color)
table(diamonds$cut)
table(diamonds$clarity)

# 11 ------------------

summary(diamonds)

# 12 ------------------

# remember, dataframes are indexed by: df[row#s, column#s]
diamonds[1:4,] # select first through fourth rows, all columns

# alternatively
diamonds[c(1,2,3,4),] # using a vector-approach

# 13 ------------------

diamonds[c(3,7),] # select 3rd and 7th row, all columns

# 14 ------------------
diamonds[,2] # select all rows, 2nd column

# 15 ------------------

# second column is called "cut"
# diamonds$cut dont' run this, it'll print 53,000 rows!

# 16 -------------------

# use the [square brackets] to subset, 
# first argument (rows) are chosen by conditional: 
# - choose diamonds based on their carat, and only carats >= 1
diamonds[diamonds$carat >= 1,] # select rows on condition, and all columns

# 17 -------------------

# we are testing for equality, so we need two ==
# we are selecting based on clarity, a character/factor, so we need quotes
diamonds[diamonds$clarity=="VVS1",] # select rows on condition, and all columns

# 18 -------------------

# same idea as last problem, except now we want one of any of these 4 colors

# first (tedious) way, a series of checking equality and using "OR"s (|) 
diamonds[diamonds$color=="E" | diamonds$color=="F" | diamonds$color=="I" | diamonds$color=="J",] # select rows on condition, and all columns

# second (better) way, using group membership operator (%in%) and list the elements as a vector
diamonds[diamonds$color %in% c("E","F","I","J"),] # select rows on condition, and all columns 

# 19 -------------------

# testing for two conditions (AND)
diamonds[diamonds$carat>=1 & diamonds$clarity=="VVS1",] # select rows on condition, and all columns 

# 20 -------------------

# use command from last question as the argument to the mean function
## but be sure that you look at the price, specifically
mean(diamonds$price[diamonds$carat>=1 & diamonds$color=="D" & diamonds$clarity=="VVS1"])

# 21 -------------------

max(diamonds$price[diamonds$carat>=1 & diamonds$color=="D" & diamonds$clarity=="VVS1"])

  1. Hint: use the : operator to create a sequence from a starting number to an ending number↩︎

  2. Hints: use your subset command as an argument to the mean function. You will not need a comma here because you are looking for a single row.↩︎