Examine the structure of the iris
data set. How many
observations and variables are in the data set?
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
There are 150 observations and 5 variables.
Create a new data frame iris1
that contains only the
species virginica and versicolor with sepal lengths
longer than 6 cm and sepal widths longer than 2.5 cm. How many
observations and variables are in the data set?
iris1 <- filter(iris,
Species %in% c("virginica","versicolor"),
Sepal.Length > 6,
Sepal.Width > 2.5)
str(iris1)
## 'data.frame': 56 obs. of 5 variables:
## $ Sepal.Length: num 7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
## $ Sepal.Width : num 3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...
## $ Petal.Length: num 4.7 4.5 4.9 4.6 4.7 4.6 4.7 4.4 4 4.7 ...
## $ Petal.Width : num 1.4 1.5 1.5 1.5 1.6 1.3 1.4 1.4 1.3 1.2 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
There are 56 observations and 5 variables.
Now, create a iris2
data frame from iris1
that contains only the columns for species, sepal length, and sepal
width. How many observations and variables are in the data set?
iris2 <- select(iris1, Species, Sepal.Length, Sepal.Width)
str(iris2)
## 'data.frame': 56 obs. of 3 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Sepal.Length: num 7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
## $ Sepal.Width : num 3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...
There are 56 observations and 3 variables.
Create an iris3
data frame from iris2
that
orders the observations from largest to smallest sepal length. Show the
first 6 rows of this data set.
iris3 <- arrange(iris2, by=desc(Sepal.Length))
head(iris3)
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
Create an iris4
data frame from iris3
that
creates a column with a sepal area (length * width) value for each
observation. How many observations and variables are in the data
set?
iris4 <- mutate(iris3, Sepal.Area=Sepal.Length*Sepal.Width)
str(iris4)
## 'data.frame': 56 obs. of 4 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
## $ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
## $ Sepal.Area : num 30 29.3 20 21.6 23.1 ...
There are 56 observations and 4 variables.
Create iris5
that calculates the average sepal length,
the average sepal width, and the sample size of the entire
iris4
data frame and print iris5
.
iris5 <- summarize(iris4,
Avg.Sepal.Length=mean(Sepal.Length, na.rm=T),
Avg.Sepal.Width=mean(Sepal.Width, na.rm=T),
Sample.Size=n())
print(iris5)
## Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## 1 6.698214 3.041071 56
Finally, create iris6
that calculates the average sepal
length, the average sepal width, and the sample size for each
species of in the iris4
data frame and print
iris6
.
iris6 <- summarize(group_by(iris4, Species),
Avg.Sepal.Length=mean(Sepal.Length, na.rm=T),
Avg.Sepal.Width=mean(Sepal.Width, na.rm=T),
Sample.Size=n())
print(iris6)
## # A tibble: 2 × 4
## Species Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
In these exercises, you have successively modified different versions
of the data frame iris1 iris2 iris3 iris4 iris5 iris6
. At
each stage, the output data frame from one operation serves as the input
fro the next. A more efficient way to do this is to use the pipe
operator %>%
from the tidyr
package. See if
you can rework all of your previous statements (except for
iris5
) into an extended piping operation that uses
iris
as the input and generates irisFinal
as
the output.
irisFinal <- filter(iris,
Species %in% c("virginica","versicolor"),
Sepal.Length > 6,
Sepal.Width > 2.5) %>%
select(Species, Sepal.Length, Sepal.Width) %>%
arrange(by=desc(Sepal.Length)) %>%
mutate(Sepal.Area=Sepal.Length*Sepal.Width) %>%
group_by(Species) %>%
summarize(Avg.Sepal.Length=mean(Sepal.Length, na.rm=T),
Avg.Sepal.Width=mean(Sepal.Width, na.rm=T),
Sample.Size=n())
print(irisFinal)
## # A tibble: 2 × 4
## Species Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
Create a ‘longer’ data frame using the original iris
data set with three columns named “Species”, “Measure”, “Value”. The
column “Species” will retain the species names of the data set. The
column “Measure” will include whether the value corresponds to
Sepal.Length, Sepal.Width, Petal.Length, or Petal.Width and the column
“Value” will include the numerical values of those measurements.
irisLong <- pivot_longer(iris, cols=Sepal.Length:Petal.Width,
names_to="Measure",
values_to="Value",
values_drop_na=T)
head(irisLong)
## # A tibble: 6 × 3
## Species Measure Value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3