tapply
Apply a function over subsets of a vector.
Example: Take group means with tapply.
x <- c(rnorm(10), runif(10), rnorm(10, 1))
x
## [1] -2.04333 0.72651 1.55476 -0.12411 1.08133 -0.30514 1.38638
## [8] 0.35047 0.88147 1.80250 0.56072 0.54421 0.19270 0.67049
## [15] 0.83034 0.02640 0.98629 0.11223 0.10747 0.85049 1.37510
## [22] 0.56575 0.34138 1.01877 1.33956 -0.07667 2.12114 0.70827
## [29] 0.53336 2.15315
f <- gl(3, 10) # Generator factors
f
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3
By default, if the applied function returns a scalar, then tapply returns a vector. In this case we are applying the mean function, so the output of tapply is a numeric vector.
tapply(x, f, mean) # Take the mean of each group. Returns a vector.
## 1 2 3
## 0.5311 0.4881 1.0080
Without simplification, tapply always returns a list.
tapply(x, f, mean, simplify = FALSE)
## $`1`
## [1] 0.5311
##
## $`2`
## [1] 0.4881
##
## $`3`
## [1] 1.008
Example: Find the group range
Range returns the minimum and maximum value for each group. Note that since the range function returns a vector, tapply returns a list.
tapply(x, f, range)
## $`1`
## [1] -2.043 1.803
##
## $`2`
## [1] 0.0264 0.9863
##
## $`3`
## [1] -0.07667 2.15315
split
The split function splits a vector int groups using a factor. Using split and then applying a function with lapply produces the same resule as tapply:
split(x, f)
## $`1`
## [1] -2.0433 0.7265 1.5548 -0.1241 1.0813 -0.3051 1.3864 0.3505
## [9] 0.8815 1.8025
##
## $`2`
## [1] 0.5607 0.5442 0.1927 0.6705 0.8303 0.0264 0.9863 0.1122 0.1075 0.8505
##
## $`3`
## [1] 1.37510 0.56575 0.34138 1.01877 1.33956 -0.07667 2.12114
## [8] 0.70827 0.53336 2.15315
lapply(split(x, f), mean) # Instead of tapply
## $`1`
## [1] 0.5311
##
## $`2`
## [1] 0.4881
##
## $`3`
## [1] 1.008
Example: Air Quality by month
library(datasets)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
s <- split(airquality, airquality$Month)
s
is a list of dataframes split by month:
class(s)
## [1] "list"
class(s[[1]])
## [1] "data.frame"
lapply(s, function(x) head(x, n = 2))
## $`5`
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
##
## $`6`
## Ozone Solar.R Wind Temp Month Day
## 32 NA 286 8.6 78 6 1
## 33 NA 287 9.7 74 6 2
##
## $`7`
## Ozone Solar.R Wind Temp Month Day
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
##
## $`8`
## Ozone Solar.R Wind Temp Month Day
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
##
## $`9`
## Ozone Solar.R Wind Temp Month Day
## 124 96 167 6.9 91 9 1
## 125 78 197 5.1 92 9 2
After splitting by month, we can use lapply
to take the column means:
takeMeans <- function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
lapply(s, takeMeans)
## $`5`
## Ozone Solar.R Wind
## 23.62 181.30 11.62
##
## $`6`
## Ozone Solar.R Wind
## 29.44 190.17 10.27
##
## $`7`
## Ozone Solar.R Wind
## 59.115 216.484 8.942
##
## $`8`
## Ozone Solar.R Wind
## 59.962 171.857 8.794
##
## $`9`
## Ozone Solar.R Wind
## 31.45 167.43 10.18
Using sapply
instead of lapply
simplifies the output into a matrix,
so we can easily see the mean values by month.
sapply(s, takeMeans)
## 5 6 7 8 9
## Ozone 23.62 29.44 59.115 59.962 31.45
## Solar.R 181.30 190.17 216.484 171.857 167.43
## Wind 11.62 10.27 8.942 8.794 10.18
Splitting on more than one level
x <- rnorm(10)
f1 <- gl(2, 5) # Create a factor with two levels
f2 <- gl(5, 2) # Create a factor with five levels
str(f1)
## Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
str(f2)
## Factor w/ 5 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5
interaction(f1, f2) # Compute a factor with 10 levels, which is the interaction of the two factors
## [1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
## Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5
Note that some levels of the interaction are empty for our data vector x
. For example, none of the samples in x
have level 2.1.
We can use multiple factor levels with split
by passing the interaction
factor. Instead of calling interaction
explicitly, if we pass the factors in a list, the interaction
function is automatically called:
str(split(x, list(f1, f2))) # Automatically calls interaction(f1,f2)
## List of 10
## $ 1.1: num [1:2] 0.993 0.939
## $ 2.1: num(0)
## $ 1.2: num [1:2] 1.092 -0.382
## $ 2.2: num(0)
## $ 1.3: num -0.985
## $ 2.3: num 1.53
## $ 1.4: num(0)
## $ 2.4: num [1:2] -0.066 -1.64
## $ 1.5: num(0)
## $ 2.5: num [1:2] 0.452 -1.324
Note that some levels are empty, but they still appear in list output by
split
. Empty levels can be dropped by passing drop = TRUE
.
str(split(x, list(f1, f2), drop = TRUE))
## List of 6
## $ 1.1: num [1:2] 0.993 0.939
## $ 1.2: num [1:2] 1.092 -0.382
## $ 1.3: num -0.985
## $ 2.3: num 1.53
## $ 2.4: num [1:2] -0.066 -1.64
## $ 2.5: num [1:2] 0.452 -1.324