```{r, echo=FALSE} opts_chunk$set(fig.path="figure/tapplyandsplit_") ``` ## tapply ## Apply a function over subsets of a vector. ### Example: Take group means with tapply. ### ```{r} x <- c(rnorm(10), runif(10), rnorm(10,1)) x f <- gl(3,10) # Generator factors f ``` By default, if the applied function returns a scalar, then tapply returns a vector. In this case we are applying the mean function, so the output of tapply is a numeric vector. ```{r} tapply(x, f, mean) # Take the mean of each group. Returns a vector. ``` Without simplification, tapply always returns a list. ```{r} tapply(x, f, mean, simplify=FALSE) ``` ### Example: Find the group range ### Range returns the minimum and maximum value for each group. Note that since the range function returns a vector, tapply returns a list. ```{r} tapply(x, f, range) ``` ## split ## The split function splits a vector int groups using a factor. Using split and then applying a function with lapply produces the same resule as tapply: ```{r} split(x, f) lapply(split(x,f), mean) # Instead of tapply ``` ### Example: Air Quality by month ### ```{r} library(datasets) head(airquality) s <- split(airquality, airquality$Month) ``` `s` is a list of dataframes split by month: ```{r} class(s) class(s[[1]]) lapply(s, function(x) head(x, n=2)) ``` After splitting by month, we can use `lapply` to take the column means: ```{r} takeMeans <- function(x) colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm=TRUE) lapply(s, takeMeans) ``` Using `sapply` instead of `lapply` simplifies the output into a matrix, so we can easily see the mean values by month. ```{r} sapply(s, takeMeans) ``` ### Splitting on more than one level ### ```{r} x <- rnorm(10) f1 <- gl(2,5) # Create a factor with two levels f2 <- gl(5,2) # Create a factor with five levels str(f1) str(f2) interaction(f1,f2) # Compute a factor with 10 levels, which is the interaction of the two factors ``` Note that some levels of the interaction are empty for our data vector `x`. For example, none of the samples in `x` have level 2.1. We can use multiple factor levels with `split` by passing the interaction factor. Instead of calling `interaction` explicitly, if we pass the factors in a list, the `interaction` function is automatically called: ```{r} str(split( x, list(f1, f2))) # Automatically calls interaction(f1,f2) ``` Note that some levels are empty, but they still appear in list output by `split`. Empty levels can be dropped by passing `drop = TRUE`. ```{r} str(split(x, list(f1,f2), drop=TRUE)) ```