Chapter 4 Indexing

When one works with data objects, most tasks require to extract, alter or update parts of the data objects. Therefore, data objects are indexed by position, name, or logical identifiers. Native data types (vectors, matrices, array) are indexed by <object name>[<index>]. You can also negate the selection, i.e. extract all entries except the indexed entries by <object name>[-<index>].

In the following we introduce briefly basic concepts of indexing for usual data objects. Note that in data science applications often tibble and data frames are used. The tidyverse package offers specifically dedicated functions for data wrangling (i.e., indexing, merging, subsetting, etc.) with so-called tidy data. The syntax and functions used are very powerful and popular in the data science communiy, but quite different from basic indexing. Therefore, they are not included here. For an introduction, see the book of Hadley Wickham here.

4.1 Vectors

For vectors, indexing allows one to access and extract scalars and sub-vectors.

x <- c(5, 6, 7)
x[1]              # extract 1st entry --> result is a scalar
x[c(1,3)]         # extract 1st & 3rd entry --> result is a vector
x[-c(1,3)]        # extract all except 1st & 3rd entry --> result is a vector
x[4]              # a 4th entry does not exist
x[c(T,F,T)]       # indexing via logical vectors requires a logical index vector of the same length of the vector to be indexed
y <- x <= 6       # generate an indexing vector by logic evaluations
z <- which(y)     # retain inices of those entries --> z is integer vector
x[y]              # subvector of all entries of x which are <= 6
x[z]              # the same subvector
x[!y]             # extract the opposite (x>6)
x[-z]             # the same subvector

Another helpfull application of indexing is to sort and rearrange vectors:

x <- sample(x = 1:100, size = 10) # sample 10 values between 1 and 100 randomly
y <- order(x)     # retain indices of entries ascending order (1 to smallest entry, 2 to nd smallest entry, ...)
y[2]              # index of 2nd smallest entry of x
x[y]              # ordered vector
sort(x)           # same result obtained by sort function
z <- rank(x)      # z contains the ranks of the entries of x
which(z == 2)     # index of 2nd smallest entry of x

4.2 Matrices

Matrices have two dimensions and, thus, two indices are required to access an entry. The indices are seperated by ,, i.e. <name matrix>[<index row>, <index column>].

x <- cbind(1:6, 2:7, 3:8) # column-wise binding
x[1,1]                    # first row, first column (just one element)
x[1,]                     # first row (a vector)
x[,1]                     # first column (a vector)
x[1:2,]                   # first and second row (a matrix)
y <- rep( x = c(T,F,T), times = 2) # logical index vector
x[y,]                     # same as 
x[y,c(1,3)]               # mix row and column selections

Matrix entries can also be accessed by index tuples in the form of vectors, i.e., <name matrix>[<index vector>].

x <- matrix(1:16, ncol=4) # quadratic matrix 
y <- cbind(1:4, 1:4)      # indices of diagonal elements
x[y]                      # extract diagonal elements of x
diag(x)                   # ... or directly
z <- cbind(c(2,4,1), c(1,3,2))    # indices of some other elements
x[z]

When parts of objects are extracted R tries automatically to simply the resulting objects as far as possible (unless you instruct otherwise).

x <- matrix(1:16, ncol=4) # quadratic matrix 
y <- x[1,]                # extract 1st row
dim(y)                    # y is not a matrix anymore
is.matrix(y)              # check
is.numeric(y)             # but its a vector, thus ...
y[2]                        # works
y[,2]                     # doesn't works
# but you can say otherwise
y <- x[1, , drop=F]       # extract 1st row as 1,4-matrix ()
dim(y)                    # so y has two dimensions
is.numeric(y)             # y can also be used as  vector
is.matrix(y)              # ... and as a vector, thus ...
y[2]                      # ... works ...
y[,2]                     # ... works too

4.3 Arrays

Arrays are multi-dimensional matrices. Indexing works analoguesly to matrices, but with some dimensions more:

x <- array(1:12, dim=c(2,2,3))  # array in 3 dimensions (cube) with 2*2*3 entries
y <- x[1,,]                     # pick first row in each submatrix -> upper floor of cube
y <- x[,2,]                     # pick second column in each submatrix -> right wing of cube
y <- x[,,3]                     # pick last layer -> top floor
dim(y)                          # y is a matrix
# or you can keep the array type. 
y <- x[,,3, drop=F] 
dim(y)

4.4 Lists, data frame and tibbles

Lists, data frames tibbles usually assign names to their components. In case of data frames and tibbles these components are always columns. To access a named component the general syntax is <object name>$<component name>. To query the components’ names of an object use names(). For lists you can alternatively access a single components analogously to vectors with <list name>[[<component index>]] or <list name>[[<component name>]]. For multiple componts, simple brackets are used, i.e. <list name>[<component names>].

df1 <- data.frame(one = 1:3, two = c("nest","test","fest") )
names(df1)        # names of both columns
df1$one           # access column one

tb1 <- tibble(one = 1:3, two = c("nest","test","fest")) 
tb1$two           # works too

l1 <- list(a = 1:3, b = "nest", c = TRUE)
l1$a              # this is equivalent to...
l1[[1]]           # ...this...
l1[["a"]]         # ...and this
l1[c(1,3)]        # for multiple components
l1[c("a","c")]        # for multiple components

4.5 Exercises

  1. Construct a list with 100 entries. Afterwards, display only entries with odd indices.
  2. Construct a tibble or data frame consisting of 100 columns and 1 row. Extract only every 3rd column.
  3. Construct a matrix of size \(6\times 6\) and fill it by sampling numbers between 1 and 100. Retain indices of all entries \(\leq 50\).
  4. Use the matrix generated in task 3. Retain the order of entries in the first column and use it to reorder all rows in the matrix.
  5. Formulate a function for calculating the moving average over a vector.
  6. Write a function that checks whether \(\tilde{a}_{ij} = \frac{1}{a_{ij}}\).
  7. Write a function that normalizes a matrix supplied to the function by column sums (i.e., \(\tilde{a}_{ij} = \frac{a_{ij}}{\sum_{j} a_{ij}}\)). Return the original matrix and the normalized matrix in a named list.
  8. Write a function that processes three matrices of the same dimension. The function should return a list containing the three input matrices and a matrix that binds together the row averages of the input matrices.