Chapter 6 Programming with R

There are a number of conventions when writing R code which are helpful for debugging and backtracing, see for example here https://www.r-bloggers.com/%F0%9F%96%8A-r-coding-style-guide/. Although they appear to slow down your progress in the first place, you will value following such rules later on particularly when you have work on multiple projects over some time.

6.1 apply, sapply & co.

Programming often involves passing parameters stored in matrices or lists repetitively to functions. In these cases, apply-functions can be used without using traditional loops. There is a whole family of apply-functions briefly described below:

function arguments x type comment
apply x,MARGIN,FUN matrix MARGIN controls whether columns or rows of x are passed to FUN
lapply x,FUN vector,list each component is passed FUN , always returns a list
sapply x,FUN vector,list same as lapply but simplifies result as far as possible (e.g. a vector)
mapply FUN, … multiple multiple objects (typically vectors) whose entries are passed to FUN
m1 <- matrix(1:10, ncol=2)   
apply(m1, 1, function(x) max(x))         # maximum in each row
apply(m1, 1, max)                        # syntax can be simplified if argument is passed to another function
apply(m1, 2, function(x) diff(range(x))) # range in each column

Apply-functions are often useful, but typically quite slow. Thus, if builtin functions are available these are typically much faster.

x <- matrix(1:10^6,ncol=10)       # pretty long vector
system.time(apply(x, 1, sum))     # using apply
system.time(rowSums(x))           # using builtin function

When operations on more complex objects like matrices are repeatedly are to be processed, sapply() and lapply() can be quite useful.

l1 <- list(matrix(sample(1:100, 25), ncol = 5),  matrix(sample(1:100, 36), ncol = 6), matrix(sample(1:100, 16), ncol = 4))
lapply(l1, diag )                     # return diagonal elements
sapply(l1, diag )                     # sapply cannot simplify much
sapply(l1, function(x) sum(diag(x)) ) # now sapply returns a vector

When there are multiple objects of the same size whose elements should be passed to a particular function successively, mapply() can be used. In contrast to most other apply-functions, the 1st argument of ‘mapply’ specifies the function the elements are passed to.

weig.mh.dist <- function(a, x, y, u = 0, v = 0) a * (abs(x - u) + abs(y - v) ) # claculates the weighted Manhattan distance from (x,y) to (u,v) 
# note that u and v have default value 0 --> thus they don't need to be specified
weig.mh.dist(a = 10, x = 1, y = 1)  
weig.mh.dist(a = 10, x = 1, y = 1, u = 1, v = 1)  # but they can be specified
n <- 10                             # set a number of points/customers
a.vec <- sample(1:100, size = n)    # sample weights for each point/customer
x.vec <- rnorm(n)                   # sample x coordinates
y.vec <- rnorm(n)                   # sample y coordinates
mapply(weig.mh.dist, a.vec, x.vec, y.vec) # distances to (0,0)
mapply(weig.mh.dist, a.vec, x.vec, y.vec, u = 1, v = 1) # distances to (1,1), u and v are passed to weig.mh.dist

Exercise 1: Try to rearrange the data objects a.vec, x.vec, and y.vec such that apply can be used to calculate all distances.

As noted before, apply-functions are slow. So, before using them, try whether your function can be vectorized using vectorize() or by passing vectors directly.

v.weigh.mh.dist <- Vectorize(weig.mh.dist, c("a", "x", "y"))      # vectorize arguments a, x, and y

n <- 10^6                             # set large number of points/customers
a.vec <- sample(1:100, size = n, replace =T)    # sample weights for each point/customer
x.vec <- rnorm(n)                   # sample x coordinates
y.vec <- rnorm(n)                   # sample y coordinates

system.time(mapply(weig.mh.dist, a.vec, x.vec, y.vec))    # slow
system.time(v.weigh.mh.dist(a.vec, x.vec, y.vec))         # a bit faster
system.time(weig.mh.dist(a.vec, x.vec, y.vec))            # awesomely fast

6.2 tapply & aggregate

When data are grouped, tapply() and aggregate() allow for passing groups of data to a function.

y <- c( rnorm(10,0,1), runif(10,0,1), rnorm(10,1,1), runif(10,0,2) )  # some normally and uniformly dist. samples
x1 <- rep(c("norm","unif","norm","unif" ), each = 10)                 # distribution groups
x2 <- rep(c("base","ext."), each = 20)                                # setting groups

tapply(X = y, INDEX = list(x1,x2), FUN = mean)                        # means of sample per group

When groups are defined to categorize variables, typically data frames are constructed. In this case, aggregate can be used to by specifying a formula. A formula is constructed as follows: <dependent var.> ~ <group var. 1> + <group var. 2> + ... while the variable names are equal to some columns of the submitted data frame or tibble. If you want to group by all variables/columns (except the dependent variable), you can use the following shortcut <dependent var.> ~ .. Note that formulas are generally used to model functional dependencies between variables for functional fitting procedures (like regression analysis). Thus, in formulas much more complex functional relations can be coded (like interaction effects between variables, transformations, etc.), for an introduction see here for an introduction.

# usually a data frame/tibble is constructed
tb1 <- tibble(y, distr = x1, set = x2)                          # create a tibble/data frame
aggregate(formula = y ~ distr + set , data = tb1, FUN = mean)   # same means of sample per group as before 

6.3 loops & co.

R is designed to support traditional mathematical calculus which typically means calculus with vectors and matrices. Nonetheless, classic programming elements are available of course. The following table shows the basic syntax of these elements are shown:

element description
if(<condition>){<code>} <code> is executed if condition is TRUE
if(<condition>){<code1>}else{<code2>} <code1> is executed if condition is TRUE, <code2> otherwise
for(<increment>){<code>} increment is of form i in <vector>, e.g. i in 1:100
while(<condition>){<code>} as long as <condition> is TRUE, <code> is executed
repeat{<code>} <code> is executed until a break occurs
break stops execution of a loop
next stops execution of an iteration and jumps to the next

Loops allow one to formulate complex code iteratively which is (too) difficult to vectorize. Be aware that loops are typically quite slow in R. Thus, only use loops if no smarter way exists. To speed up code, initialize data objects in necessary sizes before loop is executed.

6.3.1 for-loops

As always for-loops are useful when one already knows how often the code is to be executed. Thus, there is no risk of infinite looping.

# recall the weighted Manhattan distance function used before
# now, we iteratively simulate the setting introduced above 
n <- 10^6        # number of trials
u <- 0; v <- 0    # current location
res.vec <- NULL   # result vector

system.time(for(i in 1:n){
  a <- sample(1:100, size = 1, replace =T)    # sample weight for current point/customer
  x <- rnorm(1)                   # sample x coordinate
  y <- rnorm(1)                   # sample y coordinate
  res.vec[i] <- a * (abs(x - u) + abs(y - v) )    # save Manhattan distance
})
# compare with results before

# vectors of all kinds can be used to iterate over
n <- 100
iter.vec <- rep(c(T,F), times = n)  # generate logical vector to iterate over
res.vec <- NULL                     # result vector
j <- 1                              # index for result vector

for(i in iter.vec){                 # i is now logical
  if(i){                            # if i is TRUE ...
    res.vec <- c(res.vec, j^2)      # append squared j to result vector ...
    j <- j + 1                      # and increase j by 1
  }
  else{                             # otherwise
    j <- 2 * j                      # double j
  } 
}

Exercise 2: Try to fasten the code by initializing all data objects in the necessary size in advance.

6.3.2 while-loops

While-loops start with an condition to be checked before code is executed. Condition checked has to be a logical scalar. Infinite looping may occur. Therefore, always a loop excess indicator should be implemented

# similar setting as before, now with while loop, but now only even weights shall be used
n <- 10^6          # number of trials
n.excess <- 10^6 * 1.5    # but no more then n.excess loops shall be executed
u <- 0; v <- 0      # current location
i <- 1              # trial index 
j <- 0              # iteration counter
res.vec <- NULL     # result vector

system.time(
  while(i <= n & j <= n.excess){    # limit number of loops and number of trials simultaneously
    j <- j + 1                      # every time the lopp starts, j is increased
    a <- sample(1:100, size = 1, replace =T)    # sample weight for current point/customer
    if(a %% 2 == 1) next            # skip iteration if weight is odd
    x <- rnorm(1)                   # sample x coordinate
    y <- rnorm(1)                   # sample y coordinate
    res.vec[i] <- a * (abs(x - u) + abs(y - v) )  # save Manhattan distance
    i <- i+1                        # i is only increased, if i is odd
  }
)

length(res.vec)                     # check length of result vector

Exercise 3: Can you simplify and fasten the code by sampling only odd numbers?

6.3.3 repeat-loops

In repeat-loops the stopping criterion must be specified within the loop code.

n <- 10^6          # number of trials
n.excess <- 10^6 * 1.5    # but no more then n.excess loops shall be executed
u <- 0; v <- 0      # current location
i <- 1              # trial index 
j <- 0              # iteration counter
res.vec <- NULL     # result vector

system.time(
  repeat{    # limit number of loops and number of trials simultaneously
    j <- j + 1                      # every time the lopp starts, j is increased
    if(j > n.excess) break          # stop loop, if j exceeds n.excess
    a <- sample(1:100, size = 1, replace =T)    # sample weight for current point/customer
    if(a %% 2 == 1) next            # skip iteration if weight is odd
    x <- rnorm(1)                   # sample x coordinate
    y <- rnorm(1)                   # sample y coordinate
    res.vec[i] <- a * (abs(x - u) + abs(y - v) )  # save Manhattan distance
    i <- i+1                        # i is only increased, if i is odd
    if(i > n) break                 # stop loop, if i exceeds n
  }
)

length(res.vec)                     # check length of result vector

Exercise 4: Can you reformulate the stopping criteria such that only one break-statement is necessary?

6.4 Exercises

  1. Formulate a loop that calculates the inventory records over \(n\) periods based on an initial stock level (say \(i_0 = 20\)) where every 4 periods 40 units arrive at the inventory. Sample the demand for each period from a normal distribution with \(\mathcal{D} \sim N(10, 2)\) and round to integers.
  2. Consider a dynamic lot sizing problem with ordering cost of \(c_o = 100\) and a holding cost rate \(c_h = 0.1\) $ per period and unit. The demand over 10 periods is sampled from a Possion distribution with \(\lambda = 10\) (use rpois()). Calculate the total cost matrix with R.
  3. Formulate a function that performs 1st-order exponential smoothing: \(p_{t+1} = \alpha \cdot x_t + (1-\alpha) \cdot p_t\). Is there also an builtin function? If so, compare run times.