Chapter 6 Programming with R
There are a number of conventions when writing R code which are helpful for debugging and backtracing, see for example here https://www.r-bloggers.com/%F0%9F%96%8A-r-coding-style-guide/. Although they appear to slow down your progress in the first place, you will value following such rules later on particularly when you have work on multiple projects over some time.
6.1 apply, sapply & co.
Programming often involves passing parameters stored in matrices or lists repetitively to functions. In these cases, apply-functions can be used without using traditional loops. There is a whole family of apply-functions briefly described below:
function | arguments | x type |
comment |
---|---|---|---|
apply |
x ,MARGIN ,FUN |
matrix | MARGIN controls whether columns or rows of x are passed to FUN |
lapply |
x ,FUN |
vector,list | each component is passed FUN , always returns a list |
sapply |
x ,FUN |
vector,list | same as lapply but simplifies result as far as possible (e.g. a vector) |
mapply |
FUN , … |
multiple | multiple objects (typically vectors) whose entries are passed to FUN |
<- matrix(1:10, ncol=2)
m1 apply(m1, 1, function(x) max(x)) # maximum in each row
apply(m1, 1, max) # syntax can be simplified if argument is passed to another function
apply(m1, 2, function(x) diff(range(x))) # range in each column
Apply-functions are often useful, but typically quite slow. Thus, if builtin functions are available these are typically much faster.
<- matrix(1:10^6,ncol=10) # pretty long vector
x system.time(apply(x, 1, sum)) # using apply
system.time(rowSums(x)) # using builtin function
When operations on more complex objects like matrices are repeatedly are to be processed, sapply()
and lapply()
can be quite useful.
<- list(matrix(sample(1:100, 25), ncol = 5), matrix(sample(1:100, 36), ncol = 6), matrix(sample(1:100, 16), ncol = 4))
l1 lapply(l1, diag ) # return diagonal elements
sapply(l1, diag ) # sapply cannot simplify much
sapply(l1, function(x) sum(diag(x)) ) # now sapply returns a vector
When there are multiple objects of the same size whose elements should be passed to a particular function successively, mapply()
can be used. In contrast to most other apply-functions, the 1st argument of ‘mapply’ specifies the function the elements are passed to.
<- function(a, x, y, u = 0, v = 0) a * (abs(x - u) + abs(y - v) ) # claculates the weighted Manhattan distance from (x,y) to (u,v)
weig.mh.dist # note that u and v have default value 0 --> thus they don't need to be specified
weig.mh.dist(a = 10, x = 1, y = 1)
weig.mh.dist(a = 10, x = 1, y = 1, u = 1, v = 1) # but they can be specified
<- 10 # set a number of points/customers
n <- sample(1:100, size = n) # sample weights for each point/customer
a.vec <- rnorm(n) # sample x coordinates
x.vec <- rnorm(n) # sample y coordinates
y.vec mapply(weig.mh.dist, a.vec, x.vec, y.vec) # distances to (0,0)
mapply(weig.mh.dist, a.vec, x.vec, y.vec, u = 1, v = 1) # distances to (1,1), u and v are passed to weig.mh.dist
Exercise 1: Try to rearrange the data objects
a.vec
,x.vec
, andy.vec
such thatapply
can be used to calculate all distances.
As noted before, apply-functions are slow. So, before using them, try whether your function can be vectorized using vectorize()
or by passing vectors directly.
<- Vectorize(weig.mh.dist, c("a", "x", "y")) # vectorize arguments a, x, and y
v.weigh.mh.dist
<- 10^6 # set large number of points/customers
n <- sample(1:100, size = n, replace =T) # sample weights for each point/customer
a.vec <- rnorm(n) # sample x coordinates
x.vec <- rnorm(n) # sample y coordinates
y.vec
system.time(mapply(weig.mh.dist, a.vec, x.vec, y.vec)) # slow
system.time(v.weigh.mh.dist(a.vec, x.vec, y.vec)) # a bit faster
system.time(weig.mh.dist(a.vec, x.vec, y.vec)) # awesomely fast
6.2 tapply & aggregate
When data are grouped, tapply()
and aggregate()
allow for passing groups of data to a function.
<- c( rnorm(10,0,1), runif(10,0,1), rnorm(10,1,1), runif(10,0,2) ) # some normally and uniformly dist. samples
y <- rep(c("norm","unif","norm","unif" ), each = 10) # distribution groups
x1 <- rep(c("base","ext."), each = 20) # setting groups
x2
tapply(X = y, INDEX = list(x1,x2), FUN = mean) # means of sample per group
When groups are defined to categorize variables, typically data frames are constructed. In this case, aggregate
can be used to by specifying a formula. A formula is constructed as follows: <dependent var.> ~ <group var. 1> + <group var. 2> + ...
while the variable names are equal to some columns of the submitted data frame or tibble. If you want to group by all variables/columns (except the dependent variable), you can use the following shortcut <dependent var.> ~ .
. Note that formulas are generally used to model functional dependencies between variables for functional fitting procedures (like regression analysis). Thus, in formulas much more complex functional relations can be coded (like interaction effects between variables, transformations, etc.), for an introduction see here for an introduction.
# usually a data frame/tibble is constructed
<- tibble(y, distr = x1, set = x2) # create a tibble/data frame
tb1 aggregate(formula = y ~ distr + set , data = tb1, FUN = mean) # same means of sample per group as before
6.3 loops & co.
R is designed to support traditional mathematical calculus which typically means calculus with vectors and matrices. Nonetheless, classic programming elements are available of course. The following table shows the basic syntax of these elements are shown:
element | description |
---|---|
if(<condition>){<code>} |
<code> is executed if condition is TRUE |
if(<condition>){<code1>}else{<code2>} |
<code1> is executed if condition is TRUE , <code2> otherwise |
for(<increment>){<code>} |
increment is of form i in <vector> , e.g. i in 1:100 |
while(<condition>){<code>} |
as long as <condition> is TRUE , <code> is executed |
repeat{<code>} |
<code> is executed until a break occurs |
break |
stops execution of a loop |
next |
stops execution of an iteration and jumps to the next |
Loops allow one to formulate complex code iteratively which is (too) difficult to vectorize. Be aware that loops are typically quite slow in R. Thus, only use loops if no smarter way exists. To speed up code, initialize data objects in necessary sizes before loop is executed.
6.3.1 for-loops
As always for-loops are useful when one already knows how often the code is to be executed. Thus, there is no risk of infinite looping.
# recall the weighted Manhattan distance function used before
# now, we iteratively simulate the setting introduced above
<- 10^6 # number of trials
n <- 0; v <- 0 # current location
u <- NULL # result vector
res.vec
system.time(for(i in 1:n){
<- sample(1:100, size = 1, replace =T) # sample weight for current point/customer
a <- rnorm(1) # sample x coordinate
x <- rnorm(1) # sample y coordinate
y <- a * (abs(x - u) + abs(y - v) ) # save Manhattan distance
res.vec[i]
})# compare with results before
# vectors of all kinds can be used to iterate over
<- 100
n <- rep(c(T,F), times = n) # generate logical vector to iterate over
iter.vec <- NULL # result vector
res.vec <- 1 # index for result vector
j
for(i in iter.vec){ # i is now logical
if(i){ # if i is TRUE ...
<- c(res.vec, j^2) # append squared j to result vector ...
res.vec <- j + 1 # and increase j by 1
j
}else{ # otherwise
<- 2 * j # double j
j
} }
Exercise 2: Try to fasten the code by initializing all data objects in the necessary size in advance.
6.3.2 while-loops
While-loops start with an condition to be checked before code is executed. Condition checked has to be a logical scalar. Infinite looping may occur. Therefore, always a loop excess indicator should be implemented
# similar setting as before, now with while loop, but now only even weights shall be used
<- 10^6 # number of trials
n <- 10^6 * 1.5 # but no more then n.excess loops shall be executed
n.excess <- 0; v <- 0 # current location
u <- 1 # trial index
i <- 0 # iteration counter
j <- NULL # result vector
res.vec
system.time(
while(i <= n & j <= n.excess){ # limit number of loops and number of trials simultaneously
<- j + 1 # every time the lopp starts, j is increased
j <- sample(1:100, size = 1, replace =T) # sample weight for current point/customer
a if(a %% 2 == 1) next # skip iteration if weight is odd
<- rnorm(1) # sample x coordinate
x <- rnorm(1) # sample y coordinate
y <- a * (abs(x - u) + abs(y - v) ) # save Manhattan distance
res.vec[i] <- i+1 # i is only increased, if i is odd
i
}
)
length(res.vec) # check length of result vector
Exercise 3: Can you simplify and fasten the code by sampling only odd numbers?
6.3.3 repeat-loops
In repeat-loops the stopping criterion must be specified within the loop code.
<- 10^6 # number of trials
n <- 10^6 * 1.5 # but no more then n.excess loops shall be executed
n.excess <- 0; v <- 0 # current location
u <- 1 # trial index
i <- 0 # iteration counter
j <- NULL # result vector
res.vec
system.time(
repeat{ # limit number of loops and number of trials simultaneously
<- j + 1 # every time the lopp starts, j is increased
j if(j > n.excess) break # stop loop, if j exceeds n.excess
<- sample(1:100, size = 1, replace =T) # sample weight for current point/customer
a if(a %% 2 == 1) next # skip iteration if weight is odd
<- rnorm(1) # sample x coordinate
x <- rnorm(1) # sample y coordinate
y <- a * (abs(x - u) + abs(y - v) ) # save Manhattan distance
res.vec[i] <- i+1 # i is only increased, if i is odd
i if(i > n) break # stop loop, if i exceeds n
}
)
length(res.vec) # check length of result vector
Exercise 4: Can you reformulate the stopping criteria such that only one break-statement is necessary?
6.4 Exercises
- Formulate a loop that calculates the inventory records over \(n\) periods based on an initial stock level (say \(i_0 = 20\)) where every 4 periods 40 units arrive at the inventory. Sample the demand for each period from a normal distribution with \(\mathcal{D} \sim N(10, 2)\) and round to integers.
- Consider a dynamic lot sizing problem with ordering cost of \(c_o = 100\) and a holding cost rate \(c_h = 0.1\) $ per period and unit. The demand over 10 periods is sampled from a Possion distribution with \(\lambda = 10\) (use
rpois()
). Calculate the total cost matrix with R. - Formulate a function that performs 1st-order exponential smoothing: \(p_{t+1} = \alpha \cdot x_t + (1-\alpha) \cdot p_t\). Is there also an builtin function? If so, compare run times.