Harmony of R and C++ (Rcpp): Introduction

Have you ever encountered the need to run nested loops in R, even for tasks that aren't computationally intensive? I often find myself wanting to manipulate large datasets, and I don't always have the patience or desire to deal with the intricacies of lower-level programming languages. To make matters worse, not all tasks are embarrassingly parallel, which used to lead me to the dilemma of either leaving my PC running overnight to complete these tasks or grappling with my conscience over not switching to a lower-level language. It's a constant battle between convenience and efficiency.

There have been times when I've won these internal arguments, especially when R or Python already had all the libraries I needed, and the process of moving data back and forth between languages, especially for big data, isn't as straightforward as I'd hoped. So, I've often resorted to combining R with a lower-level language. If you've ever ventured down this path, you know how time-consuming it can be – the data transfers, triggering functions, and whatnot.

But recently, I stumbled upon a game-changer – a library called Rcpp. This library, in a nutshell, acts as a bridge between R and C++. It allows you to harness the power and speed of C++ for those parts of your work that demand it, while seamlessly integrating it with your R code. Now, I want to share my use case of Rcpp with you, so you can get a taste of how it can revolutionize your projects too.

In my specific use case, I needed to create training data for my time series model. To provide past timesteps for training, it's common to create a "look back period" or "lag." However, the implementation can vary. I prefer to call it creating a "look back period," and how I implement it depends on the data's frequency and my objectives. My approach ensures that the same data point isn't used in the training data more than once.

Here's a simplified version of my approach: for each column, I select values from (y-1) to (n+y), where 'y' represents the current row I'm processing and 'n' is the look back period. I repeat this process for each column, flattening them, and then bind the results from each column to create one row. Finally, I combine all these rows.

This task is inherently parallelizable, but even after parallelizing it, it was taking a significant amount of time. Chances are, your tasks may be even more complex than mine. Then came Rcpp, and even with a single core, it outperformed my parallelized R code significantly. This library expands the data types C++ can work with and makes data manipulation in R a breeze. To highlight the difference, let's compare snippets of code in both languages.

R:

df = df[1:(look_back_period*floor(nrow(df) / look_back_period)),]
outputdf = data.frame()
for( i in seq(1,nrow(df),look_back_period)){
  tval1 = c()
  for( j in 2:ncol(df)){
    tval1 = c(tval1,df[i:(i+look_back_period-1),j])
  }
  tval1 = c(df[i,1],tval1)
  outputdf = rbind(outputdf,tval1)
}

Rcpp (C++ part):

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix lbp_fun(NumericMatrix df, int look_back_period) {
  int nrow_df = df.nrow();
  int ncol_df = df.ncol();
  int output_ncol = (ncol_df - 1) * (look_back_period - 1) + 1;
 
  // Create a matrix to store the output
  NumericMatrix outputdf(nrow_df / look_back_period, output_ncol);
 
  int row_index = 0;
  for (int i = 0; i < nrow_df; i += look_back_period) {
    NumericVector tval1(output_ncol);
    int col_index = 0;
    
    for (int j = 1; j < ncol_df; j++) {
      for (int k = i + 1; k < i + look_back_period; k++) {
        tval1[col_index] = df(k, j);
        col_index++;
      }
    }

    
    tval1[col_index] = df(i, 0);
    
    for (int j = 0; j < output_ncol; j++) {
      outputdf(row_index, j) = tval1[j];
    }
    
    row_index++;
  }
 
  return outputdf;
}

Rcpp (R part):

library(Rcpp)
sourceCpp("./flattenn4r.cpp")
fdf = as.data.frame(lbp_fun(as.matrix(df),look_back_period))


Work flow of Rcpp is quite simple. There are few options to reach the goal. My preferred method is saving the C++ code into separate file. Than pointing the file to sourceCpp function in R. This function compiles the code imports it as if it is an every other R function into global environment. Then you can the function as you please in R.

When I ran both of these code with my dataset R took 2360.386 seconds and Rcpp took 3.54 seconds to run the same dataset (including compiling). Timings were done using TicToc library.

As another example I would like to share is Savitzky-Golay filter (if you don't know much about it, here is good resource). In my use case I was pushing thru large data thru it and every milliseconds I could shave was important and it was good practice for me to write a simple mathematical expression in ArmadilloRcpp. I compared my implementation to an R implementation from Signal library, specifically sgolayfilt function. When compared, there was slight difference in results but for my use case it was neglagable. With this example we can see that difference is not always as huge as the previous example and it depends on significantly the type of operations you are conducting.  The code for it can be found on GitHub. 

Its worth mentioning, While I was researching about this filter I came across an article, I found it interesting enought to share it. The topic "Why and How Savitzky–Golay Filters Should Be Replaced" is pretty self explanatory. it's worth checking out.

Plot to show difference of results:   

Time difference between two implementation with the bigger variable, where variable is y = sin(seq(0, 2*pi, length.out = 6000000)) + rnorm(6000000, sd = 0.1), windows size; 501 and polynomial order; 7. Rcpp took 2.944 seconds to complete and R took 9.169 seconds to complete. To further more investigate which factor contributes to this difference the most I ran these functions with various window size and polynomial order, then took difference of timing to show the change. 


Code for this graph can be found on the GitHub.

You might have realized I only compared single core implementation of R and Rcpp tangibly. That's because the examples I already had were designed that way and I believe these results can help making educated guess for multi core applications.

In summary, my journey with Rcpp has transcended beyond a specific use case and into a broader realm of data processing and manipulation. Rcpp, acting as a vital bridge between R and C++, has offered a profound transformation in how we handle various tasks, making it a versatile tool for a wide range of data-intensive projects. The sheer efficiency and speed it brings to the table are nothing short of remarkable. Rcpp significantly outperforms native R code, and the time savings, along with the expanded data types that C++ can work with, highlight its prowess. It's not just a library; it's a game-changer that simplifies and accelerates data manipulation, making it an indispensable addition to any data scientist or analyst's toolkit.