Jenkins, Big Data and data driven development

Improve delivery of data driven projects from development all the way to production

Applying CI/CD methodologies in an environment using R and OpenCPU

Jenkins among analysts

In data driven development, value lies in the data beneath the software and is often exempt from traditional software development schemes such as Agile. As those that work in these areas usually don’t have a formal education in good software practices, it’s easy to miss out on all the benefits it brings to the table. So how do we build quality into data driven software using modern software processes and tools?

Enter Jenkins.

In a nutshell, Jenkins is an advanced scheduler. It triggers tasks on connected hardware by reacting to events, typically on source code changes or time events. Combining Jenkins with the following tools allows you to introduce and automate tasks, drastically increasing productivity and quality. Interested?

We’ll base the environment on R, one of the most popular programming languages among data analysts. It’s functional and dynamically typed, with heavy emphasis on data manipulation and projection of large datasets.

The aptly named Rapache is a variant of Apache tailored to R. It’s a static file server used to host binary R packages using the CRAN format.

OpenCPU API server is another popular tool that can map R function calls to simple REST methods.

In summary, R has all the building blocks required to build a modern software pipeline:

  1. A package system
  2. A stable artifact server
  3. Plenty of application server support with modern features

The plan

We want to increase the awareness of code quality. By automatically gathering and displaying metrics, we introduce code quality and testability into the daily life of the analyst. This is uncommon in the art of data science, since you cannot test your data directly.

We want to increase the speed at which our deliveries reach production. Jenkins can automate the tedious tasks of uploading packages and restarting the server after installation, if necessary.

We want to test wheather our packages can be deployed successfully. By scripting the installation, it becomes trivial to automate a deployment to a production-like environment using Jenkins, even for every single change to the package.

We want to test our deployed package before it reaches production. This is now very easy, as we can now automatically deploy to a test environment and make real calls to our updated package.

So here’s what we delegate to Jenkins:

  1. Static analysis of code quality (lint and warnings)
  2. Unit tests (the package itself) - Including coverage
  3. Automatic deployment of built software to application servers
  4. Functional tests (on the OpenCPU server with real web method calls)

The implementation

To increase awareness of quality of code, two types of analysis can be applied:

A linter, or syntax checker, which checks for irregularities in code, e.g. incorrect indentation, incorrect variable naming or long lines of code. A good example is Lintr. Check your code for TODO,FIXME or similar comments, and display those using the warnings plugin

Generate warnings for these during your builds to bring issues to light and motivate fixing and preventing further warnings.

Use the testthat R package for unit testing. It’s a great way to test your software, as the tests are written in a very declarative way. Use testthat’s built-in feature to convert the results to the TAP format and display them in Jenkins with the tap plugin.

Code coverage measures how much of your code is used during tests. Use the Covr R package to measure coverage. It can output its results into the Cobertura coverage format, which allows the Cobertura plugin to displays the results in Jenkins as pretty graphs.

The next part, deploying to a test environment, is solved with the Pipeline job type in Jenkins. Pipeline makes it easy to transfer the built package between steps with it’s stash function. This makes it a great choice, as it trivializes transferring the package from the test R server to a test OpenCPU server.

After deploying to the test environment, we can run the functional tests that interact directly with our API server (OpenCPU), the functional tests are also written using testthat.

If the functional tests pass, we have a valid release candidate. Now we can publish our package to the local package server and CRAN package mirror, to be installed on the production server whenever we choose to.

Putting all this together, we roughly end up with the following flow:

Analysis > Unit tests > Deploy to test > Integration tests > Generate documentation > Create release candidate

To store our job configuration as code, and make it easy to spin up this pipeline, we created all of the jobs using the Job DSL plugin, and put the scripts in our repository.

Results

The overall goal of improving quality of software is assisted by Jenkins. We use Jenkins as the driver to showcase many of the concepts normally used in software development.

The developers now have a visual representation of the state of their software, the quality of the tests and the current progress is tracked on a dashboard for all to see, this makes it easier to demonstrate progress to upper management from a developer standpoint.