Deploy R Code - 5Analytics Enterprise AI Platform

A Predictive Maintenance Solution in R

It is well accepted by anyone that has ever operated any machinery for long enough: sooner or later, big or small, technical failures simply happen [1]. What choices do we have given such terrible doom?

[1] Murphy’s Laws can further then tell us that failures happen specifically at the worst possible time and in the worst possible way. Unfortunately I haven’t figured out how to include this information in our methods for predictive maintenance – yet.

So far, there were two ways of dealing with failures:

Thanks to the development in a third option is now becoming available to industries, manufacturers, and basically everyone else that can obtain enough data from their failure-prone entities: predictive maintenance. The main idea here is to continuously monitor several parameters of a machine in order to be able to schedule maintenance in real time and only when needed (as from what said above, it’s not about the if…). How to predict an incoming failure, then? The answer is, of course, Machine Learning!

An example case

Preparing the data

What I am going to show is my particular take on a simulated instance of the problem, as developed over my time at 5Analytics and continuously improving. I must note that usually the main complications arise during the data cleaning and preparation phase, but since that is something beyond the specific topic of predictive maintenance I will pretend everything goes smooth at the first try. Let’s first of all start our beloved R and get our precious libraries.


For the methods that we will be discussing here, in order to predict failures we need to know how a failure looks like – meaning, our training data must include failures. Anomaly detection is another problem altogether, and we won’t deal with that right now. Let’s suppose that the data we received still needs a little handling before being fed to the algorithms.

For example machine data and information about failures are, as it usually happens, in two separate and differently formatted data objects:

df <- fread("machine_readings.csv")
errors <- read_xlsx("failure_events.xlsx")

Specifically, the data is in a table with m columns, with column names (V1, …, Vm) = observed parameters, and n rows = time points at which these parameters have been measured. Let’s also suppose that column V1 contains the timestamps of the measurements, one per second, and that this is already the timescale at which we want to focus our predictive maintenance analysis.

It might look something like this:

> head(df, 3)
                    V1 V2 V3  V4   V5  	   V6 V7      V8  V9      V10
1: 2017-04-15 22:01:59 40 36  90   38 0.3211971	0 -5.5552 368 0.279942
2: 2017-04-15 22:02:00 41 36  90   38 0.0500024	0 -5.5552 368 0.281163
3: 2017-04-15 22:02:01 40 36  91   38 0.5000000	0 -5.5552 368 0.289020

The failures are instead reported in a more condensed format, with only the begin- and endtime of each out-of-order time interval given:

> head(errors, 3)
   ERR_ID           	 begin_t        	   end_t
1:   0001 2017-04-18 18:03:53 2017-04-18 18:06:43
2:   0002 2017-04-18 19:11:02 2017-04-18 19:12:24
3:   0003 2017-04-19 09:26:11 2017-04-19 09:32:50

As a first thing we are going to expand this data so to resemble the machine measurements. The following code snippet is not as elegant as what data.table could possibly allow, but it is rather straightforward and lets us focus on how the seq.POSIXt function gets the job done:

errlist <- list()	# create an empty list to populate sequentially
for(i in 1:nrow(errors))
    	x <- errors[i]	
	ds <- data.frame("time" = seq.POSIXt(from = x$begin_r, to = x$end_r, by = "sec"))
	ds$ID <- errors$ERR_ID[i]
	errlist[[i]] <- ds
failures <- rbindtable(errlist) # merge (rowbind) all elements of the list!

This way, each row of ‘errors’, meaning each failure instance, is expanded as a block of rows:

> head(failures)
                  time ERR_ID
1: 2017-04-18 18:03:53   0001
2: 2017-04-18 18:03:54   0001
3: 2017-04-18 18:03:55   0001
4: 2017-04-18 18:03:56   0001
5: 2017-04-18 18:03:57   0001
6: 2017-04-18 18:03:58   0001 

Now that both machine measurements and failures are similarly structured, we can join the two!

# set the merging keys
setkey(failures, time)
setkey(df, V1)

# join the data with the failure-seconds
dfe <- failures[df, all = T]

This newly created dfe is what our random forest will be fed soon. First, we need to work our data a bit more as our goal is not to distinguish which rows are failures and which are not, since at that point we’d be already way too late… Rather, our software should tell us when the system is about to fail. It is up to us to define how much time before we want to be alerted. For this example, we set 300 seconds = 5 minutes.

# create a new column "nextfail" and set its value for all the failure rows to
# their time (as from the timestamp). All other rows will be NAs
dfe[!, nextfail := time]

# set the distance (in seconds) from the next failure to be considered "soon"
how_soon <- 300

Another little bit of magic now. Function na.locf0() from zoo fills NA values in a column by carrying forward the last non-NA observation (from which the name, last observation carried forward). In our case we actually want the opposite to happen, namely to fill NAs with the next non-NA value, hence fromLast = TRUE.

# fill NAs in "nextfail" with the following failure time
dfe$nextfail <- na.locf0(dfe$nextfail, fromLast = TRUE)

Column nextfail then contains time values for all the rows before any failure and at the failures themselves. (Can you think of anything that we should still take care of?

The answer is in a few lines, so hold on to your chairs while we proceed.) It’s time to create a couple more columns:

# set the time to fail as difference between next fail and current time (duh..)
dfe$ttf <- dfe$nextfail - dfe$time

# if the time before next fail is less than our chosen threshold, tadan!!
dfe$fail_soon <- dfe$ttf < how_soon

And now, back to the question left unanswered. What problems could nextfail still give us? Well, as we said it contains time values – and specifically the same values – for all rows before a failure and the failure itself, because of na.locf0(). But what if our data ends with any number of non-failure rows? For these, nextfail will still be NA as there is no future failure to carry back. Should we discard them all, potentially throwing away a huge bunch of information?… maybe we can do something better! Since for each row we want to know if there is a failure happening in the upcoming time (in our example, 300 seconds), we surely won’t be able to say anything about the last 300 rows. But we know that every row before that is more than 300 seconds away from a failure!

# remove the very last chunk of rows for which one cannot say anything
dfe <- dfe[!(time %in% tail(unique(dfe$time), how_soon) &]

# we  know for certain that everything that is left is not a "fail soon"...
dfe[, fail_soon := FALSE]
That was quite smart, wasn’t it? And we are now almost done: let’s get rid of the failure rows since, as said, if we are on a failure [..]
# we are interested in the non-failure rows, so we remove all failures!
dfe <- dfe[]

# save the data object in a csv file
fwrite(dfe, "dfe.csv")

Random Forest

The Machine Learning tool we will be using to classify “safe” and soon-to-fail states of the machinery is Random Forests [2], which I’ve so far been taking from the randomForest library. Recently, though, I found out about the h2o package and I’m loving it – not just because my childhood hero [3] Matt Dowle is part of the project (he is the man behind data.table!) – so I will report my experience with Random Forests and h2o, loosely based on this tutorial.

[2] We won’t discuss what a Random Forest is and how it works, if you don’t have your favourite ML book at hand the related Wikipedia page can do a discrete job at introducing you to the subject
[3] For very large values of “childhood”

I must specify that h2o is not your standard R library. Rather it works by means of a cool Java Virtual Machine running on your server/machine/wherever, while R is mostly left in charge of directing its work. Let’s initialize everything here

# load the library

# launch the h2o machine (here we set it to use 128GB of the server memory).
# h2o is by default using "many" of the available threads too! <3 <3 <3 
h2o.init(max_mem_size = "128G")

# little drawback: you need to store your data on .csv (e.g. rather than .rds) in order
# for them to be accessible by h2o
dfe <- h2o.importFile("dfe.csv")

# given how we renamed the columns before, we can say that everything in V2, ..., Vn is
# to be considered as a factor 
factors <- setdiff(colnames(dfe)[grepl("V[0-9]+", colnames(dfe))], "V1")

# what we need to keep in our data is all the factors and the “fail_soon” column! 
dfe <- dfe[ , c(factors, "fail_soon")]

Now we split our data in three subsets: one set to train the forest, one set to validate it, and one to bring them all and in the darkness bind them. No wait, I think I got confused.

# set the percentages of the splits as 60%, 20% and 20% (the last is automatically made
# by R given that there’s 1 - (0.6 + 0.2) of the data left unsplit! 
splits <- h2o.splitFrame(dfe, c(0.6, 0.2), seed = 1234)

# assign the splits
train <- h2o.assign(splits[[1]], "train.hex")   
valid <- h2o.assign(splits[[2]], "valid.hex")
test  <- h2o.assign(splits[[3]], "test.hex")
And here’s the random forest being grown!

# here's where our random forest comes to life!
randomforest <- 
		training_frame   = train,		# where to train the forest;
		validation_frame = valid,		# where to validate its parameters;
		x = factors,				# the predictor variables;
		y = "fail_soon",			# the target variable;
		model_id = "rf_covType_v1",		# model used;
		ntrees   = 200,			# number of trees used by the forest;
		stopping_rounds  	= 2,		# 
		score_each_iteration = T,		# "Predict against training and validation
# for each tree", says the tutorial; 
		seed = 4815162342			

We now want our forest to predict which of the rows in the test set are doomed to fail soon, with the h2o equivalent of predict(), that is – you guessed right – h2o.predict()!

# perform the prediction on the test data with the forest just obtained
pred <- h2o.predict(object = rf1, newdata = test)

Finally, we bring back from the h2o environment a list where our predictions are juxtaposed to the real values of the test set. With function table() we automatically obtain a confusion matrix for our results:

# put the prediction and the original values side by side
confmatrix <- table($predict, test$fail_soon)))

# show the outcome in terms of confusion matrix and related metrics

Finally we can extract the importance of the different variables:

vi <-$variable_importances)

which completes a standard analysis of the results of a Random Forest. For sake of Predictive Maintenance, though, one must be able to process the machine status in real time. It is effortless to do so with our AI platform: we just need to define the function

PredictFailure <- function(V2,V3,V4,V5,V6...) {
	return(h2o.predict(object = rf1, newdata = as.h2o(c(V2,V3,V4,V5,V6...))))

And voila! Our function will return the likelihood for the given input row to correspond to a safe or prone to fail soon machine status.

Putting the code into production

All that is now left, is to upload the recommendation engine to the 5Analytics AI Platform, which we ensured to be as simple and straightforward as possible.

# upload file to server via webdav
> curl -u usr:pswd --digest -T predict.R 'http://localhost:5050/up/dav/'

The 5Analytics AI platform will load the code and establish the web service end-point shown above. Now your Web Service is ready to be queried and it will return our previously set number of recommended items.

> curl "http://localhost:5050/if/json/PredictFailure?_token=test_token&V2=40&V3=36… "
"data": {
"Predict" : FALSE,
"FALSE" : 0.8044024,
"TRUE" : 0.1955976
get_appWould you like to try the demo with your own data or with your own code?
Register for our Community Edition!
get_appGet our Community Edition
or visit our Homepage!