H2O.ai and Predictive Scoring

Contents

Notes On Predictive Modeling
What is H2O.ai?
How to Use H2O

 

Predictive Modeling

We hear a lot about predictive modeling and machine learning – many companies are implementing at least basic predictive models to help guide business decisions. Some predictive models can be run on an as-needed basis, when to-the-minute updates are not required. This paradigm shifts when we move from reporting to marketing. In a world where people read, socialize, and shop online, and clicks generate instantaneous responses, marketing advertisements must also respond quickly – instantaneously. And this doesn’t just mean an ad has to load quickly; we want algorithms to quickly decide WHICH ad to load or how much money to pay to have an ad load. These models must score in real-time to be a good competitor in an online market. Marketing is quickly becoming the focus not only of content specialists, market researchers, and graphic designers but also of statisticians, programmers, engineers, and data scientists. Marketing just got nerdy!

 

What is H2O.ai?

At The General®, we’ve chosen to develop and test the functionality and performance of H2O.ai. H2O.ai is a software company offering open source AI solutions. Benefits of H2O include that it processes in-memory and runs as a distributed platform. (In other words, algorithms and analyses run faster, without using disc storage, and multiple CPU cores and/or machines are harnessed to handle the job in parallel). This architecture allows for ease of scalability – faster processing can be accomplished by increasing the number of processors or nodes available. For this proof of concept project, I have H2O installed and running on my laptop computer.

 

How to Use H2O

Things you’ll need to run H2O:

  • H2O.ai (free download)

  • Java Development Kit (JDK, free download)

  • Python or other programming lanauge (free and optional)

H2O provides several options for building your predictive model. These include Python, R, Scala, or H2O’s browser-based user interface called H2O FLOW. Here, I have chosen to develop in Python. H2O provides a Python library for ease of formatting H2O-friendly data and for access to H2O’s Java Virtual Machine (JVM) platform via a REST API, which allows use of the JVM’s objects and algorithms. After installing the library in Python, a cluster instance of H2O can be initiated for data processing and model development, as follows:

#Initiate an H2O API connection and load data into an H2O data frame
import h2o
h2o.init()
data = h2o.import_file(path="path/file.csv")

When working at the console, the above call of h2o.init() generates printed output on the status of the connection, the hosting server address, and statistics about the available processing power. For example, a portion of my output indicates that I have one node and four processing cores available for use. H2O uses available nodes as JVMs in their cloud architecture.

After initiating the H2O connection and utilizing the H2O Python library to load our data into an H2O data frame, we can proceed with H2O’s model development functions. Here, I am using H2O’s implementation of the Random Forest model. H2O has various algorithm options for both supervised and unsupervised machine learning. These include a deep learning neural network algorithm, generalized linear model, gradient boosting, bayes classifier, isolation forests, k-means clustering, and principal component analysis, among others. Also of interest is that H2O.ai has released some NLP-specific functionality including a word2vec algorithm.

#Define predictors and response variable
predictors=['predictor1','predictor2','predictor3','predictor4','predictor5']
response='var'
 
#Split into train and test datasets
train,test=d.split_frame(ratios=[.8],seed=1234)
 
#Build model
drf_model=H2ORandomForestEstimator(
  model_id="rf_covType_v1",
  ntrees=200,
  stopping_rounds=2,
  score_each_iteration=True,
  seed=1000000)
 
#Train model
drf_model.train(predictors,response,training_frame=train)

The model is returned as an H2O object. After model building and training, performance can be examined using a variety of H2O model access methods (outlined in their documentation here). Once a satisfactory model is established, converting it to a production-ready format is a two-step process: first, download the .jar file that H2O generates in the cloud at model build; and second, write a Java method to use the model for scoring.

Using Python:

#Download .jar file
modelfileDRF=drf_model.download_mojo(path="C:/Location/",get_genmodel_jar=True,genmodel_name='DRFModel.jar')

This command downloads our model as a .jar file. H2O calls this a MOJO file (Model Object, Optimized). This file contains all the information about the model (or models) we just built.

The last step to make use of the .jar model is to write a Java program for the .jar file to call. The purpose of this Java program is to use the model for scoring. For example, to provide direction on the input data to be processed, specificity on which model to be run, and instructions on the prediction output format and location. For this POC, incoming data was provided via a .csv file. In a fully productionalized implementation, data would likely be received over the web in JSON format or similar. Again, data format can be easily updated by modifying your Java file.


import java.io.*;
import java.sql.Timestamp;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.*;
import au.com.bytecode.opencsv.*;

public class main{
  
        public static void main(String[]args)throws Exception{

                String[] header = new String[5];

                RowData row = new RowData();
                EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("rf_covType_v1.zip"));

                String fileName= ("C://path//dataset.csv");

                File file= new File(fileName);

                Scanner inputStream;

                try{
                        inputStream = new Scanner(file).useDelimiter("\\n");
                        int n = 1;

                        while(inputStream.hasNext()){

                                if(n==1){
                                        String line= inputStream.next();
                                        header = line.split(",");

                                } else {
                                        String line = inputStream.next();
                                        String[] values = line.split(",");
                                        row.put(header[0].toString().trim(), values[0].toString().trim());
                                        row.put(header[1].toString().trim(), values[1].toString().trim());
                                        row.put(header[2].toString().trim(), values[2].toString().trim());
                                        row.put(header[3].toString().trim(), values[3].toString().trim());
                                        row.put(header[4].toString().trim(), values[4].toString().trim());
                                        long startTime = System.nanoTime();
                                        RegressionModelPrediction p = model.predictRegression(row);
                                        long endTime = System.nanoTime();
                                        System.out.println("");
                                        System.out.println("Line Number: "+n);
                                        System.out.println("Row Values: "+row.values());
                                        System.out.println("Prediction: "+p.value);
                                        System.out.println("Prediction took " + (endTime - startTime)/1000000 + " milliseconds.");
                                }
                                n++;
                        }
                        inputStream.close();
                        
                }catch (FileNotFoundException e) 
                        
                
        }
}

Compiling the Java file and executing the .jar file runs the model and generates the prediction. Speeds achieved using my laptop’s processing capabilities and 5-feature RF model varied between 0 and 4 milliseconds per case.

Lindsey Huskey