Machine Learning Deployment - Part III: The Three Pillars of Production Models

This post continues a previous series.

When I was first brought onto the Data Science team, the guidelines given to me by Data Science leadership were straight and to the point, but very high level and mostly left up to me to start exploring and figuring out what the underlying implementation could and should actually look like.  I quickly realized the level of detail in each of the following areas:

  1. Model Serving

  2. Model Monitoring

  3. Model Re-Training

I have mostly been focusing on the first one but I will give a high-level introduction to each.

Model Serving

From a pure model prediction perspective, this means serving the models in real-time to make predictions on incoming request data.  The data should be processed, ideally using the same data preparation and feature engineering logic used during training to produce the model inputs.  Those inputs then get fed into the model, and outputs the predictions.  Any additional filtering logic should be applied to those raw predictions, such as taking the output probability from a binary classifier and returning the actual class it represents (e.g., “Fraud” or “Not Fraud”).  As discussed in previous posts, there are many other traditional software engineering details that also apply as they would to any other application.  Models also need to be versioned so we can easily roll back to a previous version if issues occur.  Another important detail to remember is the data preparation and feature engineering logic are tightly coupled to the model, so they would also need to be rolled back if necessary.

Model Monitoring

Here we refer to automated monitoring specifically involving the trained models/algorithms themselves.  This is an additional monitoring layer specific to the field of Data Science on top of traditional application monitoring.  Why?  Because if you’re using a decision algorithm created by being trained and tested on data rather than manually programmed, it means the patterns that were found came from the data.  What if over time, the chosen type of model, patterns discovered, or features used to discover those patterns are no longer valid?  Model monitoring should help you minimize the risk of finding that out too late, or allow you to identify shifts and get a jump start on addressing it. You may need to choose a new type of model, retraining an existing model, or exploring different features to create a structurally different model. 

You will also want to monitor your model for any prediction anomalies, or concept or feature drift.  Overtime, the values of the features used to make predictions can undergo distribution changes (concept drift) or changes in feature relevancy (feature drift) which cause a model to start making inaccurate predictions.  This sort of model monitoring helps us know if the trained algorithms are performing within a safe and expected bound.  For example, if we had a fraud classification model version 1.0.0 serving predictions for a year, we can automatically be notified if the model is acting differently than it had been for the past year.  Maybe there’s a shift in the distribution of the model input features, which indicates a re-training may be necessary.

If we are re-training and automatically re-deploying the new models on a scheduled basis, we need to be able to automatically detect if the new release is having issues.  If it is, we need a way to turn off the automated re-deployment pipeline temporarily and roll-back to the previous version of the model.  If this occurs, we will also want to work with our data scientists to  diagnose the suspect model and discover what went wrong in the automated re-training.

Model Re-training

This is what it sounds like, re-training of the models.  There are a number of reasons we may want to retrain a model: as data evolve models can grow stale over time; it may be competitively beneficial to re-train on a scheduled basis to quickly adapt to your customers preferences; or perhaps a newly acquired data point is predictive and needs to be added to an existing model.  During my undergraduate tenure, one of my favorite professors taught a modeling course and often posited questions about model re-training: “How and when do you re-train the model?   How much historical data should you use?  Should it be a consistently shifting segment? 30 days or 1 year?”  Thinking through these questions help guide us in making the right decisions. 

These efforts require parameterizing the data pull window, configuring a training specification, and having automated validation tests to assert the model is performing as expected.  And of course, as engineers we must work hand-in-hand with data scientists to understand how they validated the model originally.  Which metrics did they use?  One thing we learned by working more closely with the data scientists was they often have “sanity checks” when validating a model that was recently retrained (e.g., checking if coefficients of an ordinal categorical variable increase or decrease).  Knowing the whole picture and understanding how the business will use the model allow us to make an informed decision of the model re-training schedule.

In our experience you can start getting value by building one-offs for Model Serving, and perhaps have some manual user monitoring of the models.  While building one-offs is not a scalable solution, it certainly lets you get your feet wet, teaches you to fail quickly and try again, and build relationships within your organization.  With time, the boundaries will begin to emerge as to what a consistent model serving system could look like to achieve the three pillars above.

 
Mike Kehayes