Model Monitoring and Anomaly Detection Part I: Introduction and The ETL Job

A Brief Introduction to Models and Drift

In data science, machine learning can be used to solve problems that traditional algorithms cannot (reasonably) solve. Machine learning models quickly solve complex problems with high degrees of accuracy, given a representative training dataset. The biggest issue with these models while in production is that over time the input data changes, rendering the original training set unrepresentative of the real-world and causing the accuracy and performance of the model to diminish.

Two terms have been coined to describe changes that can occur in a model’s features and cause performance declination: concept drift and feature drift. A feature is “a measurable property or characteristic of a phenomenon being observed” such as an input variable to the model. Concept drift refers to a change in the distribution of values of a feature that is used in the model’s prediction.

Let’s look at a trivial example to better understand concept drift:

“A clothing store is looking to open up its first location at the start of January in X, a city that experiences all four seasons. In order to determine what inventory to stock (number of shirts, pants, shorts, etc.), they have trained a model using data from a nationwide survey of what clothes people are wearing, conducted from October-December of the previous year. Initially, the store does great, having the correct inventory to meet the needs of its customers. Fast-forward to summer, where the store notices it has an overstock of warm clothes like jackets and pants.”

What happened? The model was only trained on fall and winter clothes data. As summer arrived, more people were wearing shorts and t-shirts than in the winter. The distribution of clothes changed, causing the training data to become unrepresentative of X’s population. This, in turn, resulted in the model making inaccurate predictions on what inventory to stock. The model’s inaccuracy was caused by concept drift.

Feature drift occurs when a specific feature’s relevancy in determining the correct prediction changes. In other words, the distribution of all the features that go into a specific prediction output value has changed, rendering a feature more or less important than it used to be. Let’s follow up on the previous example to better illustrate feature drift:

“Unfortunately, minimal data was available for clothing surveys in the spring and summer months. The model did not noticeably improve much from the little data. Luckily, the original survey also asked each respondent to record the current temperature as part of their response. Learning from their past mistake, the clothing store built version 2.0 of their model that now takes temperature into account when predicting what inventory to stock. This feature helped the model compensate for the lack of data. Over the next several years, the store experienced better success with v2.0 in the summer. They decide to open a second location in Y, a city with a more uniform, temperate climate year-round. Both X and Y locations do well in the winter months. However, as summer rolls around again, Y notices that the model isn’t performing as accurately as it is in X.”

It turns out that in city Y, temperature doesn’t play as big of a role in determining the clothing trends because of the cooler, more uniform climate. This decrease in relevancy of temperature describes feature drift. Note that feature drift is a change in relevancy. This includes increases in relevancy, which the example did not describe.


Now, let’s illustrate why identifying drift is important. It should be obvious to see why a large performance decrease is harmful, but in the real world, a model might not suffer such a noticeable decrease. Let’s say that after a year, a model was observed to have an accuracy drop of 3% when comparing its Q4 performance to the initial Q1 metrics and assume that number remains the same for the following year. Does a small percentage in accuracy have a downstream impact? I’ll use an abstract scenario with hypothetical numbers to illustrate.

A company deploys a model it developed to detect if condition Ζ will occur in a process. If the process runs normally, it costs the company $500. If Ζ is detected, the company can always stop it, costing them $1,000. If Ζ occurs and the company doesn’t detect it, Ζ causes the company to lose $10,000. Ζ occurs 30% of the time overall, and the model currently has an accuracy of 90%. This means that out of 100 processes, Ζ occurs in 30 of them. The model correctly will say “no Ζ” to 63 processes (True Negatives), and “yes Ζ” to 27 (True Positives). It will incorrectly say “yes Ζ” to 7 processes where Ζ does not occur (False Positives) and “no Ζ” to 3 processes where Ζ occurs (False Negatives).

In summary:

Process(normal) = $500

Process(Ζ && detected) = Process(stop Ζ) = $1,000

Process(Ζ && undetected) = Process(Ζ) = $10,000

Confusion Matrix Over 100 processes Model Predictions @ 90% Accuracy
No Z Yes Z
Unmonitored Process No Z 63 (TN) 7 (FP)
Yes Z 3 (FN) 27 (TP)

Breaking it down for 100 processes:

Without Model With Model @ 90% Accuracy
Total Cost = 70 x Process(normal) x Process(Z) Total Cost = 63 TN + 27 TP + 7 FP + 3 FN
Total Cost = (70 x $500) + (30 x $10,000) Total Cost = (63 x Process(normal) ) + (27 x Process(stop Z) ) + (7 x Process(stop Z) ) + (3 x Process(Z) )
Total Cost = $335,000 Total Cost = (63 x $500) + (27 x $1,000) + (7 x $1,000) + (3 x $10,000)
Total Cost = $95,500

That’s an average savings of $239.50 per process! Let’s assume the company runs the process twice a day, every day of the year for a total of 730 processes. Without the model, they would spend $2,445,500 annually. With it, they would spend only $697,150, saving them $1,748,350 annually! Now, if the model’s accuracy then dropped 3% (87% overall accuracy), for 100 processes, it would correctly identify 61 “no Ζ” and 26 “yes Ζ” processes.

Confusion Matrix Over 100 processes Model Predictions @ 87% Accuracy
No Z Yes Z
Unmonitored Process No Z 61 (TN) 9 (FP)
Yes Z 4 (FN) 26 (TP)

Using the same logic as above, they would spend $770,150 annually, reducing average annual savings from $1,748,350 to $1,675,350, an opportunity cost of $73,000.

By monitoring models, issues can be detected faster, minimizing the resulting expenses. This can be a combination of retraining the model with new data to keep performance high, and/or identifying and correcting concept and feature drift within the data pipeline. Unfortunately, there is no secret formula for determining the correct blend – trends are unique to each dataset and costs differ for each business problem.

Building the Monitoring Pipeline

When building a system for model monitoring, there are a lot of moving parts to consider: what data is necessary, how the data needs to be transformed for monitoring, how to monitor the data to detect drift and anomalies, how to automate the process, and where to host everything (just to name a few). A high-level overview of such a system might look like this:


The monitoring pipeline can be configured to collect data and perform analysis either in real-time or as a scheduled batch job. The configuration of this pipeline should be determined by a variety of business and technical factors, the most straightforward being the availability of new model data; however, there are no set rules when determining this configuration. It is entirely preferential to the needs of the individual or company.


The first component in the pipeline is the ETL Job, (so cleverly) named by its functions: extract the model data, transform it for monitoring, and load the new data to the monitoring service. A custom solution was developed primarily based on the DataFrame, the main object in the popular Python package pandas. Its table-based design makes it excellent for working with the data similar to how it would appear in SQL; but with extended functionality, ease of development, and the ability to prototype and visualize the data at every step in a Jupyter Notebook.

ETL Job: Gathering Data (Extract)

The first step in building our ETL job is determining what data from the model is essential for monitoring. Since the data will ultimately be analyzed in a time series, it is essential to include a timestamp and unique identifier. This particular model makes multiple predictions on each individual set of features as new information becomes available, so a unique identifier will be created using multiple indexes on the DataFrame. All metrics for measuring a model’s performance are derived from two values: its output (prediction), and a ground truth (validation). Using pseudo-randomly generated values, the data might look like this:

A sample  raw  dataset

A sample raw dataset


This is the bare minimum that is required for model monitoring, however, data relating to the model’s features or anything else that is deemed necessary may also be appended.

ETL Job: Cleaning And Calculating Metrics (Transform)

Once the data has been loaded in, it needs to be cleaned. Data with missing values in any of the rows must be dropped. Remember, only the absolutely necessary data is being pulled from the model to assure high efficiency and minimal memory requirements. If a value is missing, the data in that row cannot proceed. Missing values can often occur in the validation column, due to the ground truth not yet existing for an event in-progress.


After cleaning, metrics can be calculated. Since this model is a binary classifier, prediction results are calculated first. Prediction results are the individual units that comprise a confusion matrix. By specifying a threshold, the numeric predictions can be converted to Boolean values. These results are useful for visualizing a model’s performance on a time series with atomic granularity or aggregation to provide total counts of each prediction result category on an absolute scale.


Metrics are then derived from the prediction results. Metrics are rates, yielding a value on a normalized scale with a reduced granularity. On a time series, this is a repeating interval (e.g., hourly, daily, weekly). Examples of such metrics include accuracy, recall, AUC, and specificity.


Due to the differing granularity of prediction results and metrics preventing their coexistence on a singular DataFrame, two “schemas” were developed for this project: pred and metrics. From a larger dataset of pseudo-random prediction and validation values:

Data transformed to  pred  schema

Data transformed to pred schema

Data transformed to  metrics  schema

Data transformed to metrics schema

After processing, the data must be checked one last time for any null values. Null values at this stage usually occur in the metrics due to mathematical exceptions like a prediction result category having a count of zero in a given aggregation bucket, resulting in a divide by zero error for a given metric. These null values may be handled different ways.

ETL Job: Adding Transformed Data to the Monitoring Service (Load)

The transformed dataset is now ready for upload to a database. While a traditional SQL database could be utilized, a NoSQL database makes more sense. As models grow to serve more data, thus meaning that more predictions must be monitored, NoSQL databases facilitate scalability. There are many different flavors of NoSQL databases, such as MongoDB and Cassandra. In one implementation, data is stored in documents that have a unique id and are organized under an index. This is analogous to rows of data stored in a SQL database table. In order to upload the DataFrame to such a database, it must be converted into individual documents that reference a common index. An example of how to do this (modified from a tutorial on Towards Data Science) would look like this:

def doc_generator(df, index_name):
    df_iter = df.iterrows()
    for index, document in df_iter:
        yield {
            "index": index_name,
            "id" : "".format(index),
            "document": document.to_dict(),

Each row of the DataFrame is converted to a document, identified by id (the unique index of the DataFrame row), and points to the index specified by index. A generator is yielded by this function, which can then be used by the appropriate calls to a NoSQL database to upload all the documents in bulk.


In order to add index updating functionality, metadata can be utilized. This metadata can include an infinite number of data points that help summarize the index (e.g., information on the model that the data came from, the schema of data loaded by the ETL job, the timestamp of the latest index, etc.). Although the processed model data at the index is uniform in structure per document, unlike SQL, NoSQL databases to not enforce a strict schema. This means that documents can contain different key-value pairs of data. One implementation to add metadata would be to add a metadata document to each index.

"metadata": {
    "schema": "metrics",
    "index" : "MyIndex",
    "model_version": 2.3.1,
    "first_entry": "2019-01-21T10:35:27.620000",
    "last_updated": "2019-08-10T16:39:52.737836",
    "model_name": "MyModel",
    "created": "2019-03-027T09:13:46.121552",
    "doc_count": 1345,
    "last_entry": "2019-08-07T12:01:10.513751"

Most notable are the schema, last_entry, model_name, and model_version fields. The first two fields contain information concerning the data’s structure and contents, which allow for dynamic querying and correct processing of new data. model_name and model_version contain application-specific data about model and version number, useful for a variety of purposes such as index identification and comparison of different model version performance. The sky’s the limit when determining what to include here as functionality is added to the monitoring pipeline.


Here’s a visual summary of the pipeline:

ETL Job Diagram.png
Upload Routine Update Routine
1. Check for index existence 1. Fetch index metadata
1a. If exists, switch to update mode at (1.)
2. Query and extract all model data to DataFrame 2. Dynamically query and extract from last document timestamp onwards to DataFrame
ETL Job (Both routines converge here)
3. Transfrom DataFrame to specified schema
4. Convert DataFrame to generator yielding dict objects per row
5. Load data to NoSQL storage on the Monitoring Service


So far, we have been introduced to predictive models, how and why their performance can change over time, and why monitoring them is important. We have also discussed what data is essential for monitoring. Finally, we walked through how to build an ETL Job for transforming this data and moving it to a storage solution that is optimal for a model monitoring service.


The next post of this series will continue on my summer research by discussing how a monitoring service might be implemented and automated. In the future, our goal will be to integrate a standardized monitoring system with data science models in production.


References and Embedded Links

  • Bhatia, Richa. “How Do Machine Learning Algorithms Differ From Traditional Algorithms?” Analytics India Magazine, Analytics India Magazine Pvt Ltd., 10 Sept. 2018,

  • “Boolean Data Type.” Wikipedia, Wikimedia Foundation, 9 Aug. 2019,

  • Dungan, John A. “Exporting Pandas Data to Elasticsearch.” Towards Data Science, Medium, 23 Feb. 2019,

  • Edlich, Stefan. “List Of NOSQL Databases [Currently >225].” NOSQL Databases, 3 July 2019,

  • “Feature (Machine Learning).” Wikipedia, Wikimedia Foundation, 1 Aug. 2019,

  • Issac, Luke P. “SQL vs NoSQL Database Differences Explained with Few Example DB.” The Geek Stuff, Ramesh Nataranjan, 14 Jan. 2014,

  • “Manage Massive Amounts of Data, Fast, without Losing Sleep.” Apache Cassandra, The Apache Software Foundation, 2016,

  • Markham, Kevin. “Simple Guide to Confusion Matrix Terminology.” Data School, Data School, 25 Mar. 2014,

  • “The Most Popular Database for Modern Apps.” MongoDB, MongoDB, Inc., 2019,

  • Narkhede, Sarang. “Understanding AUC - ROC Curve.” Towards Data Science, Medium, 26 June 2018,

  • “Pandas.DatFrame.” Pandas.DataFrame - Pandas 0.25.0 Documentation, NumFOCUS,

  • “Project Jupyter.” Project Jupyter, Project Jupyter, 18 July 2019,

  • “Python Data Analysis Library.” Pandas, NumFOCUS,

James Bertrand