[ad_1]
Image by rawpixel.com on Freepik
A machine learning model is only useful if it is used in production to solve business problems. However, the business problem and machine learning model is constantly evolving. That’s why we need to maintain machine learning so that performance aligns with business KPIs. This is where the MLOps concept came from.
MLOps, or Machine Learning Operations, is a collection of techniques and tools for machine learning in manufacturing. From machine learning to automation, versioning, delivery and monitoring is what MLOps handles. This article will focus on monitoring and how we use Python packages to get the monitoring model working in production. Let’s go in.
When we talk about monitoring in MLOps, it can refer to many things because one of the principles of MLOps is monitoring. For example:
– Track data distribution changes over time
– Track features used in development and production
– Disassembly of the monitor model
– Monitoring of model performance
– Observe system vacuum
There are still many elements to monitor in MLOps, but in this article we will focus on monitoring model performance. Model performance, in our case, refers to the model’s ability to make reliable predictions from unseen data, measured by specific metrics such as precision, accuracy, recall, etc.
Why do we need to monitor model performance? This is to maintain the reliability of model predictions to solve the business problem. Before production, we often calculate the performance of the model and its impact on KPI; For example, the baseline is 70% accuracy if we want our model to still follow the needs of the business, but below this is unacceptable. That is why performance monitoring will allow the model to always meet business requirements.
Using Python, we will learn how to monitor the model. Let’s start by installing the package. There are many choices for model monitoring, but for this example, we’ll use an open source monitoring package called obvious.
First, we need to install the package explicitly with the following code.
After installing the package, we will download an example dataset, insurance claim data from Kaggle. Also, we clean the data before using it further.
import pandas as pd
df = pd.read_csv("insurance_claims.csv")
# Sort the data based on the Incident Data
df = df.sort_values(by="incident_date").reset_index(drop=True)
# Variable Selection
df = df[
[
"incident_date",
"months_as_customer",
"age",
"policy_deductable",
"policy_annual_premium",
"umbrella_limit",
"insured_sex",
"insured_relationship",
"capital-gains",
"capital-loss",
"incident_type",
"collision_type",
"total_claim_amount",
"injury_claim",
"property_claim",
"vehicle_claim",
"incident_severity",
"fraud_reported",
]
]
# Data Cleaning and One-Hot Encoding
df = pd.get_dummies(
df,
columns=[
"insured_sex",
"insured_relationship",
"incident_type",
"collision_type",
"incident_severity",
],
drop_first=True,
)
df["fraud_reported"] = df["fraud_reported"].apply(lambda x: 1 if x == "Y" else 0)
df = df.rename(columns="incident_date": "timestamp", "fraud_reported": "target")
for i in df.select_dtypes("number").columns:
df[i] = df[i].apply(float)
data = df[df["timestamp"] < "2015-02-20"].copy()
val = df[df["timestamp"] >= "2015-02-20"].copy()
In the code above, we select a few columns for model training purposes, convert them to a numeric representation, and split the data into reference (data) and current data (val).
We need reference or baseline data to monitor model performance in the MLOps pipeline. Usually, this is data separated from training data (for example, test data). Also, we need current data or data not seen by the model (input data).
Clearly use it to monitor data and model performance. Since data drift will affect model performance, it should also be considered.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
data_drift_report = Report(metrics=[
DataDriftPreset(),
])
data_drift_report.run(current_data=val, reference_data=data, column_mapping=None)
data_drift_report.show(mode="inline")
Obviously the package automatically displays a report of what happened to the database. The information includes data drift and column drift. For the example above, we don’t have a case of data drift, but two columns have drifted.
The report shows that the column “property_claim” and “timestamp” are indeed detected as drift. This information can be used in the MLOps pipeline to train the model, or we still need to investigate the data further.
If needed, we can also get the above data reported in a log dictionary object.
data_drift_report.as_dict()
Next, let’s try to train a classifier model from the data and try to explicitly monitor the performance of the model.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(data.drop(['target', 'timestamp'], axis = 1), data['target'])
It will obviously require both target and predictor columns in the reference and current database. Let’s add the model prediction to the data set and obviously use it to monitor the performance.
data['prediction'] = rf.predict(data.drop(['target', 'timestamp'], axis = 1))
val['prediction'] = rf.predict(val.drop(['target', 'timestamp'], axis = 1))
As a note, it is better to have reference data that are not training data for real cases to monitor model performance. Let’s set up model performance monitoring with the following code.
[ad_2]
Source link