[ad_1]
A Beginner’s Guide to Databricks
Databricks enables data scientists to easily create and manage notebooks for research, experimentation, and deployment. The appeal of platforms like Databricks includes seamless integration with cloud services, model maintenance tools, and scalability.
Databricks are very useful for model experimentation and maintenance. Databricks has a machine learning library called MLflow that provides useful tools for model development and deployment. With MLflow, you can register models as well as metadata associated with the models, such as performance metrics and hyperparameters. This makes it very easy to conduct experiments and analyze the results.
Many Databricks features are useful for scaling steps in a machine learning workflow, such as data loading, model training, and model accounting. Koalas is a library in Databricks that is a more efficient alternative to Pandas. Pandas User-Defined Functions (UDFs) allow you to use custom functions that are typically computationally expensive in a distributed manner, which can significantly reduce runtime. Databricks also allows you to configure jobs on larger machines, which can be useful when dealing with big data and heavy computing. In addition, the model registry allows you to run and save experiment results for hundreds or even thousands of models. This is useful in terms of scaling the number of models a researcher develops and ultimately deploys.
In this article, we’ll cover some of the basics of Databricks. First, we’ll walk through a simple data science workflow where we’ll build a churn classification model. We’ll then see how we can use tools like Koalas and Pandas to speed up specific UDF operations. Finally, we’ll see how we can use Mlflow to help us run experiments and check the results.
Here, we will work with the Telco churn data set. This data contains customer billing information for a fictitious Telco company. It specifies whether the user has stopped or continued using the service, known as a float. The data is publicly available and free to use, share, and modify under the Apache 2.0 license.
You start
To get started, go to the Databricks website and click “Get started for free”:
You should see the following:
Enter your information and click Continue. Next you will be asked to select a cloud platform. In this article, we will not work with external cloud platforms. At the bottom of the right panel, click the “Start Community Edition” button
Then follow the steps to create a Community Edition account.
Import data
Let’s start by navigating to the “Data” tab in the left pane:
Then click on “Data” and then click on Create Table:
Then drag and drop the Churn CSV file into the space that says “Drop files to upload or click to browse”
When uploading the CSV, you should see the following:
Then click the button “Create a table in the notebook”. Using the Databricks file store (DBFS) example notebook logic to write this file to the Databricks file store will show:
DBFS allows Databricks users to upload and manage data. The system is distributed, so it is very useful for storing and managing large amounts of data.
The first cell defines the logic to read the Churn data we uploaded:
# File location and type
file_location = "/FileStore/tables/telco_churn-1.csv"
file_type = "csv"# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
If we run this cell, we get the following result:
We can see that the table contains column names that are not very useful (_c0, _c1, … etc.). To fix this we need to specify first_row_is_header= “true”:
first_row_is_header = "true"
When we run this cell, we now get:
If you click on the table, you can scroll to the right and see additional columns in the data:
Building a classification model
Let’s proceed to build a Churn classification model in Databricks using our uploaded data. Click the “Create” button on the left panel:
Next click on the notebook:
Let’s call our notebook “churn_model”:
Now we can copy the logic from the DBFS example notebook that will allow us to access the data:
Then convert the spark dataframe to a pandas dataframe:
df_pandas = df.toPandas()
Let’s build a Catboost classification model. Catboost is a tree-based ensemble machine learning algorithm that uses gradient boosting to improve the performance of the sequential trees used in the ensemble.
Let’s install the Catboost package. We do this in the top cell of the notebook:
And let’s build a Catboost churn classification model. Let’s use the term, the monthly payments, and the contract to guess the outcome of the delay. Let’s convert the churn column to binary values:
import numpy as np
df_pandas['churn_label'] = np.where(df_pandas['Churn']== 'No', 0, 1)
X = df_pandas[["tenure", "MonthlyCharges", "Contract"]]
y = df_pandas['churn_label']
Catboost allows us to handle categorical variables directly without the need to convert them to machine-readable codes. To do this, we simply define a list containing the categorical column names:
cats = ["Contract"]
When defining the Catboost model object, we set the cat_features parameter equal to this list. Let’s split our data for training and testing:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
And we can prepare the Catboost model. We’ll just use the default parameter values:
model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
And we can evaluate the performance:
from sklearn.metrics import accuracy_score, precision_score
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)print("Accuracy: ", accuracy )
print("Precision: ", precision )
Koalas
Here we have converted spark dataframe to pandas dataframe. This is fine for our small dataset, but as the dataset grows, pandas becomes slow and inefficient. An alternative to pandas is the koala library. Koalas is a package developed by Databricks that is a distributed version of Pandas. To use Koalas, we can install Databricks on top of our notebook:
%pip install -U databricks
And we import koala from the data brick:
from databricks import koalas as ks
And to convert our spark dataframe to koala dataframe we do the following:
df_koalas = ks.DataFrame(df)
df_koalas.head()
Pandas UDF
Pandas UDF is another useful tool in Data Bricks. It allows you to use a function on a data frame in a distributed manner. This is useful for increasing the efficiency of calculations performed on large data frames. For example, we can define a function that takes a data frame and builds a catboost model. We can use a Pandas UDF to apply this functionality at a grouped or categorical level. Let’s build a model for each cost of Internet service.
To begin, we need to define our function and schema for the Pandas UDF. A schema simply defines column names and their data types:
from pyspark.sql.functions import pandas_udf, PandasUDFTypechurn_schema = StructType(
[
StructField("tenure", FloatType()),
StructField("Contract", StringType()),
StructField("InternetService", StringType()),
StructField("MonthlyCharges", FloatType()),
StructField("Churn", FloatType()),
StructField("Predictions", FloatType()),
]
)
Next we define our function. We simply include the logic we defined earlier in a function called “build_model”. To use the pandas UDF we add the ‘@pandas_udf’ decorator:
@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:
And we can include the model building logic in our function. We will also store the predicted and true reduction values in our data frame:
@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:
df['churn_label'] = np.where(df['Churn']== 'No', 0, 1)
X = df[["tenure", "MonthlyCharges", "Contract"]]
y = df['churn_label']
cats = ["Contract"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
output = X_test
output['Prediction'] = y_pred
output['Churn'] = y_test
output['InternetService'] = df['InternetService']
output['churn_label'] = df['churn_label']
return output
Finally we can apply this function to our data frame. Let’s convert our Koalas dataframe back to a spark dataframe:
df_spark = df_koalas.to_spark()
churn_results = (
df_spark.groupBy('InternetService').apply(build_model))
We can convert the resulting spark dataframe to a Pandas dataframe (we can also convert it to Koala) and display the first five rows:
churn_results = churn_results.toPandas()
churn_results.head()
Although we stored predictions, you can use a Pandas UDF to store any information you get from a calculation performed on a data frame. An interesting exercise is to include the accuracy score and precision score in the output spark data frame for each Internet service cost.
Getting started with MLflow
Another useful tool in Databricks is MlFlow. MlFlow allows you to easily run, access and analyze experiments. For this demonstration, we will work with the first model object we defined earlier in the notebook. Let’s install Mlflow on top of our notebook:
%pip install -U mlflow
and import Mlflow:
import mlflow
Let’s continue by setting the name of the experiment:
mlflow.set_experiment(
f"/Users/spierre91@gmail.com/churn_model"
)
One thing we can account for is the value of the Catboost feature, which will allow us to analyze which features are important for predicting overheating:
feature_importance = pd.DataFrame(
"variable": model.feature_names_, "importance": model.feature_importances_
)
feature_importance.to_csv("/feature_importance.csv")
Then we can enter our Catboost model using the log_model method:
with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")
We get a message saying “1 run in experiment Mlflow arrived”:
We can click run and see the following:
This is where we can see metrics like model performance and model artifacts like feature values. Both of them will soon show you how to log into Mlflow.
We can also press experiment:
This is where we see each run associated with the experiment. This is useful for tracking experiments such as changing Catboost settings, training data, engineering features, etc.
Finally, let’s describe the feature value as the artifact, the accuracy score and accuracy score as the metric, and the categorical input list as the parameter:
with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")
mlflow.log_artifact("/feature_importance.csv")
mlflow.log_metric("Precison", precision)
mlflow.log_metric("Accuracy", accuracy)
mlflow.log_param("Categories", cats)
If we click run, we’ll see that we’ve logged the feature value, accuracy score, and precision score and categorical inputs:
The Databricks notebook code has been ported to an ipython file and is available on GitHub.
conclusion
In this post, we discussed how to get started with Databricks. First, we saw how to add upload data to DBFS. Then we created a notebook and showed how to access the uploaded file in the notebook. We then go on to discuss the tools available in Databricks that help data scientists and researchers scale their data science solutions. First, we saw how to convert spark dataframes to Koalas dataframes, which are a faster alternative to pandas. We next saw how to use user functions to activate data frames using Pandas UDF. This is very useful for computationally heavy tasks that need to be performed on large data frames. Finally, we saw how to include metrics, parameters, and artifacts associated with modeling experiments. Familiarity with these tools is important for anyone working in data science, machine learning, and machine learning engineering.
[ad_2]
Source link