The AI Book
    Facebook Twitter Instagram
    The AI BookThe AI Book
    • Home
    • Categories
      • AI Media Processing
      • AI Language processing (NLP)
      • AI Marketing
      • AI Business Applications
    • Guides
    • Contact
    Subscribe
    Facebook Twitter Instagram
    The AI Book
    AI Media Processing

    Getting started with Databricks. A Beginner’s Guide to Databricks | By Sadrach Pierre, Ph.D. | May 2023

    10 May 2023No Comments9 Mins Read

    [ad_1]

    A Beginner’s Guide to Databricks

    Sadrach Pierre, Ph.D.
    Towards Data Science
    Photo by Alexander Gray from Pexels

    Databricks enables data scientists to easily create and manage notebooks for research, experimentation, and deployment. The appeal of platforms like Databricks includes seamless integration with cloud services, model maintenance tools, and scalability.

    Databricks are very useful for model experimentation and maintenance. Databricks has a machine learning library called MLflow that provides useful tools for model development and deployment. With MLflow, you can register models as well as metadata associated with the models, such as performance metrics and hyperparameters. This makes it very easy to conduct experiments and analyze the results.

    Many Databricks features are useful for scaling steps in a machine learning workflow, such as data loading, model training, and model accounting. Koalas is a library in Databricks that is a more efficient alternative to Pandas. Pandas User-Defined Functions (UDFs) allow you to use custom functions that are typically computationally expensive in a distributed manner, which can significantly reduce runtime. Databricks also allows you to configure jobs on larger machines, which can be useful when dealing with big data and heavy computing. In addition, the model registry allows you to run and save experiment results for hundreds or even thousands of models. This is useful in terms of scaling the number of models a researcher develops and ultimately deploys.

    In this article, we’ll cover some of the basics of Databricks. First, we’ll walk through a simple data science workflow where we’ll build a churn classification model. We’ll then see how we can use tools like Koalas and Pandas to speed up specific UDF operations. Finally, we’ll see how we can use Mlflow to help us run experiments and check the results.

    Here, we will work with the Telco churn data set. This data contains customer billing information for a fictitious Telco company. It specifies whether the user has stopped or continued using the service, known as a float. The data is publicly available and free to use, share, and modify under the Apache 2.0 license.

    You start

    To get started, go to the Databricks website and click “Get started for free”:

    The screenshot was taken by the author

    You should see the following:

    The screenshot was taken by the author

    Enter your information and click Continue. Next you will be asked to select a cloud platform. In this article, we will not work with external cloud platforms. At the bottom of the right panel, click the “Start Community Edition” button

    The screenshot was taken by the author

    Then follow the steps to create a Community Edition account.

    Import data

    Let’s start by navigating to the “Data” tab in the left pane:

    The screenshot was taken by the author

    Then click on “Data” and then click on Create Table:

    The screenshot was taken by the author

    Then drag and drop the Churn CSV file into the space that says “Drop files to upload or click to browse”

    The screenshot was taken by the author

    When uploading the CSV, you should see the following:

    The screenshot was taken by the author

    Then click the button “Create a table in the notebook”. Using the Databricks file store (DBFS) example notebook logic to write this file to the Databricks file store will show:

    The screenshot was taken by the author

    DBFS allows Databricks users to upload and manage data. The system is distributed, so it is very useful for storing and managing large amounts of data.

    The first cell defines the logic to read the Churn data we uploaded:

    # File location and type
    file_location = "/FileStore/tables/telco_churn-1.csv"
    file_type = "csv"

    # CSV options
    infer_schema = "false"
    first_row_is_header = "false"
    delimiter = ","

    # The applied options are for CSV files. For other file types, these will be ignored.
    df = spark.read.format(file_type) \
    .option("inferSchema", infer_schema) \
    .option("header", first_row_is_header) \
    .option("sep", delimiter) \
    .load(file_location)

    display(df)

    If we run this cell, we get the following result:

    The screenshot was taken by the author

    We can see that the table contains column names that are not very useful (_c0, _c1, … etc.). To fix this we need to specify first_row_is_header= “true”:

    first_row_is_header = "true"

    When we run this cell, we now get:

    The screenshot was taken by the author

    If you click on the table, you can scroll to the right and see additional columns in the data:

    The screenshot was taken by the author

    Building a classification model

    Let’s proceed to build a Churn classification model in Databricks using our uploaded data. Click the “Create” button on the left panel:

    The screenshot was taken by the author

    Next click on the notebook:

    The screenshot was taken by the author

    Let’s call our notebook “churn_model”:

    The screenshot was taken by the author

    Now we can copy the logic from the DBFS example notebook that will allow us to access the data:

    The screenshot was taken by the author

    Then convert the spark dataframe to a pandas dataframe:

    df_pandas = df.toPandas()

    Let’s build a Catboost classification model. Catboost is a tree-based ensemble machine learning algorithm that uses gradient boosting to improve the performance of the sequential trees used in the ensemble.

    Let’s install the Catboost package. We do this in the top cell of the notebook:

    The screenshot was taken by the author

    And let’s build a Catboost churn classification model. Let’s use the term, the monthly payments, and the contract to guess the outcome of the delay. Let’s convert the churn column to binary values:

    import numpy as np 
    df_pandas['churn_label'] = np.where(df_pandas['Churn']== 'No', 0, 1)
    X = df_pandas[["tenure", "MonthlyCharges", "Contract"]]
    y = df_pandas['churn_label']

    Catboost allows us to handle categorical variables directly without the need to convert them to machine-readable codes. To do this, we simply define a list containing the categorical column names:

    cats = ["Contract"]

    When defining the Catboost model object, we set the cat_features parameter equal to this list. Let’s split our data for training and testing:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    And we can prepare the Catboost model. We’ll just use the default parameter values:

    model = CatBoostClassifier(cat_features= cats, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    And we can evaluate the performance:

    from sklearn.metrics import accuracy_score, precision_score
    accuracy = accuracy_score(y_pred, y_test)
    precision = precision_score(y_pred, y_test)

    print("Accuracy: ", accuracy )
    print("Precision: ", precision )

    The screenshot was taken by the author

    Koalas

    Here we have converted spark dataframe to pandas dataframe. This is fine for our small dataset, but as the dataset grows, pandas becomes slow and inefficient. An alternative to pandas is the koala library. Koalas is a package developed by Databricks that is a distributed version of Pandas. To use Koalas, we can install Databricks on top of our notebook:

    %pip install -U databricks

    And we import koala from the data brick:

    from databricks import koalas as ks

    And to convert our spark dataframe to koala dataframe we do the following:

    df_koalas = ks.DataFrame(df)
    df_koalas.head()

    Pandas UDF

    Pandas UDF is another useful tool in Data Bricks. It allows you to use a function on a data frame in a distributed manner. This is useful for increasing the efficiency of calculations performed on large data frames. For example, we can define a function that takes a data frame and builds a catboost model. We can use a Pandas UDF to apply this functionality at a grouped or categorical level. Let’s build a model for each cost of Internet service.

    To begin, we need to define our function and schema for the Pandas UDF. A schema simply defines column names and their data types:

    from pyspark.sql.functions import pandas_udf, PandasUDFType

    churn_schema = StructType(
    [
    StructField("tenure", FloatType()),
    StructField("Contract", StringType()),
    StructField("InternetService", StringType()),
    StructField("MonthlyCharges", FloatType()),
    StructField("Churn", FloatType()),
    StructField("Predictions", FloatType()),

    ]
    )

    Next we define our function. We simply include the logic we defined earlier in a function called “build_model”. To use the pandas UDF we add the ‘@pandas_udf’ decorator:

    @pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
    def build_model(df: pd.DataFrame) -> pd.DataFrame:

    And we can include the model building logic in our function. We will also store the predicted and true reduction values ​​in our data frame:

    @pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
    def build_model(df: pd.DataFrame) -> pd.DataFrame:
    df['churn_label'] = np.where(df['Churn']== 'No', 0, 1)
    X = df[["tenure", "MonthlyCharges", "Contract"]]
    y = df['churn_label']
    cats = ["Contract"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = CatBoostClassifier(cat_features= cats, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    output = X_test
    output['Prediction'] = y_pred
    output['Churn'] = y_test
    output['InternetService'] = df['InternetService']
    output['churn_label'] = df['churn_label']
    return output

    Finally we can apply this function to our data frame. Let’s convert our Koalas dataframe back to a spark dataframe:


    df_spark = df_koalas.to_spark()
    churn_results = (
    df_spark.groupBy('InternetService').apply(build_model))

    We can convert the resulting spark dataframe to a Pandas dataframe (we can also convert it to Koala) and display the first five rows:

    churn_results = churn_results.toPandas()
    churn_results.head()
    The screenshot was taken by the author

    Although we stored predictions, you can use a Pandas UDF to store any information you get from a calculation performed on a data frame. An interesting exercise is to include the accuracy score and precision score in the output spark data frame for each Internet service cost.

    Getting started with MLflow

    Another useful tool in Databricks is MlFlow. MlFlow allows you to easily run, access and analyze experiments. For this demonstration, we will work with the first model object we defined earlier in the notebook. Let’s install Mlflow on top of our notebook:

    %pip install -U mlflow

    and import Mlflow:

    import mlflow

    Let’s continue by setting the name of the experiment:

    mlflow.set_experiment(
    f"/Users/spierre91@gmail.com/churn_model"
    )
    The screenshot was taken by the author

    One thing we can account for is the value of the Catboost feature, which will allow us to analyze which features are important for predicting overheating:

    feature_importance = pd.DataFrame(
    "variable": model.feature_names_, "importance": model.feature_importances_
    )
    feature_importance.to_csv("/feature_importance.csv")
    The screenshot was taken by the author

    Then we can enter our Catboost model using the log_model method:

    with mlflow.start_run(run_name=f"churn_model"):
    mlflow.sklearn.log_model(model, "Catboost Model")

    We get a message saying “1 run in experiment Mlflow arrived”:

    The screenshot was taken by the author

    We can click run and see the following:

    The screenshot was taken by the author

    This is where we can see metrics like model performance and model artifacts like feature values. Both of them will soon show you how to log into Mlflow.

    We can also press experiment:

    The screenshot was taken by the author

    This is where we see each run associated with the experiment. This is useful for tracking experiments such as changing Catboost settings, training data, engineering features, etc.

    Finally, let’s describe the feature value as the artifact, the accuracy score and accuracy score as the metric, and the categorical input list as the parameter:

    with mlflow.start_run(run_name=f"churn_model"):
    mlflow.sklearn.log_model(model, "Catboost Model")
    mlflow.log_artifact("/feature_importance.csv")
    mlflow.log_metric("Precison", precision)
    mlflow.log_metric("Accuracy", accuracy)
    mlflow.log_param("Categories", cats)

    If we click run, we’ll see that we’ve logged the feature value, accuracy score, and precision score and categorical inputs:

    The screenshot was taken by the author

    The Databricks notebook code has been ported to an ipython file and is available on GitHub.

    conclusion

    In this post, we discussed how to get started with Databricks. First, we saw how to add upload data to DBFS. Then we created a notebook and showed how to access the uploaded file in the notebook. We then go on to discuss the tools available in Databricks that help data scientists and researchers scale their data science solutions. First, we saw how to convert spark dataframes to Koalas dataframes, which are a faster alternative to pandas. We next saw how to use user functions to activate data frames using Pandas UDF. This is very useful for computationally heavy tasks that need to be performed on large data frames. Finally, we saw how to include metrics, parameters, and artifacts associated with modeling experiments. Familiarity with these tools is important for anyone working in data science, machine learning, and machine learning engineering.

    [ad_2]

    Source link

    Previous ArticleUsing conversational AI for IT support for remote workers, Part 1
    Next Article Putting ROI-based influencer marketing to work
    The AI Book

    Related Posts

    AI Media Processing

    A new set of Arctic images will help artificial intelligence research MIT News

    25 July 2023
    AI Media Processing

    Analyzing rodent infestations using the geospatial capabilities of Amazon SageMaker

    24 July 2023
    AI Media Processing

    Using knowledge of social context for responsible use of artificial intelligence – Google Research Blog

    23 July 2023
    Add A Comment

    Leave A Reply Cancel Reply

    • Privacy Policy
    • Terms and Conditions
    • About Us
    • Contact Form
    © 2025 The AI Book.

    Type above and press Enter to search. Press Esc to cancel.