[ad_1]
Introduction:
Microsoft Fabric is a cloud-based platform that provides a unified data science, data engineering, and business intelligence experience. It offers a variety of features and services such as data preparation, machine learning and visualization. Fabric’s comprehensive toolset enables data professionals and business users alike to unlock the full potential of their data and shape the future of artificial intelligence.
The core of Fabric offered services such as Data Factory, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehousing, Synapse Real-Time Analytics and Power BI. Fabric provides a comprehensive and powerful solution for your data science needs, from data integration and engineering to real-time analytics and visualization.
In this blog, our focus will be on Fabric’s data science services, we’ll show how to use Microsoft Fabric to build a diabetes prediction model, and we’ll explore some great notebook tools.
To access Microsoft Fabric, create an account at app.fabric.microsoft.com for a free trial or if you’re an existing Power BI user, you can sign in using your Power BI account credentials.
Check out our blog on Mastering Data Science with Microsoft Fabric: An Introduction to Fabric Notebook Features to learn how to take advantage of the amazing capabilities that will improve your data exploration and experimentation.
Fabric Lake House and Notebooks:
To start with our diabetes prediction, we will use the diabetes database “pima-indians-diabetes” from the Kaggle database, which contains data on 768 diabetic patients.
When we talk about data, we can talk about storing structured and unstructured data. Fabric’s Lakehouse is one of the facilities that can store data and is a data architecture platform for data management and analysis. It has the ability to scale and adapt to manage huge amounts of data and supports a variety of data processing tools and frameworks. For more information about Data Lakehouse, see What is a lakehouse in Microsoft Fabric?
Fabric uses the notebook artifact in a data science experience to demonstrate the diverse capabilities of the Fabric framework. Fabric enables the use of notebooks to facilitate the development and deployment of machine learning experiments. The data science service and the notebook provide a wide range of functions, which will be discussed later. You can refer to this How to use Microsoft Fabric notebooks to know more about data science services
Follow the steps below to save files/data to Lakehouse:
- Go to the Microsoft Fabric home and select Data Engineering from the menu.
- Create a new Lakehouse
- Upload files from your local device. You will find the updated files in the existing “Files” folder.
Now let’s see how we can train our model to predict diabetes.
- You can create a new notebook or import an existing notebook from the Data Engineering home page (the image shows step #2) or the Data Science home page as shown in the image below.
- Connect Lakehouse to your notebook, you either create a new one or connect to an existing Lakehouse.
- Please follow this notebook code to train a machine learning model for predicting diabetes.
Machine learning model training and prediction score
This section walks through the steps involved in training a Scikit-Learn model, the process of saving trained models. In addition, it shows how to use the stored model for predictions after the training procedure is complete. For more information on Fabric models, please see How to train models with scikit-learn in Microsoft Fabric.
Please note that the code in this section is specifically designed for Microsoft Fabric Notebook. Attempting to run the code on other platforms such as Colab or another platform may result in errors. This is because the PREDICT function used in the code requires models to be saved in the MLflow format, which is primarily supported by the Spark language.
- A machine learning experiment is the basic organizing and management unit for all related machine learning. Run the code below to run the experiment for the trained model.
import mlflow
mlflow.set_experiment("Diabetes-Prediction")
It will create a new experiment called “Diabetes-Prediction” in your workspace. You can check Machine Learning Experiments in Microsoft Fabric to know more about “experiment”.
Or you can create an experiment using the interface (from your workspace, select an experiment from the drop-down)
- The following code shows how to use the MLflow API to create a machine learning experiment and run MLflow for an LGBMClassifier model built with the scikit-learn library. The model version is then saved and registered in the Microsoft Fabric workspace.
In the code below write your model name mlflow.sklearn.log_model()
import mlflow.sklearn
from mlflow.models.signature import infer_signature
mlflow.set_experiment("Diabetes-Prediction")
with mlflow.start_run() as run:
model = LGBMClassifier(random_state = 12345)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
score = model.score(X_train, y_train)
signature = infer_signature(X, y)
print('score...:',score)
print('Accuracy...:',accuracy)
mlflow.sklearn.log_model(
model,
"diabetes-model",
signature=signature,
registered_model_name="diabetes-model"
)
- Once the model is saved, it can be loaded for inference. To achieve this, we will load the model and perform the inference process on the sample dataset. Please refer to the code below to predict your test data.
from pyspark.sql import SparkSession
from synapse.ml.predict import MLFlowTransformer
spark = SparkSession.builder.getOrCreate()
test = spark.read.format("csv").option("header","true").load("Files/diabetes_test.csv")
# df now is a Spark DataFrame containing CSV data from "Files/diabetes_test.csv".
display(test)
# You can substitute values below for your own input columns,
# output column name, model name, and model version
model = MLFlowTransformer(
inputCols=test.columns,
outputCol='predictions',
modelName='diabetes-model',
modelVersion=1
)
prdiction = model.transform(test).show()
pred_df = prdiction.toPandas()
- Replace inputCols, modelName, and modelVersion with your test dataset’s feature columns, modelName, and modelVersion.
- Or if you want to do it using the UI, you can generate the above PREDICT code from the model element page to infer test data.
- Open the model from your workspace where you saved it
- Select a model version from the sidebar, click the Apply Model button, and select Use this model in the wizard. As shown in the image below.
- You can see the generated code in this notebook
- A “prediction” column will be added to your test data frame by running the command below.
This way you can use Fabric Notebook for your data science experiments.
[ad_2]
Source link