Predicting Shipping Cost: A Comparison Between Azure AutoML and Jupyter Notebook Developed Code

Introduction

Azure AutoML a part of Azure Cognitive Services has become a powerful tool as it empowers data scientists and developers to build, deploy and manage high-quality models faster and with confidence. AzureML expertly accelerates time to value with industry-leading MLOps (machine learning operations), open-source interoperability, and integrated tools. It aims to innovate on a secure, trusted platform designed for responsible machine learning (ML).

How It Works?

Based on Microsoft's definition, Azure AutoML is Empower professional and nonprofessional data scientists to build machine learning models rapidly. Automate time-consuming and iterative tasks of machine learning model development using breakthrough research and accelerate time to market [link].

This ML tool has a wide range of capabilities and the majors of them are:

Automatically build and deploy predictive models using the no-code UI or the SDK
Support a variety of automated machine-learning tasks
Increase productivity with easy data exploration and intelligent feature engineering using deep neural networks
Build models with transparency and trust in mind using responsible machine-learning solutions

AutoML solutions can target various stages of the machine-learning process. These are the different steps that can be automated [link]:

Data preparation and ingestion (from raw data and miscellaneous formats)
Column type detection; e.g., boolean, discrete numerical, continuous numerical, or text
Column intent detection; e.g., target/label, stratification field, numerical feature, categorical text feature, or free text feature
Task detection; e.g., binary classification, regression, clustering…
Featurization
Feature engineering
Feature extraction
Feature selection
Meta-learning and transfer learning
Detection and handling of skewed data and/or missing values
Handling of imbalanced data
Model selection
Hyperparameter Optimization of the learning algorithm
Pipeline selection under time, memory, and complexity constraints
Selection of evaluation metrics and validation procedures
Problem checking
Leakage detection
Misconfiguration detection
Monitoring
Explainability
Analysis of the results obtained
Deployment

During training, Azure Machine Learning creates a number of pipelines in parallel that try different algorithms and parameters for you. The service iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score. The better the score for the metric you want to optimize for, the better the model is considered to "fit" your data. It will stop once it hits the exit criteria defined in the experiment.

Any Need for Data Science Jobs in the Future?

Altogether, AzureML improves productivity within the studio, the development experience that supports all ML tasks to build, train and deploy models. It collaborates with Jupyter Notebooks using built-in support for popular open-source frameworks and libraries. One of the major factors is how it creates accurate models quickly with automated ML, using feature engineering and hyperparameter-sweeping capabilities. AzureML accesses the debugger, profiler, and explanations to improve model performance as you train and also uses deep Visual Studio Code integration to go from local to cloud training seamlessly and autoscale with powerful cloud-based CPU and GPU clusters [link].

Experiment: Defining A Supply Chain Shipping Price Prediction

It can be difficult to navigate the logistics when it comes to buying art. These include, but are not limited to, the following:

Effective collection management
Shipping the paintings, antiques, sculptures, and other collectibles to their respective destinations after purchase

Though many companies have made shipping consumer goods a relatively quick and painless procedure, the same rules do not always apply while shipping paintings or transporting antiques and collectibles.

Data Description

Customer Id: Represents the unique identification number of the customers
Artist Name: Represents the name of the artist
Artist Reputation: Represents the reputation of an artist in the market (the greater the reputation value, the higher the reputation of the artist in the market)
Height: Represents the height of the sculpture
Width: Represents the width of the sculpture
Weight: Represents the weight of the sculpture
Material: Represents the material that the sculpture is made of
Price Of Sculpture: Represents the price of the sculpture
Base Shipping Price: Represents the base price for shipping a sculpture
International: Represents whether the shipping is international
Express Shipment: Represents whether the shipping was in the express (fast) mode
Installation Included: Represents whether the order had installation included in the purchase of the sculpture
Transport: Represents the mode of transport of the order
Fragile: Represents whether the order is fragile
Customer Information: Represents details about a customer
Remote Location: Represents whether the customer resides in a remote location
Scheduled Date: Represents the date when the order was placed
Delivery Date: Represents the date of delivery of the order
Customer Location: Represents the location of the customer
Cost: Represents the cost of the order

Set Up the Scene in Azure AutoML

We import the above dataset to perform a regression problem in Azure AutoML. At the same time, we develop our own notebook to solve the problem. We set the metric as Normalized mean absolute error in AutoML, however, we could get all metrics via the model result upon completion.

We set the models as XGBoostRegressor, LightGBM, and RandomForest based on these models well-performed on other price predictions in other models.

The training hours were set to two hours as the runtime of the Jupiter notebook we performed. However, AutoML could perform better for a longer time, and also the generated notebook also could take more time via GridSearch of the parameters.

The validation is set to auto although we could choose between k-fold- Monte Carlo Cross-validation (picked by the AutoML).

Residuals and validation plots are shown below and it could see the model is not performed very well. However, we should consider the dataset is not optimized and need extensive feature engineering to get a good result.

Let compare the results with a designed Jupiter notebook project and see which one is comparatively could perform better.

Conclusions

By comparing the results we could see the generated Jupyter notebook (human supervision) has a better performance on modeling and hyper tuning and making pipelines. However, as expected Azure AutoML has a relatively simple data injecting to modeling and pipeline production. The lower performance on Azure AutoML is negligible comparing all other benefits Azure services and studios provide for us and on a broader scale for Enterprises.

Building and testing a model is fairly simple in Azure AutoML. However, to get the best result, a data scientist or a team of data scientists that has a deep, fundamental and statistical knowledge of machine learning should supervise the Azure AutoMl and with other Azure capabilities like Designer, pipeline, Azure Cognitive Services (ACS), Azure Kubernetes Service (AKS), the performed model via Azure could reach and save significant benefits for the businesses. We should consider that it requires knowledge of the characteristics of machine learning algorithms, and certainly will not develop new ones automatically. But it does provide an environment where machine learning could be used effectively without low-level implementation of algorithmic knowledge. So, it can be concluded on the note that AzureML will not end the future of data science but it would rather improve the working process with its advanced features.