Machine learning (ML) pipeline is an Azure tool used to club together various machine learning cycle phases. This cycle can be a training cycle, data cleaning or pre-processing, and a prediction cycle. Let’s look at the multiple aspects of an ML cycle and how the pipeline helps do these tasks with ease.
What is the machine learning cycle?
ML cycle is a combination of steps integrated to produce a final output. It can be a training algorithm model or inferences from a trained model. For example, for a training cycle, the steps include:
- Data loading, i.e., from the database or files
- Data pre-processing, i.e., cleansing, transforming, or scaling
- Preparing data for the training process
- Training the algorithm model and saving it for inferences
These steps can be dependent or independent, based on input needed from other steps.
Similarly, there is an inference cycle in which a trained model and real data are loaded to produce an inference.
ML cycle to ML pipeline
ML cycle scripts run on a single workstation in a production environment, that may run out of resources while handling large amounts of data and parallel processing. On the other hand, ML pipeline is a tool used to replicate the ML cycle’s structure steps into the Microsoft Azure Machine Learning Workspace. It can be connected to various other Azure services like storage and compute resources, with all the data security and reliability compliances. The cycle’s multiple steps can be converted into steps in the pipeline, where each step represents a process that has some inputs and produces an output. Splitting the complete process into steps creates a modular system where steps can be reused as per the requirement. Each step of the pipeline runs in a separate container, thus, independent of system-based configurations that Azure Machine Learning Service abstracts.
The intermediate data generated can be passed in and out of steps using PipelineData configurations and persists throughout the pipeline across all steps.
Azure Machine Learning Workspace3
The ML pipeline can be created using a simple drag-n-drop method in Designer or can be custom-developed using SDKs. Azure SDK provides various functions that handle most of the how-to-do stuff, so users can focus more on what to do.
Azure ML pipelines are designed to reuse the output. The runtime environment decides which step will run and which may be reused from the previous run. This capability speeds up the execution and saves the compute resources and, thus, the overall cost.
Handling large amounts of data using ML pipeline
The ML pipelines can handle a large amount of data by parallel processing steps. Any step of the pipeline can be defined as a single process step or parallel process step. In a single process step, complete data is passed to a single node. In parallel process steps, data is broken into mini-batches and passed to each node in the compute cluster as per the needs.
This is an added advantage of the ML pipeline, as a high-end compute cluster can be spun up when a parallel process step is run and shut down automatically when the step is completed.
The structured approach to the parallel process step helps the ML pipeline abstract the underlying requirements to run a parallel process and handle data independently. Also, it takes into account the failure scenarios, such as:
- Number of retries in case of failure
- Number of parallel processes, a compute cluster can handle itself without a breakdown, thus optimizing its resource utilization
Endpoints and pipeline deployment
An endpoint is a programmatic service provided by Azure Machine Learning Workspace. After creating an ML pipeline, it can be published with an endpoint. Publishing a pipeline means registering and deploying a pipeline and providing a trigger for it. Pipeline endpoints automate pipeline workflows by acting as a trigger for the pipeline. An endpoint could be:
- a REST API,
- a scheduled process that runs the pipeline at a defined time, or
- it can also be used as a changed-based trigger
The endpoint can also be used to pass parameters into the pipeline on run time, which can be defined during the pipeline development.
The Azure ML pipeline is an easy way to build and go-live with machine learning projects. It follows professional standards, and has all the system requirements to scale up and down the pipeline when required. Therefore, it reduces cost and removes the problems of maintaining workstations.