MLOps

The goal of MLOps

Measure and monitor the quality of the model.
Make the process of Model Building to production smooth and fast.

Agile Manifesto

DevOps

DevOps is the idea the Developers and Operations (and QA etc.) work together to achieve better service, better product.

Eliminating silos
DevOps is not a product
DevOps was not a job

Machine Learning Operations

The term MLOps is defined as "the extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology".
"The intention is for MLOps to decorate DevOps rather than differentiate."
MLOps Road map

What is MLOps?

Machine Learning Operations is a practice
NOT a product
NOT a job title

MLOps community

MLOps community

Silicon

CPU - Central Processing Unit
GPU - Graphical Processing Unit
TPU - Tensor Processing Unit
dedicated ASICs - Application-specific integrated circuit
custom neuro-morphic silicon - Cognitive computer
...

Off-line (batch) vs On-line (streaming) learning

Off-line (batch) learning
On-line (streaming or live) learning

Security question of MLOps

poisoning of the model (e.g. chat bot that learns from real-time data) Tay in 16 hours
private and sensitive data (e.g. gender, religion, sexual orientation, health status)
legalizations
Audit trail of all results! (code, data, parameters, random values, etc)

Data

git
git-lfs (large file support
External storage with hash
dvcchecksum

MLOps progress

code in Jupyter notebook
put the code in functions
put the functions in modules
write tests for these functions
Make sure youre results are repeatable (start with the current data-set)

Reload modules in Jupyter Notebook

%load_ext autoreload
%autoreload 2

examples/ml/reload.ipynb
mymodule.py

Testing ML

Create output that is easy to compare by computer (so numerical results are preferable over a graph)
Fix randomizations to make the results repeatable
Establish thresholds for results using different datasets (and also using different models)

What to track?

the code (and the dependencies)
the data
the artifacts (e.g. models)
the experiments and their results.

What are the inputs and what are the artifacts?

Data (what kind of data? how does it change? how can developers access it - privacy issues?)
Selecting the algorithms
Random values as input
Hyper parameters
The model (a series of numbers?, Is it language-agnostic?)

Tooling for MLOps

dvc - Data Version Control
Clear ML
MLflow
Wandb
CD Foundation - Continuous Delivery Foundation

DVC

Storage can be

local disk
cloud
HDFS

pip install dvc


git init   (creating .git)
dvc init   (creating .dvc and .dvcignore)


dvc remote add -d dvc-remote /tmp/dvc-demo-storage   (changing .dvc/config)


dvc add data/data.csv

git add .
git commit -m "data:  ...."
git tag -a v1 -m v1


dvc push

Files are now in /tmp/dvc-demo-storage
Files are also in .dvc/cache

dvc pull
dvs status

Data Pipelines (Workflow management)

Workflow management

MLFlow

Tracking
Projects
Models
Model Registry

MLFLow Tracking server backends

Entity Metadata Store

FileStore (mlruns directory)
SQLStore (via SQLAlchemy - PostgreSQL, MySQL, SQLite)
MLFlow Plugins Scheme
Managed MLFlow on Databricks

Artifact Store

Local Filesystem (mlruns directory)
S3
Azure blob
Google Cloud Storage
DBFS (Databricks File System) artifact repo

MLFlow Tracking

Parameters: key-value input to the code (learning rate, what loss function is used, number of filters to use, depth of the tree)
Metrics: numeric values
Tags and Notes: information about a run (free text)
Artifacts: files, data, model
Source: what code ran?
Version: which version of the code?
Run: an instance of code
Experiment: several Runs

with mlflow.start_run():
    mlflow.log_param("name", value)
    mlflow.log_param(dict)
    ...
    mlflow.log_metric("name", value)
    ...
    mlflow.sklearn.log_model(model)

mlflow ui

MLFlow Projects

Package data-science code to enable reproducable runs on any platform

Code
Dependencies
Data
Configuration

$ mlflow run ...
mlflow.run()

MLFlow Models

Deploy models in different environments

Input:

TensorFlow
scikit-learn
R
Spark
ML Frameworks

Standardized MLFlow model format

Output:

docker
Spark
Serving tools

Directory called mlmodel.

mlflow.model_flavor.save_model(...)  or log_model(...)
mlflow.model_flavor.load_model(...)

tutorial example

Resources

Managing the Complete Machine Learning Lifecycle with MLflow: 3-part series
repo
repo
dvc + MLflow

import mlflow
import mlflow.sklearn
import dvc.api

path = 'data/data.csv'
repo = ''      # path to git repository
version = 'v1' # git sha1, branch, or tag

data_url = dvc.api.get_url(
    path=path,
    repo=repo,
    rev=version,
)

mlflow.set_experiment('demo')

df = pd.read_csv(data_url)

mlflow.log_param('data_url', data_url)
mlflow.log_param('data_version', version)
mlflow.log_param('input_rows', df.shape[0])
mlflow.log_param('input_cols', df.shape[1])


cols_y = pd.DataFrame(list(train_y.columns))
cols_y.to_csv('features.csv', header=False, index=False)

mlflow.log_artifact('features.csv')

Goals of SCM

SCM = Software configuration management
Reproducability
Change management

MLOps notes

logging
metrics
data-pipelines
Data is changing (new type of data, the same data but a newser dataset)
Model
Monitor the quality of the model over time
The standard tools measuring precision and recall in classification, accuracy, F-measure (F1)
data quality
model decay (due to changes in the data that are not used to re-train the model)
locality (using the same model on a different set of data, eg. a different cluster of customers)
Distributed learning

Keyboard shortcuts

MLOps