MLOps
MLOps
The goal of MLOps
- Measure and monitor the quality of the model.
- Make the process of Model Building to production smooth and fast.
Agile Manifesto
- Agile Manifesto
- Extreme Programming
- Scrum
- ...
DevOps
DevOps is the idea the Developers and Operations (and QA etc.) work together to achieve better service, better product.
-
Eliminating silos
-
DevOps is not a product
-
DevOps was not a job
Machine Learning Operations
-
The term MLOps is defined as "the extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology".
-
"The intention is for MLOps to decorate DevOps rather than differentiate."
What is MLOps?
-
Machine Learning Operations is a practice
-
NOT a product
-
NOT a job title
MLOps community
Silicon
- CPU - Central Processing Unit
- GPU - Graphical Processing Unit
- TPU - Tensor Processing Unit
- dedicated ASICs - Application-specific integrated circuit
- custom neuro-morphic silicon - Cognitive computer
- ...
Off-line (batch) vs On-line (streaming) learning
- Off-line (batch) learning
- On-line (streaming or live) learning
Security question of MLOps
-
poisoning of the model (e.g. chat bot that learns from real-time data) Tay in 16 hours
-
private and sensitive data (e.g. gender, religion, sexual orientation, health status)
-
legalizations
-
Audit trail of all results! (code, data, parameters, random values, etc)
Data
- git
- git-lfs (large file support
- External storage with hash
- dvcchecksum
MLOps progress
-
code in Jupyter notebook
-
put the code in functions
-
put the functions in modules
-
write tests for these functions
-
Make sure youre results are repeatable (start with the current data-set)
Reload modules in Jupyter Notebook
%load_ext autoreload
%autoreload 2
examples/ml/reload.ipynb
mymodule.py
Testing ML
-
Create output that is easy to compare by computer (so numerical results are preferable over a graph)
-
Fix randomizations to make the results repeatable
-
Establish thresholds for results using different datasets (and also using different models)
What to track?
-
the code (and the dependencies)
-
the data
-
the artifacts (e.g. models)
-
the experiments and their results.
What are the inputs and what are the artifacts?
-
Data (what kind of data? how does it change? how can developers access it - privacy issues?)
-
Selecting the algorithms
-
Random values as input
-
Hyper parameters
-
The model (a series of numbers?, Is it language-agnostic?)
Tooling for MLOps
-
CD Foundation - Continuous Delivery Foundation
DVC
Storage can be
- local disk
- cloud
- HDFS
pip install dvc
git init (creating .git)
dvc init (creating .dvc and .dvcignore)
dvc remote add -d dvc-remote /tmp/dvc-demo-storage (changing .dvc/config)
dvc add data/data.csv
git add .
git commit -m "data: ...."
git tag -a v1 -m v1
dvc push
- Files are now in
/tmp/dvc-demo-storage
- Files are also in
.dvc/cache
dvc pull
dvs status
Data Pipelines (Workflow management)
Workflow management
MLFlow
- Tracking
- Projects
- Models
- Model Registry
MLFLow Tracking server backends
Entity Metadata Store
- FileStore (mlruns directory)
- SQLStore (via SQLAlchemy - PostgreSQL, MySQL, SQLite)
- MLFlow Plugins Scheme
- Managed MLFlow on Databricks
Artifact Store
- Local Filesystem (mlruns directory)
- S3
- Azure blob
- Google Cloud Storage
- DBFS (Databricks File System) artifact repo
MLFlow Tracking
- Parameters: key-value input to the code (learning rate, what loss function is used, number of filters to use, depth of the tree)
- Metrics: numeric values
- Tags and Notes: information about a run (free text)
- Artifacts: files, data, model
- Source: what code ran?
- Version: which version of the code?
- Run: an instance of code
- Experiment: several Runs
with mlflow.start_run():
mlflow.log_param("name", value)
mlflow.log_param(dict)
...
mlflow.log_metric("name", value)
...
mlflow.sklearn.log_model(model)
mlflow ui
MLFlow Projects
Package data-science code to enable reproducable runs on any platform
- Code
- Dependencies
- Data
- Configuration
$ mlflow run ...
mlflow.run()
MLFlow Models
- Deploy models in different environments
Input:
- TensorFlow
- scikit-learn
- R
- Spark
- ML Frameworks
Standardized MLFlow model format
Output:
- docker
- Spark
- Serving tools
Directory called mlmodel
.
mlflow.model_flavor.save_model(...) or log_model(...)
mlflow.model_flavor.load_model(...)
Resources
-
Managing the Complete Machine Learning Lifecycle with MLflow: 3-part series
import mlflow
import mlflow.sklearn
import dvc.api
path = 'data/data.csv'
repo = '' # path to git repository
version = 'v1' # git sha1, branch, or tag
data_url = dvc.api.get_url(
path=path,
repo=repo,
rev=version,
)
mlflow.set_experiment('demo')
df = pd.read_csv(data_url)
mlflow.log_param('data_url', data_url)
mlflow.log_param('data_version', version)
mlflow.log_param('input_rows', df.shape[0])
mlflow.log_param('input_cols', df.shape[1])
cols_y = pd.DataFrame(list(train_y.columns))
cols_y.to_csv('features.csv', header=False, index=False)
mlflow.log_artifact('features.csv')
Goals of SCM
-
SCM = Software configuration management
-
Reproducability
-
Change management
MLOps notes
-
logging
-
metrics
-
data-pipelines
-
Data is changing (new type of data, the same data but a newser dataset)
-
Model
-
Monitor the quality of the model over time
-
The standard tools measuring precision and recall in classification, accuracy, F-measure (F1)
-
data quality
-
model decay (due to changes in the data that are not used to re-train the model)
-
locality (using the same model on a different set of data, eg. a different cluster of customers)
-
Distributed learning