Machine Learning Model Deployment: From Development to Production Environments

Training a model is the part of machine learning that gets the most attention. Deploying it is the part that actually matters. A model that lives in a notebook is a science project. A model in production, serving real requests, updating as the world changes — that’s the point.

The gap between the two is larger than most tutorials acknowledge.

The deployment landscape

Before choosing an approach, understand what you’re deploying:

Latency requirements — real-time inference (< 100ms) vs batch predictions (minutes to hours) vs offline scoring (daily)
Throughput requirements — requests per second at peak load
Model size — a BERT-base is 400MB; a large vision model might be 10GB
Hardware — CPU inference is cheaper; GPU inference is faster for large models
Update frequency — how often do you retrain and redeploy?

These answers determine your architecture before you write a line of deployment code.

Serving approaches

REST API with FastAPI is the right starting point for most use cases:

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class PredictRequest(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(req: PredictRequest):
    prediction = model.predict([req.features])[0]
    return {"prediction": float(prediction)}

Add Docker, expose a port, and you have a deployable service. This works for sklearn models, XGBoost, LightGBM, and anything that fits in memory and runs on CPU.

TorchServe / TF Serving for deep learning models. These frameworks handle batching, GPU memory management, and model versioning concerns that a generic web framework doesn’t address well.

Serverless (AWS Lambda, Google Cloud Functions) works for models that fit within the memory and cold-start constraints (typically <512MB, no GPU). Good for low-traffic, bursty workloads where you don’t want to pay for idle compute.

ONNX Runtime deserves mention: exporting your PyTorch or TensorFlow model to ONNX format lets you run inference with a single dependency that optimises for the target hardware. Often 2–4x faster than the native framework on CPU.

Packaging and reproducibility

The model artifact alone is not enough. You need:

The model weights/pickle
The exact version of every library used for training and inference
Preprocessing code (the same transformations used at training time, applied at inference time)
Feature schema (what inputs are expected, their types and ranges)

Use mlflow or DVC to track experiments and associate artifacts with the code and data that produced them. A model you can’t reproduce is a liability.

Docker solves the “works on my machine” problem:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pkl ./
COPY app.py ./
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

The training-serving skew problem

This is the single most common cause of production model failures. Training-serving skew happens when the data your model sees in production differs from the data it was trained on — usually because preprocessing is implemented in two different places.

The fix is to implement preprocessing once, in a form that runs identically at training time and inference time. A sklearn Pipeline that includes both preprocessing and the model is one pattern. A shared library imported by both training and serving code is another. Whatever you choose, the rule is: one implementation, used in both places.

Monitoring

Once deployed, a model needs monitoring at two levels.

Infrastructure monitoring (standard DevOps):

Request latency (p50, p95, p99)
Error rate
CPU/memory/GPU utilisation
Throughput

Model monitoring (specific to ML):

Input distribution drift — are the features arriving at inference time still distributed like the training data? Large shifts indicate the world has changed and the model may be stale.
Prediction distribution drift — is the model producing significantly different predictions than it used to? Can indicate drift or a bug.
Ground truth comparison — where you can collect labels (click-through rate, fraud confirmation, diagnosis outcome), compare model predictions to actual outcomes to measure real-world performance.

Tools: Evidently, WhyLogs, and Arize are commonly used for ML-specific monitoring.

Model versioning and rollout

Never deploy a new model to 100% of traffic immediately. Use canary deployment:

Deploy v2 alongside v1
Route 5% of traffic to v2
Monitor metrics for 24–48 hours
Increase to 20%, 50%, 100% if metrics are stable
Decommission v1

This requires your serving infrastructure to support traffic splitting, which most Kubernetes-based setups and managed services (SageMaker, Vertex AI, Azure ML) provide out of the box.

Version your models with semantic versioning and keep a registry (MLflow Model Registry, Weights & Biases, or even a simple database table). You need to be able to answer “which model version is serving this prediction right now?” at any point in time.

Retraining pipelines

Models decay. The world changes, user behaviour shifts, the training data becomes unrepresentative. Build retraining into the system from the start:

Scheduled retraining — retrain weekly/monthly on a rolling window of recent data
Triggered retraining — retrain when drift monitoring detects a significant distribution shift
Online learning — update model weights continuously as new data arrives (complex, often overkill)

The retraining pipeline should be automated, tested, and fast enough to deploy a new version within hours of a trigger. A model that can’t be updated quickly is a liability when something goes wrong.

The checklist

Before promoting a model to production:

Preprocessing is implemented once, shared between training and serving
Model artifact and code are versioned together
Inference latency meets SLA at expected load (load test first)
Monitoring is set up for both infrastructure and model metrics
Rollback procedure is documented and tested
Retraining pipeline exists and is automated
Model card documents performance across demographic subgroups

The gap between a notebook model and a production system is real work. But it’s tractable work, and the patterns above cover the majority of production deployments you’ll encounter.