Machine Learning Model Deployment: From Development to Production Environments

A step-by-step guide to moving machine learning models from notebooks into real-world production systems — serving, monitoring, versioning, and avoiding the common pitfalls.

Training a model is the part of machine learning that gets the most attention. Deploying it is the part that actually matters. A model that lives in a notebook is a science project. A model in production, serving real requests, updating as the world changes — that’s the point.

The gap between the two is larger than most tutorials acknowledge.

The deployment landscape

Before choosing an approach, understand what you’re deploying:

  • Latency requirements — real-time inference (< 100ms) vs batch predictions (minutes to hours) vs offline scoring (daily)
  • Throughput requirements — requests per second at peak load
  • Model size — a BERT-base is 400MB; a large vision model might be 10GB
  • Hardware — CPU inference is cheaper; GPU inference is faster for large models
  • Update frequency — how often do you retrain and redeploy?

These answers determine your architecture before you write a line of deployment code.

Serving approaches

REST API with FastAPI is the right starting point for most use cases:

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class PredictRequest(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(req: PredictRequest):
    prediction = model.predict([req.features])[0]
    return {"prediction": float(prediction)}

Add Docker, expose a port, and you have a deployable service. This works for sklearn models, XGBoost, LightGBM, and anything that fits in memory and runs on CPU.

TorchServe / TF Serving for deep learning models. These frameworks handle batching, GPU memory management, and model versioning concerns that a generic web framework doesn’t address well.

Serverless (AWS Lambda, Google Cloud Functions) works for models that fit within the memory and cold-start constraints (typically <512MB, no GPU). Good for low-traffic, bursty workloads where you don’t want to pay for idle compute.

ONNX Runtime deserves mention: exporting your PyTorch or TensorFlow model to ONNX format lets you run inference with a single dependency that optimises for the target hardware. Often 2–4x faster than the native framework on CPU.

Packaging and reproducibility

The model artifact alone is not enough. You need:

  • The model weights/pickle
  • The exact version of every library used for training and inference
  • Preprocessing code (the same transformations used at training time, applied at inference time)
  • Feature schema (what inputs are expected, their types and ranges)

Use mlflow or DVC to track experiments and associate artifacts with the code and data that produced them. A model you can’t reproduce is a liability.

Docker solves the “works on my machine” problem:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pkl ./
COPY app.py ./
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

The training-serving skew problem

This is the single most common cause of production model failures. Training-serving skew happens when the data your model sees in production differs from the data it was trained on — usually because preprocessing is implemented in two different places.

The fix is to implement preprocessing once, in a form that runs identically at training time and inference time. A sklearn Pipeline that includes both preprocessing and the model is one pattern. A shared library imported by both training and serving code is another. Whatever you choose, the rule is: one implementation, used in both places.

Monitoring

Once deployed, a model needs monitoring at two levels.

Infrastructure monitoring (standard DevOps):

  • Request latency (p50, p95, p99)
  • Error rate
  • CPU/memory/GPU utilisation
  • Throughput

Model monitoring (specific to ML):

  • Input distribution drift — are the features arriving at inference time still distributed like the training data? Large shifts indicate the world has changed and the model may be stale.
  • Prediction distribution drift — is the model producing significantly different predictions than it used to? Can indicate drift or a bug.
  • Ground truth comparison — where you can collect labels (click-through rate, fraud confirmation, diagnosis outcome), compare model predictions to actual outcomes to measure real-world performance.

Tools: Evidently, WhyLogs, and Arize are commonly used for ML-specific monitoring.

Model versioning and rollout

Never deploy a new model to 100% of traffic immediately. Use canary deployment:

  1. Deploy v2 alongside v1
  2. Route 5% of traffic to v2
  3. Monitor metrics for 24–48 hours
  4. Increase to 20%, 50%, 100% if metrics are stable
  5. Decommission v1

This requires your serving infrastructure to support traffic splitting, which most Kubernetes-based setups and managed services (SageMaker, Vertex AI, Azure ML) provide out of the box.

Version your models with semantic versioning and keep a registry (MLflow Model Registry, Weights & Biases, or even a simple database table). You need to be able to answer “which model version is serving this prediction right now?” at any point in time.

Retraining pipelines

Models decay. The world changes, user behaviour shifts, the training data becomes unrepresentative. Build retraining into the system from the start:

  • Scheduled retraining — retrain weekly/monthly on a rolling window of recent data
  • Triggered retraining — retrain when drift monitoring detects a significant distribution shift
  • Online learning — update model weights continuously as new data arrives (complex, often overkill)

The retraining pipeline should be automated, tested, and fast enough to deploy a new version within hours of a trigger. A model that can’t be updated quickly is a liability when something goes wrong.

The checklist

Before promoting a model to production:

  • Preprocessing is implemented once, shared between training and serving
  • Model artifact and code are versioned together
  • Inference latency meets SLA at expected load (load test first)
  • Monitoring is set up for both infrastructure and model metrics
  • Rollback procedure is documented and tested
  • Retraining pipeline exists and is automated
  • Model card documents performance across demographic subgroups

The gap between a notebook model and a production system is real work. But it’s tractable work, and the patterns above cover the majority of production deployments you’ll encounter.