3 minute read

Published on July 29, 2024

Stable Diffusion 1.5 is often perceived as “just another generative model”. In reality, deploying it in production — with predictable latency, high throughput, and hardware efficiency — is a systems engineering challenge.

This article describes how I optimized and deployed Stable Diffusion 1.5 by:

  • converting Hugging Face PyTorch models to ONNX and TensorRT,
  • serving them with NVIDIA Triton Inference Server,
  • and restructuring the diffusion process using ensemble scheduling to flatten the denoising loop into a production-ready inference graph.

The result is a scalable, modular, GPU-efficient deployment that can serve complex diffusion workflows reliably.


1. Why Stable Diffusion Is Hard to Deploy

At a high level, Stable Diffusion consists of:

  • a text tokenizer,
  • a text encoder (CLIP),
  • a denoising UNet executed repeatedly,
  • a scheduler loop,
  • a VAE decoder,
  • and multiple conditioning paths (ControlNet, IP-Adapter, inpainting, etc.).

The challenge is that diffusion is inherently iterative:

  • the UNet is executed N times,
  • each step depends on the previous one,
  • control signals and guidance must be applied dynamically.

In Python, this is a simple loop.
In production inference systems, loops are a problem.


2. From PyTorch to ONNX and TensorRT

The first step is model decomposition.

Instead of serving a monolithic PyTorch pipeline, each component is exported independently:

  • tokenizer
  • text encoder
  • conditioner
  • denoiser (UNet)
  • scheduler step
  • VAE decoder
  • post-processing

Each component is:

  1. exported from PyTorch to ONNX,
  2. optimized and compiled into TensorRT engines using trtexec command-line.

This brings:

  • kernel fusion,
  • reduced memory movement,
  • FP16 execution,
  • predictable latency.

At this stage, each model is fast — but the pipeline is still sequential.


3. Why Triton Inference Server

NVIDIA Triton is not “just a model server”.
It is a graph execution engine for inference.

Key capabilities that matter here:

  • model versioning,
  • GPU memory reuse,
  • dynamic batching,
  • and most importantly: ensemble scheduling.

Ensembles allow multiple models to be connected into a directed inference graph, where outputs of one model become inputs of the next — without Python orchestration.

This is the key to flattening the diffusion loop.


4. Flattening the Diffusion Loop with Ensemble Scheduling

Instead of executing:

for t in timesteps:
  latents = scheduler(unet(latents, t))

The loop is unrolled explicitly inside a Triton ensemble.

Each denoising step becomes:

  • one UNet call,
  • followed by one scheduler step,
  • passing latents and timesteps forward explicitly.

This transforms a dynamic loop into a static inference graph.


5. Overview of the Ensemble Pipeline

The ensemble model (generate_20_pipeline) defines:

  • all inputs (prompt, seed, masks, guidance, control signals),
  • all intermediate tensors,
  • all denoising steps,
  • and final decoding and post-processing.

The pipeline consists of:

Text Processing

  • 1_tokenizer
  • 2_text_encoder (positive & negative prompts)

Conditioning

  • 4_conditionner
    • prepares latents
    • builds ControlNet inputs
    • computes timesteps
    • assembles embeddings

Denoising Loop (Unrolled)

  • Repeated blocks of:
    • 5_denoiser (TensorRT UNet)
    • 6_scheduler_step

Each step:

  • consumes the previous latent state,
  • applies guidance, ControlNet scales, and IP-Adapter weights,
  • produces the next latent and timestep.

This continues until the final denoising step.

Decoding & Post-Processing

  • 7_vae_decoder
  • 8_postprocessor
    • blending
    • inpainting
    • enhancement logic

All of this is executed inside Triton, without Python in the loop.


6. Why This Matters in Production

This architecture provides several critical advantages:

Deterministic Performance

  • No Python control flow
  • Fixed execution graph
  • Predictable GPU utilization

High Throughput

  • TensorRT-optimized kernels
  • Reduced CPU-GPU synchronization
  • Better batching opportunities

Modularity

  • Each component can be updated independently
  • Easy A/B testing (e.g. new UNet, new scheduler)
  • Clear separation of concerns

Scalability

  • Multiple pipelines (img2img, inpaint, ControlNet variants)
  • Horizontal scaling across GPUs
  • Cloud-friendly deployment

7. Handling Complex Inputs

The pipeline supports:

  • variable image sizes,
  • multiple masks,
  • ControlNet conditioning,
  • IP-Adapter embeddings,
  • product-specific blending and enhancement logic.

All of this is expressed as dataflow, not code.

This is crucial:

production diffusion is a data orchestration problem, not a model problem.


8. Stable Diffusion as a System, Not a Model

The key lesson from this work is that Stable Diffusion is not a single model — it is a distributed inference system.

Performance does not come from:

  • a better checkpoint,
  • or a faster UNet alone.

It comes from:

  • graph design,
  • memory reuse,
  • execution order,
  • and removing Python from the critical path.

Triton ensemble scheduling makes this possible.


Closing Thoughts

Optimizing Stable Diffusion 1.5 for production requires rethinking how diffusion is executed. By:

  • decomposing the pipeline,
  • compiling models with TensorRT,
  • and flattening the denoising loop using Triton ensembles,

we move from an experimental Python workflow to a robust, scalable inference system.

This approach turns Stable Diffusion from a research artifact into production infrastructure.