Optimizing and Deploying Stable Diffusion 1.5 in Production with ONNX, TensorRT and NVIDIA Triton
Published on July 29, 2024
Stable Diffusion 1.5 is often perceived as “just another generative model”. In reality, deploying it in production — with predictable latency, high throughput, and hardware efficiency — is a systems engineering challenge.
This article describes how I optimized and deployed Stable Diffusion 1.5 by:
- converting Hugging Face PyTorch models to ONNX and TensorRT,
- serving them with NVIDIA Triton Inference Server,
- and restructuring the diffusion process using ensemble scheduling to flatten the denoising loop into a production-ready inference graph.
The result is a scalable, modular, GPU-efficient deployment that can serve complex diffusion workflows reliably.
1. Why Stable Diffusion Is Hard to Deploy
At a high level, Stable Diffusion consists of:
- a text tokenizer,
- a text encoder (CLIP),
- a denoising UNet executed repeatedly,
- a scheduler loop,
- a VAE decoder,
- and multiple conditioning paths (ControlNet, IP-Adapter, inpainting, etc.).
The challenge is that diffusion is inherently iterative:
- the UNet is executed N times,
- each step depends on the previous one,
- control signals and guidance must be applied dynamically.
In Python, this is a simple loop.
In production inference systems, loops are a problem.
2. From PyTorch to ONNX and TensorRT
The first step is model decomposition.
Instead of serving a monolithic PyTorch pipeline, each component is exported independently:
- tokenizer
- text encoder
- conditioner
- denoiser (UNet)
- scheduler step
- VAE decoder
- post-processing
Each component is:
- exported from PyTorch to ONNX,
- optimized and compiled into TensorRT engines using trtexec command-line.
This brings:
- kernel fusion,
- reduced memory movement,
- FP16 execution,
- predictable latency.
At this stage, each model is fast — but the pipeline is still sequential.
3. Why Triton Inference Server
NVIDIA Triton is not “just a model server”.
It is a graph execution engine for inference.
Key capabilities that matter here:
- model versioning,
- GPU memory reuse,
- dynamic batching,
- and most importantly: ensemble scheduling.
Ensembles allow multiple models to be connected into a directed inference graph, where outputs of one model become inputs of the next — without Python orchestration.
This is the key to flattening the diffusion loop.
4. Flattening the Diffusion Loop with Ensemble Scheduling
Instead of executing:
for t in timesteps:
latents = scheduler(unet(latents, t))
The loop is unrolled explicitly inside a Triton ensemble.
Each denoising step becomes:
- one UNet call,
- followed by one scheduler step,
- passing latents and timesteps forward explicitly.
This transforms a dynamic loop into a static inference graph.
5. Overview of the Ensemble Pipeline
The ensemble model (generate_20_pipeline) defines:
- all inputs (prompt, seed, masks, guidance, control signals),
- all intermediate tensors,
- all denoising steps,
- and final decoding and post-processing.
The pipeline consists of:
Text Processing
1_tokenizer2_text_encoder(positive & negative prompts)
Conditioning
4_conditionner- prepares latents
- builds ControlNet inputs
- computes timesteps
- assembles embeddings
Denoising Loop (Unrolled)
- Repeated blocks of:
5_denoiser(TensorRT UNet)6_scheduler_step
Each step:
- consumes the previous latent state,
- applies guidance, ControlNet scales, and IP-Adapter weights,
- produces the next latent and timestep.
This continues until the final denoising step.
Decoding & Post-Processing
7_vae_decoder8_postprocessor- blending
- inpainting
- enhancement logic
All of this is executed inside Triton, without Python in the loop.
6. Why This Matters in Production
This architecture provides several critical advantages:
Deterministic Performance
- No Python control flow
- Fixed execution graph
- Predictable GPU utilization
High Throughput
- TensorRT-optimized kernels
- Reduced CPU-GPU synchronization
- Better batching opportunities
Modularity
- Each component can be updated independently
- Easy A/B testing (e.g. new UNet, new scheduler)
- Clear separation of concerns
Scalability
- Multiple pipelines (img2img, inpaint, ControlNet variants)
- Horizontal scaling across GPUs
- Cloud-friendly deployment
7. Handling Complex Inputs
The pipeline supports:
- variable image sizes,
- multiple masks,
- ControlNet conditioning,
- IP-Adapter embeddings,
- product-specific blending and enhancement logic.
All of this is expressed as dataflow, not code.
This is crucial:
production diffusion is a data orchestration problem, not a model problem.
8. Stable Diffusion as a System, Not a Model
The key lesson from this work is that Stable Diffusion is not a single model — it is a distributed inference system.
Performance does not come from:
- a better checkpoint,
- or a faster UNet alone.
It comes from:
- graph design,
- memory reuse,
- execution order,
- and removing Python from the critical path.
Triton ensemble scheduling makes this possible.
Closing Thoughts
Optimizing Stable Diffusion 1.5 for production requires rethinking how diffusion is executed. By:
- decomposing the pipeline,
- compiling models with TensorRT,
- and flattening the denoising loop using Triton ensembles,
we move from an experimental Python workflow to a robust, scalable inference system.
This approach turns Stable Diffusion from a research artifact into production infrastructure.