Optimizing and Deploying Stable Diffusion 1.5 in Production with ONNX, TensorRT and NVIDIA Triton

3 minute read

Published on July 29, 2024

Stable Diffusion 1.5 is often perceived as “just another generative model”. In reality, deploying it in production — with predictable latency, high throughput, and hardware efficiency — is a systems engineering challenge.

This article describes how I optimized and deployed Stable Diffusion 1.5 by:

converting Hugging Face PyTorch models to ONNX and TensorRT,
serving them with NVIDIA Triton Inference Server,
and restructuring the diffusion process using ensemble scheduling to flatten the denoising loop into a production-ready inference graph.

The result is a scalable, modular, GPU-efficient deployment that can serve complex diffusion workflows reliably.

1. Why Stable Diffusion Is Hard to Deploy

At a high level, Stable Diffusion consists of:

a text tokenizer,
a text encoder (CLIP),
a denoising UNet executed repeatedly,
a scheduler loop,
a VAE decoder,
and multiple conditioning paths (ControlNet, IP-Adapter, inpainting, etc.).

The challenge is that diffusion is inherently iterative:

the UNet is executed N times,
each step depends on the previous one,
control signals and guidance must be applied dynamically.

In Python, this is a simple loop.
In production inference systems, loops are a problem.

2. From PyTorch to ONNX and TensorRT

The first step is model decomposition.

Instead of serving a monolithic PyTorch pipeline, each component is exported independently:

tokenizer
text encoder
conditioner
denoiser (UNet)
scheduler step
VAE decoder
post-processing

Each component is:

exported from PyTorch to ONNX,
optimized and compiled into TensorRT engines using trtexec command-line.

This brings:

kernel fusion,
reduced memory movement,
FP16 execution,
predictable latency.

At this stage, each model is fast — but the pipeline is still sequential.

3. Why Triton Inference Server

NVIDIA Triton is not “just a model server”.
It is a graph execution engine for inference.

Key capabilities that matter here:

model versioning,
GPU memory reuse,
dynamic batching,
and most importantly: ensemble scheduling.

Ensembles allow multiple models to be connected into a directed inference graph, where outputs of one model become inputs of the next — without Python orchestration.

This is the key to flattening the diffusion loop.

4. Flattening the Diffusion Loop with Ensemble Scheduling

Instead of executing:

for t in timesteps:
  latents = scheduler(unet(latents, t))

The loop is unrolled explicitly inside a Triton ensemble.

Each denoising step becomes:

one UNet call,
followed by one scheduler step,
passing latents and timesteps forward explicitly.

This transforms a dynamic loop into a static inference graph.

5. Overview of the Ensemble Pipeline

The ensemble model (generate_20_pipeline) defines:

all inputs (prompt, seed, masks, guidance, control signals),
all intermediate tensors,
all denoising steps,
and final decoding and post-processing.

The pipeline consists of:

Text Processing

1_tokenizer
2_text_encoder (positive & negative prompts)

Conditioning

4_conditionner
- prepares latents
- builds ControlNet inputs
- computes timesteps
- assembles embeddings

Denoising Loop (Unrolled)

Repeated blocks of:
- 5_denoiser (TensorRT UNet)
- 6_scheduler_step

Each step:

consumes the previous latent state,
applies guidance, ControlNet scales, and IP-Adapter weights,
produces the next latent and timestep.

This continues until the final denoising step.

Decoding & Post-Processing

7_vae_decoder
8_postprocessor
- blending
- inpainting
- enhancement logic

All of this is executed inside Triton, without Python in the loop.

6. Why This Matters in Production

This architecture provides several critical advantages:

Deterministic Performance

No Python control flow
Fixed execution graph
Predictable GPU utilization

High Throughput

TensorRT-optimized kernels
Reduced CPU-GPU synchronization
Better batching opportunities

Modularity

Each component can be updated independently
Easy A/B testing (e.g. new UNet, new scheduler)
Clear separation of concerns

Scalability

Multiple pipelines (img2img, inpaint, ControlNet variants)
Horizontal scaling across GPUs
Cloud-friendly deployment

7. Handling Complex Inputs

The pipeline supports:

variable image sizes,
multiple masks,
ControlNet conditioning,
IP-Adapter embeddings,
product-specific blending and enhancement logic.

All of this is expressed as dataflow, not code.

This is crucial:

production diffusion is a data orchestration problem, not a model problem.

8. Stable Diffusion as a System, Not a Model

The key lesson from this work is that Stable Diffusion is not a single model — it is a distributed inference system.

Performance does not come from:

a better checkpoint,
or a faster UNet alone.

It comes from:

graph design,
memory reuse,
execution order,
and removing Python from the critical path.

Triton ensemble scheduling makes this possible.

Closing Thoughts

Optimizing Stable Diffusion 1.5 for production requires rethinking how diffusion is executed. By:

decomposing the pipeline,
compiling models with TensorRT,
and flattening the denoising loop using Triton ensembles,

we move from an experimental Python workflow to a robust, scalable inference system.

This approach turns Stable Diffusion from a research artifact into production infrastructure.

Share on

X Facebook LinkedIn Bluesky

Amine AYARI