Cutting chat latency in half

Six months ago our P50 time-to-first-token was 1.4 seconds. Today it’s 680 ms. P99 dropped from 4.8 s to 2.1 s. Nothing in this post is exotic — most of the win came from a few unflattering bugs and a couple of architectural choices we should have made sooner.

Where the time was actually going

The first thing we did was sit on the latency dashboard for a week and break the request lifecycle into ten labelled spans, end to end:

TLS + edge routing
Auth check (cached JWT, occasional refresh)
Conversation hydration from Postgres
Prompt assembly + tool schema rendering
Tokenization
Queue wait at the inference scheduler
Prefill (first forward pass)
First token emitted to the client
Subsequent decode steps
Stream flush + post-processing

The surprise: tokenization was 9% of the median request and 22% of the 99th percentile. We were calling a Python wrapper over a Rust tokenizer, then re-encoding the same system prompt every turn. Caching the tokenized system prompt and switching to the native bindings paid for the entire quarter’s engineering time in latency wins alone.

Speculative decoding, finally

We’d resisted speculative decoding for two reasons: the small draft model adds operational surface, and we worried about correctness drift on rare tokens. Both fears turned out to be smaller than expected.

The setup we ended up with: a 1.3B-parameter draft model proposes k=4 tokens; the 70B verifier accepts or rejects them in a single forward pass. Acceptance rate hovers around 73% on real chat traffic, which translates to a 1.9× speedup on decode without changing the verifier’s output distribution at all (because the verifier still gets the final say on every token).

“The acceptance rate matters less than people think — even a draft that's right 60% of the time is a free 1.5× because the rejected tokens cost almost nothing.”

Continuous batching beats dynamic batching

Our first batching layer waited up to 25 ms to assemble a batch. That works fine when the prompt distribution is uniform and decode steps are roughly equal — but real chat traffic is wildly heterogeneous. A batch with one 8k-context request and three 200-token requests stalls the short ones for the entire long generation.

Switching to continuous batching (popularised by vLLM and Orca) — where new requests can join an in-flight batch at every decode step — collapsed our P99 dramatically. The price is a more complicated KV-cache manager and slightly worse GPU utilization on tail steps. We took the trade.

Things that didn’t work

Quantizing the verifier to int4. ~1.4× throughput, but our internal evals showed a regression on multi-step reasoning that we couldn’t justify. We kept it on the drafter, which is allowed to be sloppier.
A bigger KV cache. We over-provisioned by 40% expecting a hit-rate jump. We got 2 percentage points. Long-tail conversations don’t share enough prefix.
Pre-fetching the user’s next likely message. Cute idea; net negative once you account for the wasted compute on conversations that ended.

What we’d do differently

Instrument before optimising. We almost spent a quarter on a custom CUDA kernel for layer norm before discovering the tokenizer story. Optimization without measurement is just preference, dressed up in C++.

What we learned cutting our chat latency in half

Where the time was actually going

Speculative decoding, finally

Continuous batching beats dynamic batching

Things that didn’t work

What we’d do differently