We run a small gateway in front of every model call AI Chat makes — internal traffic, third-party providers, our own hosted weights. It started as a hundred lines of Python and is now a non-trivial piece of infrastructure. These are the lessons we wish we’d known a year ago.
Why a gateway at all
The honest answer is that the alternative is worse. Letting product code talk directly to a provider SDK gets you, in the order you’ll discover them: surprise bills, brittle rate-limit handling, no way to A/B a new model, no way to roll back when a provider changes default behavior, and no way to answer the question “why was this conversation slow?” after the fact.
Our gateway sits between the application and any model and gives us four things: routing (pick a provider/model based on policy), caching (semantic and exact), fallback (when the primary fails), and observability (every call is logged with prompt hash, model, latency, tokens in/out, cost).
Caching is the highest-leverage feature
About 11% of our chat traffic is functionally identical to a prior request — the same system prompt and the same user message within a short window. Exact-match caching on a hash of (model, system, messages, tools) is cheap to implement and pays for the entire gateway.
Semantic caching (“this question is similar enough to one we’ve already answered”) is more interesting and more dangerous. Our rule of thumb: only cache when the model itself signals deterministic intent, e.g. tool calls with a stable signature. Caching free-form prose answers semantically caused subtle regressions where the cache returned a slightly-stale-but-confident answer.
Fallbacks are a product decision, not an SRE decision
The first version of our fallback logic was the obvious thing: on a 5xx from provider A, retry with provider B. This is wrong in production. Provider B has a different tone, slightly different refusal behavior, and a different tool-calling format. Users notice. We had a week of complaints about “the AI got weirder yesterday afternoon” traced to a transient outage that triggered a quiet failover.
The version we run now: every request is tagged with quality_tier, set by the calling product. tier=interactive_chat never falls back silently — it surfaces a “the model is having trouble, retry?” message to the user. tier=background_summarization falls back freely; nobody is staring at it. This is a one-line change with months of pain behind it.
The metric we actually trust
Tokens-per-second is the number everyone reports. It’s not very useful. Tokens-per-dollar is more useful. The metric we ended up trusting most is something we awkwardly call tokens-per-dollar-per-useful-answer: cost-weighted throughput, divided by the rate at which the answer was rated useful by the user (thumb-up/thumb-down on the response, plus a quieter signal: did the user immediately follow up to clarify, or did the conversation move on?).
Switching from a frontier model to a cheaper one halves cost and might halve usefulness — net zero. Our cheapest reasonable model is sometimes the right pick, and sometimes it’s a false economy; this metric makes the call legible.
Cost controls that actually worked
- Per-product budgets, not per-user. Per-user budgets create support tickets. Per-product budgets create conversations with the product manager, which is what you want.
- Hard cap on max output tokens per call, set per route, not globally. Most chat answers should be under 600 tokens; a long-form summarization route can have 4000.
- Cost-aware routing. If the user’s message looks like “summarize this paragraph”, the gateway can route to a cheaper model without asking the application. This is opt-in per route.
What we got wrong
We over-built the queueing layer. We assumed traffic would be bursty enough to need priority queues, deadline-aware scheduling, the works. In practice our traffic is smoother than we feared and 99% of the value comes from a basic FIFO with concurrency limits per provider. We deleted ~1500 lines of queueing code and nothing got worse.
If you’re building one of these, build the simplest version, add observability before you add cleverness, and let real traffic tell you what to optimise.