Shipping LLM features without the April-fools bill
A practical playbook for teams putting their first LLM feature into production: caching, routing, eval, and the cost controls that keep your monthly invoice from doubling overnight.
There is a familiar shape to LLM project failures. The prototype demos beautifully, the team ships it, traffic ramps, and then a finance review surfaces a $40,000 bill where there should have been $4,000. The feature gets pulled, the team retrenches, and the conversation shifts from "how do we make this great" to "how do we afford this at all."
We have walked teams out of this corner enough times to write down the standard playbook. None of it is novel; it is just the boring work that gets skipped when the prototype is exciting.
Tier your traffic before you tier your models
The single highest-leverage decision is recognising that not every request needs your strongest model. A user-facing autocomplete and an automated nightly summarisation do not have the same latency budget, the same accuracy requirements, or the same revenue contribution.
Map your traffic into three or four tiers, and assign each tier a model and a budget. Most production systems we audit can move 60–70% of their volume to a smaller model with no measurable quality loss — they just never tested it.
Cache the deterministic path
A surprising fraction of LLM traffic is identical or near-identical requests. Document classification on a stable taxonomy, function-call argument extraction, repeated reformatting tasks. A simple cache keyed on (model, prompt, params) routinely trims 30–40% of spend on systems we inherit.
For semantic similarity, layer in an embedding-based cache: if a new query is within ε of a cached one, return the cached response. Tune ε on your eval set, not on intuition.
Build the eval harness on day one
Without an eval set, every model change is a guess. A modest eval — 200–500 examples covering your real distribution, with reference answers and a graded rubric — pays for itself the first time you consider switching models. We have seen teams that built the eval upgrade from a frontier model to a dramatically cheaper one because the eval told them quality held; we have seen teams without one stay locked into expensive models forever, terrified to change anything.
Hard-cap before you soft-warn
Every system we ship has a per-tenant daily spend cap, a global daily spend cap, and a circuit-breaker that degrades gracefully when either is hit. Falling back to a smaller model, returning cached results, or showing an explicit "capacity reached, retry shortly" beats waking the founder up at 2am to a $30k overage.
Instrument before you optimise
Every LLM call we make is logged with: model, tokens in, tokens out, latency, cost, route, and the eval score (when available). This data turns optimisation from a guessing game into arithmetic — you can see exactly which routes drive cost, which prompts have grown over time, and where caching is leaving money on the table.
The takeaway
None of these are exotic techniques. They are the operational hygiene that distinguishes an LLM feature you can grow into a real product from one that remains a quarterly conversation about whether to keep it on. Build them in week one, not after the first finance escalation.