On latency
Why sub-second budgets are a product constraint, not an optimisation.
In real-time systems, especially programmatic advertising, speed determines what is operationally possible. A model that cannot respond inside the budget is not slow. It is inoperative.
The budget is not negotiable
An exchange has tens of milliseconds to decide. A DSP has single-digit hundreds. An attribution pipeline has a processing window, not a conversation. The frontier API round-trip, no matter how cleverly you pool connections or cache prompts, does not fit inside these budgets on any reliable basis.
This is not a limitation we expect to go away. It is the physics of synchronous calls across the public internet combined with the non-trivial cost of generating frontier-model tokens. The answer is not to wait for the frontier API to get faster. The answer is to stop calling it.
Where specialist SLMs change what is feasible
A specialist small enough to hit sub-second budgets on the right substrate unlocks workflows that were previously architecturally unavailable. "Use AI on every bid" stops being a budget problem. It starts being an engineering problem, and engineering problems are tractable.
The economic shape of the product also changes. You move from paying per token per decision to paying per specialist per month. Your inference cost stops tracking your traffic; it starts tracking your fleet.
Latency is not an optimisation. It is the property that decides whether a workflow can be automated at all.
The deployment choice that matters
Production latency is set by the inference substrate the finished specialist is deployed onto, not by the substrate it was trained on. Agentsia trains on owned hardware and packages a deployment artefact. You choose whether to serve it on Groq, Fireworks, Cerebras, your own cloud, or on-prem. The control plane stays the same.