System · Accepted state
Vision · 02docs/VISION.md

On the making of trusted specialist agents for narrow industries.

An argument on market position, moat, and why the operating model matters more than the benchmark.

This is the long version. If you have ten minutes, sit with it. If not, start with the thesis and come back.

Mission

Agentsia helps you move from agent experimentation to trusted agent operationalisation. We deliver specialist models that match or exceed frontier AI on your narrow commercial workflows, at lower latency and a fraction of the inference cost, trained and governed in your controlled environment and deployed on the inference substrate you choose.

The core product is Modelsmith. It is not a chatbot, a generic agent builder, or a fine-tuning sandbox. It is an agent-first specialisation platform that continuously produces low-latency, domain-trained small language models and model fleets that outperform frontier labs on narrow, commercially important tasks.

The first and clearest wedge is programmatic advertising and real-time bidding, where general-purpose models are poorly optimised, latency budgets are unforgiving, and proprietary exchange data is strategically valuable.

Where we sit in the stack

Our category is specialisation control plane. The layer between raw model infrastructure and workflow-native agent applications.

  • Application layer: your domain agents, workflow automation, and embedded intelligence in products.
  • Agentsia / Modelsmith layer: specialist selection, evaluation, retraining, promotion, rollback, fleet routing, lineage, and agent-first operating control.
  • Substrate layer: inference vendors, training vendors, runtimes, hardware, and serving infrastructure.

Infrastructure improvements at the substrate benefit us without commoditising us. Groq, Cerebras, Fireworks and similar companies can be execution layers beneath Modelsmith. The moat lives above them.

Why now

Several shifts are happening at once:

  • Frontier-model access is becoming more commoditised.
  • Agent adoption is accelerating across engineering, product, and operations teams.
  • Organisations are becoming more serious about governance, auditability, and cost control for agentic systems.
  • Open models are improving enough to become realistic specialist substrates.
  • Training hardware and inference substrates are improving fast enough that narrow specialists can be developed locally and served with strong latency and cost advantages on the right deployment target.

A few years ago, the problem was getting access to capable models at all. Now the problem is turning those models into domain-native, production-trustworthy systems.

The strategic value has moved one layer up. We built the company in the place the market is moving toward, not the place it came from.
On timing

The market problem

The market problem is not that you want smaller models. The real problem is that you are trying to push agents into real operating workflows, but most of the current stack is optimised for experimentation rather than trusted operationalisation.

Frontier models are strong enough to power copilots, prototypes, and broad assistant experiences. But when you use those same models in narrow, business-critical workflows, you repeatedly encounter the same limitations:

  • The model is too generic to internalise domain-specific judgement.
  • Runtime retrieval and orchestration add latency and fragility.
  • Outputs are difficult to govern, compare, promote, and roll back.
  • Model behaviour remains dependent on prompt and retrieval scaffolding rather than durable learned specialisation.
  • Your internal teams struggle to convert proprietary workflow data into repeatable model advantage.

This creates a gap between agent experimentation and agent operationalisation. Agentsia exists to close that gap.

Why existing alternatives fall short

The most common alternative is some combination of: frontier model APIs, RAG and prompt orchestration, an internal open-model fine-tuning pipeline, and a lot of platform and ML engineering effort.

A correction on the fine-tuning landscape as of early 2026: Anthropic offers no fine-tuning API. OpenAI's fine-tuning is limited to older model generations, not their current flagship. Google has deprecated fine-tuning in the Gemini API outside Vertex AI. The practical competition is not a hosted fine-tuning API. It is a well-resourced internal ML team assembling an open-model training pipeline with Axolotl, TRL, or LLaMA-Factory.

That stack can run one fine-tune. It is much less reliable as a way to create repeatable, trusted specialist behaviour. The limits are structural:

  • RAG gives access to knowledge, but not internalised domain judgement for stable workflow patterns.
  • Internal teams can assemble the pieces, but rarely institutionalise eval rigour, provenance, rollback discipline, and repeatable specialist creation.
  • One-off fine-tunes do not compound: each training run is isolated, there is no closed eval–train feedback loop, and improvement does not accumulate across iterations.

The two things internal teams almost never institutionalise are the closed autonomous eval–train feedback loop and promotion governance with evidence discipline. Those are what we build.

Build vs buy

Most strong internal teams can piece together: a frontier model or open model, a retrieval layer, an inference vendor or serving runtime, some fine-tuning code, a benchmark notebook, and a set of ad-hoc deployment scripts. That is not the same as owning a repeatable specialisation capability.

The argument is that you usually do not fail because you cannot run one fine-tune. You fail because you cannot institutionalise the full loop:

  • Deciding what specialist to build
  • Defining domain evals that are hard to game
  • Converting failures into better training signal
  • Governing promotion and rollback with evidence discipline
  • Preserving provenance and accepted-state across iterations
  • Operating continuously without the loop degrading when engineers are unavailable

Internal assembly can reproduce pieces of this. It is much harder to reproduce the compounding operating system around it. That is the real product.

The five interlocking moats

Modelsmith is not one moat. It is five that reinforce each other:

  1. Autonomous failure analysis that diagnoses why a specialist failed, not just which test failed.
  2. A domain knowledge flywheel of rubrics, safety nets, evals, and curated source material that compounds over time.
  3. Expert-per-context specialist architecture, rather than one diluted generalist.
  4. An agent-native operating workflow through coding tools, stable commands, and machine-readable state.
  5. Platform depth from accumulated operational knowledge, deployment discipline, and accepted-state control.

Any one can be copied in part. The interaction is what compounds. Better failure analysis sharpens training data. Better training uncovers new edge cases. New edge cases improve safety logic and eval coverage. Better eval coverage makes promotion more trustworthy. The system compounds. The competitor starting from scratch does not.

The operating model: agentic coding first

A data-scientist-first platform assumes you work through notebooks, dashboards, manual charts, ad-hoc scripts, and interactive inspection. An agent-first platform lets your AI coding agent work through repo files, typed configs, eval suites, CLI commands, pull requests, CI gates, logs, structured artefacts, and promotion workflows.

This does not make data scientists irrelevant. It changes what the platform optimises for. Your data scientists, applied AI leaders, engineers, product managers and domain experts remain critical reviewers and designers of eval methodology. The day-to-day improvement loop is operable by agents acting through deterministic interfaces, while adjacent functions participate through issues, PRs, runbooks, eval reviews, and human-in-the-loop approval records rather than needing to become ML infrastructure specialists.

Agentsia favours declarative training and eval specs over hidden notebook state, versioned benchmarks over informal experiment notes, typed interfaces over implicit conventions, deterministic commands over manual UI sequences, structured logs over free-form console output, PR-native review over private local experimentation, accepted-state promotion over "latest run wins", rollback-ready artefacts over one-off model uploads, and machine-readable evidence over screenshots of charts.

The primary operating surface should be an agentic coding workflow acting on your behalf, not a human stitching together notebooks and shell commands.

Two surfaces, one truth

We maintain two surfaces. The operational surface is machine-readable state, structured logs, CLI commands, MCP tools, REST API, and promotion workflows: the authoritative system of record that your agents and engineers use to operate Modelsmith. The executive surface is a dashboard that makes model health, ROI against frontier baselines, and specialist fleet status legible when you need to understand value and justify investment.

Both surfaces resolve to the same accepted system state. No surface exposes a different truth about model quality, promotion status, or platform health.

We do not promise every workflow. We promise one wedge, done well, with enough evidence that you can trust the specialist to serve. The rest follows.