System · Accepted state
Product · 03github.com/agnt-os/modelsmith

Modelsmith.

Evaluation-first specialisation engine, retraining factory, and control plane for promotion, rollback, and accepted model lineage.

Modelsmith is the software that turns your operational failures into training signal, your engineers into a promotion discipline, and your fleet of specialists into a compounding capability you can operate through agentic coding tools.

What it is

  • An evaluation-first specialisation engine.
  • A retraining factory for domain specialists.
  • A control plane for promotion, rollback, and accepted model lineage.
  • A fleet routing and coordination layer for specialist ensembles.
  • A packaging and deployment control layer for low-latency specialist runtimes.
  • An agent-first platform operated through structured interfaces.

What it is not

  • A generic frontier-model replacement for every workflow.
  • A platform for coding models where you have no data edge.
  • A notebook-first research environment.
  • A generic RAG wrapper or replacement. RAG and trained weights are complementary.
  • An ingestion or scraping system.
  • A product whose main workflow assumes hands-on operation by data scientists.

How it works

The closed eval–train loop.

A specialist moves through a loop that runs continuously, without manual intervention for routine improvement. You set the target composite score. You review novel failure modes. The loop handles everything else.

01

Evaluate

Run your governed scenarios against the current specialist. Compute composite across core, robustness, and micro-benchmarks.

02

Diagnose

Classify each failure. Identify never-pass scenarios (SFT warm-up), flip-flops (held-out rotation), always-pass (excluded from training to prevent forgetting).

03

Train

Generate augmented training data from failures. Auto-adjust hyperparameters. SFT warm-up for cold-start models, GRPO for reinforcement from reward signal.

04

Re-score

Score the new adapter. If it regresses by more than ten percent, automatic rollback. If it converges, automatic promotion.

05

Propose

Persistent failure patterns generate new eval scenarios. These are staged for your review before activation. The loop generates. You gate.

Promotion

Six explicit states from candidate to production.

Every transition is auditable. Every promotion produces an evidence bundle. Modelsmith packages the deployment artefact and governs the state machine. You control shadow and canary in your own infrastructure.

  1. 01
    candidate

    A new adapter, not yet verified.

  2. 02
    eval-accepted

    Meets all automated gates on the governed eval set.

  3. 03
    shadow

    Runs alongside production; outputs compared, not served.

  4. 04
    canary

    Serves a bounded fraction of traffic in your infrastructure.

  5. 05
    production-accepted

    Fully promoted. Complete evidence bundle. Rollback artefact ready.

  6. 06
    deprecated

    Retained for lineage; no longer routed to.

Model onboarding

Adding a new model is a single JSON diff.

No shell script to copy. No compose file to hand-edit. Every script, compose file, and iterate loop reads from the model profile and generates the appropriate behaviour at runtime. The profile is the single source of truth; every downstream artefact is derived from it.

config/clusters.jsonjson
{
  "qwen3_32b": {
    "hf_id": "Qwen/Qwen3-32B-AWQ",
    "architecture": "dense",
    "quantization": {
      "format": "awq",
      "kv_cache_dtype": "turboquant35"
    },
    "training": {
      "method": "grpo",
      "sft_warmup": true,
      "lora_targets": ["q_proj", "k_proj", "v_proj", "o_proj"],
      "max_completion_length": 1024
    },
    "clusters": ["exchange", "gaming", "campaign", "trust"],
    "promotion_gates": {
      "target_composite": 98,
      "max_regression_pct": 10
    }
  }
}

Deployment model

Training under your control. Production inference on the substrate you choose.

Near-term: self-hosted development

The recommended starting configuration is one or more DGX Spark units, providing 128 GiB unified memory, the compute for full iterate loop cycles on models up to 32B parameters, and complete training-data residency within your infrastructure.

This reflects near-term pragmatics. It eliminates training cloud costs before revenue, keeps your proprietary training data under your control, and creates a simple licensing relationship: Agentsia provides software, you provide development compute.

Production inference: your choice

The specialist artefact produced by Modelsmith is served on whichever inference substrate fits your latency, cost, residency, and operational requirements. Your choice, not ours.

Long-term: managed cloud with optional on-prem

As revenue develops, we offer a cloud-managed option where Agentsia operates the control plane and training infrastructure. Production inference can run on Agentsia-managed infrastructure, your cloud account, a specialist inference vendor, or on-prem hardware. The control plane, eval system, and specialist logic stay consistent across modes.

Commercial terms

Pricing tracks fleet size.

Domain onboarding requires upfront investment in eval design, scenario generation, and the first specialist campaign. That is billable as professional services. The recurring licence covers Modelsmith itself: the iterate loop, the eval framework, the promotion machinery, and the support that comes with them.

ATier
£40–60k/yr

1–3 active specialists

One wedge. Proof the loop.

BTier
£80–120k/yr

4–8 active specialists

Adjacent specialists. Fleet emerging.

CTier
Negotiated

8+ active specialists

Enterprise deployment. Custom governance.

Indicative ranges for annual platform licence. Professional services billed separately per engagement. Cloud consumption pricing on the managed option.

The product is not one model. It is the operating system for continuously producing, evaluating, promoting, and governing a fleet of them.
On the product