Product · 03github.com/agnt-os/modelsmith

Modelsmith.

Evaluation-first specialisation engine, retraining factory, and control plane for promotion, rollback, and accepted model lineage.

Modelsmith is the software that turns your operational failures into training signal, your engineers into a promotion discipline, and your fleet of specialists into a compounding capability you can operate through agentic coding tools.

What it is

■An evaluation-first specialisation engine.
■A retraining factory for domain specialists.
■A control plane for promotion, rollback, and accepted model lineage.
■A fleet routing and coordination layer for specialist ensembles.
■A packaging and deployment control layer for low-latency specialist runtimes.
■An agent-first platform operated through structured interfaces.

What it is not

□A generic frontier-model replacement for every workflow.
□A platform for coding models where you have no data edge.
□A notebook-first research environment.
□A generic RAG wrapper or replacement. RAG and trained weights are complementary.
□An ingestion or scraping system.
□A product whose main workflow assumes hands-on operation by data scientists.

/ the loop

How it works

The closed eval–train loop.

A specialist moves through a loop that runs continuously, without manual intervention for routine improvement. You set the target composite score. You review novel failure modes. The loop handles everything else.

Evaluate

Run your governed scenarios against the current specialist. Compute composite across core, robustness, and micro-benchmarks.

Diagnose

Classify each failure. Identify never-pass scenarios (SFT warm-up), flip-flops (held-out rotation), always-pass (excluded from training to prevent forgetting).

Train

Generate augmented training data from failures. Auto-adjust hyperparameters. SFT warm-up for cold-start models, GRPO for reinforcement from reward signal.

Re-score

Score the new adapter. If it regresses by more than ten percent, automatic rollback. If it converges, automatic promotion.

Propose

Persistent failure patterns generate new eval scenarios. These are staged for your review before activation. The loop generates. You gate.

/ promotion

Promotion

Six explicit states from candidate to production.

Every transition is auditable. Every promotion produces an evidence bundle. Modelsmith packages the deployment artefact and governs the state machine. You control shadow and canary in your own infrastructure.

01
candidate
A new adapter, not yet verified.
↓
02
eval-accepted
Meets all automated gates on the governed eval set.
↓
03
shadow
Runs alongside production; outputs compared, not served.
↓
04
canary
Serves a bounded fraction of traffic in your infrastructure.
↓
05
production-accepted
Fully promoted. Complete evidence bundle. Rollback artefact ready.
↓
06
deprecated
Retained for lineage; no longer routed to.

/ onboarding

Model onboarding

Adding a new model is a single JSON diff.

No shell script to copy. No compose file to hand-edit. Every script, compose file, and iterate loop reads from the model profile and generates the appropriate behaviour at runtime. The profile is the single source of truth; every downstream artefact is derived from it.

config/clusters.jsonjson

{
  "qwen3_32b": {
    "hf_id": "Qwen/Qwen3-32B-AWQ",
    "architecture": "dense",
    "quantization": {
      "format": "awq",
      "kv_cache_dtype": "turboquant35"
    },
    "training": {
      "method": "grpo",
      "sft_warmup": true,
      "lora_targets": ["q_proj", "k_proj", "v_proj", "o_proj"],
      "max_completion_length": 1024
    },
    "clusters": ["exchange", "gaming", "campaign", "trust"],
    "promotion_gates": {
      "target_composite": 98,
      "max_regression_pct": 10
    }
  }
}

/ deployment

Deployment model

Training under your control. Production inference on the substrate you choose.

Near-term: self-hosted development

The recommended starting configuration is one or more DGX Spark units, providing 128 GiB unified memory, the compute for full iterate loop cycles on models up to 32B parameters, and complete training-data residency within your infrastructure.

This reflects near-term pragmatics. It eliminates training cloud costs before revenue, keeps your proprietary training data under your control, and creates a simple licensing relationship: Agentsia provides software, you provide development compute.

Production inference: your choice

The specialist artefact produced by Modelsmith is served on whichever inference substrate fits your latency, cost, residency, and operational requirements. Your choice, not ours.

Long-term: managed cloud with optional on-prem

As revenue develops, we offer a cloud-managed option where Agentsia operates the control plane and training infrastructure. Production inference can run on Agentsia-managed infrastructure, your cloud account, a specialist inference vendor, or on-prem hardware. The control plane, eval system, and specialist logic stay consistent across modes.

/ commercial

Commercial terms

Pricing tracks fleet size.

Domain onboarding requires upfront investment in eval design, scenario generation, and the first specialist campaign. That is billable as professional services. The recurring licence covers Modelsmith itself: the iterate loop, the eval framework, the promotion machinery, and the support that comes with them.

ATier

£40–60k/yr

1–3 active specialists

One wedge. Proof the loop.

BTier

£80–120k/yr

4–8 active specialists

Adjacent specialists. Fleet emerging.

CTier

Negotiated

8+ active specialists

Enterprise deployment. Custom governance.

Indicative ranges for annual platform licence. Professional services billed separately per engagement. Cloud consumption pricing on the managed option.

/ close

The product is not one model. It is the operating system for continuously producing, evaluating, promoting, and governing a fleet of them.

— On the product

The seven pillars→The five phases Start an engagement