Essay · 01 · Methodlibrary/on-eval-design.md

On eval design

Why a two-layer eval architecture is the integrity boundary that lets the loop run autonomously.

Eval quality determines everything downstream. Bad rubrics produce misleading training signal, misleading promotion decisions, and misleading evidence. Your first ninety days in a new domain should invest disproportionately in eval design before any training begins.

This is the claim most founders agree with in the abstract and ignore in practice. It is also the claim that separates specialist programmes that compound from programmes that produce a good demo and then plateau.

Two layers, different governance

We use a two-layer architecture. The governed eval set contains domain anchors, rubric baselines, and golden standards that represent the non-negotiable success criteria for a specialist. Any change requires your explicit approval. The autonomous loop cannot modify this layer.

The expansion eval set contains scenario variants, edge cases, and stress tests auto-proposed by the iterate loop from persistent failure patterns. Proposals are staged for your review before activation. The loop generates; you gate.

The separation prevents the circular risk where Modelsmith drifts its own rubrics toward what the model is already good at rather than what your domain actually requires. The governed layer is not immutable. First principles do evolve. But changes are always approved by you, not initiated by the loop.

The governed layer is the integrity boundary. Without it, the autonomous loop is a system that keeps optimising toward its own reflection.

Where scenarios come from

For a new domain, eval scenarios are generated from three data sources you control:

Operational failure modes. Real failures in your existing workflows, converted into test scenarios with expected outcomes. The iterate loop auto-proposes scenario variants from persistent eval failures.
Git commit exhaust. Your engineering history surfaces domain decisions, incident resolutions, and operational edge cases that become scenario seeds.
Unstructured documents. Confluence pages, retrospectives, post-mortems, and internal documentation encode institutional judgement that can be formalised into rubric criteria.

The discipline of not cheating yourself

The hardest part of eval design is writing scenarios that a well-prompted frontier model fails, that your specialist will eventually pass, and that a reasonable domain expert would agree are both important and well-scoped. Most teams stop at the first two; the third is the one that matters.

When the governed layer is right, the autonomous loop has something to anchor on. When it is wrong, everything downstream is noise.

Invest here first. Come back to training after you can articulate, in writing, what the specialist must be able to do. If you cannot write it, you cannot evaluate it. If you cannot evaluate it, you are not building infrastructure. You are building a science project.

/ next

On the fleet

Back to the library↩Start an engagement