Shadow evaluation: how a model earns its way into production

Shadow evaluation runs a new model against live inputs while the incumbent keeps serving decisions. The candidate's outputs are logged, compared against the incumbent's on pre-agreed measures, and reviewed before anything promotes. Nothing downstream acts on the candidate until it wins.

That is the full definition. The rest of this piece is why the phrase belongs in your vendor contract verbatim, and why a paraphrase of it buys you nothing.

The term "shadow AI" saturates the 2026 governance conversation for a different reason: employees using unsanctioned external tools without IT approval. That conversation is real and worth having. This piece covers a mechanism that shares the word and shares almost nothing else. Shadow evaluation, as a model promotion protocol, is the checkpoint between a vendor's process claim and your risk officer's evidence requirement. The two are separated by a log that either exists or does not.

Most vendors describe shadow evaluation as a development practice. The enterprises that press them find it is also an audit record, a version-change event, and, in regulated workflows, a governance decision that carries the same accountability requirements as the decisions the model itself makes. The name matters because the name determines which function owns the artifact and who approves the outcome.

Shadow evaluation vs A/B testing

A/B testing promotes by splitting live traffic. A percentage of real requests routes to the candidate model, users receive its outputs, and you measure outcomes over time. In low-stakes, high-volume domains with short feedback loops, this is a workable design.

In regulated enterprise, it is the wrong design.

A/B testing exposes live decisions to a model you have not yet validated. A hiring recommendation, a claims-triage call, a retention-risk flag: each of these carries a paper trail that outlives the experiment. If the candidate model was worse on the decisions that went through it, the paper trail is evidence of a governance failure. The experiment and the liability arrive together.

Shadow evaluation removes the exposure. The candidate receives every input the incumbent receives and produces parallel outputs that nothing routes on. The decision affecting a real person continues to flow from the validated model until the candidate wins in the evaluation log.

The measure of winning gets agreed before the run begins. Set the yardstick while everyone is still neutral about the result and the promotion decision cannot turn into a negotiation afterward. The candidate either clears the threshold or it does not ship. If it never wins, it never ships, and the only evidence it existed is the evaluation log.

Shadow evaluation vs a process assurance

"Our new model was retrained on your latest outcomes" is a process assurance. So is "we added six months of additional training data." Both are true of every retrain ever shipped by any vendor, including the ones that made results worse.

Shadow evaluation is an evidence claim. The candidate ran against the incumbent, in your environment, on your live traffic. Here is the log.

Process assurances are the default in the market because they are easy to produce and difficult to falsify. A shadow run log requires a working mechanism and an honest read. Ask for the log and watch which vendors reach for it and which reach for another slide.

When evaluating any AI model improvement protocol, the question is narrow: can you show me the shadow run log, against the incumbent I am running today, in my environment, on my traffic? If the evaluation happened on a benchmark set in the vendor's environment, the evidence applies to a different population of inputs than the one running in your deployment.

Benchmarks tell you the ceiling. Shadow evaluation tells you what happens to your floor.

Why regulated enterprise needs this protocol by name

Every model promotion in a workflow that touches regulated decisions is an implicit assertion: the new model makes better decisions than the one it replaces. An auditor will ask you to back that assertion with something other than the vendor's word.

Shadow evaluation is how you generate that documentation inside your own perimeter. The run happens in your cloud. The log lives in your environment. Your team sets the measures, sets the threshold, and approves the promotion. The governance chain is yours from beginning to end.

Audit trails run backward. A claim adjudicated under one model version must be explainable using the state of that version. The version running when the audit happens two years later is irrelevant to the question the auditor is asking. The promotion log is part of that trail: it records when the model changed, on what evidence the change was approved, against what threshold, and by whom. It is the diff record for a system that regulators are starting to treat the way they treat any other consequential change to a regulated process. A version-change event without a promotion log is a gap the risk function cannot close after the fact, because the evidence of what changed and why was never captured.

Two practical conclusions follow from this. First, the shadow evaluation log belongs in the same retention and access-control regime as your other audit records. The risk function should own that policy. The engineering team that built the candidate model should not set the retention terms on the log used to evaluate the candidate model. Second, the threshold for promotion should be reviewed and approved by someone outside that team before the run begins. The same second-signer logic that applies to regulated workflows at the decision level applies to the model promotion event that changes the system making those decisions. Governance makes speed believable covers that accountability structure in detail.

The worked example in talent

The model in production scores candidates and surfaces hire recommendations at a Fortune 500 insurance carrier. A new fine-tune arrives, calibrated on four years of outcomes from 10,765 agents.

The carrier's environment runs both models. The incumbent continues scoring and serving recommendations through the live workflow. The candidate runs in shadow on the same applicant inputs and produces parallel scores, logged beside the incumbent's in a log held in the carrier's environment. The candidate team cannot modify that log.

The evaluation measures are set before the run begins: hire-rate lift on a held-out validation set, time-to-production improvement, and predictive accuracy against post-hire outcomes on the pilot cohort. The promotion threshold, statistical significance across all three, is agreed before anyone looks at the candidate's numbers.

If the candidate clears all three, the designated approver reviews the log and promotes. If it clears two but not one, the team and the risk function examine which one, why, and decide together whether to extend the shadow run or reject the candidate for that cycle. If it never clears, it never ships.

This is one gate in a larger pipeline described in The weights leave. Your data never does.: fine-tune inside the customer cloud, strip PII in two independent passes before any weights leave the perimeter, pool at the industry level only after customer review. Shadow evaluation is the gate most buyers forget to name until after they have signed, because vendors describe it in process language instead of contract language.

Where it belongs in the contract

Shadow evaluation as a defined mechanism should appear in the vendor contract. Documentation can change between signature and deployment. The contract cannot.

Three provisions are the minimum.

The yardstick: which measures, which threshold, agreed in writing before any run begins, with sign-off from both sides while neutral. The log: who holds it, who may read it, how long it is retained, who controls access, and whether the candidate team may modify it. The promotion authority: which named role at the customer organization approves a promotion, and whether a second reviewer from the risk or compliance function is required before any model change touches regulated workflows.

Vendors who have built the mechanism negotiate all three quickly, because the mechanism already enforces them. The contract just makes visible what the mechanism does. Vendors offering process assurances find these provisions harder, because process assurances have no log to produce and no threshold to name.

The difficulty of negotiating three contract clauses is itself useful information. It arrives before the contract is signed, which is when it is most useful.

Saad Bin Shafiq is the founder of Nodes. Anchor pilot: Fortune 500 insurance carrier, four years of production data, 10,765 agents. Methodology: Decision Traces.

That is the full definition. The rest of this piece is why the phrase belongs in your vendor contract verbatim, and why a paraphrase of it buys you nothing.

Shadow evaluation vs A/B testing

In regulated enterprise, it is the wrong design.

Shadow evaluation vs a process assurance

Shadow evaluation is an evidence claim. The candidate ran against the incumbent, in your environment, on your live traffic. Here is the log.

Benchmarks tell you the ceiling. Shadow evaluation tells you what happens to your floor.

Why regulated enterprise needs this protocol by name

The worked example in talent

The model in production scores candidates and surfaces hire recommendations at a Fortune 500 insurance carrier. A new fine-tune arrives, calibrated on four years of outcomes from 10,765 agents.

Where it belongs in the contract

Shadow evaluation as a defined mechanism should appear in the vendor contract. Documentation can change between signature and deployment. The contract cannot.

Three provisions are the minimum.

The difficulty of negotiating three contract clauses is itself useful information. It arrives before the contract is signed, which is when it is most useful.

Saad Bin Shafiq is the founder of Nodes. Anchor pilot: Fortune 500 insurance carrier, four years of production data, 10,765 agents. Methodology: Decision Traces.