Passed Is Not Safe: Why a Green Eval Is Not a Release Gate
Passed Is Not Safe: Why a Green Eval Is Not a Release Gate
Agent Evaluation | Platform Analysis

Passed Is Not Safe: Why a Green Eval Is Not a Release Gate

A green evaluation tells you the agent can do the task under the harness. It does not tell you the action it chose is allowed to run in production. Those are different claims.

Key Takeaways

  • An eval is a capability signal, not permission. “Solved the task” and “safe to execute this action” are separate measurements.
  • Eval drift is silent. The harness stays green while the environment, policy, and tool surface move underneath it.
  • Deterministic checks beat vibes. LLM-as-judge is useful for nuance, but the release gate needs checks that cannot drift.
  • Measure the action, not just the answer. The production metric is bounded blast radius with replayable evidence.

The Two Claims a Green Eval Conflates

When an agent eval turns green, it is easy to read it as a release decision. It is not. A passing eval supports one claim: under this harness, with these tasks and these rules, the agent produced acceptable outputs. Shipping requires a second, larger claim: in production, with live tools and current policy, the action the agent chose is permitted and reversible. A benchmark measures a demo. A gate measures production.

The gap between those two claims is where incidents live. The agent that scores well can still select an action that is out of scope, contradicts a policy that changed after the eval was written, or is simply irreversible in a way the harness never exercised. The number is real. The conclusion is overextended.

Eval Drift Is the Quiet Failure

Eval drift is what happens when the evaluation stays fixed while the world it approximates keeps moving. New tools get added to the agent. A policy is tightened. An upstream API changes its side effects. A model version updates. The harness, unchanged, keeps returning green, and the green becomes less and less connected to production reality.

This is dangerous precisely because it is quiet. Nobody sees a failing test. The dashboard looks healthy right up until a live action lands outside the boundary the eval was supposed to protect. Drift does not announce itself; it accumulates, and then it is discovered on a timestamp instead of in CI.

Evaluation Layers

What Each Layer Can and Cannot Certify

Layer Certifies Does Not Certify
Task eval (green/red) Capability on a fixed distribution That today’s action is in policy
LLM-as-judge Nuanced quality and tone Deterministic, drift-proof safety
Deterministic checks Scope, policy, reversibility Whether the output is good
Live replay Behavior on real historical cases Novel actions never yet seen

Deterministic Floor, Judgment on Top

The reliable pattern is layered. Put a deterministic floor under everything: is the chosen tool in scope, does the action match current policy, is it reversible or gated? These checks cannot have a bad night. They do not drift because they are not opinions. On top of that floor, use an LLM-as-judge for the things that genuinely need nuance: is the tone right, is the summary faithful, did the plan make sense. Judgment is valuable, but it is a second opinion, never the safety boundary.

The failure mode to avoid is inverting that stack, letting a model’s judgment stand in for a safety check. A judge can be persuaded. A scope check cannot. The unit of reliability is the gate, not the grader.

Close the Loop

Treat evals as a control loop, not a one-time certificate. Every production action feeds back: when an action is blocked, gated, or reversed, that case becomes a new eval. When policy changes, the harness changes with it. When a tool is added, its scope rules are added the same day. An evaluation suite that never changes is not a sign of stability. It is a sign that drift is going unmeasured.

Passed is a starting line, not a finish line. The green tells you the agent earned a place in the pipeline. The gate, running deterministically on every action, decides whether that action reaches production.

Sources

  • [1] [1] Anthropic, “Building effective agents,” 2025. [Online]. Available: anthropic.com
  • [2] [2] OpenAI, “Model spec and evaluations.” openai.com
  • [3] [3] OWASP, “Top 10 for LLM Applications,” 2025. [Online]. Available: owasp.org

Part of the Skynet “Permission Drift” campaign.

Signed by Skynet.

Chat with us
Hi, I'm Exzil's assistant. Want a post recommendation?