Agent Safety | Platform Analysis

Dry-Run First: The Approval Gate That Stops an Agent Before It Acts

Most agent damage does not happen while the model is thinking. It happens in the last inch, the commit step. Put the gate exactly there: simulate, then decide.

Key Takeaways

The last inch is the dangerous one. Reasoning is reversible; the executed action is not. Guard the commit, not the chat.
Dry-run before commit. Simulate the effect and surface it before anything irreversible lands.
Three defaults per action: run, block, or ask. The right default depends on reversibility and blast radius.
A gate that fails open is worse than none. If the check cannot verify, it must block, not proceed.

Guard the Commit, Not the Conversation

A lot of agent safety energy goes into the model’s reasoning: better prompts, better planning, better self-critique. That work matters, but it guards the wrong place. A plan is reversible. A thought is reversible. The moment that is not reversible is execution, when the agent sends the email, deletes the row, deploys the change, or moves the money. That is the last inch, and it is where damage becomes real.

So put the strongest control precisely at the execution boundary. Before an action with external or irreversible effect runs, the control plane should pause, simulate the effect, and require a decision. Everything upstream is preparation. The gate at the commit step is what actually protects production.

Simulate, Then Decide

A dry-run answers a simple question the agent cannot answer for itself: what will actually happen if this runs? For a database action, that is the rows affected. For a deploy, the diff and the blast radius. For an outbound message, the exact recipient and content. The simulation turns an opaque intention into a concrete, inspectable effect.

With that effect in hand, the gate applies a default. Reversible and low-impact: run, and log it. Irreversible or externally visible: ask a human, with the simulated effect attached so the decision takes seconds, not minutes. Clearly out of policy: block, and record why. The point is that the decision is made against the real effect, not against the agent’s confidence.

Action Class	Reversibility	Default
Read / draft / analyze	Fully reversible	Run, audit-only
Internal write	Reversible with effort	Run with dry-run log
External send / publish	Hard to reverse	Ask a human
Delete / deploy / pay	Irreversible	Ask, with simulated effect

Fail Closed, Always

The most dangerous gate is one that fails open. If the simulation errors, the policy service is unreachable, or the check cannot determine the effect, the safe behavior is to block and escalate, never to wave the action through. A safety check that proceeds on uncertainty gives false confidence, which is worse than no check at all because people trust it. A gate that cannot verify must stop.

This is the same discipline that makes any control trustworthy. The value of the gate is not that it usually works. It is that it never lets an unverified irreversible action through. Build it so that the failure of any component defaults to safety.

Keep the Human Cost Low

An approval gate only survives contact with reality if asking a human is cheap. Nobody will keep a gate that interrupts constantly with vague requests. So make each ask specific and pre-digested: here is the action, here is the simulated effect, here are the three buttons. Batch low-risk approvals. Learn from decisions so the same benign action stops asking over time. The goal is not maximum friction; it is friction placed at exactly the irreversible steps and nowhere else.

Done well, the gate is nearly invisible in normal operation and decisive in the rare moment that matters. That is the moment you built it for: when the agent is about to take the last inch, and something needs to ask whether it should.

Sources

[1] [1] Anthropic, “Building effective agents,” 2025. [Online]. Available: anthropic.com
[2] [2] Google, “Site Reliability Engineering: on safe, gradual change.” sre.google
[3] [3] OWASP, “Top 10 for LLM Applications,” 2025. [Online]. Available: owasp.org

Part of the Skynet “Permission Drift” campaign.

Signed by Skynet.

Dry-Run First: The Approval Gate That Stops an Agent Before It Acts

Key Takeaways

Guard the Commit, Not the Conversation

Simulate, Then Decide

Choosing Run, Block, or Ask

Fail Closed, Always

Keep the Human Cost Low

Sources

Related Reading

Least Privilege for Agents: Scope the Capability, Not the Prompt

Passed Is Not Safe: Why a Green Eval Is Not a Release Gate

Permission Drift: The Agent Passed Evals, Then Reached for the Wrong Lever

Long Context Is Not Agent Memory in 2026

Stay in the loop