Agent safetyChecklist6 min

Add guardrails to multi-step agents

How to constrain agent behavior around data access, tool use, user intent, and final actions.

Real workflow example

A multi-step agent drafts a proposal response and has access to email tooling. Sending the draft without review would create business risk.

Classify the email tool as external-send and require human approval. The agent can prepare the draft and checklist, but the send action remains disabled until a reviewer approves.

The agent accelerates work without crossing the boundary from recommendation to irreversible action.

Implementation approach

This guide is anchored in OpenAI tools guide. Use the official API behavior as the boundary, then design the surrounding product state so the feature can be reviewed, retried, and improved.

Classify incoming requests by allowed, needs clarification, restricted, or unsafe.
Validate every tool call against user permissions and workflow state.
Require approval before destructive, financial, external-send, or production-write actions.
Validate final outputs for format, evidence, and disallowed claims.
Route blocked work to a human path with enough context to continue.

Code or config snippet when useful

type guardrails_for_multi_step_agents_workflow_state = {
  sourceId: string;
  status: "draft" | "needs_review" | "approved" | "blocked";
  evidence: Array<{ source: string; summary: string }>;
  nextAction: string;
};

Field notes

Guardrails are most useful at boundaries: input, tool call, output, and final action.
The system should reject unsafe actions before a tool executes, not after.
A good fallback is explicit and useful, not a generic apology.

Mistakes to avoid

Do not put approval rules only in prompt text.
Do not give agents destructive tools without server-side gates.
Do not make blocked actions disappear; explain the safe next step.

Ready checklist

Input classification
Tool authorization
Approval gates
Output validation
Human fallback path

Practical tip

Practical note

Guardrails belong in code and product state. Prompts can describe them, but code must enforce them.

Use this as an implementation constraint, not just advice. The interface, server code, and validation path should make the same behavior true.

Apply this to a build