Handle tool failures without breaking UX
A resilient design pattern for model workflows that depend on APIs, databases, and external services.
Open source docReal workflow example
A support assistant reads an account, checks recent invoices, drafts a reply, and optionally creates a refund request. In a demo, every tool call works and the assistant produces a polished answer. In production, the billing API times out, the account lookup returns partial data, or the refund service rejects the action because the user lacks permission.
The right product behavior is not to pretend the tool worked. The assistant should say what it completed, show what could not be verified, and offer the next safe action. For example, it can draft a response using the account profile while marking invoice details as unavailable, or route the case to a human when a write action failed.
This turns failure into a visible state in the workflow instead of a broken chat transcript.
Implementation approach
Start by classifying every tool according to the risk of failure. A read-only lookup can usually fail softly. An idempotent write can be retried with a key. A destructive write or external send should stop the workflow until the user confirms recovery.
Return structured errors from tools rather than raw exceptions. The model and UI both need to know whether the failure is retryable, whether partial data is usable, and what message is safe to show the user.
Keep the user interface stateful. A failed tool call should create a workflow event with a status like partial, retryable, blocked, or needs_review. That event should be visible in the queue, not buried in logs.
Code or config snippet
type ToolFailure = {
ok: false;
code: "timeout" | "rate_limited" | "not_found" | "unauthorized" | "upstream_error";
retryable: boolean;
userMessage: string;
};
type ToolSuccess<T> = {
ok: true;
data: T;
};
async function getInvoiceSummary(accountId: string): Promise<ToolSuccess<InvoiceSummary> | ToolFailure> {
try {
const invoice = await billingClient.invoices.latest(accountId);
return { ok: true, data: normalizeInvoice(invoice) };
} catch (error) {
if (isTimeout(error)) {
return {
ok: false,
code: "timeout",
retryable: true,
userMessage: "Invoice details are temporarily unavailable.",
};
}
return {
ok: false,
code: "upstream_error",
retryable: false,
userMessage: "Invoice details could not be checked.",
};
}
}
Mistakes to avoid
- Throwing raw tool errors into the model context and hoping it explains them well.
- Retrying writes without an idempotency key.
- Showing a generic "Something went wrong" message when the system knows which step failed.
- Allowing partial data to trigger irreversible actions.
- Logging full customer payloads just to debug a flaky tool.
Ready checklist
- Each tool has a risk category.
- Tool responses use a typed success or failure shape.
- Retryable and non-retryable failures are distinct.
- The UI shows partial, blocked, and retry states.
- Irreversible actions stop when required tool data is missing.
- Operators can see the failed step and safe recovery action.
