ReliabilityGuide5 min

Handle tool failures without breaking UX

A resilient design pattern for model workflows that depend on APIs, databases, and external services.

Real workflow example

A support assistant reads an account, checks recent invoices, drafts a reply, and optionally creates a refund request. In a demo, every tool call works and the assistant produces a polished answer. In production, the billing API times out, the account lookup returns partial data, or the refund service rejects the action because the user lacks permission.

The right product behavior is not to pretend the tool worked. The assistant should say what it completed, show what could not be verified, and offer the next safe action. For example, it can draft a response using the account profile while marking invoice details as unavailable, or route the case to a human when a write action failed.

This turns failure into a visible state in the workflow instead of a broken chat transcript.

Implementation approach

Start by classifying every tool according to the risk of failure. A read-only lookup can usually fail softly. An idempotent write can be retried with a key. A destructive write or external send should stop the workflow until the user confirms recovery.

Return structured errors from tools rather than raw exceptions. The model and UI both need to know whether the failure is retryable, whether partial data is usable, and what message is safe to show the user.

Keep the user interface stateful. A failed tool call should create a workflow event with a status like partial, retryable, blocked, or needs_review. That event should be visible in the queue, not buried in logs.

Code or config snippet

type ToolFailure = {
  ok: false;
  code: "timeout" | "rate_limited" | "not_found" | "unauthorized" | "upstream_error";
  retryable: boolean;
  userMessage: string;
};

type ToolSuccess<T> = {
  ok: true;
  data: T;
};

async function getInvoiceSummary(accountId: string): Promise<ToolSuccess<InvoiceSummary> | ToolFailure> {
  try {
    const invoice = await billingClient.invoices.latest(accountId);
    return { ok: true, data: normalizeInvoice(invoice) };
  } catch (error) {
    if (isTimeout(error)) {
      return {
        ok: false,
        code: "timeout",
        retryable: true,
        userMessage: "Invoice details are temporarily unavailable.",
      };
    }

    return {
      ok: false,
      code: "upstream_error",
      retryable: false,
      userMessage: "Invoice details could not be checked.",
    };
  }
}

Mistakes to avoid

Throwing raw tool errors into the model context and hoping it explains them well.
Retrying writes without an idempotency key.
Showing a generic "Something went wrong" message when the system knows which step failed.
Allowing partial data to trigger irreversible actions.
Logging full customer payloads just to debug a flaky tool.

Ready checklist

Each tool has a risk category.
Tool responses use a typed success or failure shape.
Retryable and non-retryable failures are distinct.
The UI shows partial, blocked, and retry states.
Irreversible actions stop when required tool data is missing.
Operators can see the failed step and safe recovery action.

Practical note

Design the failed state before polishing the happy path. If the product still feels usable when the most important tool is down, the normal workflow will usually be much easier to support.

Use this as an implementation constraint, not just advice. The interface, server code, and validation path should make the same behavior true.

Apply this to a build