Realtime voiceTutorial7 min

Build a Realtime voice agent intake flow

How to design a low-latency voice workflow that captures the right facts and hands off cleanly.

Real workflow example

A service business wants a voice agent to handle inbound project inquiries. The call should capture the customer's problem, location, deadline, budget range, decision maker, and preferred follow-up channel. The first prototype sounds natural but produces inconsistent notes and misses important fields.

The production workflow treats the voice call as intake. The agent can have a friendly tone, but its job is to complete a structured record, confirm uncertain facts, and hand off to a human when the caller asks for pricing, legal commitments, or urgent scheduling.

Implementation approach

Define the call objective before the persona. The objective should list required fields, optional fields, disallowed decisions, escalation triggers, and what counts as a completed call.

Use a server endpoint to create the realtime session and return a short-lived client secret to the browser or phone client. Keep business tools behind your server. The voice agent can request actions, but the server must authorize them.

Store the transcript and structured outcome. For intake calls, the durable artifact is not only audio or chat history; it is the normalized lead record that an operator can review.

Code or config snippet

type IntakeOutcome = {
  customerName?: string;
  company?: string;
  problem: string;
  deadline?: string;
  budgetSignal?: "unknown" | "low" | "medium" | "high";
  missingFields: string[];
  escalationReason?: string;
  recommendedNextAction: "send_followup" | "schedule_call" | "human_review";
};

const requiredFields = ["problem", "preferredFollowUp"];

function isComplete(outcome: IntakeOutcome) {
  return requiredFields.every((field) => !outcome.missingFields.includes(field));
}

Mistakes to avoid

Do not optimize the voice personality before defining the business outcome.
Do not let transcripts disappear after the realtime session ends.
Do not rely on the model to remember every required field without a checklist.
Do not let the browser receive long-lived API keys.
Do not treat interruption, silence, and background noise as edge cases; they are normal voice states.

Ready checklist

Call objective and required fields are documented.
Client sessions use short-lived credentials.
Transcript events are stored with safe retention rules.
The outcome is extracted into structured fields.
Escalation triggers are tested with realistic calls.
Operators can edit and approve the call summary.

Practical tip

Add a visible missing-fields panel for the operator. It makes the voice agent easier to QA because reviewers can immediately see whether the call captured the facts the business needs.

Practical note

Voice agents need operational memory.

A good realtime experience should leave behind a reviewable record, not just a pleasant conversation.

Apply this to a build