QualityChecklist5 min

Add evals before expanding your prompt

Why prompt complexity should grow behind examples, regression checks, and clear pass/fail criteria.

Real workflow example

A proposal tool starts with a simple prompt: read the customer's request and draft a response outline. After a few successful demos, the team adds more instructions: detect budget risk, flag missing legal terms, classify the industry, write a warmer intro, and recommend follow-up questions.

The prompt now feels smarter, but nobody can tell whether the latest edit made deadline detection worse. Sales notices that good opportunities are sometimes labeled as low fit, while product notices that legal warnings are inconsistent.

The fix is not another paragraph of prompt instructions. The fix is a small eval set that represents the actual workflow: normal requests, incomplete requests, rushed deadlines, regulated industries, high-budget leads, low-budget leads, and noisy copy-pasted emails. Each prompt change must beat the baseline on these examples before it ships.

Implementation approach

Start with the user decision, not the model behavior. For a proposal intake flow, the decision might be "send to sales", "ask for missing details", "route to legal review", or "reject as poor fit". Then collect examples for each decision.

Keep the first eval set small. Ten to twenty examples are enough to catch obvious regressions and force clear expectations. Store the input, expected decision, required evidence, fields that must be present, and content that must not appear.

Run the current prompt against the fixtures and save those outputs as the baseline. Then change only one thing at a time: one system instruction, one schema field, one tool description, or one retrieved context block. If more than one change moves at once, you will not know what improved or broke the result.

Code or config snippet

type EvalCase = {
  id: string;
  input: string;
  expectedDecision: "sales_review" | "needs_context" | "legal_review" | "poor_fit";
  mustInclude: string[];
  mustAvoid: string[];
};

function scoreProposalEval(output: ProposalDecision, fixture: EvalCase) {
  const failures: string[] = [];

  if (output.decision !== fixture.expectedDecision) {
    failures.push(`expected ${fixture.expectedDecision}, received ${output.decision}`);
  }

  for (const phrase of fixture.mustInclude) {
    if (!output.evidence.join(" ").toLowerCase().includes(phrase.toLowerCase())) {
      failures.push(`missing evidence: ${phrase}`);
    }
  }

  for (const phrase of fixture.mustAvoid) {
    if (JSON.stringify(output).toLowerCase().includes(phrase.toLowerCase())) {
      failures.push(`included forbidden phrase: ${phrase}`);
    }
  }

  return { passed: failures.length === 0, failures };
}

Mistakes to avoid

Do not grade only writing quality when the product needs a decision.
Do not use perfect examples that never appear in production.
Do not keep evals only in a spreadsheet that the code cannot run.
Do not accept a prompt change because one impressive sample improved.
Do not merge prompt, retrieval, schema, and tool changes in the same eval run.

Ready checklist

Representative fixtures exist for normal and edge-case inputs.
Every fixture has a clear expected outcome.
The baseline output is stored before the prompt change.
Scoring checks decision, evidence, missing fields, and forbidden behavior.
Failed cases produce readable notes for the next prompt edit.
The eval command can run in CI or before release.

Practical tip

Name eval cases after the behavior they protect, such as rush_deadline_without_budget or legal_terms_in_healthcare_request. Useful names make failures faster to understand than numeric IDs alone.

Practical note

Treat evals as product coverage, not model research.

A small workflow-specific eval suite is more useful than a broad generic benchmark when you are protecting a customer-facing feature.

Apply this to a build