Back to notes
QualityChecklist5 min

Add evals before expanding your prompt

Why prompt complexity should grow behind examples, regression checks, and clear pass/fail criteria.

Open source doc
Real example

Example: protect tender-risk scoring before adding more instructions

The team wants to add new instructions for deadline risk, missing certificates, and budget fit. The prompt is already long and changes keep breaking older cases.

Create a small eval set of 20 real tenders with expected risk labels and required evidence. Run it before and after each prompt change, then accept only changes that improve the target cases without regressing the old ones.

Prompt work becomes engineering work: measured, reviewable, and easier to roll back.

ts
Fixture-oriented eval idea
for (const fixture of fixtures) {
  const output = await classifyTenderRisk(fixture.input);
  expect(output.risk).toBe(fixture.expectedRisk);
  expect(output.evidence.length).toBeGreaterThan(0);
}
Tutorial path

How to implement it

Step 01
Collect ten to twenty representative inputs from actual product use.
Step 02
Define what a correct answer must include, avoid, and classify.
Step 03
Run the current prompt and store outputs as a baseline.
Step 04
Change one prompt or tool instruction at a time.
Step 05
Compare old and new results before promoting the change.
Checklist

Ready when these are true

Representative fixtures
Explicit pass/fail criteria
Baseline outputs stored
Single change per eval run
Regression notes written
Field notes

What matters in practice

01
Prompt edits without examples make regressions hard to see.
02
Small eval sets catch obvious failures before they become customer-facing issues.
03
The best evals match real workflow decisions, not generic language quality.
Avoid these mistakes

Common failure modes

01
Do not evaluate only happy paths.
02
Do not use vague pass criteria like looks good.
03
Do not change prompt, schema, and model at the same time.
Practical tip
A tiny eval suite with real workflow examples beats a large generic test set.
Apply this to a build
Contact
Bring the workflow, deadline, and constraints.
Send the desired outcome, current bottleneck, users, and timeline. I will respond with a practical path for the build.