SafetyGuide5 min

Build a moderation pass for user-generated content

How to add a moderation checkpoint before AI-assisted workflows publish, send, or store risky content.

Real workflow example

A customer success product lets users draft public knowledge-base answers with AI help. The user can paste customer text, ask for a rewrite, and publish the result. That workflow handles user content, model-generated content, and public publishing in one path.

A moderation pass protects the irreversible step. Before publishing, the app checks the source text and generated answer, blocks severe issues, routes ambiguous cases to review, and explains what the user can edit. The moderator is not a vague background filter. It is a product checkpoint with states the user can understand.

Implementation approach

Find the moments where content leaves the private workspace: publishing, emailing, notifying, exporting, or making content visible to another account. Put the moderation checkpoint immediately before those actions.

Define policy outcomes that map to product behavior. For example: allow, request_edits, needs_review, and block. Each outcome should have a safe user message and an operator reason.

Store only what is needed. Keep audit metadata and decision codes, but avoid retaining sensitive raw text longer than the product requires.

Code or config snippet

const moderationSchema = z.object({
  outcome: z.enum(["allow", "request_edits", "needs_review", "block"]),
  categories: z.array(z.string()),
  userMessage: z.string(),
  reviewerNote: z.string().optional(),
});

async function moderateBeforePublish(articleId: string, content: string) {
  const response = await openai.responses.create({
    model: "gpt-4.1-mini",
    input: [
      { role: "system", content: "Classify content risk before publishing." },
      { role: "user", content },
    ],
    text: {
      format: {
        type: "json_schema",
        name: "moderation_decision",
        schema: zodToJsonSchema(moderationSchema),
        strict: true,
      },
    },
  });

  const decision = moderationSchema.parse(response.output_parsed);

  await db.articleModeration.create({
    data: { articleId, outcome: decision.outcome, categories: decision.categories },
  });

  return decision;
}

Mistakes to avoid

Running moderation after the content has already been sent or published.
Using one generic blocked message for every policy outcome.
Storing sensitive user text in logs without a retention reason.
Treating model-generated text as safe because the original user input looked safe.
Blocking users without giving a clear edit path where edits are allowed.

Ready checklist

Irreversible content actions are identified.
Moderation runs before publish, send, export, or notification.
Outcomes map to clear product states.
User-facing messages are safe and actionable.
Sensitive logs are minimized.
Manual review exists for ambiguous cases.

Practical note

Moderation should feel like a workflow checkpoint, not a mysterious rejection. Show the user what category needs attention and what action is available next.

Use this as an implementation constraint, not just advice. The interface, server code, and validation path should make the same behavior true.

Apply this to a build