17 items

AI Prompt Engineering Workflow

Build prompts that work the first time — and keep working. This phased workflow covers every step from precise task definition and few-shot examples to hallucination prevention and long-term prompt library maintenance. For more background and examples, see the guidance below; for built-in tools and options, use the quick tools guide.

Author

Checklistify Editorial Team

Last Updated

Checklist Items

0 done•17 left•3 of 4 sections collapsed

Write a single, clear sentence stating exactly what output you want — the task definition.

The task definition is the most important sentence in any prompt. Vague task definitions produce vague outputs. Compare: 'Write something about our product' versus 'Write a 150-word product description for our noise-canceling headphones targeting commuters in open offices, emphasizing the 40-hour battery life and ambient-sound passthrough.' The second prompt produces usable output on the first attempt because it answers four questions that determine output quality: What format? For whom? What constraints? What to emphasize? Before writing any other prompt element, draft this sentence and ask whether someone who has never worked on the task would know exactly what 'success' looks like. If not, it needs more specificity.

Assign a role or persona at the start of the prompt to frame domain knowledge and tone.

Role prompting works because LLMs are trained on vast domain-specific text — assigning a role activates patterns from that domain. 'You are a senior UX designer with experience in enterprise SaaS products' draws on UX-specific vocabulary, common frameworks, and professional communication norms rather than generic text. Effective role construction has three components: expertise level ('senior' or '20 years of experience'), domain ('enterprise SaaS'), and often a secondary qualifier ('who specializes in accessibility' or 'with a background in conversion optimization'). The role should match the task — a financial analysis task gets a financial analyst persona. Role prompting delivers the most impact for tasks requiring domain expertise, professional tone, or specialized vocabulary.

Provide all necessary background context — audience, purpose, brand voice, technical level, and relevant constraints.

LLMs have no context beyond what's in the prompt. What seems obvious to you is not available to the model without explicit inclusion. Context elements that reliably improve output quality: Audience — who will read this, what do they know, what do they care about? Purpose — what action should this output drive? Brand voice — formal or casual, active voice preference, terms to avoid? Technical level — expert audience or general public? Existing material — relevant facts, prior work, or constraints the output must accommodate. A practical test: imagine a skilled freelance contractor receiving only your prompt with no other information. Could they produce exactly what you actually want? If not, whatever is missing in your mental image is what needs to be added to the prompt.

Write instructions using action verbs and explicit step-by-step structure for multi-part tasks.

Use action verbs that specify exactly what to do: 'Summarize,' 'List,' 'Compare,' 'Rewrite' — not 'Help with' or 'Do something about.' For multi-step tasks, number the steps explicitly and ask the model to complete them in order. Avoid ambiguity: 'Make it shorter' is ambiguous; 'Reduce the word count by 40% while preserving all key statistics and the concluding recommendation' is not. For tasks with multiple components, consider whether a single prompt will produce better output than a chained approach — prompt 1 produces an outline, prompt 2 expands a section, prompt 3 edits the draft. Chained prompts often outperform single complex prompts for creative and analytical tasks because each step in the chain produces context that improves the next.

Decide which components belong in the system prompt versus the user prompt for API-based integrations.

Most LLM APIs support two message types: a system prompt (set by the developer, persistent across conversations) and user messages (sent per turn). The distinction matters for maintainability. The system prompt is for standing instructions that apply to every request: role and persona, default output format, behavioral constraints, brand voice, and vocabulary rules. The user prompt is for the specific task, input data, and any request-specific constraints. Keeping them separated means you update the system prompt once rather than updating every user-turn template. For consumer chat interfaces where there's no system prompt control, consolidate everything into the user prompt and position the role and constraints prominently at the top.

Add few-shot examples — 1 to 5 input/output pairs showing exactly what you want.

Few-shot prompting is one of the highest-leverage techniques available. It's more effective than description alone because examples demonstrate pattern rather than describing it — the model learns format, tone, length, and structure directly from examples rather than inferring them from instructions. Few-shot construction: use real examples from your actual use case, not generic placeholders. Include the full input and the full desired output. Vary examples slightly to show the pattern generalizes. On quantity: one example (one-shot) helps with format; three to five examples (few-shot) establishes pattern more robustly; more than five provides diminishing returns for most tasks. On position: examples should appear after instructions and immediately before the actual input to be processed.

Specify the output format precisely — JSON schema, markdown headers, numbered list, exact word count.

Explicit format specification eliminates the most common prompt failure mode: getting the right content in the wrong structure. Format elements to specify: structure type (JSON, markdown, numbered list, table, prose paragraphs), required fields or sections (for JSON: exact key names and value types), length constraints (word count, character count, or number of items), ordering preference (most important first, chronological), and what to exclude ('no preamble,' 'no explanation of your process,' 'do not begin with Certainly'). For JSON output, specify the exact schema including nested objects and value types. For structured documents, provide a template with section headers filled in. The more specific the format specification, the more usable the output — particularly for programmatic processing where format consistency is a hard requirement.

Set explicit negative constraints — what the output must not do, jargon to avoid, topics to exclude.

Negative constraints prevent predictable failure modes before they occur and are as important as positive instructions. Common constraint types: word or character limits ('The response must be under 150 words'), vocabulary restrictions ('Do not use jargon — define any technical term you introduce'), scope restrictions ('Focus only on the North American market — do not generalize globally'), tone restrictions ('Do not use phrases like In conclusion or It is important to note'), and factual restrictions ('Do not cite specific statistics or sources unless they are provided verbatim in the context below'). That last constraint is particularly critical for any prompt used on factual content — it explicitly instructs the model not to generate data it doesn't have, which significantly reduces hallucination risk.

Add chain-of-thought instruction for tasks requiring logic, analysis, math, or multi-step reasoning.

Chain-of-thought (CoT) prompting asks the model to work through its reasoning before producing a final answer. It reliably improves output quality for analytical, logical, and multi-step tasks — and costs nothing to add. The mechanism: each token the model generates becomes context for subsequent tokens. Writing out intermediate reasoning steps shapes the logical quality of the conclusion. Without CoT, the model jumps directly to a conclusion, which is more likely to be a plausible-sounding but incorrect answer. How to invoke it: 'Think through this step by step before answering,' or, for structured analysis: 'List your assumptions, work through the logic, then state your conclusion.' For recommendation tasks, asking for explicit pros and cons before the recommendation reliably produces stronger analysis than asking for the recommendation directly. CoT adds little value for simple classification, format conversion, or summarization tasks.

Set the temperature parameter — use low values for factual or structured tasks, higher values for creative ones.

Temperature controls the randomness of model outputs. At temperature 0, the model always selects the most probable next token — outputs are nearly deterministic and reproducible across runs. At temperature 1 or above, outputs become more varied and creative. For factual, structured, or classification tasks where consistency and accuracy matter more than variety — JSON extraction, data summarization, entity classification — use temperature 0 or close to it. For creative tasks — copywriting variations, brainstorming, ideation — a higher temperature (0.7 to 1.0) produces outputs worth reviewing. Running the same creative prompt three to five times at higher temperature and selecting the best result is a legitimate and efficient production strategy. If a prompt produces inconsistent outputs across runs on tasks where you need consistency, lowering the temperature is the first adjustment to try before changing the prompt itself.

#10

Run the prompt with a representative sample input — evaluate output against the task definition.

Initial testing criteria: does the output answer the task definition? Is the format correct? Is the length appropriate? Is the tone right for the target audience? Does it include anything it shouldn't — hallucinated facts, scope drift, generic filler? Does it omit anything required? Systematic evaluation means evaluating each dimension explicitly against your requirements, not just asking 'does this seem good?' Keep the original output before making any refinements — it's your baseline for comparison. The most common prompt engineering mistake is making multiple simultaneous changes after the first test, which makes it impossible to know which change produced the improvement. Change one element at a time.

#11

Diagnose the specific failure mode before modifying the prompt — the fix depends on the cause.

Prompt failure diagnosis: output is too generic means the task definition is too vague, the role assignment too general, or examples are missing. Output ignores certain instructions usually means those instructions are buried in the middle of a long prompt — position the most critical instructions first and last, as middle positions receive less reliable attention. Output format is wrong means format specification is absent or incomplete. Output contains hallucinated facts means the model is generating plausible-sounding data it doesn't actually have — add a factual constraint and provide the required facts directly in context. Output is inconsistent across runs means the prompt has ambiguity allowing multiple valid interpretations — identify the ambiguous element and resolve it, then also check whether lowering the temperature helps. Identify which failure mode applies before touching the prompt.

#12

Test with edge cases — unusual, ambiguous, very short, very long, and off-pattern inputs.

Edge case testing determines whether a prompt is reliable for production use or only works for the examples you tested it on. Categories: unusual inputs (very short, very long, missing expected fields), ambiguous inputs where the right answer is unclear, inputs that don't match the expected pattern, inputs from different domains than your training examples, and inputs that might trigger off-topic responses. For business-critical prompts: test at least 10 to 20 varied inputs before treating a prompt as stable. Document which inputs produce suboptimal outputs — these either need to be excluded from the prompt's scope or require prompt modifications to handle. A prompt that only works on your test cases is not a reliable prompt.

#13

Fact-check all AI-generated content before external use — LLMs hallucinate plausibly by design.

LLM hallucination is a structural property of how these models work, not a bug being fixed. Language models predict probable text — they don't retrieve stored facts. When a model produces a specific statistic, citation, quote, or factual claim, it's producing text that is statistically probable given the context, which means it looks identical to a real fact but may not be. The hallucinations that cause the most harm are the most plausible: slightly wrong statistics, citations to papers that don't exist, plausible but incorrect technical specifications. Hallucination risk is highest for: specific numerical data, names and their roles or affiliations, dates and timelines, research citations and quotes, and recent events. Fact-checking protocol: identify every verifiable claim in the output, verify each with an authoritative source before publishing or sending. Providing the source documents in the prompt context (retrieval-augmented generation) significantly reduces hallucination risk for factual tasks.

#14

Save refined prompts as versioned templates with a full metadata block.

A prompt template includes: the prompt text, a description of the task it performs, the version number and date, which model it was tested on (behavior can differ across models and model versions), example inputs and outputs for reference, known limitations and edge cases that are out of scope, and the author or owner. Storage options: dedicated prompt management tools like PromptLayer, Langfuse, or Helicone provide versioning, performance tracking, and team sharing. For simpler needs, a structured markdown file in a shared repository or a Notion page with a consistent template works well. Naming conventions should describe the task, not the implementation: 'product-description-ecommerce-v3' is findable and descriptive; 'prompt-final-FINAL-use-this-one' is not. Prompts that specify an exact model version string are more reproducible than prompts referencing a floating alias that may point to different underlying models over time.

#15

Maintain a prompt changelog — document what changed between versions and why.

The changelog is the institutional memory for why the prompt is structured the way it is. Without it, the next person to edit the prompt doesn't know which constraints were added to fix specific failure modes — and may inadvertently remove them, reintroducing problems that were already solved. Changelog format: version number, date, the exact text of what changed (old versus new, quoted), why the change was made (which failure mode it addressed), and what effect was observed. This also makes debugging faster: when a prompt that was working starts producing different outputs after a model update, the changelog tells you exactly what the prompt looked like when it was last confirmed working. Even a three-line changelog per version is vastly more useful than none.

#16

Review your prompt library quarterly — test against current model versions and retire obsolete prompts.

Prompts drift toward obsolescence through model updates, changed requirements, and accumulated special-case patches that have made them unwieldy. Quarterly review activities: test each active prompt against current model versions (model updates sometimes change behavior for existing prompts without warning), retire prompts that address discontinued use cases, identify prompts doing similar tasks that could be consolidated into a single parameterized template, and review prompt logs for new failure modes that have appeared in production. For any prompt that is business-critical or customer-facing: establish a scheduled review cadence and assign an owner. Unowned prompts in production are a reliability risk — models update, requirements change, and no one notices the degradation until a user complains.

#17

The six layers every effective prompt needs

Most prompts fail not because the model lacks capability, but because the prompt omits one of six functional layers. Each layer resolves a different ambiguity. A missing layer forces the model to guess — and it will, confidently and plausibly.

1. Role

Who is generating this? Sets knowledge domain, vocabulary, and professional defaults.

2. Task

What exactly should be produced? One sentence, four questions answered: format, audience, constraints, emphasis.

3. Context

What must the model know that it can't infer? Audience, purpose, brand, and any relevant facts.

4. Examples

What does a correct output look like? Demonstrates pattern directly — more powerful than description alone.

5. Format

What structure, length, and schema is required? Eliminates format guessing and post-processing work.

6. Constraints

What must the output not do? Prevents the most predictable failure modes before they occur.

Which technique should you reach for first?

Not every prompt needs every technique. This decision guide matches common failure symptoms to their most likely causes — so you fix the right thing instead of iterating blindly.

If the output is…	Likely cause	Reach for…
Generic or surface-level	Vague task or absent role	Sharper task definition + specific role
Wrong structure or format	Format not specified	Explicit format spec + example output
Shallow or inconsistent reasoning	No reasoning path forced	Chain-of-thought instruction
Plausible but invented facts	Model generating missing data	Factual constraint + provide source data in context
Right content, wrong tone	Audience or voice not specified	Context block with audience details + few-shot examples
Inconsistent across repeated runs	Ambiguous instructions or high temperature	Resolve the ambiguity + lower temperature toward 0
Instructions partially ignored	Critical instructions buried in the middle	Move key instructions to the top and repeat at the end

⚠️ Where you place instructions changes how reliably they're followed

Research on large language models documents a consistent pattern called the "lost in the middle" effect: models attend most reliably to content at the beginning and end of a long prompt. Instructions positioned in the middle of a long context — particularly when large documents are included — receive less reliable attention than the same instructions placed at the edges.

In practice: put your task definition, format requirements, and critical constraints at the very top of the prompt, before any background documents or examples. If you're providing a long document for analysis, repeat the core instruction after the document as well as before it. For any constraint that must not be violated — a factual restriction, a word limit, a scope boundary — position it prominently rather than burying it in a paragraph of context.

This effect matters more as context windows grow. A 128K-token or 200K-token context window does not mean every sentence in a long document is equally weighted. For document-heavy prompts, targeted excerpts often produce more reliable outputs than feeding the full text — the model focuses on what's there rather than averaging across too much.

📖 What a three-sentence prompt costs in production

A fintech startup built an automated customer email responder using an LLM. Their prompt was three sentences: task, tone, length limit. It worked fine in internal testing. In production, it started citing non-existent policy documents — professional, plausible, completely fabricated. Customers forwarded these responses to support agents expecting follow-through on promises that had never been made.

The fix was two prompt changes and an afternoon of work. Rebuilding the customer trust took months. The hallucination problem was entirely predictable and preventable — but only if engineered for before deployment, not after.

🧮 The one-change-at-a-time rule — why it matters

Prompt engineering is empirical. When a prompt produces the wrong output, the temptation is to fix everything at once: rewrite the task, add examples, adjust the constraints, and change the format spec in a single edit. The result is a prompt that may work better — but you have no idea which change caused the improvement.

This matters because the next time the prompt behaves unexpectedly, you'll have no basis for diagnosis. Treat each refinement as an experiment: one change, one re-test, one recorded observation. It's slower in the short term and dramatically faster over the life of a prompt used in production.

Prompt behavior varies across models — what to know before migrating

A prompt tuned for GPT-4o may produce meaningfully different output on Claude Sonnet or Gemini 1.5 Pro — not because one model is better, but because each has different training data, fine-tuning, and default behaviors for ambiguous instructions. Key practical differences to account for:

Instruction following: Claude models tend to follow explicit constraints more literally. GPT models sometimes interpret instructions more liberally when they conflict with "helpful" defaults. If you migrate prompts, test constraint adherence explicitly.
JSON reliability: Neither model guarantees valid JSON from prose instructions alone on complex outputs. Use JSON mode — available in both the OpenAI and Anthropic APIs — for any pipeline where parseable output is a hard requirement.
Default verbosity: Models have different default output lengths for identical prompts. If length is a hard constraint, always specify it explicitly rather than assuming parity across models.
Context window limits: Claude Sonnet 4.6 supports 200K tokens; GPT-4o supports 128K. For use cases involving very long documents, model selection is a practical infrastructure decision, not only a quality preference.

When migrating a prompt library from one model to another: treat it as a test-and-refine cycle, not a direct port. The structure of well-engineered prompts transfers reliably; the exact behavior often does not.

💡 Retrieval-augmented generation — when to reach for it

Retrieval-augmented generation (RAG) is a pattern where, instead of asking the model to recall facts from its training data, you retrieve the relevant source documents and inject them directly into the prompt context. The model then generates a response grounded in those documents rather than in its parametric memory.

RAG is not always necessary — but it's the right tool when: (1) the task requires accurate, up-to-date, or proprietary factual information that the model's training data doesn't reliably contain; (2) hallucination risk on specific claims is unacceptable; or (3) the source of truth needs to be auditable. For customer-facing content about your products, policies, or pricing — RAG transforms the hallucination problem from a prompt engineering challenge into a retrieval quality challenge, which is significantly easier to control.

For teams not yet using RAG: the simplest starting point is manually pasting the relevant source document into the prompt context and adding a constraint — "Base your response only on the document provided below. Do not add information from outside this document." This captures most of the hallucination-reduction benefit before investing in a full retrieval pipeline.

Master This Checklist Quickly

Every important button and option for this pre-made checklist, shown in a glance-friendly format.

Start Here

1
Click any item row to mark it complete.
2
Use the note row under each item for quick notes.
3
Use the tool row for undo, redo, reset, and check all.
4
Use Save Progress when you want to continue later.

Checklist Row Tools

UndoRedoResetCheck allCollapse/Expand sectionsShow/Hide detailsInline notes

Top Action Buttons

Open all sharing and export options in one menu.

Email DraftContinue on another devicePrint or Save as PDFPlain Text (.txt)Word (.docx)Excel (.xlsx)

Add & Ask

Open one menu for apps and AI guidance.

NotionTodoist CSVChatGPTClaude

Copy and customize

Create a new editable checklist pre-filled with your chosen content.

Save Progress

Adds this checklist to My Checklists and keeps your progress in this browser.

Most Natural Usage

Track over time

Check items -> Add notes where needed -> Save Progress

Send or export

Open Share -> Choose format -> Continue

Make your own version

Copy and customize -> Open create page -> Edit freely

AI Prompt Engineering Workflow

Checklist Items

Phase 1 — Define the Task and Context

Phase 2 — Provide Examples, Constraints, and Parameters

Phase 3 — Test and Refine

Phase 4 — Finalize and Maintain