USP
Unlike generic prompt collections, this kit offers granular, token-efficient plugins based on scientifically proven context engineering techniques, ensuring high-quality, predictable agent results without unnecessary token overhead.
Use cases
- 01Improving AI agent reliability and accuracy
- 02Developing high-quality code with AI agents
- 03Optimizing token usage in AI prompts
- 04Evaluating and testing agent performance
- 05Implementing advanced context engineering patterns
Detected files (8)
plugins/customaize-agent/skills/create-command/SKILL.mdskillShow content (15052 bytes)
--- name: create-command description: Interactive assistant for creating new Claude commands with proper structure, patterns, and MCP tool integration argument-hint: Optional command name or description of command purpose --- # Command Creator Assistant <task> You are a command creation specialist. Help create new Claude commands by understanding requirements, determining the appropriate pattern, and generating well-structured commands that follow Scopecraft conventions. </task> <context> CRITICAL: Read the command creation guide first: @/docs/claude-commands-guide.md This meta-command helps create other commands by: 1. Understanding the command's purpose 2. Determining its category and pattern 3. Choosing command location (project vs user) 4. Generating the command file 5. Creating supporting resources 6. Updating documentation </context> <command_categories> 1. **Planning Commands** (Specialized) - Feature ideation, proposals, PRDs - Complex workflows with distinct stages - Interactive, conversational style - Create documentation artifacts - Examples: @/.claude/commands/01_brainstorm-feature.md @/.claude/commands/02_feature-proposal.md 2. **Implementation Commands** (Generic with Modes) - Technical execution tasks - Mode-based variations (ui, core, mcp, etc.) - Follow established patterns - Update task states - Example: @/.claude/commands/implement.md 3. **Analysis Commands** (Specialized) - Review, audit, analyze - Generate reports or insights - Read-heavy operations - Provide recommendations - Example: @/.claude/commands/review.md 4. **Workflow Commands** (Specialized) - Orchestrate multiple steps - Coordinate between areas - Manage dependencies - Track progress - Example: @/.claude/commands/04_feature-planning.md 5. **Utility Commands** (Generic or Specialized) - Tools, helpers, maintenance - Simple operations - May or may not need modes </command_categories> <command_frontmatter> ## CRITICAL: Every Command Must Start with Frontmatter **All command files MUST begin with YAML frontmatter** enclosed in `---` delimiters: ```markdown --- description: Brief description of what the command does argument-hint: Description of expected arguments (optional) --- ``` ### Frontmatter Fields 1. **`description`** (REQUIRED): - One-line summary of the command's purpose - Clear, concise, action-oriented - Example: "Guided feature development with codebase understanding and architecture focus" 2. **`argument-hint`** (OPTIONAL): - Describes what arguments the command accepts - Examples: - "Optional feature description" - "File path to analyze" - "Component name and location" - "None required - interactive mode" ### Example Frontmatter by Command Type ```markdown # Planning Command --- description: Interactive brainstorming session for new feature ideas argument-hint: Optional initial feature concept --- # Implementation Command --- description: Implements features using mode-based patterns (ui, core, mcp) argument-hint: Mode and feature description (e.g., 'ui: add dark mode toggle') --- # Analysis Command --- description: Comprehensive code review with quality assessment argument-hint: Optional file or directory path to review --- # Utility Command --- description: Validates API documentation against OpenAPI standards argument-hint: Path to OpenAPI spec file --- ``` ### Placement - Frontmatter MUST be the **very first content** in the file - No blank lines before the opening `---` - One blank line after the closing `---` before content begins </command_frontmatter> <command_features> ## Slash Command Features ### Namespacing Use subdirectories to group related commands. Subdirectories appear in the command description but don't affect the command name. **Example:** - `.claude/commands/frontend/component.md` creates `/component` with description "(project:frontend)" - `~/.claude/commands/component.md` creates `/component` with description "(user)" **Priority:** If a project command and user command share the same name, the project command takes precedence. ### Arguments #### All Arguments with `$ARGUMENTS` Captures all arguments passed to the command: ```bash # Command definition echo 'Fix issue #$ARGUMENTS following our coding standards' > .claude/commands/fix-issue.md # Usage > /fix-issue 123 high-priority # $ARGUMENTS becomes: "123 high-priority" ``` #### Individual Arguments with `$1`, `$2`, etc. Access specific arguments individually using positional parameters: ```bash # Command definition echo 'Review PR #$1 with priority $2 and assign to $3' > .claude/commands/review-pr.md # Usage > /review-pr 456 high alice # $1 becomes "456", $2 becomes "high", $3 becomes "alice" ``` ### Bash Command Execution Execute bash commands before the slash command runs using the `!` prefix. The output is included in the command context. **Note:** You must include `allowed-tools` with the `Bash` tool. ```markdown --- allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*) description: Create a git commit --- ## Context - Current git status: !`git status` - Current git diff: !`git diff HEAD` - Current branch: !`git branch --show-current` - Recent commits: !`git log --oneline -10` ``` ### File References Include file contents using the `@` prefix to reference files: ```markdown Review the implementation in @src/utils/helpers.js Compare @src/old-version.js with @src/new-version.js ``` ### Thinking Mode Slash commands can trigger extended thinking by including extended thinking keywords. ### Frontmatter Options | Frontmatter | Purpose | Default | |-------------|---------|---------| | `allowed-tools` | List of tools the command can use | Inherits from conversation | | `argument-hint` | Expected arguments for auto-completion | None | | `description` | Brief description of the command | First line from prompt | | `model` | Specific model string | Inherits from conversation | | `disable-model-invocation` | Prevent `Skill` tool from calling this command | false | **Example with all frontmatter options:** ```markdown --- allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*) argument-hint: [message] description: Create a git commit model: claude-3-5-haiku-20241022 --- Create a git commit with message: $ARGUMENTS ``` </command_features> <pattern_research> ## Before Creating: Study Similar Commands 1. **List existing commands in target directory**: ```bash # For project commands ls -la /.claude/commands/ # For user commands ls -la ~/.claude/commands/ ``` 2. **Read similar commands for patterns**: - Check the frontmatter (description and argument-hint) - How do they structure <task> sections? - What MCP tools do they use? - How do they handle arguments? - What documentation do they reference? 3. **Common patterns to look for**: ```markdown # MCP tool usage for tasks Use tool: mcp__scopecraft-cmd__task_create Use tool: mcp__scopecraft-cmd__task_update Use tool: mcp__scopecraft-cmd__task_list # NOT CLI commands ❌ Run: scopecraft task list ✅ Use tool: mcp__scopecraft-cmd__task_list ``` 4. **Standard references to include**: - @/docs/organizational-structure-guide.md - @/docs/command-resources/{relevant-templates} - @/docs/claude-commands-guide.md </pattern_research> <interview_process> ## Phase 1: Understanding Purpose "Let's create a new command. First, let me check what similar commands exist..." *Use Glob to find existing commands in the target category* "Based on existing patterns, please describe:" 1. What problem does this command solve? 2. Who will use it and when? 3. What's the expected output? 4. Is it interactive or batch? ## Phase 2: Category Classification Based on responses and existing examples: - Is this like existing planning commands? (Check: brainstorm-feature, feature-proposal) - Is this like implementation commands? (Check: implement.md) - Does it need mode variations? - Should it follow analysis patterns? (Check: review.md) ## Phase 3: Pattern Selection **Study similar commands first**: ```markdown # Read a similar command @{similar-command-path} # Note patterns: - Task description style - Argument handling - MCP tool usage - Documentation references - Human review sections ``` ## Phase 4: Command Location 🎯 **Critical Decision: Where should this command live?** **Project Command** (`/.claude/commands/`) - Specific to this project's workflow - Uses project conventions - References project documentation - Integrates with project MCP tools **User Command** (`~/.claude/commands/`) - General-purpose utility - Reusable across projects - Personal productivity tool - Not project-specific Ask: "Should this be: 1. A project command (specific to this codebase) 2. A user command (available in all projects)?" ## Phase 5: Resource Planning Check existing resources: ```bash # Check templates ls -la /docs/command-resources/planning-templates/ ls -la /docs/command-resources/implement-modes/ # Check which guides exist ls -la /docs/ ``` </interview_process> <generation_patterns> ## Critical: Copy Patterns from Similar Commands Before generating, read similar commands and note: 1. **Frontmatter (MUST BE FIRST)**: ```markdown --- description: Clear one-line description of command purpose argument-hint: What arguments does it accept --- ``` - No blank lines before opening `---` - One blank line after closing `---` - `description` is REQUIRED - `argument-hint` is OPTIONAL 2. **MCP Tool Usage**: ```markdown # From existing commands Use mcp__scopecraft-cmd__task_create Use mcp__scopecraft-cmd__feature_get Use mcp__scopecraft-cmd__phase_list ``` 3. **Standard References**: ```markdown <context> Key Reference: @/docs/organizational-structure-guide.md Template: @/docs/command-resources/planning-templates/{template}.md Guide: @/docs/claude-commands-guide.md </context> ``` 4. **Task Update Patterns**: ```markdown <task_updates> After implementation: 1. Update task status to appropriate state 2. Add implementation log entries 3. Mark checklist items as complete 4. Document any decisions made </task_updates> ``` 5. **Human Review Sections**: ```markdown <human_review_needed> Flag decisions needing verification: - [ ] Assumptions about workflows - [ ] Technical approach choices - [ ] Pattern-based suggestions </human_review_needed> ``` </generation_patterns> <implementation_steps> 1. **Create Command File** - Determine location based on project/user choice - Generate content following established patterns - Include all required sections 2. **Create Supporting Files** (if project command) - Templates in `/docs/command-resources/` - Mode guides if generic command - Example documentation 3. **Update Documentation** (if project command) - Add to claude-commands-guide.md - Update feature-development-workflow.md if workflow command - Add to README if user-facing 4. **Test the Command** - Create example usage scenarios - Verify argument handling - Check MCP tool integration </implementation_steps> <creation_checklist> Before finalizing: - [ ] **Includes YAML frontmatter** with description (required) and argument-hint (optional) - [ ] Frontmatter is the very first content (no blank lines before opening `---`) - [ ] Studied similar commands in the category - [ ] Command follows naming conventions (use numeric prefix for ordered workflows) - [ ] Includes proper task/context structure - [ ] References @organizational-structure-guide.md - [ ] Uses MCP tools (not CLI) - check existing patterns - [ ] Includes human review sections - [ ] Has clear examples like other commands - [ ] Updates task states appropriately - [ ] Creates proper documentation - [ ] Follows established patterns from similar commands - [ ] Correct command prefix (project: or user:) </creation_checklist> <example_session> User: "I need a command to help validate our API documentation" 🔍 **Research**: Let me check existing analysis commands... *Use Read tool to examine: /.claude/commands/review.md* I notice the review command: - Uses MCP tools for task operations - Includes human review sections - References organizational structure - Has clear output formatting 🤔 **Question**: Can you tell me more about this API documentation validation? - What format is the documentation in? - What aspects need validation? - Should it create tasks for issues found? User: "It's OpenAPI specs, need to check for completeness and consistency" 💡 **Category**: This is an Analysis command similar to 'review'. 🔍 **Pattern Check**: Looking at review.md, I see it: ```markdown <task> You are a code reviewer conducting a comprehensive review... </task> <mcp_usage> Always use MCP tools: - mcp__scopecraft-cmd__task_list - mcp__scopecraft-cmd__task_update </mcp_usage> ``` 🎯 **Location Question**: Should this be: 1. A project command (specific to this API project) 2. A user command (useful for all your API projects) User: "Project command - it needs to reference our specific API standards" ✅ Creating project command: `/.claude/commands/validate-api.md` Generated command (following review.md patterns): ```markdown --- description: Validates API documentation against OpenAPI standards for completeness and consistency argument-hint: Path to OpenAPI spec file (optional, will search if not provided) --- <task> You are an API documentation validator reviewing OpenAPI specifications for completeness and consistency. </task> <context> References: - API Standards: @/docs/api-standards.md - Organizational Structure: @/docs/organizational-structure-guide.md Similar to: @/.claude/commands/review.md </context> <validation_process> 1. Load OpenAPI spec files 2. Check required endpoints documented 3. Validate response schemas 4. Verify authentication documented 5. Check for missing examples </validation_process> <mcp_usage> If issues found, create tasks: - Use tool: mcp__scopecraft-cmd__task_create - Type: "bug" or "documentation" - Phase: Current active phase - Area: "docs" or "api" </mcp_usage> <human_review_needed> Flag for manual review: - [ ] Breaking changes detected - [ ] Security implications unclear - [ ] Business logic assumptions </human_review_needed> ``` </example_session> <final_output> After gathering all information: 1. **Command Created**: - Location: {chosen location} - Name: {command-name} - Category: {category} - Pattern: {specialized/generic} 2. **Resources Created**: - Supporting templates: {list} - Documentation updates: {list} 3. **Usage Instructions**: - Command: `/{prefix}:{name}` - Example: {example usage} 4. **Next Steps**: - Test the command - Refine based on usage - Add to command documentation </final_output>plugins/customaize-agent/skills/context-engineering/SKILL.mdskillShow content (53651 bytes)
--- name: context-engineering description: Understand the components, mechanics, and constraints of context in agent systems. Use when writing, editing, or optimizing commands, skills, or sub-agents prompts. --- # Context Engineering Fundamentals Context is the complete state available to a language model at inference time. It includes everything the model can attend to when generating responses: system instructions, tool definitions, retrieved documents, message history, and tool outputs. Understanding context fundamentals is prerequisite to effective context engineering. ## Core Concepts Context comprises several distinct components, each with different characteristics and constraints. The attention mechanism creates a finite budget that constrains effective context usage. Progressive disclosure manages this constraint by loading information only as needed. The engineering discipline is curating the smallest high-signal token set that achieves desired outcomes. ## Detailed Topics ### The Anatomy of Context **System Prompts** System prompts establish the agent's core identity, constraints, and behavioral guidelines. They are loaded once at session start and typically persist throughout the conversation. System prompts should be extremely clear and use simple, direct language at the right altitude for the agent. The right altitude balances two failure modes. At one extreme, engineers hardcode complex brittle logic that creates fragility and maintenance burden. At the other extreme, engineers provide vague high-level guidance that fails to give concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics. Organize prompts into distinct sections using XML tagging or Markdown headers to delineate background information, instructions, tool guidance, and output description. The exact formatting matters less as models become more capable, but structural clarity remains valuable. **Tool Definitions** Tool definitions specify the actions an agent can take. Each tool includes a name, description, parameters, and return format. Tool definitions live near the front of context after serialization, typically before or after the system prompt. Tool descriptions collectively steer agent behavior. Poor descriptions force agents to guess; optimized descriptions include usage context, examples, and defaults. The consolidation principle states that if a human engineer cannot definitively say which tool should be used in a given situation, an agent cannot be expected to do better. **Retrieved Documents** Retrieved documents provide domain-specific knowledge, reference materials, or task-relevant information. Agents use retrieval augmented generation to pull relevant documents into context at runtime rather than pre-loading all possible information. The just-in-time approach maintains lightweight identifiers (file paths, stored queries, web links) and uses these references to load data into context dynamically. This mirrors human cognition: we generally do not memorize entire corpuses of information but rather use external organization and indexing systems to retrieve relevant information on demand. **Message History** Message history contains the conversation between the user and agent, including previous queries, responses, and reasoning. For long-running tasks, message history can grow to dominate context usage. Message history serves as scratchpad memory where agents track progress, maintain task state, and preserve reasoning across turns. Effective management of message history is critical for long-horizon task completion. **Tool Outputs** Tool outputs are the results of agent actions: file contents, search results, command execution output, API responses, and similar data. Tool outputs comprise the majority of tokens in typical agent trajectories, with research showing observations (tool outputs) can reach 83.9% of total context usage. Tool outputs consume context whether they are relevant to current decisions or not. This creates pressure for strategies like observation masking, compaction, and selective tool result retention. ### Context Windows and Attention Mechanics **The Attention Budget Constraint** Language models process tokens through attention mechanisms that create pairwise relationships between all tokens in context. For n tokens, this creates n^2 relationships that must be computed and stored. As context length increases, the model's ability to capture these relationships gets stretched thin. Models develop attention patterns from training data distributions where shorter sequences predominate. This means models have less experience with and fewer specialized parameters for context-wide dependencies. The result is an "attention budget" that depletes as context grows. **Position Encoding and Context Extension** Position encoding interpolation allows models to handle longer sequences by adapting them to originally trained smaller contexts. However, this adaptation introduces degradation in token position understanding. Models remain highly capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to performance on shorter contexts. **The Progressive Disclosure Principle** Progressive disclosure manages context efficiently by loading information only as needed. At startup, agents load only skill names and descriptions--sufficient to know when a skill might be relevant. Full content loads only when a skill is activated for specific tasks. This approach keeps agents fast while giving them access to more context on demand. The principle applies at multiple levels: skill selection, document loading, and even tool result retrieval. ### Context Quality Versus Context Quantity The assumption that larger context windows solve memory problems has been empirically debunked. Context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes. Several factors create pressure for context efficiency. Processing cost grows disproportionately with context length--not just double the cost for double the tokens, but exponentially more in time and computing resources. Model performance degrades beyond certain context lengths even when the window technically supports more tokens. Long inputs remain expensive even with prefix caching. The guiding principle is informativity over exhaustiveness. Include what matters for the decision at hand, exclude what does not, and design systems that can access additional information on demand. ### Context as Finite Resource Context must be treated as a finite resource with diminishing marginal returns. Like humans with limited working memory, language models have an attention budget drawn on when parsing large volumes of context. Every new token introduced depletes this budget by some amount. This creates the need for careful curation of available tokens. The engineering problem is optimizing utility against inherent constraints. Context engineering is iterative and the curation phase happens each time you decide what to pass to the model. It is not a one-time prompt writing exercise but an ongoing discipline of context management. ## Practical Guidance ### File-System-Based Access Agents with filesystem access can use progressive disclosure naturally. Store reference materials, documentation, and data externally. Load files only when needed using standard filesystem operations. This pattern avoids stuffing context with information that may not be relevant. The file system itself provides structure that agents can navigate. File sizes suggest complexity; naming conventions hint at purpose; timestamps serve as proxies for relevance. Metadata of file references provides a mechanism to efficiently refine behavior. ### Hybrid Strategies The most effective agents employ hybrid strategies. Pre-load some context for speed (like CLAUDE.md files or project rules), but enable autonomous exploration for additional context as needed. The decision boundary depends on task characteristics and context dynamics. For contexts with less dynamic content, pre-loading more upfront makes sense. For rapidly changing or highly specific information, just-in-time loading avoids stale context. ### Context Budgeting Design with explicit context budgets in mind. Know the effective context limit for your model and task. Monitor context usage during development. Implement compaction triggers at appropriate thresholds. Design systems assuming context will degrade rather than hoping it will not. Effective context budgeting requires understanding not just raw token counts but also attention distribution patterns. The middle of context receives less attention than the beginning and end. Place critical information at attention-favored positions. ## Examples **Example 1: Organizing System Prompts** ```markdown <BACKGROUND_INFORMATION> You are a Python expert helping a development team. Current project: Data processing pipeline in Python 3.9+ </BACKGROUND_INFORMATION> <INSTRUCTIONS> - Write clean, idiomatic Python code - Include type hints for function signatures - Add docstrings for public functions - Follow PEP 8 style guidelines </INSTRUCTIONS> <TOOL_GUIDANCE> Use bash for shell operations, python for code tasks. File operations should use pathlib for cross-platform compatibility. </TOOL_GUIDANCE> <OUTPUT_DESCRIPTION> Provide actionable feedback with specific line references. Explain the reasoning behind suggestions. </OUTPUT_DESCRIPTION> ``` **Example 2: Progressive Document Loading** ```markdown # Instead of loading all documentation at once: # Step 1: Load summary docs/architecture_overview.md # Lightweight overview # Step 2: Load specific section as needed docs/api/endpoints.md # Only when API work needed docs/database/schemas.md # Only when data layer work needed ``` **Example 3: Skill Description Design** ```markdown # Bad: Vague description that loads into context but provides little signal description: Helps with code things # Good: Specific description that helps model decide when to activate description: Analyze code quality and suggest refactoring patterns. Use when reviewing pull requests or improving existing code structure. ``` ## Guidelines 1. Treat context as a finite resource with diminishing returns 2. Place critical information at attention-favored positions (beginning and end) 3. Use progressive disclosure to defer loading until needed 4. Organize system prompts with clear section boundaries 5. Monitor context usage during development 6. Implement compaction triggers at 70-80% utilization 7. Design for context degradation rather than hoping to avoid it 8. Prefer smaller high-signal context over larger low-signal context # Context Degradation Patterns Language models exhibit predictable degradation patterns as context length increases. Understanding these patterns is essential for diagnosing failures and designing resilient systems. Context degradation is not a binary state but a continuum of performance degradation that manifests in several distinct ways. ## Core Concepts Context degradation manifests through several distinct patterns. The lost-in-middle phenomenon causes information in the center of context to receive less attention. Context poisoning occurs when errors compound through repeated reference. Context distraction happens when irrelevant information overwhelms relevant content. Context confusion arises when the model cannot determine which context applies. Context clash develops when accumulated information directly conflicts. These patterns are predictable and can be mitigated through architectural patterns like compaction, masking, partitioning, and isolation. ## Detailed Topics ### The Lost-in-Middle Phenomenon The most well-documented degradation pattern is the "lost-in-middle" effect, where models demonstrate U-shaped attention curves. Information at the beginning and end of context receives reliable attention, while information buried in the middle suffers from dramatically reduced recall accuracy. **Empirical Evidence** Research demonstrates that relevant information placed in the middle of context experiences 10-40% lower recall accuracy compared to the same information at the beginning or end. This is not a failure of the model but a consequence of attention mechanics and training data distributions. Models allocate massive attention to the first token (often the BOS token) to stabilize internal states. This creates an "attention sink" that soaks up attention budget. As context grows, the limited budget is stretched thinner, and middle tokens fail to garner sufficient attention weight for reliable retrieval. **Practical Implications** Design context placement with attention patterns in mind. Place critical information at the beginning or end of context. Consider whether information will be queried directly or needs to support reasoning--if the latter, placement matters less but overall signal quality matters more. For long documents or conversations, use summary structures that surface key information at attention-favored positions. Use explicit section headers and transitions to help models navigate structure. ### Context Poisoning Context poisoning occurs when hallucinations, errors, or incorrect information enters context and compounds through repeated reference. Once poisoned, context creates feedback loops that reinforce incorrect beliefs. **How Poisoning Occurs** Poisoning typically enters through three pathways. First, tool outputs may contain errors or unexpected formats that models accept as ground truth. Second, retrieved documents may contain incorrect or outdated information that models incorporate into reasoning. Third, model-generated summaries or intermediate outputs may introduce hallucinations that persist in context. The compounding effect is severe. If an agent's goals section becomes poisoned, it develops strategies that take substantial effort to undo. Each subsequent decision references the poisoned content, reinforcing incorrect assumptions. **Detection and Recovery** Watch for symptoms including degraded output quality on tasks that previously succeeded, tool misalignment where agents call wrong tools or parameters, and hallucinations that persist despite correction attempts. When these symptoms appear, consider context poisoning. Recovery requires removing or replacing poisoned content. This may involve truncating context to before the poisoning point, explicitly noting the poisoning in context and asking for re-evaluation, or restarting with clean context and preserving only verified information. ### Context Distraction Context distraction emerges when context grows so long that models over-focus on provided information at the expense of their training knowledge. The model attends to everything in context regardless of relevance, and this creates pressure to use provided information even when internal knowledge is more accurate. **The Distractor Effect** Research shows that even a single irrelevant document in context reduces performance on tasks involving relevant documents. Multiple distractors compound degradation. The effect is not about noise in absolute terms but about attention allocation--irrelevant information competes with relevant information for limited attention budget. Models do not have a mechanism to "skip" irrelevant context. They must attend to everything provided, and this obligation creates distraction even when the irrelevant information is clearly not useful. **Mitigation Strategies** Mitigate distraction through careful curation of what enters context. Apply relevance filtering before loading retrieved documents. Use namespacing and organization to make irrelevant sections easy to ignore structurally. Consider whether information truly needs to be in context or can be accessed through tool calls instead. ### Context Confusion Context confusion arises when irrelevant information influences responses in ways that degrade quality. This is related to distraction but distinct--confusion concerns the influence of context on model behavior rather than attention allocation. If you put something in context, the model has to pay attention to it. The model may incorporate irrelevant information, use inappropriate tool definitions, or apply constraints that came from different contexts. Confusion is especially problematic when context contains multiple task types or when switching between tasks within a single session. **Signs of Confusion** Watch for responses that address the wrong aspect of a query, tool calls that seem appropriate for a different task, or outputs that mix requirements from multiple sources. These indicate confusion about what context applies to the current situation. **Architectural Solutions** Architectural solutions include explicit task segmentation where different tasks get different context windows, clear transitions between task contexts, and state management that isolates context for different objectives. ### Context Clash Context clash develops when accumulated information directly conflicts, creating contradictory guidance that derails reasoning. This differs from poisoning where one piece of information is incorrect--in clash, multiple correct pieces of information contradict each other. **Sources of Clash** Clash commonly arises from multi-source retrieval where different sources have contradictory information, version conflicts where outdated and current information both appear in context, and perspective conflicts where different viewpoints are valid but incompatible. **Resolution Approaches** Resolution approaches include explicit conflict marking that identifies contradictions and requests clarification, priority rules that establish which source takes precedence, and version filtering that excludes outdated information from context. ### Counterintuitive Findings Research reveals several counterintuitive patterns that challenge assumptions about context management. **Shuffled Haystacks Outperform Coherent Ones** Studies found that shuffled (incoherent) haystacks produce better performance than logically coherent ones. This suggests that coherent context may create false associations that confuse retrieval, while incoherent context forces models to rely on exact matching. **Single Distractors Have Outsized Impact** Even a single irrelevant document reduces performance significantly. The effect is not proportional to the amount of noise but follows a step function where the presence of any distractor triggers degradation. **Needle-Question Similarity Correlation** Lower similarity between needle and question pairs shows faster degradation with context length. Tasks requiring inference across dissimilar content are particularly vulnerable. ### When Larger Contexts Hurt Larger context windows do not uniformly improve performance. In many cases, larger contexts create new problems that outweigh benefits. **Performance Degradation Curves** Models exhibit non-linear degradation with context length. Performance remains stable up to a threshold, then degrades rapidly. The threshold varies by model and task complexity. For many models, meaningful degradation begins around 8,000-16,000 tokens even when context windows support much larger sizes. **Cost Implications** Processing cost grows disproportionately with context length. The cost to process a 400K token context is not double the cost of 200K--it increases exponentially in both time and computing resources. For many applications, this makes large-context processing economically impractical. **Cognitive Load Metaphor** Even with an infinite context, asking a single model to maintain consistent quality across dozens of independent tasks creates a cognitive bottleneck. The model must constantly switch context between items, maintain a comparative framework, and ensure stylistic consistency. This is not a problem that more context solves. ## Practical Guidance ### The Four-Bucket Approach Four strategies address different aspects of context degradation: **Write**: Save context outside the window using scratchpads, file systems, or external storage. This keeps active context lean while preserving information access. **Select**: Pull relevant context into the window through retrieval, filtering, and prioritization. This addresses distraction by excluding irrelevant information. **Compress**: Reduce tokens while preserving information through summarization, abstraction, and observation masking. This extends effective context capacity. **Isolate**: Split context across sub-agents or sessions to prevent any single context from growing large enough to degrade. This is the most aggressive strategy but often the most effective. ### Architectural Patterns Implement these strategies through specific architectural patterns. Use just-in-time context loading to retrieve information only when needed. Use observation masking to replace verbose tool outputs with compact references. Use sub-agent architectures to isolate context for different tasks. Use compaction to summarize growing context before it exceeds limits. ## Examples **Example 1: Detecting Degradation in Prompt Design** ```markdown # Signs your command/skill prompt may be too large: Early signs (context ~50-70% utilized): - Agent occasionally misses instructions - Responses become less focused - Some guidelines ignored Warning signs (context ~70-85% utilized): - Inconsistent behavior across runs - Agent "forgets" earlier instructions - Quality varies significantly Critical signs (context >85% utilized): - Agent ignores key constraints - Hallucinations increase - Task completion fails ``` **Example 2: Mitigating Lost-in-Middle in Prompt Structure** ```markdown # Organize prompts with critical info at edges <CRITICAL_CONSTRAINTS> # At start (high attention) - Never modify production files directly - Always run tests before committing - Maximum file size: 500 lines </CRITICAL_CONSTRAINTS> <DETAILED_GUIDELINES> # Middle (lower attention) - Code style preferences - Documentation templates - Review checklists - Example patterns </DETAILED_GUIDELINES> <KEY_REMINDERS> # At end (high attention) - Run tests: npm test - Format code: npm run format - Create PR with description </KEY_REMINDERS> ``` **Example 3: Sub-Agent Context Isolation** ```markdown # Instead of one agent handling everything: ## Coordinator Agent (lean context) - Understands task decomposition - Delegates to specialized sub-agents - Synthesizes results ## Code Review Sub-Agent (isolated context) - Loaded only with code review guidelines - Focuses solely on review task - Returns structured findings ## Test Writer Sub-Agent (isolated context) - Loaded only with testing patterns - Focuses solely on test creation - Returns test files ``` ## Guidelines 1. Monitor context length and performance correlation during development 2. Place critical information at beginning or end of context 3. Implement compaction triggers before degradation becomes severe 4. Validate retrieved documents for accuracy before adding to context 5. Use versioning to prevent outdated information from causing clash 6. Segment tasks to prevent context confusion across different objectives 7. Design for graceful degradation rather than assuming perfect conditions 8. Test with progressively larger contexts to find degradation thresholds # Context Degradation Patterns: Multi-Agent Workflows This section transforms context degradation detection and mitigation concepts into actionable multi-agent workflows for Claude Code. Use these patterns when building commands, skills, or complex agent pipelines to ensure quality and reliability. ## Hallucination Detection Workflow Hallucinations in agent output can poison downstream context and propagate errors through multi-step workflows. This workflow detects hallucinations before they compound. ### When to Use - After any agent completes a task that produces factual claims - Before committing agent-generated code or documentation - When output will be used as input for subsequent agents - During review of long-running agent sessions ### Multi-Agent Verification Pattern **Step 1: Generate Output** Have the primary agent complete its task normally. **Step 2: Extract Claims** Spawn a verification sub-agent with this prompt: ```markdown <TASK> Extract all factual claims from the following output. List each claim on a separate line. </TASK> <FOCUS_AREAS> - File paths and their existence - Function/class/method names referenced - Code behavior assertions ("this function returns X") - External facts about APIs, libraries, or specifications - Numerical values and metrics </FOCUS_AREAS> <OUTPUT_TO_ANALYZE> {agent_output} </OUTPUT_TO_ANALYZE> <OUTPUT_FORMAT> One claim per line, prefixed with category: [PATH] /src/auth/login.ts exists [CODE] validateCredentials() returns a boolean [FACT] JWT tokens expire after 24 hours by default [METRIC] The function has O(n) complexity </OUTPUT_FORMAT> ``` **Step 3: Verify Claims** For groups of extracted claimd, spawn a verification agent: ```markdown <TASK> Verify this claim by checking the actual codebase and context. </TASK> <CLAIM> {claim} </CLAIM> <VERIFICATION_APPROACH> - For file paths: Use file tools to check existence - For code claims: Read the actual code and verify behavior - For external facts: Cross-reference with documentation or web search - For metrics: Analyze the code structure </VERIFICATION_APPROACH> <RESPONSE_FORMAT> STATUS: [VERIFIED | FALSE | UNVERIFIABLE] EVIDENCE: [What you found] CONFIDENCE: [HIGH | MEDIUM | LOW] </RESPONSE_FORMAT> ``` **Step 4: Calculate Poisoning Risk** Aggregate verification results: ``` total_claims = number of claims extracted verified_count = claims marked VERIFIED false_count = claims marked FALSE unverifiable_count = claims marked UNVERIFIABLE poisoning_risk = (false_count * 2 + unverifiable_count) / total_claims ``` **Step 5: Decision Threshold** - **Risk < 0.1**: Output is reliable, proceed normally - **Risk 0.1-0.3**: Review flagged claims manually before proceeding - **Risk > 0.3**: Regenerate output with more explicit grounding instructions: ```markdown <REGENERATION_PROMPT> Previous output contained {false_count} false claims and {unverifiable_count} unverifiable claims. Specific issues: {list of FALSE and UNVERIFIABLE claims with evidence} Please regenerate your response. For each factual claim: 1. Explicitly verify it using tools before stating it 2. If you cannot verify, state "I cannot verify..." instead of asserting 3. Cite the specific file/line/source for verifiable facts </REGENERATION_PROMPT> ``` ## Lost-in-Middle Detection Workflow Critical information buried in the middle of long prompts receives less attention. This workflow detects which parts of your prompt are at risk of being ignored by running multiple agents and verifying their outputs against the original instructions. ### When to Use - When designing new commands or skills with long prompts - When agents inconsistently follow instructions across runs - Before deploying prompts to production - During prompt optimization ### Multi-Run Verification Pattern **Step 1: Identify Critical Instructions** Extract all critical instructions from your prompt that the agent MUST follow: ```markdown Critical instructions to verify: 1. "Never modify files in /production" 2. "Always run tests before committing" 3. "Use TypeScript strict mode" 4. "Maximum function length: 50 lines" 5. "Include JSDoc for public APIs" 6. "Format output as JSON" 7. "Log all file modifications" ``` **Step 2: Run Multiple Agents with Same Prompt** Spawn 3-5 agents with the SAME prompt (the command/skill/agent being tested). Each agent runs independently with identical inputs: ```markdown <AGENT_RUN_CONFIG> Number of runs: 5 Prompt: {your_full_prompt_being_tested} Task: {representative_task_that_exercises_all_instructions} For each run, save: - run_id: unique identifier - agent_output: complete response from agent - timestamp: when run completed </AGENT_RUN_CONFIG> ``` **Step 3: Verify Each Output Against Original Prompt** For each agent's output, spawn a NEW verification agent that checks compliance with every critical instruction: ```markdown <VERIFICATION_AGENT_PROMPT> <TASK> You are a compliance verification agent. Analyze whether the agent output followed each instruction from the original prompt. </TASK> <ORIGINAL_PROMPT> {the_full_prompt_being_tested} </ORIGINAL_PROMPT> <CRITICAL_INSTRUCTIONS> {numbered_list_of_critical_instructions} </CRITICAL_INSTRUCTIONS> <AGENT_OUTPUT> {output_from_run_N} </AGENT_OUTPUT> <VERIFICATION_APPROACH> For each critical instruction: 1. Determine if the instruction was applicable to this task 2. If applicable, check whether the output complies 3. Look for both explicit violations and omissions 4. Note any partial compliance </VERIFICATION_APPROACH> <OUTPUT_FORMAT> RUN_ID: {run_id} INSTRUCTION_COMPLIANCE: - Instruction 1: "Never modify files in /production" STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE] EVIDENCE: {quote from output or explanation} - Instruction 2: "Always run tests before committing" STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE] EVIDENCE: {quote from output or explanation} [... continue for all instructions ...] SUMMARY: - Instructions followed: {count} - Instructions violated: {count} - Not applicable: {count} </OUTPUT_FORMAT> </VERIFICATION_AGENT_PROMPT> ``` **Step 4: Aggregate Results and Identify At-Risk Parts** Collect verification results from all runs and identify instructions that were inconsistently followed: ```markdown <AGGREGATION_LOGIC> For each instruction: followed_count = number of runs where STATUS == FOLLOWED violated_count = number of runs where STATUS == VIOLATED applicable_runs = total_runs - (runs where STATUS == NOT_APPLICABLE) compliance_rate = followed_count / applicable_runs Classification: - compliance_rate == 1.0: RELIABLE (always followed) - compliance_rate >= 0.8: MOSTLY_RELIABLE (minor inconsistency) - compliance_rate >= 0.5: AT_RISK (inconsistent - likely lost-in-middle) - compliance_rate < 0.5: FREQUENTLY_IGNORED (severe issue) - compliance_rate == 0.0: ALWAYS_IGNORED (critical failure) AT_RISK instructions are the primary signal for lost-in-middle problems. These are instructions that work sometimes but not consistently, indicating they are in attention-weak positions. </AGGREGATION_LOGIC> <AGGREGATION_OUTPUT_FORMAT> INSTRUCTION COMPLIANCE SUMMARY: | Instruction | Followed | Violated | Compliance Rate | Status | |-------------|----------|----------|-----------------|--------| | 1. Never modify /production | 5/5 | 0/5 | 100% | RELIABLE | | 2. Run tests before commit | 3/5 | 2/5 | 60% | AT_RISK | | 3. TypeScript strict mode | 4/5 | 1/5 | 80% | MOSTLY_RELIABLE | | 4. Max function length 50 | 2/5 | 3/5 | 40% | FREQUENTLY_IGNORED | | 5. Include JSDoc | 5/5 | 0/5 | 100% | RELIABLE | | 6. Format as JSON | 1/5 | 4/5 | 20% | ALWAYS_IGNORED | | 7. Log modifications | 3/5 | 2/5 | 60% | AT_RISK | AT-RISK INSTRUCTIONS (likely in lost-in-middle zone): - Instruction 2: "Run tests before commit" (60% compliance) - Instruction 4: "Max function length 50" (40% compliance) - Instruction 6: "Format as JSON" (20% compliance) - Instruction 7: "Log modifications" (60% compliance) </AGGREGATION_OUTPUT_FORMAT> ``` **Step 5: Output Recommendations** Based on the at-risk parts identified, provide specific remediation guidance: ```markdown <RECOMMENDATIONS_OUTPUT> LOST-IN-MIDDLE ANALYSIS COMPLETE At-Risk Instructions Detected: {count} These instructions are inconsistently followed, indicating they likely reside in attention-weak positions (middle of prompt). SPECIFIC RECOMMENDATIONS: 1. MOVE CRITICAL INFORMATION TO ATTENTION-FAVORED POSITIONS The following instructions should be relocated to the beginning or end of your prompt: - "Run tests before commit" -> Move to <CRITICAL_CONSTRAINTS> at prompt START - "Max function length 50" -> Move to <KEY_REMINDERS> at prompt END - "Format as JSON" -> Move to <OUTPUT_FORMAT> at prompt END - "Log modifications" -> Add to both START and END sections 2. USE EXPLICIT MARKERS TO HIGHLIGHT CRITICAL INFORMATION Restructure at-risk instructions with emphasis: Before: "Always run tests before committing" After: "**CRITICAL:** You MUST run tests before committing. Never skip this step." Before: "Maximum function length: 50 lines" After: "3. [REQUIRED] Maximum function length: 50 lines" Use numbered lists, bold markers, or explicit tags like [REQUIRED], [CRITICAL], [MUST]. 3. CONSIDER SPLITTING CONTEXT TO REDUCE MIDDLE SECTION If your prompt has many instructions, consider: - Breaking into focused sub-prompts for different aspects - Using sub-agents with specialized, shorter contexts - Moving detailed guidance to on-demand sections loaded only when needed Current prompt structure creates a large middle section where {count} instructions are being lost. Reduce middle section by: - Moving 2-3 most critical items to edges - Converting remaining middle items to a numbered checklist - Adding explicit "verify these items" reminder at end </RECOMMENDATIONS_OUTPUT> ``` ### Complete Workflow Example ```markdown # Example: Testing a Code Review Command ## Original Prompt Being Tested: "Review the code for: security issues, performance problems, code style, test coverage, documentation completeness, error handling, and logging practices." ## Run 5 Agents: Each agent reviews the same code sample with this prompt. ## Verification Results: | Instruction | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Rate | |-------------|-------|-------|-------|-------|-------|------| | Security | Y | Y | Y | Y | Y | 100% | | Performance | Y | X | Y | X | Y | 60% | | Code style | X | X | Y | X | X | 20% | | Test coverage | X | Y | X | X | Y | 40% | | Documentation | X | X | X | Y | X | 20% | | Error handling | Y | Y | X | Y | Y | 80% | | Logging | Y | Y | Y | Y | Y | 100% | ## Analysis: - RELIABLE: Security, Logging (at edges of list) - AT_RISK: Performance, Error handling - FREQUENTLY_IGNORED: Code style, Test coverage, Documentation (middle of list) ## Remediation Applied: "**CRITICAL REVIEW AREAS:** 1. Security vulnerabilities 2. Test coverage gaps 3. Documentation completeness Review also: performance, code style, error handling, logging. **BEFORE COMPLETING:** Verify you addressed items 1-3 above." ``` ## Error Propagation Analysis Workflow In multi-agent chains, errors from early agents propagate and amplify through subsequent agents. This workflow traces errors to their source. ### When to Use - When final output contains errors despite correct intermediate steps - When debugging complex multi-agent workflows - When establishing error boundaries in agent chains - During post-mortem analysis of failed agent tasks ### Error Trace Pattern **Step 1: Capture Agent Chain Outputs** Record the output of each agent in your chain: ```markdown Agent Chain Record: - Agent 1 (Analyzer): {output_1} - Agent 2 (Planner): {output_2} - Agent 3 (Implementer): {output_3} - Agent 4 (Reviewer): {output_4} ``` **Step 2: Identify Error Symptoms** Spawn an error identification agent: ```markdown <TASK> Analyze the final output and identify all errors, inconsistencies, or quality issues. </TASK> <FINAL_OUTPUT> {output_from_last_agent} </FINAL_OUTPUT> <OUTPUT_FORMAT> ERROR_ID: E1 DESCRIPTION: Function missing null check LOCATION: src/utils/parser.ts:45 SEVERITY: HIGH ERROR_ID: E2 ... </OUTPUT_FORMAT> ``` **Step 3: Trace Each Error Backward** For each identified error, spawn a trace agent: ```markdown <TASK> Trace this error backward through the agent chain to find its origin. </TASK> <ERROR> {error_description} </ERROR> <AGENT_CHAIN_OUTPUTS> Agent 1 Output: {output_1} Agent 2 Output: {output_2} Agent 3 Output: {output_3} Agent 4 Output: {output_4} </AGENT_CHAIN_OUTPUTS> <ANALYSIS_APPROACH> For each agent output (starting from the last): 1. Does this output contain the error? 2. If yes, was the error present in the input to this agent? 3. If error is in output but not input: This agent INTRODUCED the error 4. If error is in both: This agent PROPAGATED the error </ANALYSIS_APPROACH> <OUTPUT_FORMAT> ERROR: {error_id} ORIGIN_AGENT: Agent {N} ORIGIN_TYPE: [INTRODUCED | PROPAGATED_FROM_CONTEXT | PROPAGATED_FROM_TOOL_OUTPUT] ROOT_CAUSE: {explanation} CONTEXT_THAT_CAUSED_IT: {relevant context snippet if applicable} </OUTPUT_FORMAT> ``` **Step 4: Calculate Propagation Metrics** ``` For each agent in chain: errors_introduced = count of errors this agent created errors_propagated = count of errors this agent passed through errors_caught = count of errors this agent fixed or flagged propagation_rate = errors_at_end / errors_introduced_total amplification_factor = errors_at_end / errors_at_start ``` **Step 5: Establish Error Boundaries** Based on analysis, add verification checkpoints: ```markdown <ERROR_BOUNDARY_TEMPLATE> After Agent {N} completes: 1. Spawn verification agent to check for common error patterns: - {error_pattern_1 that Agent N tends to introduce} - {error_pattern_2 that Agent N tends to introduce} 2. If errors detected: - Log error for analysis - Either: Fix inline and continue - Or: Regenerate Agent N output with explicit guidance 3. Only proceed to Agent {N+1} if verification passes </ERROR_BOUNDARY_TEMPLATE> ``` ## Context Relevance Scoring Workflow Not all parts of a prompt contribute equally to task completion. This workflow identifies distractor parts within a prompt that consume attention budget without adding value. ### When to Use - When optimizing prompt length and content - When deciding what to include in CLAUDE.md - When a prompt feels bloated but you are unsure what to cut - When debugging agents that ignore provided context - Before deploying new commands, skills, or agent prompts ### Distractor Identification Pattern **Step 1: Split Prompt into Parts** Divide the prompt (command/skill/agent) into logical sections. Each part should be a coherent unit: ```markdown <PROMPT_PARTS> PART_1: ID: background CONTENT: | You are a Python expert helping a development team. Current project: Data processing pipeline in Python 3.9+ PART_2: ID: code_style_rules CONTENT: | - Write clean, idiomatic Python code - Include type hints for function signatures - Add docstrings for public functions - Follow PEP 8 style guidelines PART_3: ID: historical_context CONTENT: | The project was migrated from Python 2.7 in 2019. Original team used camelCase naming but we now use snake_case. Legacy modules in /legacy folder are frozen. PART_4: ID: output_format CONTENT: | Provide actionable feedback with specific line references. Explain the reasoning behind suggestions. </PROMPT_PARTS> ``` Splitting guidelines: - Each XML section or Markdown header becomes a part - Separate conceptually distinct instructions into their own parts - Keep related instructions together (do not split mid-thought) - Aim for 3-15 parts depending on prompt length **Step 2: Spawn Scoring Agents** Spawn multiple scoring agents in parallel: ```markdown <TASK> Score how relevant this prompt parts is for accomplishing the specified task. </TASK> <TASK_DESCRIPTION> {description of what the agent should accomplish} Example: "Review a pull request for code quality issues and suggest improvements" </TASK_DESCRIPTION> <PROMPT_PARTS> {contents of all the parts being evaluated} </PROMPT_PARTS> <SCORING_CRITERIA> Score 0-10 based on these criteria: ESSENTIAL (8-10): - Part directly enables task completion - Removing this part would cause task failure - Part contains critical constraints that prevent errors - Part defines required output format or structure HELPFUL (5-7): - Part improves output quality but is not strictly required - Part provides useful context that guides better decisions - Part contains preferences that affect style but not correctness MARGINAL (2-4): - Part has tangential relevance to the task - Part might occasionally be useful but usually is not - Part provides historical context rarely needed DISTRACTOR (0-1): - Part is irrelevant to the task - Part could confuse the agent about what to focus on - Part competes for attention without contributing value </SCORING_CRITERIA> <OUTPUT_FORMAT> RELEVANCE_SCORE: [0-10] JUSTIFICATION: [2-3 sentences explaining the score] USAGE_LIKELIHOOD: [How often would the agent reference this part during task execution? ALWAYS | OFTEN | SOMETIMES | RARELY | NEVER] </OUTPUT_FORMAT> ``` **Step 3: Aggregate Relevance Scores** Collect scores from all scoring agents: ``` PART_SCORES = [ {id: "background", score: 8, usage: "ALWAYS"}, {id: "code_style_rules", score: 9, usage: "ALWAYS"}, {id: "historical_context", score: 3, usage: "RARELY"}, {id: "output_format", score: 7, usage: "OFTEN"} ] ``` Calculate aggregate metrics: ``` total_parts = count(PART_SCORES) high_relevance_parts = count(parts where score >= 5) distractor_parts = count(parts where score < 5) context_efficiency = high_relevance_parts / total_parts average_relevance = sum(scores) / total_parts ``` **Step 4: Identify Distractor Parts** Apply the distractor threshold (score < 5): ```markdown DISTRACTOR_ANALYSIS: Identified Distractors: 1. PART: historical_context SCORE: 3/10 JUSTIFICATION: "Migration history from Python 2.7 is rarely relevant to reviewing current code. The naming convention note is useful but should be in code_style_rules instead." RECOMMENDATION: REMOVE or RELOCATE Summary: - Total parts: 4 - High-relevance parts (>=5): 3 - Distractor parts (<5): 1 - Context efficiency: 75% - Average relevance: 6.75 Token Impact: - Distractor tokens: ~45 (historical_context) - Potential savings: 45 tokens (11% of prompt) ``` **Step 5: Generate Optimization Recommendations** Based on distractor analysis, provide actionable recommendations: ```markdown OPTIMIZATION_RECOMMENDATIONS: 1. REMOVE: historical_context Reason: Score 3/10, usage RARELY. Migration history does not inform code review decisions. 2. RELOCATE: "we now use snake_case" from historical_context Target: code_style_rules section Reason: This specific rule is relevant but buried in irrelevant historical context. 3. CONSIDER CONDENSING: background Current: 2 sentences Could be: 1 sentence ("Python 3.9+ data pipeline expert") Savings: ~15 tokens OPTIMIZED PROMPT STRUCTURE: - background (condensed): 8 tokens - code_style_rules (with snake_case added): 52 tokens - output_format: 28 tokens - Total: 88 tokens (down from 133 tokens) - Efficiency improvement: 34% reduction ``` ### Distractor Threshold Guidelines The default threshold of 5 balances comprehensiveness against efficiency: | Threshold | Use Case | |-----------|----------| | < 3 | Aggressive pruning for token-constrained contexts | | < 5 | Standard optimization (recommended default) | | < 7 | Conservative pruning for critical prompts | Adjust threshold based on: - **Context budget pressure**: Lower threshold when approaching limits - **Task criticality**: Higher threshold for production prompts - **Prompt stability**: Lower threshold for experimental prompts ### Scoring Agent Deployment For efficiency, parallelize scoring agents: ```markdown # Parallel execution pattern spawn_parallel([ scoring_agent(part_1, task_description), scoring_agent(part_2, task_description), scoring_agent(part_3, task_description), ... ]) # Collect and aggregate scores = await_all(scoring_agents) analysis = aggregate_scores(scores) ``` For large prompts (>10 parts), batch scoring agents in groups of 5-7 to manage orchestration overhead. ## Context Health Monitoring Workflow Long-running agent sessions accumulate context that degrades over time. This workflow monitors context health and triggers intervention. ### When to Use - During long-running agent sessions (>20 turns) - When agents start exhibiting degradation symptoms - As a periodic health check in agent orchestration systems - Before critical decision points in agent workflows ### Health Check Pattern **Step 1: Periodic Symptom Detection** Every N turns (recommended: every 10 turns), spawn a health check agent: ```markdown <TASK> Analyze the recent conversation history for signs of context degradation. </TASK> <RECENT_HISTORY> {last 10 turns of conversation} </RECENT_HISTORY> <SYMPTOM_CHECKLIST> Check for these degradation symptoms: LOST_IN_MIDDLE: - [ ] Agent missing instructions from early in conversation - [ ] Critical constraints being ignored - [ ] Agent asking for information already provided CONTEXT_POISONING: - [ ] Same error appearing repeatedly - [ ] Agent referencing incorrect information as fact - [ ] Hallucinations that persist despite correction CONTEXT_DISTRACTION: - [ ] Responses becoming unfocused - [ ] Agent using irrelevant context inappropriately - [ ] Quality declining on previously-successful tasks CONTEXT_CONFUSION: - [ ] Agent mixing up different task requirements - [ ] Wrong tool selections for obvious tasks - [ ] Outputs that blend requirements from different tasks CONTEXT_CLASH: - [ ] Agent expressing uncertainty about conflicting information - [ ] Inconsistent behavior between turns - [ ] Agent asking for clarification on resolved issues </SYMPTOM_CHECKLIST> <OUTPUT_FORMAT> HEALTH_STATUS: [HEALTHY | DEGRADED | CRITICAL] SYMPTOMS_DETECTED: [list of checked symptoms] RECOMMENDED_ACTION: [CONTINUE | COMPACT | RESTART] SPECIFIC_ISSUES: [detailed description of problems found] </OUTPUT_FORMAT> ``` **Step 2: Automated Intervention** Based on health status, trigger appropriate intervention: ```markdown IF HEALTH_STATUS == "DEGRADED" or HEALTH_STATUS == "CRITICAL": <RESTART_INTERVENTION> 1. Extract essential state to preserve and save to a file 2. Ask user to start a new session with clean context and load the preserved state from the file after the new session is started </RESTART_INTERVENTION> ``` ## Guidelines for Multi-Agent Verification 1. Spawn verification agents with focused, single-purpose prompts 2. Use structured output formats for reliable parsing 3. Set clear thresholds for action vs. continue decisions 4. Log all verification results for debugging and optimization 5. Balance verification overhead against error prevention value 6. Implement verification at natural checkpoints, not every turn 7. Use lighter-weight checks for routine operations, heavier for critical ones 8. Design verification to be skippable in time-critical scenarios # Context Optimization Techniques Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity. Effective optimization can double or triple effective context capacity without requiring larger models or longer contexts. ## Core Concepts Context optimization extends effective capacity through four primary strategies: compaction (summarizing context near limits), observation masking (replacing verbose outputs with references), KV-cache optimization (reusing cached computations), and context partitioning (splitting work across isolated contexts). The key insight is that context quality matters more than quantity. Optimization preserves signal while reducing noise. The art lies in selecting what to keep versus what to discard, and when to apply each technique. ## Detailed Topics ### Compaction Strategies **What is Compaction** Compaction is the practice of summarizing context contents when approaching limits, then reinitializing a new context window with the summary. This distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation. Compaction typically serves as the first lever in context optimization. The art lies in selecting what to keep versus what to discard. **Compaction in Practice** Compaction works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries. Priority for compression: 1. **Tool outputs** - Replace verbose outputs with key findings 2. **Old conversation turns** - Summarize early exchanges 3. **Retrieved documents** - Summarize if task context captured 4. **Never compress** - System prompt and critical constraints **Summary Generation** Effective summaries preserve different elements depending on content type: - **Tool outputs**: Preserve key findings, metrics, and conclusions. Remove verbose raw output. - **Conversational turns**: Preserve key decisions, commitments, and context shifts. Remove filler and back-and-forth. - **Retrieved documents**: Preserve key facts and claims. Remove supporting evidence and elaboration. ### Observation Masking **The Observation Problem** Tool outputs can comprise 80%+ of token usage in agent trajectories. Much of this is verbose output that has already served its purpose. Once an agent has used a tool output to make a decision, keeping the full output provides diminishing value while consuming significant context. Observation masking replaces verbose tool outputs with compact references. The information remains accessible if needed but does not consume context continuously. **Masking Strategy Selection** Not all observations should be masked equally: **Never mask:** - Observations critical to current task - Observations from the most recent turn - Observations used in active reasoning **Consider masking:** - Observations from 3+ turns ago - Verbose outputs with key points extractable - Observations whose purpose has been served **Always mask:** - Repeated outputs - Boilerplate headers/footers - Outputs already summarized in conversation ### Context Partitioning **Sub-Agent Partitioning** The most aggressive form of context optimization is partitioning work across sub-agents with isolated contexts. Each sub-agent operates in a clean context focused on its subtask without carrying accumulated context from other subtasks. This approach achieves separation of concerns--the detailed search context remains isolated within sub-agents while the coordinator focuses on synthesis and analysis. **When to Partition** Consider partitioning when: - Task naturally decomposes into independent subtasks - Different subtasks require different specialized context - Context accumulation threatens to exceed limits - Different subtasks have conflicting requirements **Result Aggregation** Aggregate results from partitioned subtasks by: 1. Validating all partitions completed 2. Merging compatible results 3. Summarizing if combined results still too large 4. Resolving conflicts between partition outputs ## Practical Guidance ### Optimization Decision Framework **When to optimize:** - Response quality degrades as conversations extend - Costs increase due to long contexts - Latency increases with conversation length **What to apply:** - Tool outputs dominate: observation masking - Retrieved documents dominate: summarization or partitioning - Message history dominates: compaction with summarization - Multiple components: combine strategies ### Applying Optimization to Claude Code Prompts **Command Optimization** Commands load on-demand, so focus on keeping individual commands focused: ```markdown # Good: Focused command with clear scope --- name: review-security description: Review code for security vulnerabilities --- # Specific security review instructions only # Avoid: Overloaded command trying to do everything --- name: review-all description: Review code for everything --- # 50 different review checklists crammed together ``` **Skill Optimization** Skills load their descriptions by default, so descriptions must be concise: ```markdown # Good: Concise description description: Analyze code architecture. Use for design reviews. # Avoid: Verbose description that wastes context budget description: This skill provides comprehensive analysis of code architecture including but not limited to class hierarchies, dependency graphs, coupling metrics, cohesion analysis... ``` **Sub-Agent Context Design** When spawning sub-agents, provide focused context: ```markdown # Coordinator provides minimal handoff: "Review authentication module for security issues. Return findings in structured format." # NOT this verbose handoff: "I need you to look at the authentication module which is located in src/auth/ and contains several files including login.ts, session.ts, tokens.ts... [500 more tokens of context]" ``` ## Guidelines 1. Measure before optimizing--know your current state 2. Apply compaction before masking when possible 3. Design for cache stability with consistent prompts 4. Partition before context becomes problematic 5. Monitor optimization effectiveness over time 6. Balance token savings against quality preservation 7. Test optimization at production scale 8. Implement graceful degradation for edge casesplugins/customaize-agent/skills/create-rule/SKILL.mdskillShow content (21996 bytes)
--- name: create-rule description: Use when found gap or repetative issue, that produced by you or implemenataion agent. Esentially use it each time when you say "You absolutly right, I should have done it differently." -> need create rule for this issue so it not appears again. --- # Create Rule Guide for creating effective `.claude/rules` files with contrastive examples that improve agent accuracy. ## Overview **Core principle:** Effective rules use contrastive examples (Incorrect vs Correct) to eliminate ambiguity. **REQUIRED BACKGROUND:** Rules are behavioral guardrails, that load into every session and shapes how agents behave across all tasks. Skills load on-demand. If guidance is task-specific, create a skill instead. ## About Rules Rules are modular, always-loaded instructions placed in `.claude/rules/` that enforce consistent behavior. They act as "standing orders" — every agent session inherits them automatically. ### What Rules Provide 1. **Behavioral constraints** — What to do and what NOT to do 2. **Code standards** — Formatting, patterns, architecture decisions 3. **Quality gates** — Conditions that must be met before proceeding 4. **Domain conventions** — Project-specific terminology and practices ### Rules vs Skills vs CLAUDE.md | Aspect | Rules (`.claude/rules/`) | Skills (`skills/`) | CLAUDE.md | |--------|--------------------------|---------------------|-----------| | **Loading** | Every session (or path-scoped) | On-demand when triggered | Every session | | **Purpose** | Behavioral constraints | Procedural knowledge | Project overview | | **Scope** | Narrow, focused topics | Complete workflows | Broad project context | | **Size** | Small (50-200 words each) | Medium (200-2000 words) | Medium (project summary) | | **Format** | Contrastive examples | Step-by-step guides | Key-value / bullet points | ## When to Create a Rule **Create when:** - A behavior must apply to ALL agent sessions, not just specific tasks - Agents repeatedly make the same mistake despite corrections - A convention has clear right/wrong patterns (contrastive examples possible) - Path-specific guidance is needed for certain file types **Do NOT create for:** - Task-specific workflows (use a skill instead) - One-time instructions (put in the prompt) - Broad project context (put in CLAUDE.md) - Guidance that requires multi-step procedures (use a skill) ## Rule Types ### Global Rules (no `paths` frontmatter) Load every session. Use for universal constraints. ```markdown # Error Handling All error handlers must log the error before rethrowing. Never silently swallow exceptions. ``` ### Path-Scoped Rules (`paths` frontmatter) Load only when agent works with matching files. Use for file-type-specific guidance. ```markdown --- paths: - "src/api/**/*.ts" --- # API Development Rules All API endpoints must include input validation. Use the standard error response format. ``` ### Priority Rules (evaluator/judge guidance) Explicit high-level rules that set evaluation priorities. ```markdown # Evaluation Priorities Prioritize correctness over style. Do not reward hallucinated detail. Penalize confident wrong answers more than uncertain correct ones. ``` ## Rule Structure: The Contrastive Pattern Every rule MUST follow the Description-Incorrect-Correct template. This structure eliminates ambiguity by showing both what NOT to do and what TO do. ### Required Sections ```markdown --- title: Short Rule Name paths: # Optional but preferable: when it is possible to define, use it! - "src/**/*.ts" --- # Rule Name [1-2 sentence description of what the rule enforces and WHY it matters.] ## Incorrect [Description of what is wrong with this pattern.] \`\`\`language // Anti-pattern code or behavior example \`\`\` ## Correct [Description of why this pattern is better.] \`\`\`language // Recommended code or behavior example \`\`\` ## Reference [Optional: links to documentation, papers, or related rules.] ``` ### Why Contrastive Examples Work Researches shows that rules with both positive and negative examples are significantly more discriminative than rules with only positive guidance. The Incorrect/Correct pairing: 1. **Eliminates ambiguity** — the agent sees the exact boundary between acceptable and unacceptable 2. **Prevents rationalization** — harder to argue "this is close enough" when the wrong pattern is explicitly shown 3. **Enables self-correction** — agents can compare their output against both patterns ## Writing Effective Rules ### Rule Description Principles Explicit, high-level guidance: | Principle | Example | |-----------|---------| | **Prioritize correctness over style** | "A functionally correct but ugly solution is better than an elegant but broken one" | | **Do not reward hallucinated detail** | "Extra information not grounded in the codebase should be penalized, not rewarded" | | **Penalize confident errors** | "A confidently stated wrong answer is worse than an uncertain correct one" | | **Be specific, not vague** | "Functions must not exceed 50 lines" not "Keep functions short" | | **State the WHY** | "Use early returns to reduce nesting — deeply nested code increases cognitive load" | ### Incorrect Examples: What to Show The Incorrect section must show a pattern the agent would **plausibly produce**. Abstract or contrived bad examples provide no value. **Effective Incorrect examples:** - Show the most common mistake agents make for this scenario - Include the rationalization an agent might use ("this is simpler") - Mirror real code patterns found in the codebase **Ineffective Incorrect examples:** - Obviously broken code no agent would produce - Syntax errors (agents already avoid these) - Patterns unrelated to the rule's concern ### Correct Examples: What to Show The Correct section must show the minimal change needed to fix the Incorrect pattern. Large rewrites obscure the actual lesson. **Effective Correct examples:** - Show the same scenario as Incorrect, fixed - Highlight the specific change that matters - Include a brief comment explaining WHY this is better **Ineffective Correct examples:** - Completely different code from the Incorrect example - Over-engineered solutions that add unnecessary complexity - Patterns that require additional context not shown ### Token Efficiency Rules load every session. Every token counts. - **Target:** 50-200 words per rule file (excluding code examples) - **One rule per file** — do not bundle unrelated constraints - **Use path scoping** to avoid loading irrelevant rules - **Code examples:** Keep under 20 lines each (Incorrect and Correct) ## Directory Structure ``` .claude/ ├── CLAUDE.md # Project overview (broad) └── rules/ ├── code-style.md # Global: code formatting rules ├── error-handling.md # Global: error handling patterns ├── testing.md # Global: testing conventions ├── security.md # Global: security requirements ├── evaluation-priorities.md # Global: judge/evaluator priorities ├── frontend/ │ ├── components.md # Path-scoped: React component rules │ └── state-management.md # Path-scoped: state management rules └── backend/ ├── api-design.md # Path-scoped: API patterns └── database.md # Path-scoped: database conventions ``` **Naming conventions:** - Use lowercase with hyphens: `error-handling.md`, not `ErrorHandling.md` - Name by the concern, not the solution: `error-handling.md`, not `try-catch-patterns.md` - One topic per file for modularity - Use subdirectories to group related rules by domain ## Rule Creation Process Follow these steps in order, skipping only when a step is clearly not applicable. ### Step 1: Identify the Behavioral Gap Before writing any rule, identify the specific agent behavior that needs correction. This understanding can come from: - **Observed failures** — the agent repeatedly makes a specific mistake - **Codebase analysis** — the project has conventions not obvious from code alone - **Evaluation findings** — a judge/meta-judge identified a quality gap - **User feedback** — explicit correction of agent behavior Document the gap as a concrete statement: "The agent does X, but should do Y." Conclude this step when there is a clear, specific behavior to correct. ### Step 2: Determine Rule Scope Decide whether this rule should be: 1. **Global** (no `paths` frontmatter) — applies to all work in the project 2. **Path-scoped** (`paths` frontmatter with glob patterns) — applies only when working with matching files 3. **User-level** (`~/.claude/rules/`) — applies across all projects for personal preferences **Decision guide:** ``` Is this project-specific? No → User-level rule (~/.claude/rules/) Yes → Is it relevant to ALL files? Yes → Global rule (.claude/rules/rule-name.md) No → Path-scoped rule (.claude/rules/rule-name.md with paths: frontmatter) ``` ### Step 3: Write Contrastive Examples This is the most critical step. Write the Incorrect and Correct examples BEFORE writing the description. 1. **Start with the Incorrect pattern** — write the exact code or behavior the agent produces that needs correction 2. **Write the Correct pattern** — show the minimal fix that addresses the issue 3. **Verify contrast is clear** — the difference between Incorrect and Correct must be obvious and focused on exactly one concept **Quality check for contrastive examples:** | Check | Pass Criteria | |-------|---------------| | Plausibility | Would an agent actually produce the Incorrect pattern? | | Minimality | Does the Correct pattern change only what is necessary? | | Clarity | Can a reader identify the difference in under 5 seconds? | | Specificity | Does each example demonstrate exactly one concept? | | Groundedness | Are the examples drawn from real codebase patterns? | ### Step 4: Write the Rule Description Now write the 1-2 sentence description that connects the contrastive examples. The description must: - State WHAT the rule enforces - State WHY it matters (the impact or consequence) - Use imperative form ("Use early returns" not "You should use early returns") ### Step 5: Assemble the Rule File Create the rule file following the structure template: 1. Add YAML frontmatter with `title`, `impact`, `tags`, and optionally `paths` 2. Write the heading and description 3. Add the Incorrect section with description and code 4. Add the Correct section with description and code 5. Optionally add a Reference section with links Place the file in `.claude/rules/` with a descriptive filename. ### Step 6: Validate the Rule Before finishing, verify: 1. **File location** — rule exists at `.claude/rules/<rule-name>.md` 2. **Frontmatter** — contains at minimum `title` and `impact` 3. **Contrastive examples** — both Incorrect and Correct sections present with code blocks 4. **Token budget** — description is 50-200 words (excluding code) 5. **Path scoping** — if `paths` is set, glob patterns match intended files 6. **No overlap** — rule does not duplicate guidance in CLAUDE.md or other rules ### Step 7: Iterate Based on Feedback or Observations After a rule is written, apply a Decompose → Filter → Reweight refinement cycle before finalizing: #### 7.1 Decompose Check Consider splitting complex rules into multiple focused rules. For rules that your written, ask yourself: "Is this rule trying to cover more than one concept?" - If YES, split it into multiple focused rules, each addressing exactly one concept - If the Incorrect example shows multiple distinct anti-patterns, create separate rules for each #### 7.2 Misalignment Filter For rules that your written, ask yourself: "Could this rule penalize acceptable variations or reward behaviors the prompt does not ask for?" - If YES, narrow the scope or rewrite the contrastive examples - Verify: would an agent actually produce the Incorrect pattern? (If not, the rule is contrived) #### 7.3 Redundancy Filter Check all existing `.claude/rules/` files for overlap: - If already exists a rule that covers the same concept, **update the existing rule** instead and remove the duplicate rule that you just created - If two rules substantially overlap (enforcing the same behavioral boundary), merge them - Use: `ls -R .claude/rules/` and `grep -r "relevant-keyword"` to find potential overlaps #### 7.4 Impact Reweight Assign or reassign the `impact` frontmatter field based on: - **CRITICAL**: Anti-pattern causes data loss, security vulnerabilities, or system failures - **HIGH**: Anti-pattern causes broken functionality, incorrect behavior, or hard-to-debug issues - **MEDIUM**: Anti-pattern degrades quality, readability, or maintainability - **LOW**: Anti-pattern is a minor style or convention issue #### 7.5 Iterate Based on Feedback After the refinement cycle, ask the user for feedback on the rule. - If the user says that the rule is good, you can stop the refinement cycle. - If the user says that the rule is bad, you should update the rule to close gaps. You should continue to iterate until the rule is good. ## Complete Rule Example ```markdown --- title: Use Early Returns to Reduce Nesting paths: - "**/*.ts" --- # Use Early Returns to Reduce Nesting Handle error conditions and edge cases at the top of functions using early returns. Deeply nested code increases cognitive load and makes logic harder to follow. ## Incorrect Guard clauses are buried inside nested conditionals, making the happy path hard to find. \`\`\`typescript function processOrder(order: Order) { if (order) { if (order.items.length > 0) { if (order.status === 'pending') { // actual logic buried 3 levels deep const total = calculateTotal(order.items) return submitOrder(order, total) } else { throw new Error('Order not pending') } } else { throw new Error('No items') } } else { throw new Error('No order') } } \`\`\` ## Correct Error conditions are handled first with early returns, keeping the happy path at the top level. \`\`\`typescript function processOrder(order: Order) { if (!order) throw new Error('No order') if (order.items.length === 0) throw new Error('No items') if (order.status !== 'pending') throw new Error('Order not pending') const total = calculateTotal(order.items) return submitOrder(order, total) } \`\`\` ## Reference - [Flattening Arrow Code](https://blog.codinghorror.com/flattening-arrow-code/) ``` ## Complete Path-Scoped Rule Example ```markdown --- title: API Endpoints Must Validate Input paths: - "src/api/**/*.ts" - "src/routes/**/*.ts" --- # API Endpoints Must Validate Input Every API endpoint must validate request input before processing. Unvalidated input leads to runtime errors, security vulnerabilities, and data corruption. ## Incorrect The handler trusts the request body without validation, allowing malformed data through. \`\`\`typescript export async function POST(req: Request) { const body = await req.json() const user = await db.users.create({ email: body.email, name: body.name, }) return Response.json(user) } \`\`\` ## Correct Input is validated with a schema before use. Invalid requests receive a 400 response. \`\`\`typescript import { z } from 'zod' const CreateUserSchema = z.object({ email: z.string().email(), name: z.string().min(1).max(100), }) export async function POST(req: Request) { const parsed = CreateUserSchema.safeParse(await req.json()) if (!parsed.success) { return Response.json({ error: parsed.error.flatten() }, { status: 400 }) } const user = await db.users.create(parsed.data) return Response.json(user) } \`\`\` ``` ## Anti-Patterns ### Vague Rules Without Examples ```markdown # Bad: No contrastive examples, too vague Keep functions short and readable. Use meaningful variable names. ``` **Why bad:** No concrete boundary. "Short" means different things to different agents. No Incorrect/Correct to calibrate behavior. ### Rules That Should Be Skills ```markdown # Bad: Multi-step procedure in a rule When deploying to production: 1. Run all tests 2. Check coverage thresholds 3. Build the project 4. Run integration tests 5. Deploy to staging first ... ``` **Why bad:** Rules should be constraints, not workflows. This belongs in a skill. ### Duplicate Rules ```markdown # Bad: Same guidance in two places # .claude/rules/formatting.md says "use 2-space indent" # CLAUDE.md also says "use 2-space indent" ``` **Why bad:** When guidance conflicts, the agent cannot determine which takes precedence. Keep each piece of guidance in exactly one location. ### Overly Broad Path Scoping ```markdown --- paths: - "**/*" --- ``` **Why bad:** Equivalent to a global rule but with the overhead of path matching. Remove the `paths` field entirely for global rules. ## Rule Creation Checklist - [ ] Behavioral gap identified with concrete "does X, should do Y" statement - [ ] Rule type determined: global, path-scoped, or user-level - [ ] Contrastive examples written: Incorrect shows plausible agent mistake - [ ] Contrastive examples written: Correct shows minimal fix - [ ] Description states WHAT the rule enforces and WHY - [ ] Frontmatter includes `title` and `impact` - [ ] Token budget: 50-200 words (excluding code examples) - [ ] One topic per rule file - [ ] No overlap with CLAUDE.md or other rule files - [ ] Path scoping uses correct glob patterns (if applicable) - [ ] File placed in `.claude/rules/` with descriptive hyphenated name ## The Bottom Line **Effective rules show, they do not just tell.** The Incorrect/Correct contrastive pattern eliminates ambiguity that prose descriptions leave open. When an agent can see both what to avoid and what to produce, compliance improves dramatically. Every rule should answer three questions: 1. **What** behavior does this enforce? 2. **Why** does it matter? 3. **How** does right differ from wrong? (shown through contrastive examples) ## Claude Code Official Rules Guidlines For larger projects, you can organize instructions into multiple files using the `.claude/rules/` directory. This keeps instructions modular and easier for teams to maintain. Rules can also be [scoped to specific file paths](#path-specific-rules), so they only load into context when Claude works with matching files, reducing noise and saving context space. <Note> Rules load into context every session or when matching files are opened. For task-specific instructions that don't need to be in context all the time, use [skills](/en/skills) instead, which only load when you invoke them or when Claude determines they're relevant to your prompt. </Note> ### Set up rules Place markdown files in your project's `.claude/rules/` directory. Each file should cover one topic, with a descriptive filename like `testing.md` or `api-design.md`. All `.md` files are discovered recursively, so you can organize rules into subdirectories like `frontend/` or `backend/`: ```text theme={null} your-project/ ├── .claude/ │ ├── CLAUDE.md # Main project instructions │ └── rules/ │ ├── code-style.md # Code style guidelines │ ├── testing.md # Testing conventions │ └── security.md # Security requirements ``` Rules without [`paths` frontmatter](#path-specific-rules) are loaded at launch with the same priority as `.claude/CLAUDE.md`. ### Path-specific rules Rules can be scoped to specific files using YAML frontmatter with the `paths` field. These conditional rules only apply when Claude is working with files matching the specified patterns. ```markdown theme={null} --- paths: - "src/api/**/*.ts" --- # API Development Rules - All API endpoints must include input validation - Use the standard error response format - Include OpenAPI documentation comments ``` Rules without a `paths` field are loaded unconditionally and apply to all files. Path-scoped rules trigger when Claude reads files matching the pattern, not on every tool use. Use glob patterns in the `paths` field to match files by extension, directory, or any combination: | Pattern | Matches | | ---------------------- | ---------------------------------------- | | `**/*.ts` | All TypeScript files in any directory | | `src/**/*` | All files under `src/` directory | | `*.md` | Markdown files in the project root | | `src/components/*.tsx` | React components in a specific directory | You can specify multiple patterns and use brace expansion to match multiple extensions in one pattern: ```markdown theme={null} --- paths: - "src/**/*.{ts,tsx}" - "lib/**/*.ts" - "tests/**/*.test.ts" --- ``` ### Share rules across projects with symlinks The `.claude/rules/` directory supports symlinks, so you can maintain a shared set of rules and link them into multiple projects. Symlinks are resolved and loaded normally, and circular symlinks are detected and handled gracefully. This example links both a shared directory and an individual file: ```bash theme={null} ln -s ~/shared-claude-rules .claude/rules/shared ln -s ~/company-standards/security.md .claude/rules/security.md ``` ### User-level rules Personal rules in `~/.claude/rules/` apply to every project on your machine. Use them for preferences that aren't project-specific: ```text theme={null} ~/.claude/rules/ ├── preferences.md # Your personal coding preferences └── workflows.md # Your preferred workflows ``` User-level rules are loaded before project rules, giving project rules higher priority.plugins/customaize-agent/skills/agent-evaluation/SKILL.mdskillShow content (57295 bytes)
--- name: agent-evaluation description: Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality. --- # Evaluation Methods for Claude Code Agents Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects. ## Core Concepts Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases. The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes. **Performance Drivers: The 95% Finding** Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance: | Factor | Variance Explained | Implication | |--------|-------------------|-------------| | Token usage | 80% | More tokens = better performance | | Number of tool calls | ~10% | More exploration helps | | Model choice | ~5% | Better models multiply efficiency | Implications for Claude Code development: - **Token budgets matter**: Evaluate with realistic token constraints - **Model upgrades beat token increases**: Upgrading models provides larger gains than increasing token budgets - **Multi-agent validation**: Validates architectures that distribute work across subagents with separate context windows ## Evaluation Challenges ### Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context. **Solution**: The solution is outcomes, not exact execution paths. Judge whether the agent achieves the right result through a reasonable process. ### Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on complex queries but fail on simple ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates. **Solution**: Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries. ### Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa. An agent might score high on accuracy but low in efficiency. **Solution**: Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case. ## Evaluation Rubric Design ### Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels: **Instruction Following** (weight: 0.30) - Excellent (1.0): All instructions followed precisely - Good (0.8): Minor deviations that don't affect outcome - Acceptable (0.6): Major instructions followed, minor ones missed - Poor (0.3): Significant instructions ignored - Failed (0.0): Fundamentally misunderstood the task **Output Completeness** (weight: 0.25) - Excellent: All requested aspects thoroughly covered - Good: Most aspects covered with minor gaps - Acceptable: Key aspects covered, some gaps - Poor: Major aspects missing - Failed: Fundamental aspects not addressed **Tool Efficiency** (weight: 0.20) - Excellent: Optimal tool selection and minimal calls - Good: Good tool selection with minor inefficiencies - Acceptable: Appropriate tools with some redundancy - Poor: Wrong tools or excessive calls - Failed: Severe tool misuse or extremely excessive calls **Reasoning Quality** (weight: 0.15) - Excellent: Clear, logical reasoning throughout - Good: Generally sound reasoning with minor gaps - Acceptable: Basic reasoning present - Poor: Reasoning unclear or flawed - Failed: No apparent reasoning **Response Coherence** (weight: 0.10) - Excellent: Well-structured, easy to follow - Good: Generally coherent with minor issues - Acceptable: Understandable but could be clearer - Poor: Difficult to follow - Failed: Incoherent ### Scoring Approach Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Set passing thresholds based on use case requirements (typically 0.7 for general use, 0.85 for critical operations). ## Evaluation Methodologies ### LLM-as-Judge Using an LLM to evaluate agent outputs scales well and provides consistent judgments. Design evaluation prompts that capture the dimensions of interest. LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest. Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment. **Evaluation Prompt Template**: ```markdown You are evaluating the output of a Claude Code agent. ## Original Task {task_description} ## Agent Output {agent_output} ## Ground Truth (if available) {expected_output} ## Evaluation Criteria For each criterion, assess the output and provide: 1. Score (1-5) 2. Specific evidence supporting your score 3. One improvement suggestion ### Criteria 1. Instruction Following: Did the agent follow all instructions? 2. Completeness: Are all requested aspects covered? 3. Tool Efficiency: Were appropriate tools used efficiently? 4. Reasoning Quality: Is the reasoning clear and sound? 5. Response Coherence: Is the output well-structured? Provide your evaluation as a structured assessment with scores and justifications. ``` **Chain-of-Thought Requirement**: Always require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches. ### Human Evaluation Human evaluation catches what automation misses: - Hallucinated answers on unusual queries - Subtle context misunderstandings - Edge cases that automated evaluation overlooks - Qualitative issues with tone or approach For Claude Code development, ask users this: - Review agent outputs manually for edge cases - Sample systematically across complexity levels - Track patterns in failures to inform prompt improvements ### End-State Evaluation For commands that produce artifacts (files, configurations, code), evaluate the final output rather than the process: - Does the generated code work? - Is the configuration valid? - Does the output meet requirements? ## Test Set Design **Sample Selection** Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects. Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels. **Complexity Stratification** Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning). ## Context Engineering Evaluation ### Testing Prompt Variations When iterating on Claude Code prompts, evaluate systematically: 1. **Baseline**: Run current prompt on test cases 2. **Variation**: Run modified prompt on same cases 3. **Compare**: Measure quality scores, token usage, efficiency 4. **Analyze**: Identify which changes improved which dimensions ### Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics. ### Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits. ## Advanced Evaluation: LLM-as-Judge **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ### The Evaluation Taxonomy Evaluation approaches fall into two primary categories with distinct reliability profiles: **Direct Scoring**: A single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following, toxicity) - Reliability: Moderate to high for well-defined criteria - Failure mode: Score calibration drift, inconsistent scale interpretation **Pairwise Comparison**: An LLM compares two responses and selects the better one. - Best for: Subjective preferences (tone, style, persuasiveness) - Reliability: Higher than direct scoring for preferences - Failure mode: Position bias, length bias Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth. ### The Bias Landscape LLM judges exhibit systematic biases that must be actively mitigated: **Position Bias**: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check. **Length Bias**: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring. **Self-Enhancement Bias**: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation. **Verbosity Bias**: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail. **Authority Bias**: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer. ### Metric Selection Framework Choose metrics based on the evaluation task structure: | Task Type | Primary Metrics | Secondary Metrics | |-----------|-----------------|-------------------| | Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ | | Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) | | Pairwise preference | Agreement rate, Position consistency | Confidence calibration | | Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall | The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise. ## Evaluation Metrics Reference ### Classification Metrics (Pass/Fail Tasks) **Precision**: Of all responses marked as passing, what fraction truly passed? - Use when false positives are costly **Recall**: Of all actually passing responses, what fraction did we identify? - Use when false negatives are costly **F1 Score**: Harmonic mean of precision and recall - Use for balanced single-number summary ### Agreement Metrics (Comparing to Human Judgment) **Cohen's Kappa**: Agreement adjusted for chance > - > 0.8: Almost perfect agreement - 0.6-0.8: Substantial agreement - 0.4-0.6: Moderate agreement - < 0.4: Fair to poor agreement ### Correlation Metrics (Ordinal Scores) **Spearman's Rank Correlation**: Correlation between rankings > - > 0.9: Very strong correlation - 0.7-0.9: Strong correlation - 0.5-0.7: Moderate correlation - < 0.5: Weak correlation ### Good Evaluation System Indicators | Metric | Good | Acceptable | Concerning | |--------|------|------------|------------| | Spearman's rho | > 0.8 | 0.6-0.8 | < 0.6 | | Cohen's Kappa | > 0.7 | 0.5-0.7 | < 0.5 | | Position consistency | > 0.9 | 0.8-0.9 | < 0.8 | | Length-score correlation | < 0.2 | 0.2-0.4 | > 0.4 | ## Evaluation Approaches ### Direct Scoring Implementation Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format. **Criteria Definition Pattern**: ``` Criterion: [Name] Description: [What this criterion measures] Weight: [Relative importance, 0-1] ``` **Scale Calibration**: - 1-3 scales: Binary with neutral option, lowest cognitive load - 1-5 scales: Standard Likert, good balance of granularity and reliability - 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics **Prompt Structure for Direct Scoring**: ``` You are an expert evaluator assessing response quality. ## Task Evaluate the following response against each criterion. ## Original Prompt {prompt} ## Response to Evaluate {response} ## Criteria {for each criterion: name, description, weight} ## Instructions For each criterion: 1. Find specific evidence in the response 2. Score according to the rubric (1-{max} scale) 3. Justify your score with evidence 4. Suggest one specific improvement ## Output Format Respond with structured JSON containing scores, justifications, and summary. ``` **Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches. ### Pairwise Comparison Implementation Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation. **Position Bias Mitigation Protocol**: 1. First pass: Response A in first position, Response B in second 2. Second pass: Response B in first position, Response A in second 3. Consistency check: If passes disagree, return TIE with reduced confidence 4. Final verdict: Consistent winner with averaged confidence **Prompt Structure for Pairwise Comparison**: ``` You are an expert evaluator comparing two AI responses. ## Critical Instructions - Do NOT prefer responses because they are longer - Do NOT prefer responses based on position (first vs second) - Focus ONLY on quality according to the specified criteria - Ties are acceptable when responses are genuinely equivalent ## Original Prompt {prompt} ## Response A {response_a} ## Response B {response_b} ## Comparison Criteria {criteria list} ## Instructions 1. Analyze each response independently first 2. Compare them on each criterion 3. Determine overall winner with confidence level ## Output Format JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning. ``` **Confidence Calibration**: Confidence scores should reflect position consistency: - Both passes agree: confidence = average of individual confidences - Passes disagree: confidence = 0.5, verdict = TIE ## Rubric Generation Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring. ### Rubric Components 1. **Level descriptions**: Clear boundaries for each score level 2. **Characteristics**: Observable features that define each level 3. **Examples**: Representative outputs for each level (when possible) 4. **Edge cases**: Guidance for ambiguous situations 5. **Scoring guidelines**: General principles for consistent application ### Strictness Calibration - **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration - **Balanced**: Fair, typical expectations for production use - **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation ### Domain Adaptation Rubrics should use domain-specific terminology: - A "code readability" rubric mentions variables, functions, and comments. - Documentation rubrics reference clarity, accuracy, completeness - Analysis rubrics focus on depth, accuracy, actionability ## Practical Guidance ### Evaluation Pipeline Design Production evaluation systems require multiple layers: ``` ┌─────────────────────────────────────────────────┐ │ Evaluation Pipeline │ ├─────────────────────────────────────────────────┤ │ │ │ Input: Response + Prompt + Context │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Criteria Loader │ ◄── Rubrics, weights │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Primary Scorer │ ◄── Direct or Pairwise │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Bias Mitigation │ ◄── Position swap, etc. │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Confidence Scoring │ ◄── Calibration │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ Output: Scores + Justifications + Confidence │ │ │ └─────────────────────────────────────────────────┘ ``` ### Avoiding Evaluation Pitfalls **Anti-pattern: Scoring without justification** - Problem: Scores lack grounding, difficult to debug or improve - Solution: Always require evidence-based justification before score **Anti-pattern: Single-pass pairwise comparison** - Problem: Position bias corrupts results - Solution: Always swap positions and check consistency **Anti-pattern: Overloaded criteria** - Problem: Criteria measuring multiple things are unreliable - Solution: One criterion = one measurable aspect **Anti-pattern: Missing edge case guidance** - Problem: Evaluators handle ambiguous cases inconsistently - Solution: Include edge cases in rubrics with explicit guidance **Anti-pattern: Ignoring confidence calibration** - Problem: High-confidence wrong judgments are worse than low-confidence - Solution: Calibrate confidence to position consistency and evidence strength ### Decision Framework: Direct vs. Pairwise Use this decision tree: ``` Is there an objective ground truth? ├── Yes → Direct Scoring │ └── Examples: factual accuracy, instruction following, format compliance │ └── No → Is it a preference or quality judgment? ├── Yes → Pairwise Comparison │ └── Examples: tone, style, persuasiveness, creativity │ └── No → Consider reference-based evaluation └── Examples: summarization (compare to source), translation (compare to reference) ``` ### Scaling Evaluation For high-volume evaluation: 1. **Panel of LLMs (PoLL)**: Use multiple models as judges, aggregate votes - Reduces individual model bias - More expensive but more reliable for high-stakes decisions 2. **Hierarchical evaluation**: Fast cheap model for screening, expensive model for edge cases - Cost-effective for large volumes - Requires calibration of screening threshold 3. **Human-in-the-loop**: Automated evaluation for clear cases, human review for low-confidence - Best reliability for critical applications - Design feedback loop to improve automated evaluation ## Examples ### Example 1: Direct Scoring for Accuracy **Input**: ``` Prompt: "What causes seasons on Earth?" Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, different hemispheres receive more direct sunlight at different times of year." Criterion: Factual Accuracy (weight: 1.0) Scale: 1-5 ``` **Output**: ```json { "criterion": "Factual Accuracy", "score": 5, "evidence": [ "Correctly identifies axial tilt as primary cause", "Correctly explains differential sunlight by hemisphere", "No factual errors present" ], "justification": "Response accurately explains the cause of seasons with correct scientific reasoning. Both the axial tilt and its effect on sunlight distribution are correctly described.", "improvement": "Could add the specific tilt angle (23.5°) for completeness." } ``` ### Example 2: Pairwise Comparison with Position Swap **Input**: ``` Prompt: "Explain machine learning to a beginner" Response A: [Technical explanation with jargon] Response B: [Simple analogy-based explanation] Criteria: ["clarity", "accessibility"] ``` **First Pass (A first)**: ```json { "winner": "B", "confidence": 0.8 } ``` **Second Pass (B first)**: ```json { "winner": "A", "confidence": 0.6 } ``` (Note: Winner is A because B was in first position) **Mapped Second Pass**: ```json { "winner": "B", "confidence": 0.6 } ``` **Final Result**: ```json { "winner": "B", "confidence": 0.7, "positionConsistency": { "consistent": true, "firstPassWinner": "B", "secondPassWinner": "B" } } ``` ### Example 3: Rubric Generation **Input**: ``` criterionName: "Code Readability" criterionDescription: "How easy the code is to understand and maintain" domain: "software engineering" scale: "1-5" strictness: "balanced" ``` **Output** (abbreviated): ```json { "levels": [ { "score": 1, "label": "Poor", "description": "Code is difficult to understand without significant effort", "characteristics": [ "No meaningful variable or function names", "No comments or documentation", "Deeply nested or convoluted logic" ] }, { "score": 3, "label": "Adequate", "description": "Code is understandable with some effort", "characteristics": [ "Most variables have meaningful names", "Basic comments present for complex sections", "Logic is followable but could be cleaner" ] }, { "score": 5, "label": "Excellent", "description": "Code is immediately clear and maintainable", "characteristics": [ "All names are descriptive and consistent", "Comprehensive documentation", "Clean, modular structure" ] } ], "edgeCases": [ { "situation": "Code is well-structured but uses domain-specific abbreviations", "guidance": "Score based on readability for domain experts, not general audience" } ] } ``` ### Iterative Improvement Workflow 1. **Identify weakness**: Use evaluation to find where agent struggles 2. **Hypothesize cause**: Is it the prompt? The context? The examples? 3. **Modify prompt**: Make targeted changes based on hypothesis 4. **Re-evaluate**: Run same test cases with modified prompt 5. **Compare**: Did the change improve the target dimension? 6. **Check regression**: Did other dimensions suffer? 7. **Iterate**: Repeat until quality meets threshold ## Guidelines 1. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25% 2. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias 3. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions 4. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective 5. **Include confidence scores** - Calibrate to position consistency and evidence strength 6. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance 7. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations 8. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment 9. **Monitor for systematic bias** - Track disagreement patterns by criterion and response type 10. **Design for iteration** - Evaluation systems improve with feedback loops ## Example: Evaluating a Claude Code Command Suppose you've created a `/refactor` command and want to evaluate its quality: **Test Cases**: 1. Simple: Rename a variable across a single file 2. Medium: Extract a function from existing code 3. Complex: Refactor a class to use a new design pattern 4. Very Complex: Restructure module dependencies **Evaluation Rubric**: - Correctness: Does the refactored code work? - Completeness: Were all instances updated? - Style: Does it follow project conventions? - Efficiency: Were unnecessary changes avoided? **Evaluation Prompt**: ```markdown Evaluate this refactoring output: Original Code: {original} Refactored Code: {refactored} Request: {user_request} Score 1-5 on each dimension with evidence: 1. Correctness: Does the code still work correctly? 2. Completeness: Were all relevant instances updated? 3. Style: Does it follow the project's coding patterns? 4. Efficiency: Were only necessary changes made? Provide scores with specific evidence from the code. ``` **Iteration**: If evaluation reveals the command often misses instances: 1. Add explicit instruction: "Search the entire codebase for all occurrences" 2. Re-evaluate with same test cases 3. Compare completeness scores 4. Check that correctness didn't regress # Bias Mitigation Techniques for LLM Evaluation This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems. ## Position Bias ### The Problem In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows: - GPT has mild first-position bias (~55% preference for first position in ties) - Claude shows similar patterns - Smaller models often show stronger bias ### Mitigation: Position Swapping Protocol ```python async def position_swap_comparison(response_a, response_b, prompt, criteria): # Pass 1: Original order result_ab = await compare(response_a, response_b, prompt, criteria) # Pass 2: Swapped order result_ba = await compare(response_b, response_a, prompt, criteria) # Map second result (A in second position → B in first) result_ba_mapped = { 'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']], 'confidence': result_ba['confidence'] } # Consistency check if result_ab['winner'] == result_ba_mapped['winner']: return { 'winner': result_ab['winner'], 'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2, 'position_consistent': True } else: # Disagreement indicates position bias was a factor return { 'winner': 'TIE', 'confidence': 0.5, 'position_consistent': False, 'bias_detected': True } ``` ### Alternative: Multiple Shuffles For higher reliability, use multiple position orderings: ```python async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3): results = [] for i in range(n_shuffles): if i % 2 == 0: r = await compare(response_a, response_b, prompt, criteria) else: r = await compare(response_b, response_a, prompt, criteria) r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']] results.append(r) # Majority vote winners = [r['winner'] for r in results] final_winner = max(set(winners), key=winners.count) agreement = winners.count(final_winner) / len(winners) return { 'winner': final_winner, 'confidence': agreement, 'n_shuffles': n_shuffles } ``` ## Length Bias ### The Problem LLMs tend to rate longer responses higher, regardless of quality. This manifests as: - Verbose responses receiving inflated scores - Concise but complete responses penalized - Padding and repetition being rewarded ### Mitigation: Explicit Prompting Include anti-length-bias instructions in the prompt: ``` CRITICAL EVALUATION GUIDELINES: - Do NOT prefer responses because they are longer - Concise, complete answers are as valuable as detailed ones - Penalize unnecessary verbosity or repetition - Focus on information density, not word count ``` ### Mitigation: Length-Normalized Scoring ```python def length_normalized_score(score, response_length, target_length=500): """Adjust score based on response length.""" length_ratio = response_length / target_length if length_ratio > 2.0: # Penalize excessively long responses penalty = (length_ratio - 2.0) * 0.1 return max(score - penalty, 1) elif length_ratio < 0.3: # Penalize excessively short responses penalty = (0.3 - length_ratio) * 0.5 return max(score - penalty, 1) else: return score ``` ### Mitigation: Separate Length Criterion Make length a separate, explicit criterion so it's not implicitly rewarded: ```python criteria = [ {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4}, {"name": "Completeness", "description": "Covers key points", "weight": 0.3}, {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3} # Explicit ] ``` ## Self-Enhancement Bias ### The Problem Models rate outputs generated by themselves (or similar models) higher than outputs from different models. ### Mitigation: Cross-Model Evaluation Use a different model family for evaluation than generation: ```python def get_evaluator_model(generator_model): """Select evaluator to avoid self-enhancement bias.""" if 'gpt' in generator_model.lower(): return 'claude-4-5-sonnet' elif 'claude' in generator_model.lower(): return 'gpt-5.2' else: return 'gpt-5.2' # Default ``` ### Mitigation: Blind Evaluation Remove model attribution from responses before evaluation: ```python def anonymize_response(response, model_name): """Remove model-identifying patterns.""" patterns = [ f"As {model_name}", "I am an AI", "I don't have personal opinions", # Model-specific patterns ] anonymized = response for pattern in patterns: anonymized = anonymized.replace(pattern, "[REDACTED]") return anonymized ``` ## Verbosity Bias ### The Problem Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect. ### Mitigation: Relevance-Weighted Scoring ```python async def relevance_weighted_evaluation(response, prompt, criteria): # First, assess relevance of each segment relevance_scores = await assess_relevance(response, prompt) # Weight evaluation by relevance segments = split_into_segments(response) weighted_scores = [] for segment, relevance in zip(segments, relevance_scores): if relevance > 0.5: # Only count relevant segments score = await evaluate_segment(segment, prompt, criteria) weighted_scores.append(score * relevance) return sum(weighted_scores) / len(weighted_scores) ``` ### Mitigation: Rubric with Verbosity Penalty Include explicit verbosity penalties in rubrics: ```python rubric_levels = [ { "score": 5, "description": "Complete and concise. All necessary information, nothing extraneous.", "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"] }, { "score": 3, "description": "Complete but verbose. Contains unnecessary detail or repetition.", "characteristics": ["Main points covered", "Some tangents", "Could be more concise"] }, # ... etc ] ``` ## Authority Bias ### The Problem Confident, authoritative tone is rated higher regardless of accuracy. ### Mitigation: Evidence Requirement Require explicit evidence for claims: ``` For each claim in the response: 1. Identify whether it's a factual claim 2. Note if evidence or sources are provided 3. Score based on verifiability, not confidence IMPORTANT: Confident claims without evidence should NOT receive higher scores than hedged claims with evidence. ``` ### Mitigation: Fact-Checking Layer Add a fact-checking step before scoring: ```python async def fact_checked_evaluation(response, prompt, criteria): # Extract claims claims = await extract_claims(response) # Fact-check each claim fact_check_results = await asyncio.gather(*[ verify_claim(claim) for claim in claims ]) # Adjust score based on fact-check results accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results) base_score = await evaluate(response, prompt, criteria) return base_score * (0.7 + 0.3 * accuracy_factor) # At least 70% of score ``` ## Aggregate Bias Detection Monitor for systematic biases in production: ```python class BiasMonitor: def __init__(self): self.evaluations = [] def record(self, evaluation): self.evaluations.append(evaluation) def detect_position_bias(self): """Detect if first position wins more often than expected.""" first_wins = sum(1 for e in self.evaluations if e['first_position_winner']) expected = len(self.evaluations) * 0.5 z_score = (first_wins - expected) / (expected * 0.5) ** 0.5 return {'bias_detected': abs(z_score) > 2, 'z_score': z_score} def detect_length_bias(self): """Detect if longer responses score higher.""" from scipy.stats import spearmanr lengths = [e['response_length'] for e in self.evaluations] scores = [e['score'] for e in self.evaluations] corr, p_value = spearmanr(lengths, scores) return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr} ``` ## Summary Table | Bias | Primary Mitigation | Secondary Mitigation | Detection Method | |------|-------------------|---------------------|------------------| | Position | Position swapping | Multiple shuffles | Consistency check | | Length | Explicit prompting | Length normalization | Length-score correlation | | Self-enhancement | Cross-model evaluation | Anonymization | Model comparison study | | Verbosity | Relevance weighting | Rubric penalties | Relevance scoring | | Authority | Evidence requirement | Fact-checking layer | Confidence-accuracy correlation | # LLM-as-Judge Implementation Patterns for Claude Code This reference provides practical prompt patterns and workflows for evaluating Claude Code commands, skills, and agents during development. ## Pattern 1: Structured Evaluation Workflow The most reliable evaluation follows a structured workflow that separates concerns: ``` Define Criteria → Gather Test Cases → Run Evaluation → Mitigate Bias → Interpret Results ``` ### Step 1: Define Evaluation Criteria Before evaluating, establish clear criteria. Document them in a reusable format: ```markdown ## Evaluation Criteria for [Command/Skill Name] ### Criterion 1: Instruction Following (weight: 0.30) - **Description**: Does the output follow all explicit instructions? - **1 (Poor)**: Ignores or misunderstands core instructions - **3 (Adequate)**: Follows main instructions, misses some details - **5 (Excellent)**: Follows all instructions precisely ### Criterion 2: Output Completeness (weight: 0.25) - **Description**: Are all requested aspects covered? - **1 (Poor)**: Major aspects missing - **3 (Adequate)**: Core aspects covered with gaps - **5 (Excellent)**: All aspects thoroughly addressed ### Criterion 3: Tool Efficiency (weight: 0.20) - **Description**: Were appropriate tools used efficiently? - **1 (Poor)**: Wrong tools or excessive redundant calls - **3 (Adequate)**: Appropriate tools with some redundancy - **5 (Excellent)**: Optimal tool selection, minimal calls ### Criterion 4: Reasoning Quality (weight: 0.15) - **Description**: Is the reasoning clear and sound? - **1 (Poor)**: No apparent reasoning or flawed logic - **3 (Adequate)**: Basic reasoning present - **5 (Excellent)**: Clear, logical reasoning throughout ### Criterion 5: Response Coherence (weight: 0.10) - **Description**: Is the output well-structured and clear? - **1 (Poor)**: Difficult to follow or incoherent - **3 (Adequate)**: Understandable but could be clearer - **5 (Excellent)**: Well-structured, easy to follow ``` ### Step 2: Create Test Cases Structure test cases by complexity level: ```markdown ## Test Cases for /refactor Command ### Simple (Single Operation) - **Input**: Rename variable `x` to `count` in a single file - **Expected**: All instances renamed, code still runs - **Complexity**: Low ### Medium (Multiple Operations) - **Input**: Extract function from 20-line code block - **Expected**: New function created, original call site updated, behavior preserved - **Complexity**: Medium ### Complex (Cross-File Changes) - **Input**: Refactor class to use Strategy pattern - **Expected**: Interface created, implementations separated, all usages updated - **Complexity**: High ### Edge Case - **Input**: Refactor code with conflicting variable names in nested scopes - **Expected**: Correct scoping preserved, no accidental shadowing - **Complexity**: Edge case ``` ### Step 3: Run Direct Scoring Evaluation Use this prompt template to evaluate a single output: ```markdown You are evaluating the output of a Claude Code command. ## Original Task {paste the user's original request} ## Command Output {paste the full command output including tool calls} ## Evaluation Criteria {paste your criteria definitions from Step 1} ## Instructions For each criterion: 1. Find specific evidence in the output that supports your assessment 2. Assign a score (1-5) based on the rubric levels 3. Write a 1-2 sentence justification citing the evidence 4. Suggest one specific improvement IMPORTANT: Provide your justification BEFORE stating the score. This improves evaluation reliability. ## Output Format For each criterion, respond with: ### [Criterion Name] **Evidence**: [Quote or describe specific parts of the output] **Justification**: [Explain how the evidence maps to the rubric level] **Score**: [1-5] **Improvement**: [One actionable suggestion] ### Overall Assessment **Weighted Score**: [Calculate: sum of (score × weight)] **Pass/Fail**: [Pass if weighted score ≥ 3.5] **Summary**: [2-3 sentences summarizing strengths and weaknesses] ``` ### Step 4: Mitigate Position Bias in Comparisons When comparing two prompt variants (A vs B), use this two-pass workflow: **Pass 1 (A First):** ```markdown You are comparing two outputs from different prompt variants. ## Original Task {task description} ## Output A (First Variant) {output from prompt variant A} ## Output B (Second Variant) {output from prompt variant B} ## Comparison Criteria - Instruction Following - Output Completeness - Reasoning Quality ## Critical Instructions - Do NOT prefer outputs because they are longer - Do NOT prefer outputs based on their position (first vs second) - Focus ONLY on quality differences - TIE is acceptable when outputs are equivalent ## Analysis Process 1. Analyze Output A independently: [strengths, weaknesses] 2. Analyze Output B independently: [strengths, weaknesses] 3. Compare on each criterion 4. Determine winner with confidence (0-1) ## Output Reasoning: [Explain why] Winner: [A/B/TIE] Confidence: [0.0-1.0] ``` **Pass 2 (B First):** Repeat the same prompt but swap the order—put Output B first and Output A second. **Interpret Results:** - If both passes agree → Winner confirmed, average the confidences - If passes disagree → Result is TIE with confidence 0.5 (position bias detected) ## Pattern 2: Hierarchical Evaluation Workflow For complex evaluations, use a hierarchical approach: ``` Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases) ``` ### Tier 1: Quick Screen (Use Haiku) ```markdown Rate this command output 0-10 for basic adequacy. Task: {brief task description} Output: {command output} Quick assessment: Does this output reasonably address the task? Score (0-10): One-line reasoning: ``` **Decision rule**: Score < 5 → Fail, Score ≥ 7 → Pass, Score 5-7 → Escalate to detailed evaluation ### Tier 2: Detailed Evaluation (Use Opus) Use the full direct scoring prompt from Pattern 1 for borderline cases. ### Tier 3: Human Review For low-confidence automated evaluations (confidence < 0.6), queue for manual review: ```markdown ## Human Review Request **Automated Score**: 3.2/5 (Confidence: 0.45) **Reason for Escalation**: Low confidence, evaluator disagreed across passes ### What to Review 1. Does the output actually complete the task? 2. Are the automated criterion scores reasonable? 3. What did the automation miss? ### Original Task {task} ### Output {output} ### Automated Assessment {paste automated evaluation} ### Human Override [ ] Agree with automation [ ] Override to PASS - Reason: ___ [ ] Override to FAIL - Reason: ___ ``` ## Pattern 3: Panel of LLM Judges (PoLL) For high-stakes evaluation, use multiple models:: ### Workflow 1. **Run 3 independent evaluations** with different prompt framings: - Evaluation 1: Standard criteria prompt - Evaluation 2: Adversarial framing ("Find problems with this output") - Evaluation 3: User perspective ("Would a developer be satisfied?") 2. **Aggregate results**: - Take median score per criterion (robust to outliers) - Flag criteria with high variance (std > 1.0) for review - Overall pass requires majority agreement ### Multi-Judge Prompt Variants **Standard Framing:** ```markdown Evaluate this output against the specified criteria. Be fair and balanced. ``` **Adversarial Framing:** ```markdown Your role is to find problems with this output. Be critical and thorough. Look for: factual errors, missing requirements, inefficiencies, unclear explanations. ``` **User Perspective:** ```markdown Imagine you're a developer who requested this task. Would you be satisfied with this result? Would you need to redo any work? ``` ### Agreement Analysis After running all judges, check consistency: | Criterion | Judge 1 | Judge 2 | Judge 3 | Median | Std Dev | |-----------|---------|---------|---------|--------|---------| | Instruction Following | 4 | 4 | 5 | 4 | 0.58 | | Completeness | 3 | 4 | 3 | 3 | 0.58 | | Tool Efficiency | 2 | 3 | 4 | 3 | 1.00 ⚠️ | **⚠️ High variance** on Tool Efficiency suggests the criterion needs clearer definition or the output has ambiguous efficiency characteristics. ## Pattern 4: Confidence Calibration Confidence scores should be calibrated to actual reliability: ### Confidence Factors | Factor | High Confidence | Low Confidence | |--------|-----------------|----------------| | Position consistency | Both passes agree | Passes disagree | | Evidence count | 3+ specific citations | Vague or no citations | | Criterion agreement | All criteria align | Criteria scores vary widely | | Edge case match | Similar to known cases | Novel situation | ### Calibration Prompt Addition Add this to evaluation prompts: ```markdown ## Confidence Assessment After scoring, assess your confidence: 1. **Evidence Strength**: How specific was the evidence you cited? - Strong: Quoted exact passages, precise observations - Moderate: General observations, reasonable inferences - Weak: Vague impressions, assumptions 2. **Criterion Clarity**: How clear were the criterion boundaries? - Clear: Easy to map output to rubric levels - Ambiguous: Output fell between levels - Unclear: Rubric didn't fit this case 3. **Overall Confidence**: [0.0-1.0] - 0.9+: Very confident, clear evidence, obvious rubric fit - 0.7-0.9: Confident, good evidence, minor ambiguity - 0.5-0.7: Moderate confidence, some ambiguity - <0.5: Low confidence, significant uncertainty Confidence: [score] Confidence Reasoning: [explain what factors affected confidence] ``` ## Pattern 5: Structured Output Format Request consistent output structure for easier analysis: ### Evaluation Output Template ```markdown ## Evaluation Results ### Metadata - **Evaluated**: [command/skill name] - **Test Case**: [test case ID or description] - **Evaluator**: [model used] - **Timestamp**: [when evaluated] ### Criterion Scores | Criterion | Score | Weight | Weighted | Confidence | |-----------|-------|--------|----------|------------| | Instruction Following | 4/5 | 0.30 | 1.20 | 0.85 | | Output Completeness | 3/5 | 0.25 | 0.75 | 0.70 | | Tool Efficiency | 5/5 | 0.20 | 1.00 | 0.90 | | Reasoning Quality | 4/5 | 0.15 | 0.60 | 0.75 | | Response Coherence | 4/5 | 0.10 | 0.40 | 0.80 | ### Summary - **Overall Score**: 3.95/5.0 - **Pass Threshold**: 3.5/5.0 - **Result**: ✅ PASS ### Evidence Summary - **Strengths**: [bullet points] - **Weaknesses**: [bullet points] - **Improvements**: [prioritized suggestions] ### Confidence Assessment - **Overall Confidence**: 0.78 - **Flags**: [any concerns or caveats] ``` ## Evaluation Workflows for Claude Code Development ### Workflow: Testing a New Command 1. **Write 5-10 test cases** spanning complexity levels 2. **Run command** on each test case, capture full output 3. **Quick screen** all outputs with Tier 1 evaluation 4. **Detailed evaluate** failures and borderline cases 5. **Identify patterns** in failures to guide prompt improvements 6. **Iterate prompt** based on specific weaknesses found 7. **Re-evaluate** same test cases to measure improvement ### Workflow: Comparing Prompt Variants 1. **Create variant prompts** (e.g., different instruction phrasings) 2. **Run both variants** on identical test cases 3. **Pairwise compare** with position swapping 4. **Calculate win rate** for each variant 5. **Analyze** which cases each variant handles better 6. **Decide**: Pick winner or create hybrid ### Workflow: Regression Testing 1. **Maintain test suite** of representative cases 2. **Before changes**: Run evaluation, record baseline scores 3. **After changes**: Re-run evaluation 4. **Compare**: Flag regressions (score drops > 0.5) 5. **Investigate**: Why did specific cases regress? 6. **Accept or revert**: Based on overall impact ### Workflow: Continuous Quality Monitoring 1. **Sample production usage** (if available) 2. **Run lightweight evaluation** on samples 3. **Track metrics over time**: - Average scores by criterion - Failure rate - Low-confidence rate 4. **Alert on degradation**: Score drop > 10% from baseline 5. **Periodic deep dive**: Monthly detailed evaluation on random sample ## Anti-Patterns to Avoid ### ❌ Scoring Without Justification **Problem**: Scores lack grounding, difficult to debug **Solution**: Always require evidence before score ### ❌ Single-Pass Pairwise Comparison **Problem**: Position bias corrupts results **Solution**: Always swap positions and check consistency ### ❌ Overloaded Criteria **Problem**: Criteria measuring multiple things are unreliable **Solution**: One criterion = one measurable aspect ### ❌ Missing Edge Case Guidance **Problem**: Evaluators handle ambiguous cases inconsistently **Solution**: Include edge cases in rubrics with explicit guidance ### ❌ Ignoring Low Confidence **Problem**: Acting on uncertain evaluations leads to wrong conclusions **Solution**: Escalate low-confidence cases for human review ### ❌ Generic Rubrics **Problem**: Generic criteria produce vague, unhelpful evaluations **Solution**: Create domain-specific rubrics (code commands vs documentation commands vs analysis commands) ## Handling Evaluation Failures When evaluations fail or produce unreliable results, use these recovery strategies: ### Malformed Output Disregard When the evaluator produces unparseable or incomplete output: 1. **Mark as invalid and ingore for analysis** - incorrect output, usally means halicunations during thinking process 2. **Retry initial prompt without chagnes** - multiple retries usally more consistent rahter one shot prompt 3. **if still produce incorrect output, flag for human review**: Mark as "evaluation failed, needs manual check" and queue for later ### Validation Checklist Before trusting evaluation results, verify: - [ ] All criteria have scores in valid range (1-5) - [ ] Each score has a justification referencing specific evidence - [ ] Confidence score is provided and reasonable - [ ] No contradictions between justification and assigned score - [ ] Weighted total calculation is correct ## Validating Evaluation Prompts (Meta-Evaluation) Before using an evaluation prompt in production, test it against known cases: ### Calibration Test Cases Create a small set of outputs with known quality levels: | Test Type | Description | Expected Score | |-----------|-------------|----------------| | Known-good | Clearly excellent output | 4.5+ / 5.0 | | Known-bad | Clearly poor output | < 2.5 / 5.0 | | Boundary | Borderline case | 3.0-3.5 with nuanced explanation | ### Validation Workflow 1. **Known-good test**: Evaluate a clearly excellent output - If score < 4.0 → Rubric is too strict or evidence requirements unclear 2. **Known-bad test**: Evaluate a clearly poor output - If score > 3.0 → Rubric is too lenient or criteria not specific enough 3. **Boundary test**: Evaluate a borderline case - Should produce moderate score (3.0-3.5) with detailed explanation - If confident high/low score → Criteria lack nuance 4. **Consistency test**: Run same evaluation 3 times - Score variance should be < 0.5 - If higher variance → Criteria need tighter definitions ### Position Bias Validation Test for position bias before using pairwise comparisons: ```markdown ## Position Bias Test Run this test with IDENTICAL outputs in both positions: Test Case: [Same output text] Position A: [Paste output] Position B: [Paste identical output] Expected Result: TIE with high confidence (>0.9) If Result Shows Winner: - Position bias detected - Add stronger anti-bias instructions to prompt - Re-test until TIE achieved consistently ``` ### Evaluation Prompt Iteration When calibration tests fail: 1. **Identify failure mode**: Too strict? Too lenient? Inconsistent? 2. **Adjust specific rubric levels**: Add examples, clarify boundaries 3. **Re-run calibration tests**: All 4 tests must pass 4. **Document changes**: Track what adjustments improved reliability # Metric Selection Guide for LLM Evaluation This reference provides guidance on selecting appropriate metrics for different evaluation scenarios. ## Metric Categories ### Classification Metrics Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect). #### Precision ``` Precision = True Positives / (True Positives + False Positives) ``` **Interpretation**: Of all responses the judge said were good, what fraction were actually good? **Use when**: False positives are costly (e.g., approving unsafe content) #### Recall ``` Recall = True Positives / (True Positives + False Negatives) ``` **Interpretation**: Of all actually good responses, what fraction did the judge identify? **Use when**: False negatives are costly (e.g., missing good content in filtering) #### F1 Score ``` F1 = 2 * (Precision * Recall) / (Precision + Recall) ``` **Interpretation**: Harmonic mean of precision and recall **Use when**: You need a single number balancing both concerns ### Agreement Metrics Use for comparing automated evaluation with human judgment. #### Cohen's Kappa (κ) ``` κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement) ``` **Interpretation**: Agreement adjusted for chance - κ > 0.8: Almost perfect agreement - κ 0.6-0.8: Substantial agreement - κ 0.4-0.6: Moderate agreement - κ < 0.4: Fair to poor agreement **Use for**: Binary or categorical judgments #### Weighted Kappa For ordinal scales where disagreement severity matters: **Interpretation**: Penalizes large disagreements more than small ones ### Correlation Metrics Use for ordinal/continuous scores. #### Spearman's Rank Correlation (ρ) **Interpretation**: Correlation between rankings, not absolute values - ρ > 0.9: Very strong correlation - ρ 0.7-0.9: Strong correlation - ρ 0.5-0.7: Moderate correlation - ρ < 0.5: Weak correlation **Use when**: Order matters more than exact values #### Kendall's Tau (τ) **Interpretation**: Similar to Spearman but based on pairwise concordance **Use when**: You have many tied values #### Pearson Correlation (r) **Interpretation**: Linear correlation between scores **Use when**: Exact score values matter, not just order ### Pairwise Comparison Metrics #### Agreement Rate ``` Agreement = (Matching Decisions) / (Total Comparisons) ``` **Interpretation**: Simple percentage of agreement #### Position Consistency ``` Consistency = (Consistent across position swaps) / (Total comparisons) ``` **Interpretation**: How often does swapping position change the decision? ## Selection Decision Tree ``` What type of evaluation task? │ ├── Binary classification (pass/fail) │ └── Use: Precision, Recall, F1, Cohen's κ │ ├── Ordinal scale (1-5 rating) │ ├── Comparing to human judgments? │ │ └── Use: Spearman's ρ, Weighted κ │ └── Comparing two automated judges? │ └── Use: Kendall's τ, Spearman's ρ │ ├── Pairwise preference │ └── Use: Agreement rate, Position consistency │ └── Multi-label classification └── Use: Macro-F1, Micro-F1, Per-label metrics ``` ## Metric Selection by Use Case ### Use Case 1: Validating Automated Evaluation **Goal**: Ensure automated evaluation correlates with human judgment **Recommended Metrics**: 1. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical) 2. Secondary: Per-criterion agreement 3. Diagnostic: Confusion matrix for systematic errors ### Use Case 2: Comparing Two Models **Goal**: Determine which model produces better outputs **Recommended Metrics**: 1. Primary: Win rate (from pairwise comparison) 2. Secondary: Position consistency (bias check) 3. Diagnostic: Per-criterion breakdown ### Use Case 3: Quality Monitoring **Goal**: Track evaluation quality over time **Recommended Metrics**: 1. Primary: Rolling agreement with human spot-checks 2. Secondary: Score distribution stability 3. Diagnostic: Bias indicators (position, length) ## Interpreting Metric Results ### Good Evaluation System Indicators | Metric | Good | Acceptable | Concerning | |--------|------|------------|------------| | Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 | | Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 | | Position consistency | > 0.9 | 0.8-0.9 | < 0.8 | | Length correlation | < 0.2 | 0.2-0.4 | > 0.4 | ### Warning Signs 1. **High agreement but low correlation**: May indicate calibration issues 2. **Low position consistency**: Position bias affecting results 3. **High length correlation**: Length bias inflating scores 4. **Per-criterion variance**: Some criteria may be poorly defined ## Reporting Template ```markdown ## Evaluation System Metrics Report ### Human Agreement - Spearman's ρ: 0.82 (p < 0.001) - Cohen's κ: 0.74 - Sample size: 500 evaluations ### Bias Indicators - Position consistency: 91% - Length-score correlation: 0.12 ### Per-Criterion Performance | Criterion | Spearman's ρ | κ | |-----------|--------------|---| | Accuracy | 0.88 | 0.79 | | Clarity | 0.76 | 0.68 | | Completeness | 0.81 | 0.72 | ### Recommendations - All metrics within acceptable ranges - Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement ```plugins/customaize-agent/skills/create-hook/SKILL.mdskillShow content (110475 bytes)
--- name: create-hook description: Create and configure git hooks with intelligent project analysis, suggestions, and automated testing argument-hint: Optional hook type or description of desired behavior --- # Create Hook Command Analyze the project, suggest practical hooks, and create them with proper testing. ## Your Task (/create-hook) 1. **Analyze environment** - Detect tooling and existing hooks 2. **Suggest hooks** - Based on your project configuration 3. **Configure hook** - Ask targeted questions and create the script 4. **Test & validate** - Ensure the hook works correctly ## Your Workflow ### 1. Environment Analysis & Suggestions Automatically detect the project tooling and suggest relevant hooks: **When TypeScript is detected (`tsconfig.json`):** - PostToolUse hook: "Type-check files after editing" - PreToolUse hook: "Block edits with type errors" **When Prettier is detected (`.prettierrc`, `prettier.config.js`):** - PostToolUse hook: "Auto-format files after editing" - PreToolUse hook: "Require formatted code" **When ESLint is detected (`.eslintrc.*`):** - PostToolUse hook: "Lint and auto-fix after editing" - PreToolUse hook: "Block commits with linting errors" **When package.json has scripts:** - `test` script → "Run tests before commits" - `build` script → "Validate build before commits" **When a git repository is detected:** - PreToolUse/Bash hook: "Prevent commits with secrets" - PostToolUse hook: "Security scan on file changes" **Decision Tree:** ``` Project has TypeScript? → Suggest type checking hooks Project has formatter? → Suggest formatting hooks Project has tests? → Suggest test validation hooks Security sensitive? → Suggest security hooks + Scan for additional patterns and suggest custom hooks based on: - Custom scripts in package.json - Unique file patterns or extensions - Development workflow indicators - Project-specific tooling configurations ``` ### 2. Hook Configuration Start by asking: **"What should this hook do?"** and offer relevant suggestions from your analysis. Then understand the context from the user's description and **only ask about details you're unsure about**: 1. **Trigger timing**: When should it run? - `PreToolUse`: Before file operations (can block) - `PostToolUse`: After file operations (feedback/fixes) - `UserPromptSubmit`: Before processing requests - Other event types as needed 2. **Tool matcher**: Which tools should trigger it? (`Write`, `Edit`, `Bash`, `*` etc) 3. **Scope**: `global`, `project`, or `project-local` 4. **Response approach**: - **Exit codes only**: Simple (exit 0 = success, exit 2 = block in PreToolUse) - **JSON response**: Advanced control (blocking, context, decisions) - Guide based on complexity: simple pass/fail → exit codes, rich feedback → JSON 5. **Blocking behavior** (if relevant): "Should this stop operations when issues are found?" - PreToolUse: Can block operations (security, validation) - PostToolUse: Usually provide feedback only 6. **Claude integration** (CRITICAL): "Should Claude Code automatically see and fix issues this hook detects?" - If YES: Use `additionalContext` for error communication - If NO: Use `suppressOutput: true` for silent operation 7. **Context pollution**: "Should successful operations be silent to avoid noise?" - Recommend YES for formatting, routine checks - Recommend NO for security alerts, critical errors 8. **File filtering**: "What file types should this hook process?" ### 3. Hook Creation You should: - **Create hooks directory**: `~/.claude/hooks/` or `.claude/hooks/` based on scope - **Generate script**: Create hook script with: - Proper shebang and executable permissions - Project-specific commands (use detected config paths) - Comments explaining the hook's purpose - **Update settings**: Add hook configuration to appropriate settings.json - **Use absolute paths**: Avoid relative paths to scripts and executables. Use `$CLAUDE_PROJECT_DIR` to reference project root - **Offer validation**: Ask if the user wants you to test the hook **Key Implementation Standards:** - Read JSON from stdin (never use argv) - Use top-level `additionalContext`/`systemMessage` for Claude communication - Include `suppressOutput: true` for successful operations - Provide specific error counts and actionable feedback - Focus on changed files rather than entire codebase - Support common development workflows **⚠️ CRITICAL: Input/Output Format** This is where most hook implementations fail. Pay extra attention to: - **Input**: Reading JSON from stdin correctly (not argv) - **Output**: Using correct top-level JSON structure for Claude communication - **Documentation**: Consulting official docs for exact schemas when in doubt ### 4. Testing & Validation **CRITICAL: Test both happy and sad paths:** **Happy Path Testing:** 1. **Test expected success scenario** - Create conditions where hook should pass - _Examples_: TypeScript (valid code), Linting (formatted code), Security (safe commands) **Sad Path Testing:** 2. **Test expected failure scenario** - Create conditions where hook should fail/warn - _Examples_: TypeScript (type errors), Linting (unformatted code), Security (dangerous operations) **Verification Steps:** 3. **Verify expected behavior**: Check if it blocks/warns/provides context as intended **Example Testing Process:** - For a hook preventing file deletion: Create a test file, attempt the protected action, and verify the hook prevents it **If Issues Occur, you should:** - Check hook registration in settings - Verify script permissions (`chmod +x`) - Test with simplified version first - Debug with detailed hook execution analysis ## Hook Templates ### Type Checking (PostToolUse) ``` #!/usr/bin/env node // Read stdin JSON, check .ts/.tsx files only // Run: npx tsc --noEmit --pretty // Output: JSON with additionalContext for errors ``` ### Auto-formatting (PostToolUse) ``` #!/usr/bin/env node // Read stdin JSON, check supported file types // Run: npx prettier --write [file] // Output: JSON with suppressOutput: true ``` ### Security Scanning (PreToolUse) ```bash #!/bin/bash # Read stdin JSON, check for secrets/keys # Block if dangerous patterns found # Exit 2 to block, 0 to continue ``` _Complete templates available at: <https://docs.claude.com/en/docs/claude-code/hooks#examples>_ ## Quick Reference **📖 Official Docs**: <https://docs.claude.com/en/docs/claude-code/hooks.md> **Common Patterns:** - **stdin input**: `JSON.parse(process.stdin.read())` - **File filtering**: Check extensions before processing - **Success response**: `{continue: true, suppressOutput: true}` - **Error response**: `{continue: true, additionalContext: "error details"}` - **Block operation**: `exit(2)` in PreToolUse hooks **Hook Types by Use Case:** - **Code Quality**: PostToolUse for feedback and fixes - **Security**: PreToolUse to block dangerous operations - **CI/CD**: PreToolUse to validate before commits - **Development**: PostToolUse for automated improvements **Hook Execution Best Practices:** - **Hooks run in parallel** according to official documentation - **Design for independence** since execution order isn't guaranteed - **Plan hook interactions carefully** when multiple hooks affect the same files ## Success Criteria ✅ **Hook created successfully when:** - Script has executable permissions - Registered in correct settings.json - Responds correctly to test scenarios - Integrates properly with Claude for automated fixes - Follows project conventions and detected tooling **Result**: The user gets a working hook that enhances their development workflow with intelligent automation and quality checks. --- > ## Documentation Index > > Fetch the complete documentation index at: <https://code.claude.com/docs/llms.txt> > Use this file to discover all available pages before exploring further. # Automate workflows with hooks > Run shell commands automatically when Claude Code edits files, finishes tasks, or needs input. Format code, send notifications, validate commands, and enforce project rules. Hooks are user-defined shell commands that execute at specific points in Claude Code's lifecycle. They provide deterministic control over Claude Code's behavior, ensuring certain actions always happen rather than relying on the LLM to choose to run them. Use hooks to enforce project rules, automate repetitive tasks, and integrate Claude Code with your existing tools. For decisions that require judgment rather than deterministic rules, you can also use [prompt-based hooks](#prompt-based-hooks) or [agent-based hooks](#agent-based-hooks) that use a Claude model to evaluate conditions. For other ways to extend Claude Code, see [skills](/en/skills) for giving Claude additional instructions and executable commands, [subagents](/en/sub-agents) for running tasks in isolated contexts, and [plugins](/en/plugins) for packaging extensions to share across projects. <Tip> This guide covers common use cases and how to get started. For full event schemas, JSON input/output formats, and advanced features like async hooks and MCP tool hooks, see the [Hooks reference](/en/hooks). </Tip> ## Set up your first hook The fastest way to create a hook is through the `/hooks` interactive menu in Claude Code. This walkthrough creates a desktop notification hook, so you get alerted whenever Claude is waiting for your input instead of watching the terminal. <Steps> <Step title="Open the hooks menu"> Type `/hooks` in the Claude Code CLI. You'll see a list of all available hook events, plus an option to disable all hooks. Each event corresponds to a point in Claude's lifecycle where you can run custom code. Select `Notification` to create a hook that fires when Claude needs your attention. </Step> <Step title="Configure the matcher"> The menu shows a list of matchers, which filter when the hook fires. Set the matcher to `*` to fire on all notification types. You can narrow it later by changing the matcher to a specific value like `permission_prompt` or `idle_prompt`. </Step> <Step title="Add your command"> Select `+ Add new hook…`. The menu prompts you for a shell command to run when the event fires. Hooks run any shell command you provide, so you can use your platform's built-in notification tool. Copy the command for your OS: <Tabs> <Tab title="macOS"> Uses [`osascript`](https://ss64.com/mac/osascript.html) to trigger a native macOS notification through AppleScript: ``` osascript -e 'display notification "Claude Code needs your attention" with title "Claude Code"' ``` </Tab> <Tab title="Linux"> Uses `notify-send`, which is pre-installed on most Linux desktops with a notification daemon: ``` notify-send 'Claude Code' 'Claude Code needs your attention' ``` </Tab> <Tab title="Windows (PowerShell)"> Uses PowerShell to show a native message box through .NET's Windows Forms: ``` powershell.exe -Command "[System.Reflection.Assembly]::LoadWithPartialName('System.Windows.Forms'); [System.Windows.Forms.MessageBox]::Show('Claude Code needs your attention', 'Claude Code')" ``` </Tab> </Tabs> </Step> <Step title="Choose a storage location"> The menu asks where to save the hook configuration. Select `User settings` to store it in `~/.claude/settings.json`, which applies the hook to all your projects. You could also choose `Project settings` to scope it to the current project. See [Configure hook location](#configure-hook-location) for all available scopes. </Step> <Step title="Test the hook"> Press `Esc` to return to the CLI. Ask Claude to do something that requires permission, then switch away from the terminal. You should receive a desktop notification. </Step> </Steps> ## What you can automate Hooks let you run code at key points in Claude Code's lifecycle: format files after edits, block commands before they execute, send notifications when Claude needs input, inject context at session start, and more. For the full list of hook events, see the [Hooks reference](/en/hooks#hook-lifecycle). Each example includes a ready-to-use configuration block that you add to a [settings file](#configure-hook-location). The most common patterns: - [Get notified when Claude needs input](#get-notified-when-claude-needs-input) - [Auto-format code after edits](#auto-format-code-after-edits) - [Block edits to protected files](#block-edits-to-protected-files) - [Re-inject context after compaction](#re-inject-context-after-compaction) ### Get notified when Claude needs input Get a desktop notification whenever Claude finishes working and needs your input, so you can switch to other tasks without checking the terminal. This hook uses the `Notification` event, which fires when Claude is waiting for input or permission. Each tab below uses the platform's native notification command. Add this to `~/.claude/settings.json`, or use the [interactive walkthrough](#set-up-your-first-hook) above to configure it with `/hooks`: <Tabs> <Tab title="macOS"> ```json theme={null} { "hooks": { "Notification": [ { "matcher": "", "hooks": [ { "type": "command", "command": "osascript -e 'display notification \"Claude Code needs your attention\" with title \"Claude Code\"'" } ] } ] } } ``` </Tab> <Tab title="Linux"> ```json theme={null} { "hooks": { "Notification": [ { "matcher": "", "hooks": [ { "type": "command", "command": "notify-send 'Claude Code' 'Claude Code needs your attention'" } ] } ] } } ``` </Tab> <Tab title="Windows (PowerShell)"> ```json theme={null} { "hooks": { "Notification": [ { "matcher": "", "hooks": [ { "type": "command", "command": "powershell.exe -Command \"[System.Reflection.Assembly]::LoadWithPartialName('System.Windows.Forms'); [System.Windows.Forms.MessageBox]::Show('Claude Code needs your attention', 'Claude Code')\"" } ] } ] } } ``` </Tab> </Tabs> ### Auto-format code after edits Automatically run [Prettier](https://prettier.io/) on every file Claude edits, so formatting stays consistent without manual intervention. This hook uses the `PostToolUse` event with an `Edit|Write` matcher, so it runs only after file-editing tools. The command extracts the edited file path with [`jq`](https://jqlang.github.io/jq/) and passes it to Prettier. Add this to `.claude/settings.json` in your project root: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Edit|Write", "hooks": [ { "type": "command", "command": "jq -r '.tool_input.file_path' | xargs npx prettier --write" } ] } ] } } ``` <Note> The Bash examples on this page use `jq` for JSON parsing. Install it with `brew install jq` (macOS), `apt-get install jq` (Debian/Ubuntu), or see [`jq` downloads](https://jqlang.github.io/jq/download/). </Note> ### Block edits to protected files Prevent Claude from modifying sensitive files like `.env`, `package-lock.json`, or anything in `.git/`. Claude receives feedback explaining why the edit was blocked, so it can adjust its approach. This example uses a separate script file that the hook calls. The script checks the target file path against a list of protected patterns and exits with code 2 to block the edit. <Steps> <Step title="Create the hook script"> Save this to `.claude/hooks/protect-files.sh`: ```bash theme={null} #!/bin/bash # protect-files.sh INPUT=$(cat) FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty') PROTECTED_PATTERNS=(".env" "package-lock.json" ".git/") for pattern in "${PROTECTED_PATTERNS[@]}"; do if [[ "$FILE_PATH" == *"$pattern"* ]]; then echo "Blocked: $FILE_PATH matches protected pattern '$pattern'" >&2 exit 2 fi done exit 0 ``` </Step> <Step title="Make the script executable (macOS/Linux)"> Hook scripts must be executable for Claude Code to run them: ```bash theme={null} chmod +x .claude/hooks/protect-files.sh ``` </Step> <Step title="Register the hook"> Add a `PreToolUse` hook to `.claude/settings.json` that runs the script before any `Edit` or `Write` tool call: ```json theme={null} { "hooks": { "PreToolUse": [ { "matcher": "Edit|Write", "hooks": [ { "type": "command", "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/protect-files.sh" } ] } ] } } ``` </Step> </Steps> ### Re-inject context after compaction When Claude's context window fills up, compaction summarizes the conversation to free space. This can lose important details. Use a `SessionStart` hook with a `compact` matcher to re-inject critical context after every compaction. Any text your command writes to stdout is added to Claude's context. This example reminds Claude of project conventions and recent work. Add this to `.claude/settings.json` in your project root: ```json theme={null} { "hooks": { "SessionStart": [ { "matcher": "compact", "hooks": [ { "type": "command", "command": "echo 'Reminder: use Bun, not npm. Run bun test before committing. Current sprint: auth refactor.'" } ] } ] } } ``` You can replace the `echo` with any command that produces dynamic output, like `git log --oneline -5` to show recent commits. For injecting context on every session start, consider using [CLAUDE.md](/en/memory) instead. For environment variables, see [`CLAUDE_ENV_FILE`](/en/hooks#persist-environment-variables) in the reference. ## How hooks work Hook events fire at specific lifecycle points in Claude Code. When an event fires, all matching hooks run in parallel, and identical hook commands are automatically deduplicated. The table below shows each event and when it triggers: | Event | When it fires | | :------------------- | :--------------------------------------------------- | | `SessionStart` | When a session begins or resumes | | `UserPromptSubmit` | When you submit a prompt, before Claude processes it | | `PreToolUse` | Before a tool call executes. Can block it | | `PermissionRequest` | When a permission dialog appears | | `PostToolUse` | After a tool call succeeds | | `PostToolUseFailure` | After a tool call fails | | `Notification` | When Claude Code sends a notification | | `SubagentStart` | When a subagent is spawned | | `SubagentStop` | When a subagent finishes | | `Stop` | When Claude finishes responding | | `PreCompact` | Before context compaction | | `SessionEnd` | When a session terminates | Each hook has a `type` that determines how it runs. Most hooks use `"type": "command"`, which runs a shell command. Two other options use a Claude model to make decisions: `"type": "prompt"` for single-turn evaluation and `"type": "agent"` for multi-turn verification with tool access. See [Prompt-based hooks](#prompt-based-hooks) and [Agent-based hooks](#agent-based-hooks) for details. ### Read input and return output Hooks communicate with Claude Code through stdin, stdout, stderr, and exit codes. When an event fires, Claude Code passes event-specific data as JSON to your script's stdin. Your script reads that data, does its work, and tells Claude Code what to do next via the exit code. #### Hook input Every event includes common fields like `session_id` and `cwd`, but each event type adds different data. For example, when Claude runs a Bash command, a `PreToolUse` hook receives something like this on stdin: ```json theme={null} { "session_id": "abc123", // unique ID for this session "cwd": "/Users/sarah/myproject", // working directory when the event fired "hook_event_name": "PreToolUse", // which event triggered this hook "tool_name": "Bash", // the tool Claude is about to use "tool_input": { // the arguments Claude passed to the tool "command": "npm test" // for Bash, this is the shell command } } ``` Your script can parse that JSON and act on any of those fields. `UserPromptSubmit` hooks get the `prompt` text instead, `SessionStart` hooks get the `source` (startup, resume, compact), and so on. See [Common input fields](/en/hooks#common-input-fields) in the reference for shared fields, and each event's section for event-specific schemas. #### Hook output Your script tells Claude Code what to do next by writing to stdout or stderr and exiting with a specific code. For example, a `PreToolUse` hook that wants to block a command: ```bash theme={null} #!/bin/bash INPUT=$(cat) COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command') if echo "$COMMAND" | grep -q "drop table"; then echo "Blocked: dropping tables is not allowed" >&2 # stderr becomes Claude's feedback exit 2 # exit 2 = block the action fi exit 0 # exit 0 = let it proceed ``` The exit code determines what happens next: - **Exit 0**: the action proceeds. For `UserPromptSubmit` and `SessionStart` hooks, anything you write to stdout is added to Claude's context. - **Exit 2**: the action is blocked. Write a reason to stderr, and Claude receives it as feedback so it can adjust. - **Any other exit code**: the action proceeds. Stderr is logged but not shown to Claude. Toggle verbose mode with `Ctrl+O` to see these messages in the transcript. #### Structured JSON output Exit codes give you two options: allow or block. For more control, exit 0 and print a JSON object to stdout instead. <Note> Use exit 2 to block with a stderr message, or exit 0 with JSON for structured control. Don't mix them: Claude Code ignores JSON when you exit 2. </Note> For example, a `PreToolUse` hook can deny a tool call and tell Claude why, or escalate it to the user for approval: ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PreToolUse", "permissionDecision": "deny", "permissionDecisionReason": "Use rg instead of grep for better performance" } } ``` Claude Code reads `permissionDecision` and cancels the tool call, then feeds `permissionDecisionReason` back to Claude as feedback. These three options are specific to `PreToolUse`: - `"allow"`: proceed without showing a permission prompt - `"deny"`: cancel the tool call and send the reason to Claude - `"ask"`: show the permission prompt to the user as normal Other events use different decision patterns. For example, `PostToolUse` and `Stop` hooks use a top-level `decision: "block"` field, while `PermissionRequest` uses `hookSpecificOutput.decision.behavior`. See the [summary table](/en/hooks#decision-control) in the reference for a full breakdown by event. For `UserPromptSubmit` hooks, use `additionalContext` instead to inject text into Claude's context. Prompt-based hooks (`type: "prompt"`) handle output differently: see [Prompt-based hooks](#prompt-based-hooks). ### Filter hooks with matchers Without a matcher, a hook fires on every occurrence of its event. Matchers let you narrow that down. For example, if you want to run a formatter only after file edits (not after every tool call), add a matcher to your `PostToolUse` hook: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Edit|Write", "hooks": [ { "type": "command", "command": "prettier --write ..." } ] } ] } } ``` The `"Edit|Write"` matcher is a regex pattern that matches the tool name. The hook only fires when Claude uses the `Edit` or `Write` tool, not when it uses `Bash`, `Read`, or any other tool. Each event type matches on a specific field. Matchers support exact strings and regex patterns: | Event | What the matcher filters | Example matcher values | | :--------------------------------------------------------------------- | :------------------------ | :----------------------------------------------------------------------- | | `PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest` | tool name | `Bash`, `Edit\|Write`, `mcp__.*` | | `SessionStart` | how the session started | `startup`, `resume`, `clear`, `compact` | | `SessionEnd` | why the session ended | `clear`, `logout`, `prompt_input_exit`, `other` | | `Notification` | notification type | `permission_prompt`, `idle_prompt`, `auth_success`, `elicitation_dialog` | | `SubagentStart` | agent type | `Bash`, `Explore`, `Plan`, or custom agent names | | `PreCompact` | what triggered compaction | `manual`, `auto` | | `UserPromptSubmit`, `Stop` | no matcher support | always fires on every occurrence | | `SubagentStop` | agent type | same values as `SubagentStart` | A few more examples showing matchers on different event types: <Tabs> <Tab title="Log every Bash command"> Match only `Bash` tool calls and log each command to a file. The `PostToolUse` event fires after the command completes, so `tool_input.command` contains what ran. The hook receives the event data as JSON on stdin, and `jq -r '.tool_input.command'` extracts just the command string, which `>>` appends to the log file: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Bash", "hooks": [ { "type": "command", "command": "jq -r '.tool_input.command' >> ~/.claude/command-log.txt" } ] } ] } } ``` </Tab> <Tab title="Match MCP tools"> MCP tools use a different naming convention than built-in tools: `mcp__<server>__<tool>`, where `<server>` is the MCP server name and `<tool>` is the tool it provides. For example, `mcp__github__search_repositories` or `mcp__filesystem__read_file`. Use a regex matcher to target all tools from a specific server, or match across servers with a pattern like `mcp__.*__write.*`. See [Match MCP tools](/en/hooks#match-mcp-tools) in the reference for the full list of examples. The command below extracts the tool name from the hook's JSON input with `jq` and writes it to stderr, where it shows up in verbose mode (`Ctrl+O`): ```json theme={null} { "hooks": { "PreToolUse": [ { "matcher": "mcp__github__.*", "hooks": [ { "type": "command", "command": "echo \"GitHub tool called: $(jq -r '.tool_name')\" >&2" } ] } ] } } ``` </Tab> <Tab title="Clean up on session end"> The `SessionEnd` event supports matchers on the reason the session ended. This hook only fires on `clear` (when you run `/clear`), not on normal exits: ```json theme={null} { "hooks": { "SessionEnd": [ { "matcher": "clear", "hooks": [ { "type": "command", "command": "rm -f /tmp/claude-scratch-*.txt" } ] } ] } } ``` </Tab> </Tabs> For full matcher syntax, see the [Hooks reference](/en/hooks#configuration). ### Configure hook location Where you add a hook determines its scope: | Location | Scope | Shareable | | :--------------------------------------------------------- | :--------------------------------- | :--------------------------------- | | `~/.claude/settings.json` | All your projects | No, local to your machine | | `.claude/settings.json` | Single project | Yes, can be committed to the repo | | `.claude/settings.local.json` | Single project | No, gitignored | | Managed policy settings | Organization-wide | Yes, admin-controlled | | [Plugin](/en/plugins) `hooks/hooks.json` | When plugin is enabled | Yes, bundled with the plugin | | [Skill](/en/skills) or [agent](/en/sub-agents) frontmatter | While the skill or agent is active | Yes, defined in the component file | You can also use the [`/hooks` menu](/en/hooks#the-hooks-menu) in Claude Code to add, delete, and view hooks interactively. To disable all hooks at once, use the toggle at the bottom of the `/hooks` menu or set `"disableAllHooks": true` in your settings file. Hooks added through the `/hooks` menu take effect immediately. If you edit settings files directly while Claude Code is running, the changes won't take effect until you review them in the `/hooks` menu or restart your session. ## Prompt-based hooks For decisions that require judgment rather than deterministic rules, use `type: "prompt"` hooks. Instead of running a shell command, Claude Code sends your prompt and the hook's input data to a Claude model (Haiku by default) to make the decision. You can specify a different model with the `model` field if you need more capability. The model's only job is to return a yes/no decision as JSON: - `"ok": true`: the action proceeds - `"ok": false`: the action is blocked. The model's `"reason"` is fed back to Claude so it can adjust. This example uses a `Stop` hook to ask the model whether all requested tasks are complete. If the model returns `"ok": false`, Claude keeps working and uses the `reason` as its next instruction: ```json theme={null} { "hooks": { "Stop": [ { "hooks": [ { "type": "prompt", "prompt": "Check if all tasks are complete. If not, respond with {\"ok\": false, \"reason\": \"what remains to be done\"}." } ] } ] } } ``` For full configuration options, see [Prompt-based hooks](/en/hooks#prompt-based-hooks) in the reference. ## Agent-based hooks When verification requires inspecting files or running commands, use `type: "agent"` hooks. Unlike prompt hooks which make a single LLM call, agent hooks spawn a subagent that can read files, search code, and use other tools to verify conditions before returning a decision. Agent hooks use the same `"ok"` / `"reason"` response format as prompt hooks, but with a longer default timeout of 60 seconds and up to 50 tool-use turns. This example verifies that tests pass before allowing Claude to stop: ```json theme={null} { "hooks": { "Stop": [ { "hooks": [ { "type": "agent", "prompt": "Verify that all unit tests pass. Run the test suite and check the results. $ARGUMENTS", "timeout": 120 } ] } ] } } ``` Use prompt hooks when the hook input data alone is enough to make a decision. Use agent hooks when you need to verify something against the actual state of the codebase. For full configuration options, see [Agent-based hooks](/en/hooks#agent-based-hooks) in the reference. ## Limitations and troubleshooting ### Limitations - Hooks communicate through stdout, stderr, and exit codes only. They cannot trigger slash commands or tool calls directly. - Hook timeout is 10 minutes by default, configurable per hook with the `timeout` field (in seconds). - `PostToolUse` hooks cannot undo actions since the tool has already executed. - `PermissionRequest` hooks do not fire in [non-interactive mode](/en/headless) (`-p`). Use `PreToolUse` hooks for automated permission decisions. - `Stop` hooks fire whenever Claude finishes responding, not only at task completion. They do not fire on user interrupts. ### Hook not firing The hook is configured but never executes. - Run `/hooks` and confirm the hook appears under the correct event - Check that the matcher pattern matches the tool name exactly (matchers are case-sensitive) - Verify you're triggering the right event type (e.g., `PreToolUse` fires before tool execution, `PostToolUse` fires after) - If using `PermissionRequest` hooks in non-interactive mode (`-p`), switch to `PreToolUse` instead ### Hook error in output You see a message like "PreToolUse hook error: ..." in the transcript. - Your script exited with a non-zero code unexpectedly. Test it manually by piping sample JSON: ```bash theme={null} echo '{"tool_name":"Bash","tool_input":{"command":"ls"}}' | ./my-hook.sh echo $? # Check the exit code ``` * If you see "command not found", use absolute paths or `$CLAUDE_PROJECT_DIR` to reference scripts - If you see "jq: command not found", install `jq` or use Python/Node.js for JSON parsing - If the script isn't running at all, make it executable: `chmod +x ./my-hook.sh` ### `/hooks` shows no hooks configured You edited a settings file but the hooks don't appear in the menu. - Restart your session or open `/hooks` to reload. Hooks added through the `/hooks` menu take effect immediately, but manual file edits require a reload. - Verify your JSON is valid (trailing commas and comments are not allowed) - Confirm the settings file is in the correct location: `.claude/settings.json` for project hooks, `~/.claude/settings.json` for global hooks ### Stop hook runs forever Claude keeps working in an infinite loop instead of stopping. Your Stop hook script needs to check whether it already triggered a continuation. Parse the `stop_hook_active` field from the JSON input and exit early if it's `true`: ```bash theme={null} #!/bin/bash INPUT=$(cat) if [ "$(echo "$INPUT" | jq -r '.stop_hook_active')" = "true" ]; then exit 0 # Allow Claude to stop fi # ... rest of your hook logic ``` ### JSON validation failed Claude Code shows a JSON parsing error even though your hook script outputs valid JSON. When Claude Code runs a hook, it spawns a shell that sources your profile (`~/.zshrc` or `~/.bashrc`). If your profile contains unconditional `echo` statements, that output gets prepended to your hook's JSON: ``` Shell ready on arm64 {"decision": "block", "reason": "Not allowed"} ``` Claude Code tries to parse this as JSON and fails. To fix this, wrap echo statements in your shell profile so they only run in interactive shells: ```bash theme={null} # In ~/.zshrc or ~/.bashrc if [[ $- == *i* ]]; then echo "Shell ready" fi ``` The `$-` variable contains shell flags, and `i` means interactive. Hooks run in non-interactive shells, so the echo is skipped. ### Debug techniques Toggle verbose mode with `Ctrl+O` to see hook output in the transcript, or run `claude --debug` for full execution details including which hooks matched and their exit codes. ## Learn more - [Hooks reference](/en/hooks): full event schemas, JSON output format, async hooks, and MCP tool hooks - [Security considerations](/en/hooks#security-considerations): review before deploying hooks in shared or production environments - [Bash command validator example](https://github.com/anthropics/claude-code/blob/main/examples/hooks/bash_command_validator_example.py): complete reference implementation --- > ## Documentation Index > > Fetch the complete documentation index at: <https://code.claude.com/docs/llms.txt> > Use this file to discover all available pages before exploring further. # Hooks reference > Reference for Claude Code hook events, configuration schema, JSON input/output formats, exit codes, async hooks, prompt hooks, and MCP tool hooks. <Tip> For a quickstart guide with examples, see [Automate workflows with hooks](/en/hooks-guide). </Tip> Hooks are user-defined shell commands or LLM prompts that execute automatically at specific points in Claude Code's lifecycle. Use this reference to look up event schemas, configuration options, JSON input/output formats, and advanced features like async hooks and MCP tool hooks. If you're setting up hooks for the first time, start with the [guide](/en/hooks-guide) instead. ## Hook lifecycle Hooks fire at specific points during a Claude Code session. When an event fires and a matcher matches, Claude Code passes JSON context about the event to your hook handler. For command hooks, this arrives on stdin. Your handler can then inspect the input, take action, and optionally return a decision. Some events fire once per session, while others fire repeatedly inside the agentic loop: <div style={{maxWidth: "500px", margin: "0 auto"}}> <Frame> <img src="https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=5c25fedbc3db6f8882af50c3cc478c32" alt="Hook lifecycle diagram showing the sequence of hooks from SessionStart through the agentic loop to SessionEnd" data-og-width="8876" width="8876" data-og-height="12492" height="12492" data-path="images/hooks-lifecycle.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=280&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=62406fcd5d4a189cc8842ee1bd946b84 280w, https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=560&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=fa3049022a6973c5f974e0f95b28169d 560w, https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=840&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=bd2890897db61a03160b93d4f972ff8e 840w, https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=1100&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=7ae8e098340479347135e39df4a13454 1100w, https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=1650&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=848a8606aab22c2ccaa16b6a18431e32 1650w, https://mintcdn.com/claude-code/z2YM37Ycg6eMbID3/images/hooks-lifecycle.png?w=2500&fit=max&auto=format&n=z2YM37Ycg6eMbID3&q=85&s=f3a9ef7feb61fa8fe362005aa185efbc 2500w" /> </Frame> </div> The table below summarizes when each event fires. The [Hook events](#hook-events) section documents the full input schema and decision control options for each one. | Event | When it fires | | :------------------- | :--------------------------------------------------- | | `SessionStart` | When a session begins or resumes | | `UserPromptSubmit` | When you submit a prompt, before Claude processes it | | `PreToolUse` | Before a tool call executes. Can block it | | `PermissionRequest` | When a permission dialog appears | | `PostToolUse` | After a tool call succeeds | | `PostToolUseFailure` | After a tool call fails | | `Notification` | When Claude Code sends a notification | | `SubagentStart` | When a subagent is spawned | | `SubagentStop` | When a subagent finishes | | `Stop` | When Claude finishes responding | | `PreCompact` | Before context compaction | | `SessionEnd` | When a session terminates | ### How a hook resolves To see how these pieces fit together, consider this `PreToolUse` hook that blocks destructive shell commands. The hook runs `block-rm.sh` before every Bash tool call: ```json theme={null} { "hooks": { "PreToolUse": [ { "matcher": "Bash", "hooks": [ { "type": "command", "command": ".claude/hooks/block-rm.sh" } ] } ] } } ``` The script reads the JSON input from stdin, extracts the command, and returns a `permissionDecision` of `"deny"` if it contains `rm -rf`: ```bash theme={null} #!/bin/bash # .claude/hooks/block-rm.sh COMMAND=$(jq -r '.tool_input.command') if echo "$COMMAND" | grep -q 'rm -rf'; then jq -n '{ hookSpecificOutput: { hookEventName: "PreToolUse", permissionDecision: "deny", permissionDecisionReason: "Destructive command blocked by hook" } }' else exit 0 # allow the command fi ``` Now suppose Claude Code decides to run `Bash "rm -rf /tmp/build"`. Here's what happens: <Frame> <img src="https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=7c13f51ffcbc37d22a593b27e2f2de72" alt="Hook resolution flow: PreToolUse event fires, matcher checks for Bash match, hook handler runs, result returns to Claude Code" data-og-width="780" width="780" data-og-height="290" height="290" data-path="images/hook-resolution.svg" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=280&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=36a39a07e8bc1995dcb4639e09846905 280w, https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=560&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=6568d90c596c7605bbac2c325b0a0c86 560w, https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=840&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=255a6f68b9475a0e41dbde7b88002dad 840w, https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=1100&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=dcecf8d5edc88cd2bc49deb006d5760d 1100w, https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=1650&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=04fe51bf69ae375e9fd517f18674e35f 1650w, https://mintcdn.com/claude-code/s7NM0vfd_wres2nf/images/hook-resolution.svg?w=2500&fit=max&auto=format&n=s7NM0vfd_wres2nf&q=85&s=b1b76e0b77fddb5c7fa7bf302dacd80b 2500w" /> </Frame> <Steps> <Step title="Event fires"> The `PreToolUse` event fires. Claude Code sends the tool input as JSON on stdin to the hook: ```json theme={null} { "tool_name": "Bash", "tool_input": { "command": "rm -rf /tmp/build" }, ... } ``` </Step> <Step title="Matcher checks"> The matcher `"Bash"` matches the tool name, so `block-rm.sh` runs. If you omit the matcher or use `"*"`, the hook runs on every occurrence of the event. Hooks only skip when a matcher is defined and doesn't match. </Step> <Step title="Hook handler runs"> The script extracts `"rm -rf /tmp/build"` from the input and finds `rm -rf`, so it prints a decision to stdout: ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PreToolUse", "permissionDecision": "deny", "permissionDecisionReason": "Destructive command blocked by hook" } } ``` If the command had been safe (like `npm test`), the script would hit `exit 0` instead, which tells Claude Code to allow the tool call with no further action. </Step> <Step title="Claude Code acts on the result"> Claude Code reads the JSON decision, blocks the tool call, and shows Claude the reason. </Step> </Steps> The [Configuration](#configuration) section below documents the full schema, and each [hook event](#hook-events) section documents what input your command receives and what output it can return. ## Configuration Hooks are defined in JSON settings files. The configuration has three levels of nesting: 1. Choose a [hook event](#hook-events) to respond to, like `PreToolUse` or `Stop` 2. Add a [matcher group](#matcher-patterns) to filter when it fires, like "only for the Bash tool" 3. Define one or more [hook handlers](#hook-handler-fields) to run when matched See [How a hook resolves](#how-a-hook-resolves) above for a complete walkthrough with an annotated example. <Note> This page uses specific terms for each level: **hook event** for the lifecycle point, **matcher group** for the filter, and **hook handler** for the shell command, prompt, or agent that runs. "Hook" on its own refers to the general feature. </Note> ### Hook locations Where you define a hook determines its scope: | Location | Scope | Shareable | | :--------------------------------------------------------- | :---------------------------- | :--------------------------------- | | `~/.claude/settings.json` | All your projects | No, local to your machine | | `.claude/settings.json` | Single project | Yes, can be committed to the repo | | `.claude/settings.local.json` | Single project | No, gitignored | | Managed policy settings | Organization-wide | Yes, admin-controlled | | [Plugin](/en/plugins) `hooks/hooks.json` | When plugin is enabled | Yes, bundled with the plugin | | [Skill](/en/skills) or [agent](/en/sub-agents) frontmatter | While the component is active | Yes, defined in the component file | For details on settings file resolution, see [settings](/en/settings). Enterprise administrators can use `allowManagedHooksOnly` to block user, project, and plugin hooks. See [Hook configuration](/en/settings#hook-configuration). ### Matcher patterns The `matcher` field is a regex string that filters when hooks fire. Use `"*"`, `""`, or omit `matcher` entirely to match all occurrences. Each event type matches on a different field: | Event | What the matcher filters | Example matcher values | | :--------------------------------------------------------------------- | :------------------------ | :----------------------------------------------------------------------------- | | `PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest` | tool name | `Bash`, `Edit\|Write`, `mcp__.*` | | `SessionStart` | how the session started | `startup`, `resume`, `clear`, `compact` | | `SessionEnd` | why the session ended | `clear`, `logout`, `prompt_input_exit`, `bypass_permissions_disabled`, `other` | | `Notification` | notification type | `permission_prompt`, `idle_prompt`, `auth_success`, `elicitation_dialog` | | `SubagentStart` | agent type | `Bash`, `Explore`, `Plan`, or custom agent names | | `PreCompact` | what triggered compaction | `manual`, `auto` | | `SubagentStop` | agent type | same values as `SubagentStart` | | `UserPromptSubmit`, `Stop` | no matcher support | always fires on every occurrence | The matcher is a regex, so `Edit|Write` matches either tool and `Notebook.*` matches any tool starting with Notebook. The matcher runs against a field from the [JSON input](#hook-input-and-output) that Claude Code sends to your hook on stdin. For tool events, that field is `tool_name`. Each [hook event](#hook-events) section lists the full set of matcher values and the input schema for that event. This example runs a linting script only when Claude writes or edits a file: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Edit|Write", "hooks": [ { "type": "command", "command": "/path/to/lint-check.sh" } ] } ] } } ``` `UserPromptSubmit` and `Stop` don't support matchers and always fire on every occurrence. If you add a `matcher` field to these events, it is silently ignored. #### Match MCP tools [MCP](/en/mcp) server tools appear as regular tools in tool events (`PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest`), so you can match them the same way you match any other tool name. MCP tools follow the naming pattern `mcp__<server>__<tool>`, for example: - `mcp__memory__create_entities`: Memory server's create entities tool - `mcp__filesystem__read_file`: Filesystem server's read file tool - `mcp__github__search_repositories`: GitHub server's search tool Use regex patterns to target specific MCP tools or groups of tools: - `mcp__memory__.*` matches all tools from the `memory` server - `mcp__.*__write.*` matches any tool containing "write" from any server This example logs all memory server operations and validates write operations from any MCP server: ```json theme={null} { "hooks": { "PreToolUse": [ { "matcher": "mcp__memory__.*", "hooks": [ { "type": "command", "command": "echo 'Memory operation initiated' >> ~/mcp-operations.log" } ] }, { "matcher": "mcp__.*__write.*", "hooks": [ { "type": "command", "command": "/home/user/scripts/validate-mcp-write.py" } ] } ] } } ``` ### Hook handler fields Each object in the inner `hooks` array is a hook handler: the shell command, LLM prompt, or agent that runs when the matcher matches. There are three types: - **[Command hooks](#command-hook-fields)** (`type: "command"`): run a shell command. Your script receives the event's [JSON input](#hook-input-and-output) on stdin and communicates results back through exit codes and stdout. - **[Prompt hooks](#prompt-and-agent-hook-fields)** (`type: "prompt"`): send a prompt to a Claude model for single-turn evaluation. The model returns a yes/no decision as JSON. See [Prompt-based hooks](#prompt-based-hooks). - **[Agent hooks](#prompt-and-agent-hook-fields)** (`type: "agent"`): spawn a subagent that can use tools like Read, Grep, and Glob to verify conditions before returning a decision. See [Agent-based hooks](#agent-based-hooks). #### Common fields These fields apply to all hook types: | Field | Required | Description | | :-------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------- | | `type` | yes | `"command"`, `"prompt"`, or `"agent"` | | `timeout` | no | Seconds before canceling. Defaults: 600 for command, 30 for prompt, 60 for agent | | `statusMessage` | no | Custom spinner message displayed while the hook runs | | `once` | no | If `true`, runs only once per session then is removed. Skills only, not agents. See [Hooks in skills and agents](#hooks-in-skills-and-agents) | #### Command hook fields In addition to the [common fields](#common-fields), command hooks accept these fields: | Field | Required | Description | | :-------- | :------- | :------------------------------------------------------------------------------------------------------------------ | | `command` | yes | Shell command to execute | | `async` | no | If `true`, runs in the background without blocking. See [Run hooks in the background](#run-hooks-in-the-background) | #### Prompt and agent hook fields In addition to the [common fields](#common-fields), prompt and agent hooks accept these fields: | Field | Required | Description | | :------- | :------- | :------------------------------------------------------------------------------------------ | | `prompt` | yes | Prompt text to send to the model. Use `$ARGUMENTS` as a placeholder for the hook input JSON | | `model` | no | Model to use for evaluation. Defaults to a fast model | All matching hooks run in parallel, and identical handlers are deduplicated automatically. Handlers run in the current directory with Claude Code's environment. The `$CLAUDE_CODE_REMOTE` environment variable is set to `"true"` in remote web environments and not set in the local CLI. ### Reference scripts by path Use environment variables to reference hook scripts relative to the project or plugin root, regardless of the working directory when the hook runs: - `$CLAUDE_PROJECT_DIR`: the project root. Wrap in quotes to handle paths with spaces. - `${CLAUDE_PLUGIN_ROOT}`: the plugin's root directory, for scripts bundled with a [plugin](/en/plugins). <Tabs> <Tab title="Project scripts"> This example uses `$CLAUDE_PROJECT_DIR` to run a style checker from the project's `.claude/hooks/` directory after any `Write` or `Edit` tool call: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Write|Edit", "hooks": [ { "type": "command", "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/check-style.sh" } ] } ] } } ``` </Tab> <Tab title="Plugin scripts"> Define plugin hooks in `hooks/hooks.json` with an optional top-level `description` field. When a plugin is enabled, its hooks merge with your user and project hooks. This example runs a formatting script bundled with the plugin: ```json theme={null} { "description": "Automatic code formatting", "hooks": { "PostToolUse": [ { "matcher": "Write|Edit", "hooks": [ { "type": "command", "command": "${CLAUDE_PLUGIN_ROOT}/scripts/format.sh", "timeout": 30 } ] } ] } } ``` See the [plugin components reference](/en/plugins-reference#hooks) for details on creating plugin hooks. </Tab> </Tabs> ### Hooks in skills and agents In addition to settings files and plugins, hooks can be defined directly in [skills](/en/skills) and [subagents](/en/sub-agents) using frontmatter. These hooks are scoped to the component's lifecycle and only run when that component is active. All hook events are supported. For subagents, `Stop` hooks are automatically converted to `SubagentStop` since that is the event that fires when a subagent completes. Hooks use the same configuration format as settings-based hooks but are scoped to the component's lifetime and cleaned up when it finishes. This skill defines a `PreToolUse` hook that runs a security validation script before each `Bash` command: ```yaml theme={null} --- name: secure-operations description: Perform operations with security checks hooks: PreToolUse: - matcher: "Bash" hooks: - type: command command: "./scripts/security-check.sh" --- ``` Agents use the same format in their YAML frontmatter. ### The `/hooks` menu Type `/hooks` in Claude Code to open the interactive hooks manager, where you can view, add, and delete hooks without editing settings files directly. For a step-by-step walkthrough, see [Set up your first hook](/en/hooks-guide#set-up-your-first-hook) in the guide. Each hook in the menu is labeled with a bracket prefix indicating its source: - `[User]`: from `~/.claude/settings.json` - `[Project]`: from `.claude/settings.json` - `[Local]`: from `.claude/settings.local.json` - `[Plugin]`: from a plugin's `hooks/hooks.json`, read-only ### Disable or remove hooks To remove a hook, delete its entry from the settings JSON file, or use the `/hooks` menu and select the hook to delete it. To temporarily disable all hooks without removing them, set `"disableAllHooks": true` in your settings file or use the toggle in the `/hooks` menu. There is no way to disable an individual hook while keeping it in the configuration. Direct edits to hooks in settings files don't take effect immediately. Claude Code captures a snapshot of hooks at startup and uses it throughout the session. This prevents malicious or accidental hook modifications from taking effect mid-session without your review. If hooks are modified externally, Claude Code warns you and requires review in the `/hooks` menu before changes apply. ## Hook input and output Hooks receive JSON data via stdin and communicate results through exit codes, stdout, and stderr. This section covers fields and behavior common to all events. Each event's section under [Hook events](#hook-events) includes its specific input schema and decision control options. ### Common input fields All hook events receive these fields via stdin as JSON, in addition to event-specific fields documented in each [hook event](#hook-events) section: | Field | Description | | :---------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | `session_id` | Current session identifier | | `transcript_path` | Path to conversation JSON | | `cwd` | Current working directory when the hook is invoked | | `permission_mode` | Current [permission mode](/en/permissions#permission-modes): `"default"`, `"plan"`, `"acceptEdits"`, `"dontAsk"`, or `"bypassPermissions"` | | `hook_event_name` | Name of the event that fired | For example, a `PreToolUse` hook for a Bash command receives this on stdin: ```json theme={null} { "session_id": "abc123", "transcript_path": "/home/user/.claude/projects/.../transcript.jsonl", "cwd": "/home/user/my-project", "permission_mode": "default", "hook_event_name": "PreToolUse", "tool_name": "Bash", "tool_input": { "command": "npm test" } } ``` The `tool_name` and `tool_input` fields are event-specific. Each [hook event](#hook-events) section documents the additional fields for that event. ### Exit code output The exit code from your hook command tells Claude Code whether the action should proceed, be blocked, or be ignored. **Exit 0** means success. Claude Code parses stdout for [JSON output fields](#json-output). JSON output is only processed on exit 0. For most events, stdout is only shown in verbose mode (`Ctrl+O`). The exceptions are `UserPromptSubmit` and `SessionStart`, where stdout is added as context that Claude can see and act on. **Exit 2** means a blocking error. Claude Code ignores stdout and any JSON in it. Instead, stderr text is fed back to Claude as an error message. The effect depends on the event: `PreToolUse` blocks the tool call, `UserPromptSubmit` rejects the prompt, and so on. See [exit code 2 behavior](#exit-code-2-behavior-per-event) for the full list. **Any other exit code** is a non-blocking error. stderr is shown in verbose mode (`Ctrl+O`) and execution continues. For example, a hook command script that blocks dangerous Bash commands: ```bash theme={null} #!/bin/bash # Reads JSON input from stdin, checks the command command=$(jq -r '.tool_input.command' < /dev/stdin) if [[ "$command" == rm* ]]; then echo "Blocked: rm commands are not allowed" >&2 exit 2 # Blocking error: tool call is prevented fi exit 0 # Success: tool call proceeds ``` #### Exit code 2 behavior per event Exit code 2 is the way a hook signals "stop, don't do this." The effect depends on the event, because some events represent actions that can be blocked (like a tool call that hasn't happened yet) and others represent things that already happened or can't be prevented. | Hook event | Can block? | What happens on exit 2 | | :------------------- | :--------- | :-------------------------------------------------------- | | `PreToolUse` | Yes | Blocks the tool call | | `PermissionRequest` | Yes | Denies the permission | | `UserPromptSubmit` | Yes | Blocks prompt processing and erases the prompt | | `Stop` | Yes | Prevents Claude from stopping, continues the conversation | | `SubagentStop` | Yes | Prevents the subagent from stopping | | `PostToolUse` | No | Shows stderr to Claude (tool already ran) | | `PostToolUseFailure` | No | Shows stderr to Claude (tool already failed) | | `Notification` | No | Shows stderr to user only | | `SubagentStart` | No | Shows stderr to user only | | `SessionStart` | No | Shows stderr to user only | | `SessionEnd` | No | Shows stderr to user only | | `PreCompact` | No | Shows stderr to user only | ### JSON output Exit codes let you allow or block, but JSON output gives you finer-grained control. Instead of exiting with code 2 to block, exit 0 and print a JSON object to stdout. Claude Code reads specific fields from that JSON to control behavior, including [decision control](#decision-control) for blocking, allowing, or escalating to the user. <Note> You must choose one approach per hook, not both: either use exit codes alone for signaling, or exit 0 and print JSON for structured control. Claude Code only processes JSON on exit 0. If you exit 2, any JSON is ignored. </Note> Your hook's stdout must contain only the JSON object. If your shell profile prints text on startup, it can interfere with JSON parsing. See [JSON validation failed](/en/hooks-guide#json-validation-failed) in the troubleshooting guide. The JSON object supports three kinds of fields: - **Universal fields** like `continue` work across all events. These are listed in the table below. - **Top-level `decision` and `reason`** are used by some events to block or provide feedback. - **`hookSpecificOutput`** is a nested object for events that need richer control. It requires a `hookEventName` field set to the event name. | Field | Default | Description | | :--------------- | :------ | :------------------------------------------------------------------------------------------------------------------------- | | `continue` | `true` | If `false`, Claude stops processing entirely after the hook runs. Takes precedence over any event-specific decision fields | | `stopReason` | none | Message shown to the user when `continue` is `false`. Not shown to Claude | | `suppressOutput` | `false` | If `true`, hides stdout from verbose mode output | | `systemMessage` | none | Warning message shown to the user | To stop Claude entirely regardless of event type: ```json theme={null} { "continue": false, "stopReason": "Build failed, fix errors before continuing" } ``` #### Decision control Not every event supports blocking or controlling behavior through JSON. The events that do each use a different set of fields to express that decision. Use this table as a quick reference before writing a hook: | Events | Decision pattern | Key fields | | :-------------------------------------------------------------------- | :------------------- | :---------------------------------------------------------------- | | UserPromptSubmit, PostToolUse, PostToolUseFailure, Stop, SubagentStop | Top-level `decision` | `decision: "block"`, `reason` | | PreToolUse | `hookSpecificOutput` | `permissionDecision` (allow/deny/ask), `permissionDecisionReason` | | PermissionRequest | `hookSpecificOutput` | `decision.behavior` (allow/deny) | Here are examples of each pattern in action: <Tabs> <Tab title="Top-level decision"> Used by `UserPromptSubmit`, `PostToolUse`, `PostToolUseFailure`, `Stop`, and `SubagentStop`. The only value is `"block"` — to allow the action to proceed, omit `decision` from your JSON, or exit 0 without any JSON at all: ```json theme={null} { "decision": "block", "reason": "Test suite must pass before proceeding" } ``` </Tab> <Tab title="PreToolUse"> Uses `hookSpecificOutput` for richer control: allow, deny, or escalate to the user. You can also modify tool input before it runs or inject additional context for Claude. See [PreToolUse decision control](#pretooluse-decision-control) for the full set of options. ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PreToolUse", "permissionDecision": "deny", "permissionDecisionReason": "Database writes are not allowed" } } ``` </Tab> <Tab title="PermissionRequest"> Uses `hookSpecificOutput` to allow or deny a permission request on behalf of the user. When allowing, you can also modify the tool's input or apply permission rules so the user isn't prompted again. See [PermissionRequest decision control](#permissionrequest-decision-control) for the full set of options. ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PermissionRequest", "decision": { "behavior": "allow", "updatedInput": { "command": "npm run lint" } } } } ``` </Tab> </Tabs> For extended examples including Bash command validation, prompt filtering, and auto-approval scripts, see [What you can automate](/en/hooks-guide#what-you-can-automate) in the guide and the [Bash command validator reference implementation](https://github.com/anthropics/claude-code/blob/main/examples/hooks/bash_command_validator_example.py). ## Hook events Each event corresponds to a point in Claude Code's lifecycle where hooks can run. The sections below are ordered to match the lifecycle: from session setup through the agentic loop to session end. Each section describes when the event fires, what matchers it supports, the JSON input it receives, and how to control behavior through output. ### SessionStart Runs when Claude Code starts a new session or resumes an existing session. Useful for loading development context like existing issues or recent changes to your codebase, or setting up environment variables. For static context that does not require a script, use [CLAUDE.md](/en/memory) instead. SessionStart runs on every session, so keep these hooks fast. The matcher value corresponds to how the session was initiated: | Matcher | When it fires | | :-------- | :------------------------------------- | | `startup` | New session | | `resume` | `--resume`, `--continue`, or `/resume` | | `clear` | `/clear` | | `compact` | Auto or manual compaction | #### SessionStart input In addition to the [common input fields](#common-input-fields), SessionStart hooks receive `source`, `model`, and optionally `agent_type`. The `source` field indicates how the session started: `"startup"` for new sessions, `"resume"` for resumed sessions, `"clear"` after `/clear`, or `"compact"` after compaction. The `model` field contains the model identifier. If you start Claude Code with `claude --agent <name>`, an `agent_type` field contains the agent name. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "SessionStart", "source": "startup", "model": "claude-sonnet-4-5-20250929" } ``` #### SessionStart decision control Any text your hook script prints to stdout is added as context for Claude. In addition to the [JSON output fields](#json-output) available to all hooks, you can return these event-specific fields: | Field | Description | | :------------------ | :------------------------------------------------------------------------ | | `additionalContext` | String added to Claude's context. Multiple hooks' values are concatenated | ```json theme={null} { "hookSpecificOutput": { "hookEventName": "SessionStart", "additionalContext": "My additional context here" } } ``` #### Persist environment variables SessionStart hooks have access to the `CLAUDE_ENV_FILE` environment variable, which provides a file path where you can persist environment variables for subsequent Bash commands. To set individual environment variables, write `export` statements to `CLAUDE_ENV_FILE`. Use append (`>>`) to preserve variables set by other hooks: ```bash theme={null} #!/bin/bash if [ -n "$CLAUDE_ENV_FILE" ]; then echo 'export NODE_ENV=production' >> "$CLAUDE_ENV_FILE" echo 'export DEBUG_LOG=true' >> "$CLAUDE_ENV_FILE" echo 'export PATH="$PATH:./node_modules/.bin"' >> "$CLAUDE_ENV_FILE" fi exit 0 ``` To capture all environment changes from setup commands, compare the exported variables before and after: ```bash theme={null} #!/bin/bash ENV_BEFORE=$(export -p | sort) # Run your setup commands that modify the environment source ~/.nvm/nvm.sh nvm use 20 if [ -n "$CLAUDE_ENV_FILE" ]; then ENV_AFTER=$(export -p | sort) comm -13 <(echo "$ENV_BEFORE") <(echo "$ENV_AFTER") >> "$CLAUDE_ENV_FILE" fi exit 0 ``` Any variables written to this file will be available in all subsequent Bash commands that Claude Code executes during the session. <Note> `CLAUDE_ENV_FILE` is available for SessionStart hooks. Other hook types do not have access to this variable. </Note> ### UserPromptSubmit Runs when the user submits a prompt, before Claude processes it. This allows you to add additional context based on the prompt/conversation, validate prompts, or block certain types of prompts. #### UserPromptSubmit input In addition to the [common input fields](#common-input-fields), UserPromptSubmit hooks receive the `prompt` field containing the text the user submitted. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "UserPromptSubmit", "prompt": "Write a function to calculate the factorial of a number" } ``` #### UserPromptSubmit decision control `UserPromptSubmit` hooks can control whether a user prompt is processed and add context. All [JSON output fields](#json-output) are available. There are two ways to add context to the conversation on exit code 0: - **Plain text stdout**: any non-JSON text written to stdout is added as context - **JSON with `additionalContext`**: use the JSON format below for more control. The `additionalContext` field is added as context Plain stdout is shown as hook output in the transcript. The `additionalContext` field is added more discretely. To block a prompt, return a JSON object with `decision` set to `"block"`: | Field | Description | | :------------------ | :----------------------------------------------------------------------------------------------------------------- | | `decision` | `"block"` prevents the prompt from being processed and erases it from context. Omit to allow the prompt to proceed | | `reason` | Shown to the user when `decision` is `"block"`. Not added to context | | `additionalContext` | String added to Claude's context | ```json theme={null} { "decision": "block", "reason": "Explanation for decision", "hookSpecificOutput": { "hookEventName": "UserPromptSubmit", "additionalContext": "My additional context here" } } ``` <Note> The JSON format isn't required for simple use cases. To add context, you can print plain text to stdout with exit code 0. Use JSON when you need to block prompts or want more structured control. </Note> ### PreToolUse Runs after Claude creates tool parameters and before processing the tool call. Matches on tool name: `Bash`, `Edit`, `Write`, `Read`, `Glob`, `Grep`, `Task`, `WebFetch`, `WebSearch`, and any [MCP tool names](#match-mcp-tools). Use [PreToolUse decision control](#pretooluse-decision-control) to allow, deny, or ask for permission to use the tool. #### PreToolUse input In addition to the [common input fields](#common-input-fields), PreToolUse hooks receive `tool_name`, `tool_input`, and `tool_use_id`. The `tool_input` fields depend on the tool: ##### Bash Executes shell commands. | Field | Type | Example | Description | | :------------------ | :------ | :----------------- | :-------------------------------------------- | | `command` | string | `"npm test"` | The shell command to execute | | `description` | string | `"Run test suite"` | Optional description of what the command does | | `timeout` | number | `120000` | Optional timeout in milliseconds | | `run_in_background` | boolean | `false` | Whether to run the command in background | ##### Write Creates or overwrites a file. | Field | Type | Example | Description | | :---------- | :----- | :-------------------- | :--------------------------------- | | `file_path` | string | `"/path/to/file.txt"` | Absolute path to the file to write | | `content` | string | `"file content"` | Content to write to the file | ##### Edit Replaces a string in an existing file. | Field | Type | Example | Description | | :------------ | :------ | :-------------------- | :--------------------------------- | | `file_path` | string | `"/path/to/file.txt"` | Absolute path to the file to edit | | `old_string` | string | `"original text"` | Text to find and replace | | `new_string` | string | `"replacement text"` | Replacement text | | `replace_all` | boolean | `false` | Whether to replace all occurrences | ##### Read Reads file contents. | Field | Type | Example | Description | | :---------- | :----- | :-------------------- | :----------------------------------------- | | `file_path` | string | `"/path/to/file.txt"` | Absolute path to the file to read | | `offset` | number | `10` | Optional line number to start reading from | | `limit` | number | `50` | Optional number of lines to read | ##### Glob Finds files matching a glob pattern. | Field | Type | Example | Description | | :-------- | :----- | :--------------- | :--------------------------------------------------------------------- | | `pattern` | string | `"**/*.ts"` | Glob pattern to match files against | | `path` | string | `"/path/to/dir"` | Optional directory to search in. Defaults to current working directory | ##### Grep Searches file contents with regular expressions. | Field | Type | Example | Description | | :------------ | :------ | :--------------- | :------------------------------------------------------------------------------------ | | `pattern` | string | `"TODO.*fix"` | Regular expression pattern to search for | | `path` | string | `"/path/to/dir"` | Optional file or directory to search in | | `glob` | string | `"*.ts"` | Optional glob pattern to filter files | | `output_mode` | string | `"content"` | `"content"`, `"files_with_matches"`, or `"count"`. Defaults to `"files_with_matches"` | | `-i` | boolean | `true` | Case insensitive search | | `multiline` | boolean | `false` | Enable multiline matching | ##### WebFetch Fetches and processes web content. | Field | Type | Example | Description | | :------- | :----- | :---------------------------- | :----------------------------------- | | `url` | string | `"https://example.com/api"` | URL to fetch content from | | `prompt` | string | `"Extract the API endpoints"` | Prompt to run on the fetched content | ##### WebSearch Searches the web. | Field | Type | Example | Description | | :---------------- | :----- | :----------------------------- | :------------------------------------------------ | | `query` | string | `"react hooks best practices"` | Search query | | `allowed_domains` | array | `["docs.example.com"]` | Optional: only include results from these domains | | `blocked_domains` | array | `["spam.example.com"]` | Optional: exclude results from these domains | ##### Task Spawns a [subagent](/en/sub-agents). | Field | Type | Example | Description | | :-------------- | :----- | :------------------------- | :------------------------------------------- | | `prompt` | string | `"Find all API endpoints"` | The task for the agent to perform | | `description` | string | `"Find API endpoints"` | Short description of the task | | `subagent_type` | string | `"Explore"` | Type of specialized agent to use | | `model` | string | `"sonnet"` | Optional model alias to override the default | #### PreToolUse decision control `PreToolUse` hooks can control whether a tool call proceeds. Unlike other hooks that use a top-level `decision` field, PreToolUse returns its decision inside a `hookSpecificOutput` object. This gives it richer control: three outcomes (allow, deny, or ask) plus the ability to modify tool input before execution. | Field | Description | | :------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------- | | `permissionDecision` | `"allow"` bypasses the permission system, `"deny"` prevents the tool call, `"ask"` prompts the user to confirm | | `permissionDecisionReason` | For `"allow"` and `"ask"`, shown to the user but not Claude. For `"deny"`, shown to Claude | | `updatedInput` | Modifies the tool's input parameters before execution. Combine with `"allow"` to auto-approve, or `"ask"` to show the modified input to the user | | `additionalContext` | String added to Claude's context before the tool executes | ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PreToolUse", "permissionDecision": "allow", "permissionDecisionReason": "My reason here", "updatedInput": { "field_to_modify": "new value" }, "additionalContext": "Current environment: production. Proceed with caution." } } ``` <Note> PreToolUse previously used top-level `decision` and `reason` fields, but these are deprecated for this event. Use `hookSpecificOutput.permissionDecision` and `hookSpecificOutput.permissionDecisionReason` instead. The deprecated values `"approve"` and `"block"` map to `"allow"` and `"deny"` respectively. Other events like PostToolUse and Stop continue to use top-level `decision` and `reason` as their current format. </Note> ### PermissionRequest Runs when the user is shown a permission dialog. Use [PermissionRequest decision control](#permissionrequest-decision-control) to allow or deny on behalf of the user. Matches on tool name, same values as PreToolUse. #### PermissionRequest input PermissionRequest hooks receive `tool_name` and `tool_input` fields like PreToolUse hooks, but without `tool_use_id`. An optional `permission_suggestions` array contains the "always allow" options the user would normally see in the permission dialog. The difference is when the hook fires: PermissionRequest hooks run when a permission dialog is about to be shown to the user, while PreToolUse hooks run before tool execution regardless of permission status. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "PermissionRequest", "tool_name": "Bash", "tool_input": { "command": "rm -rf node_modules", "description": "Remove node_modules directory" }, "permission_suggestions": [ { "type": "toolAlwaysAllow", "tool": "Bash" } ] } ``` #### PermissionRequest decision control `PermissionRequest` hooks can allow or deny permission requests. In addition to the [JSON output fields](#json-output) available to all hooks, your hook script can return a `decision` object with these event-specific fields: | Field | Description | | :------------------- | :------------------------------------------------------------------------------------------------------------- | | `behavior` | `"allow"` grants the permission, `"deny"` denies it | | `updatedInput` | For `"allow"` only: modifies the tool's input parameters before execution | | `updatedPermissions` | For `"allow"` only: applies permission rule updates, equivalent to the user selecting an "always allow" option | | `message` | For `"deny"` only: tells Claude why the permission was denied | | `interrupt` | For `"deny"` only: if `true`, stops Claude | ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PermissionRequest", "decision": { "behavior": "allow", "updatedInput": { "command": "npm run lint" } } } } ``` ### PostToolUse Runs immediately after a tool completes successfully. Matches on tool name, same values as PreToolUse. #### PostToolUse input `PostToolUse` hooks fire after a tool has already executed successfully. The input includes both `tool_input`, the arguments sent to the tool, and `tool_response`, the result it returned. The exact schema for both depends on the tool. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "PostToolUse", "tool_name": "Write", "tool_input": { "file_path": "/path/to/file.txt", "content": "file content" }, "tool_response": { "filePath": "/path/to/file.txt", "success": true }, "tool_use_id": "toolu_01ABC123..." } ``` #### PostToolUse decision control `PostToolUse` hooks can provide feedback to Claude after tool execution. In addition to the [JSON output fields](#json-output) available to all hooks, your hook script can return these event-specific fields: | Field | Description | | :--------------------- | :----------------------------------------------------------------------------------------- | | `decision` | `"block"` prompts Claude with the `reason`. Omit to allow the action to proceed | | `reason` | Explanation shown to Claude when `decision` is `"block"` | | `additionalContext` | Additional context for Claude to consider | | `updatedMCPToolOutput` | For [MCP tools](#match-mcp-tools) only: replaces the tool's output with the provided value | ```json theme={null} { "decision": "block", "reason": "Explanation for decision", "hookSpecificOutput": { "hookEventName": "PostToolUse", "additionalContext": "Additional information for Claude" } } ``` ### PostToolUseFailure Runs when a tool execution fails. This event fires for tool calls that throw errors or return failure results. Use this to log failures, send alerts, or provide corrective feedback to Claude. Matches on tool name, same values as PreToolUse. #### PostToolUseFailure input PostToolUseFailure hooks receive the same `tool_name` and `tool_input` fields as PostToolUse, along with error information as top-level fields: ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "PostToolUseFailure", "tool_name": "Bash", "tool_input": { "command": "npm test", "description": "Run test suite" }, "tool_use_id": "toolu_01ABC123...", "error": "Command exited with non-zero status code 1", "is_interrupt": false } ``` | Field | Description | | :------------- | :------------------------------------------------------------------------------ | | `error` | String describing what went wrong | | `is_interrupt` | Optional boolean indicating whether the failure was caused by user interruption | #### PostToolUseFailure decision control `PostToolUseFailure` hooks can provide context to Claude after a tool failure. In addition to the [JSON output fields](#json-output) available to all hooks, your hook script can return these event-specific fields: | Field | Description | | :------------------ | :------------------------------------------------------------ | | `additionalContext` | Additional context for Claude to consider alongside the error | ```json theme={null} { "hookSpecificOutput": { "hookEventName": "PostToolUseFailure", "additionalContext": "Additional information about the failure for Claude" } } ``` ### Notification Runs when Claude Code sends notifications. Matches on notification type: `permission_prompt`, `idle_prompt`, `auth_success`, `elicitation_dialog`. Omit the matcher to run hooks for all notification types. Use separate matchers to run different handlers depending on the notification type. This configuration triggers a permission-specific alert script when Claude needs permission approval and a different notification when Claude has been idle: ```json theme={null} { "hooks": { "Notification": [ { "matcher": "permission_prompt", "hooks": [ { "type": "command", "command": "/path/to/permission-alert.sh" } ] }, { "matcher": "idle_prompt", "hooks": [ { "type": "command", "command": "/path/to/idle-notification.sh" } ] } ] } } ``` #### Notification input In addition to the [common input fields](#common-input-fields), Notification hooks receive `message` with the notification text, an optional `title`, and `notification_type` indicating which type fired. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "Notification", "message": "Claude needs your permission to use Bash", "title": "Permission needed", "notification_type": "permission_prompt" } ``` Notification hooks cannot block or modify notifications. In addition to the [JSON output fields](#json-output) available to all hooks, you can return `additionalContext` to add context to the conversation: | Field | Description | | :------------------ | :------------------------------- | | `additionalContext` | String added to Claude's context | ### SubagentStart Runs when a Claude Code subagent is spawned via the Task tool. Supports matchers to filter by agent type name (built-in agents like `Bash`, `Explore`, `Plan`, or custom agent names from `.claude/agents/`). #### SubagentStart input In addition to the [common input fields](#common-input-fields), SubagentStart hooks receive `agent_id` with the unique identifier for the subagent and `agent_type` with the agent name (built-in agents like `"Bash"`, `"Explore"`, `"Plan"`, or custom agent names). ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "SubagentStart", "agent_id": "agent-abc123", "agent_type": "Explore" } ``` SubagentStart hooks cannot block subagent creation, but they can inject context into the subagent. In addition to the [JSON output fields](#json-output) available to all hooks, you can return: | Field | Description | | :------------------ | :------------------------------------- | | `additionalContext` | String added to the subagent's context | ```json theme={null} { "hookSpecificOutput": { "hookEventName": "SubagentStart", "additionalContext": "Follow security guidelines for this task" } } ``` ### SubagentStop Runs when a Claude Code subagent has finished responding. Matches on agent type, same values as SubagentStart. #### SubagentStop input In addition to the [common input fields](#common-input-fields), SubagentStop hooks receive `stop_hook_active`, `agent_id`, `agent_type`, and `agent_transcript_path`. The `agent_type` field is the value used for matcher filtering. The `transcript_path` is the main session's transcript, while `agent_transcript_path` is the subagent's own transcript stored in a nested `subagents/` folder. ```json theme={null} { "session_id": "abc123", "transcript_path": "~/.claude/projects/.../abc123.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "SubagentStop", "stop_hook_active": false, "agent_id": "def456", "agent_type": "Explore", "agent_transcript_path": "~/.claude/projects/.../abc123/subagents/agent-def456.jsonl" } ``` SubagentStop hooks use the same decision control format as [Stop hooks](#stop-decision-control). ### Stop Runs when the main Claude Code agent has finished responding. Does not run if the stoppage occurred due to a user interrupt. #### Stop input In addition to the [common input fields](#common-input-fields), Stop hooks receive `stop_hook_active`. This field is `true` when Claude Code is already continuing as a result of a stop hook. Check this value or process the transcript to prevent Claude Code from running indefinitely. ```json theme={null} { "session_id": "abc123", "transcript_path": "~/.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "Stop", "stop_hook_active": true } ``` #### Stop decision control `Stop` and `SubagentStop` hooks can control whether Claude continues. In addition to the [JSON output fields](#json-output) available to all hooks, your hook script can return these event-specific fields: | Field | Description | | :--------- | :------------------------------------------------------------------------- | | `decision` | `"block"` prevents Claude from stopping. Omit to allow Claude to stop | | `reason` | Required when `decision` is `"block"`. Tells Claude why it should continue | ```json theme={null} { "decision": "block", "reason": "Must be provided when Claude is blocked from stopping" } ``` ### PreCompact Runs before Claude Code is about to run a compact operation. The matcher value indicates whether compaction was triggered manually or automatically: | Matcher | When it fires | | :------- | :------------------------------------------- | | `manual` | `/compact` | | `auto` | Auto-compact when the context window is full | #### PreCompact input In addition to the [common input fields](#common-input-fields), PreCompact hooks receive `trigger` and `custom_instructions`. For `manual`, `custom_instructions` contains what the user passes into `/compact`. For `auto`, `custom_instructions` is empty. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "PreCompact", "trigger": "manual", "custom_instructions": "" } ``` ### SessionEnd Runs when a Claude Code session ends. Useful for cleanup tasks, logging session statistics, or saving session state. Supports matchers to filter by exit reason. The `reason` field in the hook input indicates why the session ended: | Reason | Description | | :---------------------------- | :----------------------------------------- | | `clear` | Session cleared with `/clear` command | | `logout` | User logged out | | `prompt_input_exit` | User exited while prompt input was visible | | `bypass_permissions_disabled` | Bypass permissions mode was disabled | | `other` | Other exit reasons | #### SessionEnd input In addition to the [common input fields](#common-input-fields), SessionEnd hooks receive a `reason` field indicating why the session ended. See the [reason table](#sessionend) above for all values. ```json theme={null} { "session_id": "abc123", "transcript_path": "/Users/.../.claude/projects/.../00893aaf-19fa-41d2-8238-13269b9b3ca0.jsonl", "cwd": "/Users/...", "permission_mode": "default", "hook_event_name": "SessionEnd", "reason": "other" } ``` SessionEnd hooks have no decision control. They cannot block session termination but can perform cleanup tasks. ## Prompt-based hooks In addition to Bash command hooks (`type: "command"`), Claude Code supports prompt-based hooks (`type: "prompt"`) that use an LLM to evaluate whether to allow or block an action. Prompt-based hooks work with the following events: `PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest`, `UserPromptSubmit`, `Stop`, and `SubagentStop`. ### How prompt-based hooks work Instead of executing a Bash command, prompt-based hooks: 1. Send the hook input and your prompt to a Claude model, Haiku by default 2. The LLM responds with structured JSON containing a decision 3. Claude Code processes the decision automatically ### Prompt hook configuration Set `type` to `"prompt"` and provide a `prompt` string instead of a `command`. Use the `$ARGUMENTS` placeholder to inject the hook's JSON input data into your prompt text. Claude Code sends the combined prompt and input to a fast Claude model, which returns a JSON decision. This `Stop` hook asks the LLM to evaluate whether all tasks are complete before allowing Claude to finish: ```json theme={null} { "hooks": { "Stop": [ { "hooks": [ { "type": "prompt", "prompt": "Evaluate if Claude should stop: $ARGUMENTS. Check if all tasks are complete." } ] } ] } } ``` | Field | Required | Description | | :-------- | :------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `type` | yes | Must be `"prompt"` | | `prompt` | yes | The prompt text to send to the LLM. Use `$ARGUMENTS` as a placeholder for the hook input JSON. If `$ARGUMENTS` is not present, input JSON is appended to the prompt | | `model` | no | Model to use for evaluation. Defaults to a fast model | | `timeout` | no | Timeout in seconds. Default: 30 | ### Response schema The LLM must respond with JSON containing: ```json theme={null} { "ok": true | false, "reason": "Explanation for the decision" } ``` | Field | Description | | :------- | :--------------------------------------------------------- | | `ok` | `true` allows the action, `false` prevents it | | `reason` | Required when `ok` is `false`. Explanation shown to Claude | ### Example: Multi-criteria Stop hook This `Stop` hook uses a detailed prompt to check three conditions before allowing Claude to stop. If `"ok"` is `false`, Claude continues working with the provided reason as its next instruction. `SubagentStop` hooks use the same format to evaluate whether a [subagent](/en/sub-agents) should stop: ```json theme={null} { "hooks": { "Stop": [ { "hooks": [ { "type": "prompt", "prompt": "You are evaluating whether Claude should stop working. Context: $ARGUMENTS\n\nAnalyze the conversation and determine if:\n1. All user-requested tasks are complete\n2. Any errors need to be addressed\n3. Follow-up work is needed\n\nRespond with JSON: {\"ok\": true} to allow stopping, or {\"ok\": false, \"reason\": \"your explanation\"} to continue working.", "timeout": 30 } ] } ] } } ``` ## Agent-based hooks Agent-based hooks (`type: "agent"`) are like prompt-based hooks but with multi-turn tool access. Instead of a single LLM call, an agent hook spawns a subagent that can read files, search code, and inspect the codebase to verify conditions. Agent hooks support the same events as prompt-based hooks. ### How agent hooks work When an agent hook fires: 1. Claude Code spawns a subagent with your prompt and the hook's JSON input 2. The subagent can use tools like Read, Grep, and Glob to investigate 3. After up to 50 turns, the subagent returns a structured `{ "ok": true/false }` decision 4. Claude Code processes the decision the same way as a prompt hook Agent hooks are useful when verification requires inspecting actual files or test output, not just evaluating the hook input data alone. ### Agent hook configuration Set `type` to `"agent"` and provide a `prompt` string. The configuration fields are the same as [prompt hooks](#prompt-hook-configuration), with a longer default timeout: | Field | Required | Description | | :-------- | :------- | :------------------------------------------------------------------------------------------ | | `type` | yes | Must be `"agent"` | | `prompt` | yes | Prompt describing what to verify. Use `$ARGUMENTS` as a placeholder for the hook input JSON | | `model` | no | Model to use. Defaults to a fast model | | `timeout` | no | Timeout in seconds. Default: 60 | The response schema is the same as prompt hooks: `{ "ok": true }` to allow or `{ "ok": false, "reason": "..." }` to block. This `Stop` hook verifies that all unit tests pass before allowing Claude to finish: ```json theme={null} { "hooks": { "Stop": [ { "hooks": [ { "type": "agent", "prompt": "Verify that all unit tests pass. Run the test suite and check the results. $ARGUMENTS", "timeout": 120 } ] } ] } } ``` ## Run hooks in the background By default, hooks block Claude's execution until they complete. For long-running tasks like deployments, test suites, or external API calls, set `"async": true` to run the hook in the background while Claude continues working. Async hooks cannot block or control Claude's behavior: response fields like `decision`, `permissionDecision`, and `continue` have no effect, because the action they would have controlled has already completed. ### Configure an async hook Add `"async": true` to a command hook's configuration to run it in the background without blocking Claude. This field is only available on `type: "command"` hooks. This hook runs a test script after every `Write` tool call. Claude continues working immediately while `run-tests.sh` executes for up to 120 seconds. When the script finishes, its output is delivered on the next conversation turn: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Write", "hooks": [ { "type": "command", "command": "/path/to/run-tests.sh", "async": true, "timeout": 120 } ] } ] } } ``` The `timeout` field sets the maximum time in seconds for the background process. If not specified, async hooks use the same 10-minute default as sync hooks. ### How async hooks execute When an async hook fires, Claude Code starts the hook process and immediately continues without waiting for it to finish. The hook receives the same JSON input via stdin as a synchronous hook. After the background process exits, if the hook produced a JSON response with a `systemMessage` or `additionalContext` field, that content is delivered to Claude as context on the next conversation turn. ### Example: run tests after file changes This hook starts a test suite in the background whenever Claude writes a file, then reports the results back to Claude when the tests finish. Save this script to `.claude/hooks/run-tests-async.sh` in your project and make it executable with `chmod +x`: ```bash theme={null} #!/bin/bash # run-tests-async.sh # Read hook input from stdin INPUT=$(cat) FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty') # Only run tests for source files if [[ "$FILE_PATH" != *.ts && "$FILE_PATH" != *.js ]]; then exit 0 fi # Run tests and report results via systemMessage RESULT=$(npm test 2>&1) EXIT_CODE=$? if [ $EXIT_CODE -eq 0 ]; then echo "{\"systemMessage\": \"Tests passed after editing $FILE_PATH\"}" else echo "{\"systemMessage\": \"Tests failed after editing $FILE_PATH: $RESULT\"}" fi ``` Then add this configuration to `.claude/settings.json` in your project root. The `async: true` flag lets Claude keep working while tests run: ```json theme={null} { "hooks": { "PostToolUse": [ { "matcher": "Write|Edit", "hooks": [ { "type": "command", "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/run-tests-async.sh", "async": true, "timeout": 300 } ] } ] } } ``` ### Limitations Async hooks have several constraints compared to synchronous hooks: - Only `type: "command"` hooks support `async`. Prompt-based hooks cannot run asynchronously. - Async hooks cannot block tool calls or return decisions. By the time the hook completes, the triggering action has already proceeded. - Hook output is delivered on the next conversation turn. If the session is idle, the response waits until the next user interaction. - Each execution creates a separate background process. There is no deduplication across multiple firings of the same async hook. ## Security considerations ### Disclaimer Hooks run with your system user's full permissions. <Warning> Hooks execute shell commands with your full user permissions. They can modify, delete, or access any files your user account can access. Review and test all hook commands before adding them to your configuration. </Warning> ### Security best practices Keep these practices in mind when writing hooks: - **Validate and sanitize inputs**: never trust input data blindly - **Always quote shell variables**: use `"$VAR"` not `$VAR` - **Block path traversal**: check for `..` in file paths - **Use absolute paths**: specify full paths for scripts, using `"$CLAUDE_PROJECT_DIR"` for the project root - **Skip sensitive files**: avoid `.env`, `.git/`, keys, etc. ## Debug hooks Run `claude --debug` to see hook execution details, including which hooks matched, their exit codes, and output. Toggle verbose mode with `Ctrl+O` to see hook progress in the transcript. ``` [DEBUG] Executing hooks for PostToolUse:Write [DEBUG] Getting matching hook commands for PostToolUse with query: Write [DEBUG] Found 1 hook matchers in settings [DEBUG] Matched 1 hooks for query "Write" [DEBUG] Found 1 hook commands to execute [DEBUG] Executing hook command: <Your command> with timeout 600000ms [DEBUG] Hook command completed with status 0: <Your stdout> ``` For troubleshooting common issues like hooks not firing, infinite Stop hook loops, or configuration errors, see [Limitations and troubleshooting](/en/hooks-guide#limitations-and-troubleshooting) in the guide.plugins/customaize-agent/skills/apply-anthropic-skill-best-practices/SKILL.mdskillShow content (42366 bytes)
--- name: apply-anthropic-skill-best-practices description: Comprehensive guide for skill development based on Anthropic's official best practices - use for complex skills requiring detailed structure argument-hint: Optional skill name or path to skill being reviewed --- # Anthropic's official skill authoring best practices Apply Anthropic's official skill authoring best practices to your skill. Good Skills are concise, well-structured, and tested with real usage. This guide provides practical authoring decisions to help you write Skills that Claude can discover and use effectively. ## Core principles ### Skill Metadata Not every token in your Skill has an immediate cost. At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed. However, being concise in SKILL.md still matters: once Claude loads it, every token competes with conversation history and other context. ### Test with all models you plan to use Skills act as additions to models, so effectiveness depends on the underlying model. Test your Skill with all the models you plan to use it with. **Testing considerations by model**: - **Claude Haiku** (fast, economical): Does the Skill provide enough guidance? - **Claude Sonnet** (balanced): Is the Skill clear and efficient? - **Claude Opus** (powerful reasoning): Does the Skill avoid over-explaining? What works perfectly for Opus might need more detail for Haiku. If you plan to use your Skill across multiple models, aim for instructions that work well with all of them. ## Skill structure <Note> **YAML Frontmatter**: The SKILL.md frontmatter supports two fields: - `name` - Human-readable name of the Skill (64 characters maximum) - `description` - One-line description of what the Skill does and when to use it (1024 characters maximum) For complete Skill structure details, see the [Skills overview](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#skill-structure). </Note> ### Naming conventions Use consistent naming patterns to make Skills easier to reference and discuss. We recommend using **gerund form** (verb + -ing) for Skill names, as this clearly describes the activity or capability the Skill provides. **Good naming examples (gerund form)**: - "Processing PDFs" - "Analyzing spreadsheets" - "Managing databases" - "Testing code" - "Writing documentation" **Acceptable alternatives**: - Noun phrases: "PDF Processing", "Spreadsheet Analysis" - Action-oriented: "Process PDFs", "Analyze Spreadsheets" **Avoid**: - Vague names: "Helper", "Utils", "Tools" - Overly generic: "Documents", "Data", "Files" - Inconsistent patterns within your skill collection Consistent naming makes it easier to: - Reference Skills in documentation and conversations - Understand what a Skill does at a glance - Organize and search through multiple Skills - Maintain a professional, cohesive skill library ### Writing effective descriptions The `description` field enables Skill discovery and should include both what the Skill does and when to use it. <Warning> **Always write in third person**. The description is injected into the system prompt, and inconsistent point-of-view can cause discovery problems. - **Good:** "Processes Excel files and generates reports" - **Avoid:** "I can help you process Excel files" - **Avoid:** "You can use this to process Excel files" </Warning> **Be specific and include key terms**. Include both what the Skill does and specific triggers/contexts for when to use it. Each Skill has exactly one description field. The description is critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills. Your description must provide enough detail for Claude to know when to select this Skill, while the rest of SKILL.md provides the implementation details. Effective examples: **PDF Processing skill:** ```yaml theme={null} description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction. ``` **Excel Analysis skill:** ```yaml theme={null} description: Analyze Excel spreadsheets, create pivot tables, generate charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files. ``` **Git Commit Helper skill:** ```yaml theme={null} description: Generate descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes. ``` Avoid vague descriptions like these: ```yaml theme={null} description: Helps with documents ``` ```yaml theme={null} description: Processes data ``` ```yaml theme={null} description: Does stuff with files ``` ### Progressive disclosure patterns SKILL.md serves as an overview that points Claude to detailed materials as needed, like a table of contents in an onboarding guide. For an explanation of how progressive disclosure works, see [How Skills work](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#how-skills-work) in the overview. **Practical guidance:** - Keep SKILL.md body under 500 lines for optimal performance - Split content into separate files when approaching this limit - Use the patterns below to organize instructions, code, and resources effectively #### Visual overview: From simple to complex A basic Skill starts with just a SKILL.md file containing metadata and instructions: <img src="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=87782ff239b297d9a9e8e1b72ed72db9" alt="Simple SKILL.md file showing YAML frontmatter and markdown body" data-og-width="2048" width="2048" data-og-height="1153" height="1153" data-path="images/agent-skills-simple-file.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=280&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=c61cc33b6f5855809907f7fda94cd80e 280w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=560&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=90d2c0c1c76b36e8d485f49e0810dbfd 560w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=840&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=ad17d231ac7b0bea7e5b4d58fb4aeabb 840w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=1100&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=f5d0a7a3c668435bb0aee9a3a8f8c329 1100w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=1650&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=0e927c1af9de5799cfe557d12249f6e6 1650w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-simple-file.png?w=2500&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=46bbb1a51dd4c8202a470ac8c80a893d 2500w" /> As your Skill grows, you can bundle additional content that Claude loads only when needed: <img src="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=a5e0aa41e3d53985a7e3e43668a33ea3" alt="Bundling additional reference files like reference.md and forms.md." data-og-width="2048" width="2048" data-og-height="1327" height="1327" data-path="images/agent-skills-bundling-content.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=280&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=f8a0e73783e99b4a643d79eac86b70a2 280w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=560&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=dc510a2a9d3f14359416b706f067904a 560w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=840&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=82cd6286c966303f7dd914c28170e385 840w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=1100&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=56f3be36c77e4fe4b523df209a6824c6 1100w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=1650&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=d22b5161b2075656417d56f41a74f3dd 1650w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-bundling-content.png?w=2500&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=3dd4bdd6850ffcc96c6c45fcb0acd6eb 2500w" /> The complete Skill directory structure might look like this: ``` pdf/ ├── SKILL.md # Main instructions (loaded when triggered) ├── FORMS.md # Form-filling guide (loaded as needed) ├── reference.md # API reference (loaded as needed) ├── examples.md # Usage examples (loaded as needed) └── scripts/ ├── analyze_form.py # Utility script (executed, not loaded) ├── fill_form.py # Form filling script └── validate.py # Validation script ``` #### Pattern 1: High-level guide with references ````markdown theme={null} --- name: PDF Processing description: Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction. --- # PDF Processing ## Quick start Extract text with pdfplumber: ```python import pdfplumber with pdfplumber.open("file.pdf") as pdf: text = pdf.pages[0].extract_text() ``` ## Advanced features **Form filling**: See [FORMS.md](FORMS.md) for complete guide **API reference**: See [REFERENCE.md](REFERENCE.md) for all methods **Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns ```` Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed. #### Pattern 2: Domain-specific organization For Skills with multiple domains, organize content by domain to avoid loading irrelevant context. When a user asks about sales metrics, Claude only needs to read sales-related schemas, not finance or marketing data. This keeps token usage low and context focused. ``` bigquery-skill/ ├── SKILL.md (overview and navigation) └── reference/ ├── finance.md (revenue, billing metrics) ├── sales.md (opportunities, pipeline) ├── product.md (API usage, features) └── marketing.md (campaigns, attribution) ``` ````markdown SKILL.md theme={null} # BigQuery Data Analysis ## Available datasets **Finance**: Revenue, ARR, billing → See [reference/finance.md](reference/finance.md) **Sales**: Opportunities, pipeline, accounts → See [reference/sales.md](reference/sales.md) **Product**: API usage, features, adoption → See [reference/product.md](reference/product.md) **Marketing**: Campaigns, attribution, email → See [reference/marketing.md](reference/marketing.md) ## Quick search Find specific metrics using grep: ```bash grep -i "revenue" reference/finance.md grep -i "pipeline" reference/sales.md grep -i "api usage" reference/product.md ``` ```` #### Pattern 3: Conditional details Show basic content, link to advanced content: ```markdown theme={null} # DOCX Processing ## Creating documents Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md). ## Editing documents For simple edits, modify the XML directly. **For tracked changes**: See [REDLINING.md](REDLINING.md) **For OOXML details**: See [OOXML.md](OOXML.md) ``` Claude reads REDLINING.md or OOXML.md only when the user needs those features. ### Avoid deeply nested references Claude may partially read files when they're referenced from other referenced files. When encountering nested references, Claude might use commands like `head -100` to preview content rather than reading entire files, resulting in incomplete information. **Keep references one level deep from SKILL.md**. All reference files should link directly from SKILL.md to ensure Claude reads complete files when needed. **Bad example: Too deep**: ```markdown theme={null} # SKILL.md See [advanced.md](advanced.md)... # advanced.md See [details.md](details.md)... # details.md Here's the actual information... ``` **Good example: One level deep**: ```markdown theme={null} # SKILL.md **Basic usage**: [instructions in SKILL.md] **Advanced features**: See [advanced.md](advanced.md) **API reference**: See [reference.md](reference.md) **Examples**: See [examples.md](examples.md) ``` ### Structure longer reference files with table of contents For reference files longer than 100 lines, include a table of contents at the top. This ensures Claude can see the full scope of available information even when previewing with partial reads. **Example**: ```markdown theme={null} # API Reference ## Contents - Authentication and setup - Core methods (create, read, update, delete) - Advanced features (batch operations, webhooks) - Error handling patterns - Code examples ## Authentication and setup ... ## Core methods ... ``` Claude can then read the complete file or jump to specific sections as needed. For details on how this filesystem-based architecture enables progressive disclosure, see the [Runtime environment](#runtime-environment) section in the Advanced section below. ## Workflows and feedback loops ### Use workflows for complex tasks Break complex operations into clear, sequential steps. For particularly complex workflows, provide a checklist that Claude can copy into its response and check off as it progresses. **Example 1: Research synthesis workflow** (for Skills without code): ````markdown theme={null} ## Research synthesis workflow Copy this checklist and track your progress: ``` Research Progress: - [ ] Step 1: Read all source documents - [ ] Step 2: Identify key themes - [ ] Step 3: Cross-reference claims - [ ] Step 4: Create structured summary - [ ] Step 5: Verify citations ``` **Step 1: Read all source documents** Review each document in the `sources/` directory. Note the main arguments and supporting evidence. **Step 2: Identify key themes** Look for patterns across sources. What themes appear repeatedly? Where do sources agree or disagree? **Step 3: Cross-reference claims** For each major claim, verify it appears in the source material. Note which source supports each point. **Step 4: Create structured summary** Organize findings by theme. Include: - Main claim - Supporting evidence from sources - Conflicting viewpoints (if any) **Step 5: Verify citations** Check that every claim references the correct source document. If citations are incomplete, return to Step 3. ```` This example shows how workflows apply to analysis tasks that don't require code. The checklist pattern works for any complex, multi-step process. **Example 2: PDF form filling workflow** (for Skills with code): ````markdown theme={null} ## PDF form filling workflow Copy this checklist and check off items as you complete them: ``` Task Progress: - [ ] Step 1: Analyze the form (run analyze_form.py) - [ ] Step 2: Create field mapping (edit fields.json) - [ ] Step 3: Validate mapping (run validate_fields.py) - [ ] Step 4: Fill the form (run fill_form.py) - [ ] Step 5: Verify output (run verify_output.py) ``` **Step 1: Analyze the form** Run: `python scripts/analyze_form.py input.pdf` This extracts form fields and their locations, saving to `fields.json`. **Step 2: Create field mapping** Edit `fields.json` to add values for each field. **Step 3: Validate mapping** Run: `python scripts/validate_fields.py fields.json` Fix any validation errors before continuing. **Step 4: Fill the form** Run: `python scripts/fill_form.py input.pdf fields.json output.pdf` **Step 5: Verify output** Run: `python scripts/verify_output.py output.pdf` If verification fails, return to Step 2. ```` Clear steps prevent Claude from skipping critical validation. The checklist helps both Claude and you track progress through multi-step workflows. ### Implement feedback loops **Common pattern**: Run validator → fix errors → repeat This pattern greatly improves output quality. **Example 1: Style guide compliance** (for Skills without code): ```markdown theme={null} ## Content review process 1. Draft your content following the guidelines in STYLE_GUIDE.md 2. Review against the checklist: - Check terminology consistency - Verify examples follow the standard format - Confirm all required sections are present 3. If issues found: - Note each issue with specific section reference - Revise the content - Review the checklist again 4. Only proceed when all requirements are met 5. Finalize and save the document ``` This shows the validation loop pattern using reference documents instead of scripts. The "validator" is STYLE\_GUIDE.md, and Claude performs the check by reading and comparing. **Example 2: Document editing process** (for Skills with code): ```markdown theme={null} ## Document editing process 1. Make your edits to `word/document.xml` 2. **Validate immediately**: `python ooxml/scripts/validate.py unpacked_dir/` 3. If validation fails: - Review the error message carefully - Fix the issues in the XML - Run validation again 4. **Only proceed when validation passes** 5. Rebuild: `python ooxml/scripts/pack.py unpacked_dir/ output.docx` 6. Test the output document ``` The validation loop catches errors early. ## Content guidelines ### Avoid time-sensitive information Don't include information that will become outdated: **Bad example: Time-sensitive** (will become wrong): ```markdown theme={null} If you're doing this before August 2025, use the old API. After August 2025, use the new API. ``` **Good example** (use "old patterns" section): ```markdown theme={null} ## Current method Use the v2 API endpoint: `api.example.com/v2/messages` ## Old patterns <details> <summary>Legacy v1 API (deprecated 2025-08)</summary> The v1 API used: `api.example.com/v1/messages` This endpoint is no longer supported. </details> ``` The old patterns section provides historical context without cluttering the main content. ### Use consistent terminology Choose one term and use it throughout the Skill: **Good - Consistent**: - Always "API endpoint" - Always "field" - Always "extract" **Bad - Inconsistent**: - Mix "API endpoint", "URL", "API route", "path" - Mix "field", "box", "element", "control" - Mix "extract", "pull", "get", "retrieve" Consistency helps Claude understand and follow instructions. ## Common patterns ### Template pattern Provide templates for output format. Match the level of strictness to your needs. **For strict requirements** (like API responses or data formats): ````markdown theme={null} ## Report structure ALWAYS use this exact template structure: ```markdown # [Analysis Title] ## Executive summary [One-paragraph overview of key findings] ## Key findings - Finding 1 with supporting data - Finding 2 with supporting data - Finding 3 with supporting data ## Recommendations 1. Specific actionable recommendation 2. Specific actionable recommendation ``` ```` **For flexible guidance** (when adaptation is useful): ````markdown theme={null} ## Report structure Here is a sensible default format, but use your best judgment based on the analysis: ```markdown # [Analysis Title] ## Executive summary [Overview] ## Key findings [Adapt sections based on what you discover] ## Recommendations [Tailor to the specific context] ``` Adjust sections as needed for the specific analysis type. ```` ### Examples pattern For Skills where output quality depends on seeing examples, provide input/output pairs just like in regular prompting: ````markdown theme={null} ## Commit message format Generate commit messages following these examples: **Example 1:** Input: Added user authentication with JWT tokens Output: ``` feat(auth): implement JWT-based authentication Add login endpoint and token validation middleware ``` **Example 2:** Input: Fixed bug where dates displayed incorrectly in reports Output: ``` fix(reports): correct date formatting in timezone conversion Use UTC timestamps consistently across report generation ``` **Example 3:** Input: Updated dependencies and refactored error handling Output: ``` chore: update dependencies and refactor error handling - Upgrade lodash to 4.17.21 - Standardize error response format across endpoints ``` Follow this style: type(scope): brief description, then detailed explanation. ```` Examples help Claude understand the desired style and level of detail more clearly than descriptions alone. ### Conditional workflow pattern Guide Claude through decision points: ```markdown theme={null} ## Document modification workflow 1. Determine the modification type: **Creating new content?** → Follow "Creation workflow" below **Editing existing content?** → Follow "Editing workflow" below 2. Creation workflow: - Use docx-js library - Build document from scratch - Export to .docx format 3. Editing workflow: - Unpack existing document - Modify XML directly - Validate after each change - Repack when complete ``` <Tip> If workflows become large or complicated with many steps, consider pushing them into separate files and tell Claude to read the appropriate file based on the task at hand. </Tip> ## Evaluation and iteration ### Build evaluations first **Create evaluations BEFORE writing extensive documentation.** This ensures your Skill solves real problems rather than documenting imagined ones. **Evaluation-driven development:** 1. **Identify gaps**: Run Claude on representative tasks without a Skill. Document specific failures or missing context 2. **Create evaluations**: Build three scenarios that test these gaps 3. **Establish baseline**: Measure Claude's performance without the Skill 4. **Write minimal instructions**: Create just enough content to address the gaps and pass evaluations 5. **Iterate**: Execute evaluations, compare against baseline, and refine This approach ensures you're solving actual problems rather than anticipating requirements that may never materialize. **Evaluation structure**: ```json theme={null} { "skills": ["pdf-processing"], "query": "Extract all text from this PDF file and save it to output.txt", "files": ["test-files/document.pdf"], "expected_behavior": [ "Successfully reads the PDF file using an appropriate PDF processing library or command-line tool", "Extracts text content from all pages in the document without missing any pages", "Saves the extracted text to a file named output.txt in a clear, readable format" ] } ``` <Note> This example demonstrates a data-driven evaluation with a simple testing rubric. We do not currently provide a built-in way to run these evaluations. Users can create their own evaluation system. Evaluations are your source of truth for measuring Skill effectiveness. </Note> ### Develop Skills iteratively with Claude The most effective Skill development process involves Claude itself. Work with one instance of Claude ("Claude A") to create a Skill that will be used by other instances ("Claude B"). Claude A helps you design and refine instructions, while Claude B tests them in real tasks. This works because Claude models understand both how to write effective agent instructions and what information agents need. **Creating a new Skill:** 1. **Complete a task without a Skill**: Work through a problem with Claude A using normal prompting. As you work, you'll naturally provide context, explain preferences, and share procedural knowledge. Notice what information you repeatedly provide. 2. **Identify the reusable pattern**: After completing the task, identify what context you provided that would be useful for similar future tasks. **Example**: If you worked through a BigQuery analysis, you might have provided table names, field definitions, filtering rules (like "always exclude test accounts"), and common query patterns. 3. **Ask Claude A to create a Skill**: "Create a Skill that captures this BigQuery analysis pattern we just used. Include the table schemas, naming conventions, and the rule about filtering test accounts." <Tip> Claude models understand the Skill format and structure natively. You don't need special system prompts or a "writing skills" skill to get Claude to help create Skills. Simply ask Claude to create a Skill and it will generate properly structured SKILL.md content with appropriate frontmatter and body content. </Tip> 4. **Review for conciseness**: Check that Claude A hasn't added unnecessary explanations. Ask: "Remove the explanation about what win rate means - Claude already knows that." 5. **Improve information architecture**: Ask Claude A to organize the content more effectively. For example: "Organize this so the table schema is in a separate reference file. We might add more tables later." 6. **Test on similar tasks**: Use the Skill with Claude B (a fresh instance with the Skill loaded) on related use cases. Observe whether Claude B finds the right information, applies rules correctly, and handles the task successfully. 7. **Iterate based on observation**: If Claude B struggles or misses something, return to Claude A with specifics: "When Claude used this Skill, it forgot to filter by date for Q4. Should we add a section about date filtering patterns?" **Iterating on existing Skills:** The same hierarchical pattern continues when improving Skills. You alternate between: - **Working with Claude A** (the expert who helps refine the Skill) - **Testing with Claude B** (the agent using the Skill to perform real work) - **Observing Claude B's behavior** and bringing insights back to Claude A 1. **Use the Skill in real workflows**: Give Claude B (with the Skill loaded) actual tasks, not test scenarios 2. **Observe Claude B's behavior**: Note where it struggles, succeeds, or makes unexpected choices **Example observation**: "When I asked Claude B for a regional sales report, it wrote the query but forgot to filter out test accounts, even though the Skill mentions this rule." 3. **Return to Claude A for improvements**: Share the current SKILL.md and describe what you observed. Ask: "I noticed Claude B forgot to filter test accounts when I asked for a regional report. The Skill mentions filtering, but maybe it's not prominent enough?" 4. **Review Claude A's suggestions**: Claude A might suggest reorganizing to make rules more prominent, using stronger language like "MUST filter" instead of "always filter", or restructuring the workflow section. 5. **Apply and test changes**: Update the Skill with Claude A's refinements, then test again with Claude B on similar requests 6. **Repeat based on usage**: Continue this observe-refine-test cycle as you encounter new scenarios. Each iteration improves the Skill based on real agent behavior, not assumptions. **Gathering team feedback:** 1. Share Skills with teammates and observe their usage 2. Ask: Does the Skill activate when expected? Are instructions clear? What's missing? 3. Incorporate feedback to address blind spots in your own usage patterns **Why this approach works**: Claude A understands agent needs, you provide domain expertise, Claude B reveals gaps through real usage, and iterative refinement improves Skills based on observed behavior rather than assumptions. ### Observe how Claude navigates Skills As you iterate on Skills, pay attention to how Claude actually uses them in practice. Watch for: - **Unexpected exploration paths**: Does Claude read files in an order you didn't anticipate? This might indicate your structure isn't as intuitive as you thought - **Missed connections**: Does Claude fail to follow references to important files? Your links might need to be more explicit or prominent - **Overreliance on certain sections**: If Claude repeatedly reads the same file, consider whether that content should be in the main SKILL.md instead - **Ignored content**: If Claude never accesses a bundled file, it might be unnecessary or poorly signaled in the main instructions Iterate based on these observations rather than assumptions. The 'name' and 'description' in your Skill's metadata are particularly critical. Claude uses these when deciding whether to trigger the Skill in response to the current task. Make sure they clearly describe what the Skill does and when it should be used. ## Anti-patterns to avoid ### Avoid Windows-style paths Always use forward slashes in file paths, even on Windows: - ✓ **Good**: `scripts/helper.py`, `reference/guide.md` - ✗ **Avoid**: `scripts\helper.py`, `reference\guide.md` Unix-style paths work across all platforms, while Windows-style paths cause errors on Unix systems. ### Avoid offering too many options Don't present multiple approaches unless necessary: ````markdown theme={null} **Bad example: Too many choices** (confusing): "You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image, or..." **Good example: Provide a default** (with escape hatch): "Use pdfplumber for text extraction: ```python import pdfplumber ``` For scanned PDFs requiring OCR, use pdf2image with pytesseract instead." ```` ## Advanced: Skills with executable code The sections below focus on Skills that include executable scripts. If your Skill uses only markdown instructions, skip to [Checklist for effective Skills](#checklist-for-effective-skills). ### Solve, don't punt When writing scripts for Skills, handle error conditions rather than punting to Claude. **Good example: Handle errors explicitly**: ```python theme={null} def process_file(path): """Process a file, creating it if it doesn't exist.""" try: with open(path) as f: return f.read() except FileNotFoundError: # Create file with default content instead of failing print(f"File {path} not found, creating default") with open(path, 'w') as f: f.write('') return '' except PermissionError: # Provide alternative instead of failing print(f"Cannot access {path}, using default") return '' ``` **Bad example: Punt to Claude**: ```python theme={null} def process_file(path): # Just fail and let Claude figure it out return open(path).read() ``` Configuration parameters should also be justified and documented to avoid "voodoo constants" (Ousterhout's law). If you don't know the right value, how will Claude determine it? **Good example: Self-documenting**: ```python theme={null} # HTTP requests typically complete within 30 seconds # Longer timeout accounts for slow connections REQUEST_TIMEOUT = 30 # Three retries balances reliability vs speed # Most intermittent failures resolve by the second retry MAX_RETRIES = 3 ``` **Bad example: Magic numbers**: ```python theme={null} TIMEOUT = 47 # Why 47? RETRIES = 5 # Why 5? ``` ### Provide utility scripts Even if Claude could write a script, pre-made scripts offer advantages: **Benefits of utility scripts**: - More reliable than generated code - Save tokens (no need to include code in context) - Save time (no code generation required) - Ensure consistency across uses <img src="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=4bbc45f2c2e0bee9f2f0d5da669bad00" alt="Bundling executable scripts alongside instruction files" data-og-width="2048" width="2048" data-og-height="1154" height="1154" data-path="images/agent-skills-executable-scripts.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=280&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=9a04e6535a8467bfeea492e517de389f 280w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=560&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=e49333ad90141af17c0d7651cca7216b 560w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=840&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=954265a5df52223d6572b6214168c428 840w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=1100&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=2ff7a2d8f2a83ee8af132b29f10150fd 1100w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=1650&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=48ab96245e04077f4d15e9170e081cfb 1650w, https://mintcdn.com/anthropic-claude-docs/4Bny2bjzuGBK7o00/images/agent-skills-executable-scripts.png?w=2500&fit=max&auto=format&n=4Bny2bjzuGBK7o00&q=85&s=0301a6c8b3ee879497cc5b5483177c90 2500w" /> The diagram above shows how executable scripts work alongside instruction files. The instruction file (forms.md) references the script, and Claude can execute it without loading its contents into context. **Important distinction**: Make clear in your instructions whether Claude should: - **Execute the script** (most common): "Run `analyze_form.py` to extract fields" - **Read it as reference** (for complex logic): "See `analyze_form.py` for the field extraction algorithm" For most utility scripts, execution is preferred because it's more reliable and efficient. See the [Runtime environment](#runtime-environment) section below for details on how script execution works. **Example**: ````markdown theme={null} ## Utility scripts **analyze_form.py**: Extract all form fields from PDF ```bash python scripts/analyze_form.py input.pdf > fields.json ``` Output format: ```json { "field_name": {"type": "text", "x": 100, "y": 200}, "signature": {"type": "sig", "x": 150, "y": 500} } ``` **validate_boxes.py**: Check for overlapping bounding boxes ```bash python scripts/validate_boxes.py fields.json # Returns: "OK" or lists conflicts ``` **fill_form.py**: Apply field values to PDF ```bash python scripts/fill_form.py input.pdf fields.json output.pdf ``` ```` ### Use visual analysis When inputs can be rendered as images, have Claude analyze them: ````markdown theme={null} ## Form layout analysis 1. Convert PDF to images: ```bash python scripts/pdf_to_images.py form.pdf ``` 2. Analyze each page image to identify form fields 3. Claude can see field locations and types visually ```` <Note> In this example, you'd need to write the `pdf_to_images.py` script. </Note> Claude's vision capabilities help understand layouts and structures. ### Create verifiable intermediate outputs When Claude performs complex, open-ended tasks, it can make mistakes. The "plan-validate-execute" pattern catches errors early by having Claude first create a plan in a structured format, then validate that plan with a script before executing it. **Example**: Imagine asking Claude to update 50 form fields in a PDF based on a spreadsheet. Without validation, Claude might reference non-existent fields, create conflicting values, miss required fields, or apply updates incorrectly. **Solution**: Use the workflow pattern shown above (PDF form filling), but add an intermediate `changes.json` file that gets validated before applying changes. The workflow becomes: analyze → **create plan file** → **validate plan** → execute → verify. **Why this pattern works:** - **Catches errors early**: Validation finds problems before changes are applied - **Machine-verifiable**: Scripts provide objective verification - **Reversible planning**: Claude can iterate on the plan without touching originals - **Clear debugging**: Error messages point to specific problems **When to use**: Batch operations, destructive changes, complex validation rules, high-stakes operations. **Implementation tip**: Make validation scripts verbose with specific error messages like "Field 'signature\_date' not found. Available fields: customer\_name, order\_total, signature\_date\_signed" to help Claude fix issues. ### Package dependencies Skills run in the code execution environment with platform-specific limitations: - **claude.ai**: Can install packages from npm and PyPI and pull from GitHub repositories - **Anthropic API**: Has no network access and no runtime package installation List required packages in your SKILL.md and verify they're available in the [code execution tool documentation](docs.claude.com/en/docs/agents-and-tools/tool-use/code-execution-tool). ### Runtime environment Skills run in a code execution environment with filesystem access, bash commands, and code execution capabilities. For the conceptual explanation of this architecture, see [The Skills architecture](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#the-skills-architecture) in the overview. **How this affects your authoring:** **How Claude accesses Skills:** 1. **Metadata pre-loaded**: At startup, the name and description from all Skills' YAML frontmatter are loaded into the system prompt 2. **Files read on-demand**: Claude uses bash Read tools to access SKILL.md and other files from the filesystem when needed 3. **Scripts executed efficiently**: Utility scripts can be executed via bash without loading their full contents into context. Only the script's output consumes tokens 4. **No context penalty for large files**: Reference files, data, or documentation don't consume context tokens until actually read - **File paths matter**: Claude navigates your skill directory like a filesystem. Use forward slashes (`reference/guide.md`), not backslashes - **Name files descriptively**: Use names that indicate content: `form_validation_rules.md`, not `doc2.md` - **Organize for discovery**: Structure directories by domain or feature - Good: `reference/finance.md`, `reference/sales.md` - Bad: `docs/file1.md`, `docs/file2.md` - **Bundle comprehensive resources**: Include complete API docs, extensive examples, large datasets; no context penalty until accessed - **Prefer scripts for deterministic operations**: Write `validate_form.py` rather than asking Claude to generate validation code - **Make execution intent clear**: - "Run `analyze_form.py` to extract fields" (execute) - "See `analyze_form.py` for the extraction algorithm" (read as reference) - **Test file access patterns**: Verify Claude can navigate your directory structure by testing with real requests **Example:** ``` bigquery-skill/ ├── SKILL.md (overview, points to reference files) └── reference/ ├── finance.md (revenue metrics) ├── sales.md (pipeline data) └── product.md (usage analytics) ``` When the user asks about revenue, Claude reads SKILL.md, sees the reference to `reference/finance.md`, and invokes bash to read just that file. The sales.md and product.md files remain on the filesystem, consuming zero context tokens until needed. This filesystem-based model is what enables progressive disclosure. Claude can navigate and selectively load exactly what each task requires. For complete details on the technical architecture, see [How Skills work](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#how-skills-work) in the Skills overview. ### MCP tool references If your Skill uses MCP (Model Context Protocol) tools, always use fully qualified tool names to avoid "tool not found" errors. **Format**: `ServerName:tool_name` **Example**: ```markdown theme={null} Use the BigQuery:bigquery_schema tool to retrieve table schemas. Use the GitHub:create_issue tool to create issues. ``` Where: - `BigQuery` and `GitHub` are MCP server names - `bigquery_schema` and `create_issue` are the tool names within those servers Without the server prefix, Claude may fail to locate the tool, especially when multiple MCP servers are available. ### Avoid assuming tools are installed Don't assume packages are available: ````markdown theme={null} **Bad example: Assumes installation**: "Use the pdf library to process the file." **Good example: Explicit about dependencies**: "Install required package: `pip install pypdf` Then use it: ```python from pypdf import PdfReader reader = PdfReader("file.pdf") ```" ```` ## Technical notes ### YAML frontmatter requirements The SKILL.md frontmatter includes only `name` (64 characters max) and `description` (1024 characters max) fields. See the [Skills overview](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#skill-structure) for complete structure details. ### Token budgets Keep SKILL.md body under 500 lines for optimal performance. If your content exceeds this, split it into separate files using the progressive disclosure patterns described earlier. For architectural details, see the [Skills overview](docs.claude.com/en/docs/agents-and-tools/agent-skills/overview#how-skills-work). ## Checklist for effective Skills Before sharing a Skill, verify: ### Core quality - [ ] Description is specific and includes key terms - [ ] Description includes both what the Skill does and when to use it - [ ] SKILL.md body is under 500 lines - [ ] Additional details are in separate files (if needed) - [ ] No time-sensitive information (or in "old patterns" section) - [ ] Consistent terminology throughout - [ ] Examples are concrete, not abstract - [ ] File references are one level deep - [ ] Progressive disclosure used appropriately - [ ] Workflows have clear steps ### Code and scripts - [ ] Scripts solve problems rather than punt to Claude - [ ] Error handling is explicit and helpful - [ ] No "voodoo constants" (all values justified) - [ ] Required packages listed in instructions and verified as available - [ ] Scripts have clear documentation - [ ] No Windows-style paths (all forward slashes) - [ ] Validation/verification steps for critical operations - [ ] Feedback loops included for quality-critical tasks ### Testing - [ ] At least three evaluations created - [ ] Tested with Haiku, Sonnet, and Opus - [ ] Tested with real usage scenarios - [ ] Team feedback incorporated (if applicable)plugins/customaize-agent/skills/create-agent/SKILL.mdskillShow content (20241 bytes)
--- name: create-agent description: Comprehensive guide for creating Claude Code agents with proper structure, triggering conditions, system prompts, and validation - combines official Anthropic best practices with proven patterns argument-hint: "[agent-name] [optional description of agent purpose]" allowed-tools: Read, Write, Glob, Grep, Bash(mkdir:*), Task --- # Create Agent Command Create autonomous Claude Code agents that handle complex, multi-step tasks independently. This command provides comprehensive guidance based on official Anthropic documentation and proven patterns. ## User Input ```text Agent Name: $1 Description: $2 ``` ## What Are Agents? Agents are **autonomous subprocesses** spawned via the Task tool that: - Handle complex, multi-step tasks independently - Have their own isolated context window - Return results to the parent conversation - Can be specialized for specific domains | Concept | Agent | Command | |---------|-------|---------| | **Trigger** | Claude decides based on description | User invokes with `/name` | | **Purpose** | Autonomous work | User-initiated actions | | **Context** | Isolated subprocess | Shared conversation | | **File format** | `agents/*.md` | `commands/*.md` | ## Agent File Structure Agents use a unique format combining **YAML frontmatter** with a **markdown system prompt**: ```markdown --- name: agent-identifier description: Use this agent when [triggering conditions]. Examples: <example> Context: [Situation description] user: "[User request]" assistant: "[How assistant should respond and use this agent]" <commentary> [Why this agent should be triggered] </commentary> </example> <example> [Additional example...] </example> model: inherit color: blue tools: ["Read", "Write", "Grep"] --- You are [agent role description]... **Your Core Responsibilities:** 1. [Responsibility 1] 2. [Responsibility 2] **Analysis Process:** [Step-by-step workflow] **Output Format:** [What to return] ``` ## Frontmatter Fields Reference ### Required Fields #### `name` (Required) **Format**: Lowercase with hyphens only **Length**: 3-50 characters **Rules**: - Must start and end with alphanumeric character - Only lowercase letters, numbers, and hyphens - No underscores, spaces, or special characters | Valid | Invalid | Reason | |-------|---------|--------| | `code-reviewer` | `helper` | Too generic | | `test-generator` | `-agent-` | Starts/ends with hyphen | | `api-docs-writer` | `my_agent` | Underscores not allowed | | `security-analyzer` | `ag` | Too short (<3 chars) | | `pr-quality-reviewer` | `MyAgent` | Uppercase not allowed | #### `description` (Required, Critical) **The most important field** - Defines when Claude triggers the agent. **Requirements**: - Length: 10-5,000 characters (ideal: 200-1,000 with 2-4 examples) - **MUST start with**: "Use this agent when..." - **MUST include**: `<example>` blocks showing usage patterns - Each example needs: context, user request, assistant response, commentary **Example Block Format**: ```markdown <example> Context: [Describe the situation - what led to this interaction] user: "[Exact user message or request]" assistant: "[How Claude should respond before triggering]" <commentary> [Explanation of why this agent should be triggered in this scenario] </commentary> assistant: "[How Claude triggers the agent - 'I'll use the [agent-name] agent...']" </example> ``` **Best Practices for Descriptions**: - Include 2-4 concrete examples - Show both proactive and reactive triggering scenarios - Cover different phrasings of the same intent - Explain reasoning in commentary - Be specific about when NOT to use the agent #### `model` (Required) **Values**: `inherit`, `sonnet`, `opus`, `haiku` **Default**: `inherit` (recommended) | Value | Use Case | Cost | |-------|----------|------| | `inherit` | Use parent conversation model | Default | | `haiku` | Fast, simple tasks | Lowest | | `sonnet` | Balanced performance | Medium | | `opus` | Maximum capability, complex reasoning | Highest | **Recommendation**: Use `inherit` unless you have a specific reason to override. #### `color` (Required) **Purpose**: Visual indicator in UI to distinguish agents **Values**: `blue`, `cyan`, `green`, `yellow`, `magenta`, `red` **Best Practice**: Use different colors for different agents in the same plugin to distinguish them visually. ### Optional Fields #### `tools` (Optional) **Purpose**: Restrict available tools (principle of least privilege) **Format**: Array of tool names **Default**: All tools available ```yaml # Common tool restrictions tools: ["Read", "Grep", "Glob"] # Read-only analysis tools: ["Read", "Write", "Grep", "Glob"] # Code modification tools: ["Read", "Bash", "Grep"] # System operations ``` **Security Principle**: Only grant tools the agent actually needs. ## Triggering Patterns ### Pattern 1: Explicit Request User directly asks for the agent's function. ```markdown <example> Context: User wants code reviewed user: "Can you review my code changes?" assistant: "I'll use the code-reviewer agent to analyze the changes." <commentary> User explicitly requested code review, trigger the code-reviewer agent. </commentary> </example> ``` ### Pattern 2: Implicit Need Agent needed based on context, not explicit request. ```markdown <example> Context: User mentions code is hard to understand user: "This code is confusing and hard to follow" assistant: "I can help improve the code clarity." <commentary> User indicates complexity issues. Trigger code-simplifier agent to refactor for better readability. </commentary> assistant: "I'll use the code-simplifier agent to suggest improvements." </example> ``` ### Pattern 3: Proactive Trigger Agent triggers after completing relevant work without explicit request. ```markdown <example> Context: User just implemented database queries user: "I've added the user search functionality" assistant: "Great! Let me review it for security." <commentary> Database code written. Proactively trigger security-analyzer to check for SQL injection and other database security issues. </commentary> assistant: "I'll use the security-analyzer agent to check the database code." </example> ``` ### Pattern 4: Tool Usage Pattern Agent triggers based on prior tool usage. ```markdown <example> Context: User made multiple edits to test files user: "I've updated all the tests" assistant: "Great! Let me verify test quality." <commentary> Multiple Edit tools used on test files. Proactively trigger test-quality-analyzer to ensure tests follow best practices. </commentary> assistant: "I'll use the test-quality-analyzer agent to review the tests." </example> ``` ## System Prompt Design The system prompt (markdown body after frontmatter) defines agent behavior. Use this proven template: ```markdown You are [role] specializing in [domain]. **Your Core Responsibilities:** 1. [Primary responsibility - what the agent MUST do] 2. [Secondary responsibility] 3. [Additional responsibilities...] **Analysis Process:** 1. [Step one - be specific] 2. [Step two] 3. [Step three] [...] **Quality Standards:** - [Standard 1 - measurable criteria] - [Standard 2] **Output Format:** Provide results in this format: - [What to include] - [How to structure] **Edge Cases:** Handle these situations: - [Edge case 1]: [How to handle] - [Edge case 2]: [How to handle] **What NOT to Do:** - [Anti-pattern 1] - [Anti-pattern 2] ``` ### System Prompt Principles | Principle | Good | Bad | |-----------|------|-----| | Be specific | "Check for SQL injection in query strings" | "Look for security issues" | | Include examples | "Format: `## Critical Issues\n- Issue 1`" | "Use proper formatting" | | Define boundaries | "Do NOT modify files, only analyze" | No boundaries stated | | Provide fallbacks | "If unsure, ask for clarification" | Assume and proceed | | Quality mechanisms | "Verify each finding with evidence" | No verification | ### Validation Requirements System prompts must be: - **Length**: 20-10,000 characters (ideal: 500-3,000) - **Well-structured**: Clear sections with responsibilities, process, output format - **Specific**: Actionable instructions, not vague guidance - **Complete**: Handles edge cases and quality standards ## AI-Assisted Agent Generation Use this prompt to generate agent configurations automatically: ```markdown Create an agent configuration based on this request: "[YOUR DESCRIPTION]" Requirements: 1. Extract core intent and responsibilities 2. Design expert persona for the domain 3. Create comprehensive system prompt with: - Clear behavioral boundaries - Specific methodologies - Edge case handling - Output format 4. Create identifier (lowercase, hyphens, 3-50 chars) 5. Write description with triggering conditions 6. Include 2-3 <example> blocks showing when to use Return JSON with: { "identifier": "agent-name", "whenToUse": "Use this agent when... Examples: <example>...</example>", "systemPrompt": "You are..." } ``` ### Elite Agent Architect Process When creating agents, follow this 6-step process: 1. **Extract Core Intent**: Identify fundamental purpose, key responsibilities, success criteria 2. **Design Expert Persona**: Create compelling expert identity with domain knowledge 3. **Architect Comprehensive Instructions**: Behavioral boundaries, methodologies, edge cases, output formats 4. **Optimize for Performance**: Decision frameworks, quality control, workflow patterns, fallback strategies 5. **Create Identifier**: Concise, descriptive, 2-4 words with hyphens 6. **Generate Examples**: Triggering scenarios with context, user/assistant dialogue, commentary ## Default Agent Standards ### Frontmatter Rules - `description`: Keep to ONE sentence - descriptions load into parent context, every token counts - Do NOT add verbose `<example>` blocks in description - they waste context tokens ### Required Agent Sections (in order) 1. **Title** - `# <Role Title>` with strong identity statement 2. **Identity** - Quality expectations and motivation (consequences for poor work) 3. **Goal** - Clear single-paragraph objective 4. **Input** - What files/data the agent receives 5. **CRITICAL: Load Context** - Explicit requirement to read ALL relevant files BEFORE analysis 6. **Process/Stages** - Step-by-step workflow with proper ordering ### Process Stage Ordering (critical for multi-stage agents) ``` WRONG: Decompose → Self-Critique → Produce → Solve RIGHT: Decompose → Solve → Produce Full Solution → Self-Critique → Output ``` - Self-critique comes as the last step, always - Always produce everything first, then evaluate and select ### Decision Tables Put reasoning column BEFORE decision column: ```markdown WRONG: | Section | Include? | Reasoning | RIGHT: | Section | Reasoning | Include? | ``` This forces the agent to explain WHY before deciding, improving decision quality. ## Validation Rules ### Structural Validation | Component | Rule | Valid | Invalid | |-----------|------|-------|---------| | Name | 3-50 chars, lowercase, hyphens | `code-reviewer` | `Code_Reviewer` | | Description | 10-5000 chars, starts "Use this agent when" | `Use this agent when reviewing code...` | `Reviews code` | | Model | One of: inherit, sonnet, opus, haiku | `inherit` | `gpt-4` | | Color | One of: blue, cyan, green, yellow, magenta, red | `blue` | `purple` | | System prompt | 20-10000 chars | 500+ char prompt | Empty body | | Examples | At least one `<example>` block | Has examples | No examples | ### Validation Script ```bash # Validate agent structure scripts/validate-agent.sh agents/your-agent.md ``` ### Quality Checklist Before deployment: - [ ] Name follows conventions (lowercase, hyphens, 3-50 chars) - [ ] Description starts with "Use this agent when..." - [ ] Description includes 2-4 `<example>` blocks - [ ] Each example has context, user, assistant, commentary - [ ] Model is appropriate for task complexity - [ ] Color is unique among related agents - [ ] Tools restricted to what's needed (least privilege) - [ ] System prompt has clear structure - [ ] Responsibilities are specific and actionable - [ ] Process steps are concrete - [ ] Output format is defined - [ ] Edge cases are addressed ## Production Examples ### Code Quality Reviewer Agent ```markdown --- name: code-quality-reviewer description: Use this agent when the user asks to review code changes, check code quality, or analyze modifications for bugs and improvements. Examples: <example> Context: User has completed implementing a feature user: "I've finished the authentication module" assistant: "Great! Let me review it for quality." <commentary> Code implementation complete. Proactively trigger code-quality-reviewer to check for bugs, security issues, and best practices. </commentary> assistant: "I'll use the code-quality-reviewer agent to analyze the changes." </example> <example> Context: User explicitly requests review user: "Can you review PR #123 for code quality?" assistant: "I'll use the code-quality-reviewer agent to analyze the PR." <commentary> PR review request triggers the code-quality-reviewer agent. </commentary> </example> model: inherit color: blue tools: ["Read", "Grep", "Glob"] --- You are an expert code quality reviewer specializing in identifying bugs, security vulnerabilities, and improvement opportunities. **Your Core Responsibilities:** 1. Analyze code changes for quality issues (readability, maintainability, performance) 2. Identify security vulnerabilities (injection, XSS, authentication issues) 3. Check adherence to project best practices and coding standards 4. Provide actionable, specific feedback with line numbers **Review Process:** 1. **Initial scan**: Quick pass for obvious issues 2. **Deep analysis**: Line-by-line review of changed code 3. **Context evaluation**: Check impact on related code 4. **Best practices**: Compare against project and language standards 5. **Recommendations**: Prioritized list of improvements **Output Format:** For each file reviewed: ## Summary [2-3 sentence overview] ## Critical Issues (must fix) - [Issue with file:line reference and fix suggestion] ## Major Issues (should fix) - [Issue with explanation] ## Minor Issues (nice to fix) - [Suggestion] ## Positive Observations - [What was done well] ## Overall Assessment [Pass/Needs Work/Fail with reasoning] **Quality Standards:** - Every issue must have a specific location (file:line) - Every issue must have a suggested fix - Prioritize security over style issues - Acknowledge good patterns, not just problems ``` ### Test Generator Agent ```markdown --- name: test-generator description: Use this agent when the user asks for test generation, needs test coverage, or has written new code requiring tests. Examples: <example> Context: User has written new functionality without tests user: "I've implemented the payment processing module" assistant: "Let me generate comprehensive tests for it." <commentary> New code written without tests. Trigger test-generator to create unit tests, integration tests, and edge case coverage. </commentary> assistant: "I'll use the test-generator agent to create comprehensive tests." </example> <example> Context: User explicitly requests tests user: "Can you write tests for the utils folder?" assistant: "I'll use the test-generator agent to create tests." <commentary> Explicit test generation request. </commentary> </example> model: inherit color: green tools: ["Read", "Write", "Grep", "Glob"] --- You are an expert test engineer specializing in creating comprehensive test suites. **Your Core Responsibilities:** 1. Analyze code to understand behavior and dependencies 2. Generate unit tests for individual functions/methods 3. Create integration tests for module interactions 4. Design edge case and error condition tests 5. Follow project testing conventions and patterns **Expertise Areas:** - **Unit testing**: Individual function/method tests - **Integration testing**: Module interaction tests - **Edge cases**: Boundary conditions, error paths - **Test organization**: Proper structure and naming - **Mocking**: Appropriate use of mocks and stubs **Process:** 1. Read target code and understand its behavior 2. Identify testable units and their dependencies 3. Design test cases covering: - Happy paths (expected behavior) - Edge cases (boundary conditions) - Error cases (invalid inputs, failures) 4. Generate tests following project patterns 5. Add comprehensive assertions **Output Format:** Complete test files with: - Proper test suite structure (describe/it or test blocks) - Setup/teardown if needed - Descriptive test names explaining what's being tested - Comprehensive assertions covering all behaviors - Comments explaining complex test logic **Quality Standards:** - Each function should have at least 3 tests (happy, edge, error) - Test names should describe the scenario being tested - Mocks should be clearly documented - No test interdependencies ``` ## Agent Creation Process ### Step 1: Gather Requirements Ask user (if not provided): 1. **Agent name**: What should the agent be called? (kebab-case) 2. **Purpose**: What problem does this agent solve? 3. **Triggers**: When should Claude use this agent? 4. **Responsibilities**: What are the core tasks? 5. **Tools needed**: Read-only? Can modify files? 6. **Model**: Need maximum capability (opus) or balanced (sonnet/inherit)? ### Step 2: Create Agent File ```bash # Create agents directory if needed mkdir -p ${CLAUDE_PLUGIN_ROOT}/agents # Create agent file touch ${CLAUDE_PLUGIN_ROOT}/agents/<agent-name>.md ``` ### Step 3: Write Frontmatter Generate frontmatter with: - Unique, descriptive name - Description with triggering conditions and examples - Appropriate model setting - Distinct color - Minimal required tools ### Step 4: Write System Prompt Create system prompt following the template: 1. Role statement with specialization 2. Core responsibilities (numbered list) 3. Analysis/work process (step-by-step) 4. Quality standards (measurable criteria) 5. Output format (specific structure) 6. Edge cases (how to handle special situations) ### Step 5: Validate Run validation: ```bash scripts/validate-agent.sh agents/<agent-name>.md ``` Check: - [ ] Frontmatter parses correctly - [ ] All required fields present - [ ] Examples are complete - [ ] System prompt is comprehensive ### Step 6: Test Triggering Test with various scenarios: 1. Explicit requests matching examples 2. Implicit needs where agent should activate 3. Scenarios where agent should NOT activate 4. Edge cases and variations ## Best Practices Summary ### DO - Include 2-4 concrete examples in agent descriptions - Write specific, unambiguous triggering conditions - Use "inherit" model setting unless specific need - Apply principle of least privilege for tools - Write clear, structured system prompts with explicit steps - Test agent triggering thoroughly before deployment - Use different colors for different agents - Include commentary explaining trigger logic ### DON'T - Generic descriptions without examples - Omit triggering conditions - Use same color for multiple agents in same plugin - Grant unnecessary tool access - Write vague system prompts - Skip testing phases - Use underscores or uppercase in names - Forget to handle edge cases ## Integration with Workflows Agents integrate with plugin workflows: 1. **Phase 5: Component Implementation** uses agent-creator to generate agents 2. **Validation phase** uses validate-agent.sh script 3. **Testing phase** verifies triggering across scenarios For comprehensive plugin development, use: - `/plugin-dev:create-plugin` for full plugin workflow - This command for individual agent creation/refinement ## Create the Agent Based on user input, create: 1. **Directory structure**: `${CLAUDE_PLUGIN_ROOT}/agents/` 2. **Agent file**: Complete markdown with frontmatter + system prompt 3. **Validation**: Run validation script 4. **Testing suggestions**: Scenarios to verify triggering After creation, suggest testing with `/customaize-agent:test-prompt` command to verify agent behavior under various scenarios..claude-plugin/marketplace.jsonmarketplaceShow content (5880 bytes)
{ "$schema": "https://anthropic.com/claude-code/marketplace.schema.json", "name": "context-engineering-kit", "version": "3.0.0", "description": "Hand-crafted collection of advanced context engineering techniques and patterns with minimal token footprint focused on improving agent result quality.", "owner": { "name": "NeoLabHQ", "email": "vlad.goncharov@neolab.finance" }, "plugins": [ { "name": "reflexion", "description": "Collection of commands that force LLM to reflect on previous response and output. Based on papers like Self-Refine and Reflexion. These techniques improve the output of large language models by introducing feedback and refinement loops.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/reflexion", "category": "productivity" }, { "name": "review", "description": "Introduce codebase and PR review commands and skills using multiple specialized agents.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/review", "category": "productivity" }, { "name": "git", "description": "Introduces commands for commit and PRs creation, plus skills for git worktrees and notes.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/git", "category": "productivity" }, { "name": "tdd", "description": "Introduces commands for test-driven development, common anti-patterns and skills for testing using subagents.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/tdd", "category": "development" }, { "name": "sadd", "description": "Introduces skills for subagent-driven development, dispatches fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/sadd", "category": "development" }, { "name": "ddd", "description": "Introduces command to update CLAUDE.md with best practices for domain-driven development, focused on quality of code, includes Clean Architecture, SOLID principles, and other design patterns.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/ddd", "category": "development" }, { "name": "sdd", "description": "Specification Driven Development workflow commands and agents, based on Github Spec Kit and OpenSpec. Uses specialized agents for effective context management and quality review.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/sdd", "category": "development" }, { "name": "kaizen", "description": "Inspired by Japanese continuous improvement philosophy, Agile and Lean development practices. Introduces commands for analysis of root cause of issues and problems, including 5 Whys, Cause and Effect Analysis, and other techniques.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/kaizen", "category": "productivity" }, { "name": "customaize-agent", "description": "Commands and skills for writing and refining commands, hooks, skills for Claude Code, includes Anthropic Best Practices and Agent Persuasion Principles that can be useful for sub-agent workflows.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/customaize-agent", "category": "development" }, { "name": "docs", "description": "Commands for analysing project, writing and refining documentation.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/docs", "category": "productivity" }, { "name": "tech-stack", "description": "Commands for setup or update of CLAUDE.md file with best practices for specific language or framework.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/tech-stack", "category": "development" }, { "name": "mcp", "description": "Commands for setup well known MCP server integration if needed and update CLAUDE.md file with requirement to use this MCP server for current project.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/mcp", "category": "development" }, { "name": "fpf", "description": "First Principles Framework (FPF) for structured reasoning. Implements ADI (Abduction-Deduction-Induction) cycle for hypothesis generation, logical verification, empirical validation, and auditable decision-making.", "version": "3.0.0", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" }, "source": "./plugins/fpf", "category": "development" } ] }
README
Advanced context engineering techniques and patterns for Claude Code, OpenCode, Cursor, Antigravity and more.
Quick Start · Plugins · Github Action · Reference · Docs
Context Engineering Kit
Hand-crafted collection of advanced context engineering techniques and patterns with minimal token footprint, focused on improving agent result quality and predictability.
The marketplace is based on prompts used daily by our company developers for a long time, supplemented by plugins from benchmarked papers and high-quality projects.
Key Features
- Simple to Use - Easy to install and use without any dependencies. Contains automatically used skills and self-explanatory commands.
- Token-Efficient - Carefully crafted prompts and architecture, preferring command-oriented skills with sub-agents over general information skills when possible, to minimize populating context with unnecessary information.
- Quality-Focused - Each plugin is focused on meaningfully improving agent results in a specific area.
- Granular - Install only the plugins you need. Each plugin loads only its specific agents, commands, and skills. Each without overlap or redundant skills.
- Scientifically proven - Plugins are based on proven techniques and patterns that were tested by well-trusted benchmarks and studies.
- Open-Standards - Skills are based on agentskills.io specification. The SDD plugin is based on the Arc42 specification standard for software development documentation.
News
Updates from key releases:
- v3.0.0: Added support for AMP and Hermes agents. Tech Stack plugin now automatically injects typescript best practices when agent reads or writes TypeScript files.
- v2.2.0: Subagent-Driven Development plugin now works as a distilled version of SDD plugin using meta-judge and judge sub-agents for specification generation on the fly and in parallel to implementation. DDD plugin now includes Clean Architecture, DDD, SOLID, Functional Programming, and other pattern examples as rules that are automatically added to the context during code writing.
- v2.1.0: Spec-Driven Development plugin agents include high-level code quality guidelines from DDD plugin.
- v2.0.0: Spec-Driven Development plugin was rewritten from scratch. It is now able to produce working code in 99% of cases on real-life production projects!
Quick Start
Step 1: Install Marketplace and Plugins
Claude Code
Open Claude Code and add the Context Engineering Kit marketplace
/plugin marketplace add NeoLabHQ/context-engineering-kit
This makes all plugins available for installation, but does not load any agents or skills into your context.
Install any plugin, for example reflexion:
/plugin install reflexion@NeoLabHQ/context-engineering-kit
Each installed plugin loads only its specific agents, commands, and skills into Claude's context.
Cursor, Antigravity, Codex, OpenCode and others
Run the vercel-labs/skills command in your terminal:
npx skills add NeoLabHQ/context-engineering-kit
You can pick which skills and agents to install.
Alternative installation methods
You can use OpenSkills to install skills by running the following commands:
npx openskills install NeoLabHQ/context-engineering-kit
npx openskills sync
Step 2: Use Plugin
> claude "implement user authentication"
# Claude implements user authentication, then you can ask it to reflect on implementation
> /reflexion:reflect
# It analyses results and suggests improvements
# If issues are obvious, it will fix them immediately
# If they are minor, it will suggest improvements that you can respond to
> fix the issues
# If you would like to prevent issues found during reflection from appearing again,
# ask Claude to extract resolution strategies and save the insights to project memory
> /reflexion:memorize
Alternatively, you can use the reflect word in the initial prompt:
> claude "implement user authentication, then reflect"
# Claude implements user authentication,
# then hook automatically runs /reflexion:reflect
In order to use this hook, you need to have bun installed. However, it is not required for the overall command.
Documentation
You can find the complete Context Engineering Kit documentation here.
However, the main plugins we recommend starting from are Subagent-Driven Development and Spec-Driven Development.
Agent Reliability Engineering
The three plugins in this marketplace are designed to improve how accurately and consistently the agent follows provided instructions and reduce the number of hallucinations and bias toward incorrect solutions. They are not competitors but rather complementary to each other, because they allow you to balance reliability vs token cost. Here is a high-level comparison of different agent usage approaches vs probability to receive results that are fully accurate and include zero hallucinations based on task complexity:
| Approach | Probability to receive fully accurate results for the following number of changed files (p) | Tokens Overhead | What does this mean in practice | |||
|---|---|---|---|---|---|---|
| 1-3 | 4-10 | 10-20 | 20+ | |||
| One-shot prompt | 60%-80% | 30%-50% | 5%-30% | 1%-20% | 0 | Accuracy depends on model, but with context growth LLM quality degrades exponentially |
| /reflect | 68%-91% | 49%-71% | 13%-41% | 1%-30% | 1k-3k | Agent finds and fixes missed requirements on its own |
| /reflect + /memorize | 79%-87% | 60%-79% | 34%-42% | 5%-30% | 2k-5k | Agent extracts repeatable mistakes and avoids them during new tasks |
| /do-and-judge | 90% | 83% | 60% | 30% | 1.5x-3x | Mitigates context rot, bias, hallucinations and missed requirements using Judge sub-agent |
| /do-in-steps | 92% | 90% | 71% | 50% | 3x-5x | Resolves all issues similarly to /do-and-judge, but separately per file group |
| /plan-task + /implement-task | 94% | 93% | 85% | 70% | 5x-20x | Performs the /do-in-steps flow, but the specification mitigates issues caused by inconsistent architecture and codebase size |
| /brainstorm + /plan-task + /implement-task | 95% | 95% | 90% | 80% | 5x-20x | Brainstorming decreases the number of incorrect decisions and missed requirements |
| /plan-task + human review + /implement-task | 99% | 99% | 99% | 95% | 5x-35x | Human review mitigates misunderstanding of requirements by LLM |
Reliability metrics are based on real development usage on production projects for more than 6 months.
Plugins List
To view all available plugins:
/plugin
- Reflexion - Introduces feedback and refinement loops to improve output quality.
- Spec-Driven Development - Introduces commands for specification-driven development, based on Continuous Learning + LLM-as-Judge + Agent Swarm. Achieves development as compilation through reliable code generation.
- Review - Introduces code and PR review commands and skills using multiple specialized agents with impact/confidence filtering.
- Git - Introduces commands for commit and PR creation.
- Test-Driven Development - Introduces commands for test-driven development, common anti-patterns and skills for testing using subagents.
- Subagent-Driven Development - Introduces skills for subagent-driven development, which dispatches a fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates.
- Domain-Driven Development - Introduces commands to update CLAUDE.md with best practices for domain-driven development, focused on code quality, and includes Clean Architecture, SOLID principles, and other design patterns.
- FPF - First Principles Framework - Introduces structured reasoning using ADI cycle (Abduction-Deduction-Induction) with knowledge layer progression. Uses workflow command pattern with fpf-agent for hypothesis generation, verification, and auditable decision-making.
- Kaizen - Inspired by Japanese continuous improvement philosophy, Agile and Lean development practices. Introduces commands for analysis of root causes of issues and problems, including 5 Whys, Cause and Effect Analysis, and other techniques.
- Customaize Agent - Commands and skills for writing and refining commands, hooks, and skills for Claude Code. Includes Anthropic Best Practices and Agent Persuasion Principles that can be useful for sub-agent workflows.
- Docs - Commands for analyzing projects, writing and refining documentation.
- Tech Stack - Rules for language-specific best practices, automatically applied when working on matching file types.
- MCP - Commands for setting up well-known MCP server integration if needed and updating CLAUDE.md file with requirements to use this MCP server for the current project.
Stay ahead
Star Context Engineering Kit on GitHub to support it's development and get notified about new features and updates.
Reflexion
Collection of commands that force the LLM to reflect on the previous response and output. Includes automatic reflection hooks that trigger when you include "reflect" in your prompt.
How to install
/plugin install reflexion@NeoLabHQ/context-engineering-kit
Commands
- /reflexion:reflect - Reflect on previous response and output, based on Self-refinement framework for iterative improvement with complexity triage and verification
- /reflexion:memorize - Memorize insights from reflections and update the CLAUDE.md file with this knowledge. Curates insights from reflections and critiques into CLAUDE.md using Agentic Context Engineering
- /reflexion:critique - Comprehensive multi-perspective review using specialized judges with debate and consensus building
Hooks
- Automatic Reflection Hook - Triggers
/reflexion:reflectautomatically when "reflect" appears in your prompt
Theoretical Foundation
The plugin is based on papers like Self-Refine and Reflexion. These techniques improve the output of large language models by introducing feedback and refinement loops.
They are proven to increase output quality by 8–21% based on both automatic metrics and human preferences across seven diverse tasks, including dialogue generation, coding, and mathematical reasoning, when compared to standard one-step model outputs.
On top of that, the plugin is based on the Agentic Context Engineering paper that uses memory updates after reflection, and consistently outperforms strong baselines by 10.6% on agents.
Review
Comprehensive code and PR review commands using multiple specialized agents for thorough code quality evaluation with impact/confidence filtering.
How to install
/plugin install review@NeoLabHQ/context-engineering-kit
Commands
- /review-local-changes - Comprehensive review of local uncommitted changes using specialized agents with code improvement suggestions
- /review-pr - Comprehensive pull request review using specialized agents
Agents
This plugin uses multiple specialized agents for comprehensive code quality analysis:
- bug-hunter - Identifies potential bugs, edge cases, and error-prone patterns
- code-quality-reviewer - Evaluates code structure, readability, and maintainability
- contracts-reviewer - Reviews interfaces, API contracts, and data models
- historical-context-reviewer - Analyzes changes in relation to codebase history and patterns
- security-auditor - Identifies security vulnerabilities and potential attack vectors
- test-coverage-reviewer - Evaluates test coverage and suggests missing test cases
You can use this plugin to review code in GitHub Actions; to do so, follow this guide.
Git
Commands and skills for streamlined Git operations including commits, pull request creation, and advanced workflow patterns.
How to install
/plugin install git@NeoLabHQ/context-engineering-kit
Commands
- /commit - Create well-formatted commits with conventional commit messages and emoji
- /create-pr - Create pull requests using GitHub CLI with proper templates and formatting
- /analyze-issue - Analyze a GitHub issue and create a detailed technical specification
- /load-issues - Load all open issues from GitHub and save them as markdown files
- /worktree - Create, compare, and merge git worktrees for parallel development with automatic dependency installation
Skills
- notes - Skill about using git notes to add metadata to commits without changing history.
Test-Driven Development
Commands and skills for test-driven development with anti-pattern detection.
How to install
/plugin install tdd@NeoLabHQ/context-engineering-kit
Commands
- /write-tests - Systematically add test coverage for local code changes using specialized review and development agents
- /fix-tests - Fix failing tests after business logic changes or refactoring using orchestrated agents
Skills
- test-driven-development - Introduces TDD methodology, best practices, and skills for testing using subagents
Subagent-Driven Development
Execution framework for competitive generation, multi-agent evaluation, and subagent-driven development with quality gates.
How to install
/plugin install sadd@NeoLabHQ/context-engineering-kit
Commands
- /launch-sub-agent - Launch focused sub-agents with intelligent model selection, Zero-shot CoT reasoning, and self-critique verification
- /do-and-judge - Execute a single task with implementation sub-agent, independent judge verification, and automatic retry loop until passing
- /do-in-parallel - Execute the same task across multiple independent targets in parallel with context isolation
- /do-in-steps - Execute complex tasks through sequential sub-agent orchestration with automatic decomposition and context passing
- /do-competitively - Execute tasks through competitive generation, multi-judge evaluation, and evidence-based synthesis to produce superior results
- /tree-of-thoughts - Execute complex reasoning through systematic exploration of solution space, pruning unpromising branches, and synthesizing the best solution
- /judge-with-debate - Evaluate solutions through iterative multi-judge debate with consensus building or disagreement reporting
- /judge - Evaluate completed work using LLM-as-Judge with structured rubrics and evidence-based scoring
Skills
- subagent-driven-development - Dispatches a fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates
- multi-agent-patterns - Design multi-agent architectures (supervisor, peer-to-peer, hierarchical) for complex tasks exceeding single-agent context limits
Spec-Driven Development
Comprehensive specification-driven development workflow plugin that transforms prompts into production-ready implementations through structured planning, architecture design, and quality-gated execution.
This plugin is designed to consistently produce working code. It was tested on real-life production projects by our team, and in 100% of cases, it generated working code aligned with the initial prompt. If you find a use case it cannot handle, please report it as an issue.
Key Features
- Development as compilation — The plugin works like a "compilation" or "nightly build" for your development process:
task specs → run /implement-task → working code. After writing your prompt, you can launch the plugin and expect a working result when you come back. The time it takes depends on task complexity — simple tasks may finish in 30 minutes, while complex ones can take a few days. - Benchmark-level quality in real life — Model benchmarks improve with each release, yet real-world results usually stay the same. That's because benchmarks reflect the best possible output a model can achieve, whereas in practice LLMs tend to drift toward sub-optimal solutions that can be wrong or non-functional. This plugin uses a variety of patterns to keep the model working at its peak performance.
- Customizable — Balance result quality and process speed by adjusting command parameters. Learn more in the Customization section.
- Developer time-efficient — The overall process is designed to minimize developer time and reduce the number of interactions, while still producing results better than what a model can generate from scratch. However, overall quality is highly proportional to the time you invest in iterating and refining the specification.
- Industry-standard — The plugin's specification template is based on the arc42 standard, adjusted for LLM capabilities. Arc42 is a widely adopted, high-quality standard for software development documentation used by many companies and organizations.
- Works best in complex or large codebases — While most other frameworks work best for new projects and greenfield development, this plugin is designed to perform better the more existing code and well-structured architecture you have. At each planning phase it includes a codebase impact analysis step that evaluates which files may be affected and which patterns to follow to achieve the desired result.
- Simple — This plugin avoids unnecessary complexity and mainly uses just 3 commands, offloading process complexity to the model via multi-agent orchestration.
/implement-taskis a single command that produces working code from a task specification. To create that specification, you run/sdd:add-taskand/plan-task, which analyze your prompt and iteratively refine the specification until it meets the required quality.
Quick Start
/plugin install sdd@NeoLabHQ/context-engineering-kit
Then run the following commands:
# create .specs/tasks/draft/design-auth-middleware.feature.md file with initial prompt
/add-task "Design and implement authentication middleware with JWT support"
# write detailed specification for the task
/plan-task
# will move task to .specs/tasks/todo/ folder
Restart the Claude Code session to clear context and start fresh. Then run the following command:
# implement the task
/implement-task @.specs/tasks/todo/design-auth-middleware.feature.md
# produces working implementation and moves the task to .specs/tasks/done/ folder
Commands
- /add-task - Create task template file with initial prompt
- /plan-task - Analyze prompt, generate required skills and refine task specification
- /implement-task - Produce a working implementation of the task and verify it
Additional commands useful before creating a task:
- /create-ideas - Generate diverse ideas on a given topic using creative sampling techniques
- /brainstorm - Refine vague ideas into fully-formed designs through collaborative dialogue
Agents
| Agent | Description | Used By |
|---|---|---|
researcher | Technology research, dependency analysis, best practices | /plan-task (Phase 2a) |
code-explorer | Codebase analysis, pattern identification, architecture mapping | /plan-task (Phase 2b) |
business-analyst | Requirements discovery, stakeholder analysis, specification writing | /plan-task (Phase 2c) |
software-architect | Architecture design, component design, implementation planning | /plan-task (Phase 3) |
tech-lead | Task decomposition, dependency mapping, risk analysis | /plan-task (Phase 4) |
team-lead | Step parallelization, agent assignment, execution planning | /plan-task (Phase 5) |
qa-engineer | Verification rubrics, quality gates, LLM-as-Judge definitions | /plan-task (Phase 6) |
developer | Code implementation, TDD execution, quality review, verification | /implement-task |
tech-writer | Technical documentation writing, API guides, architecture updates, lessons learned | /implement-task |
Patterns
Key patterns implemented in this plugin:
- Structured reasoning templates — includes Zero-shot and Few-shot Chain of Thought, Tree of Thoughts, Problem Decomposition, and Self-Critique. Each is tailored to a specific agent and task, enabling sufficiently detailed decomposition so that isolated sub-agents can implement each step independently.
- Multi-agent orchestration for context management — Context isolation of independent agents prevents the context rot problem, essentially keeping LLMs at optimal performance at each step of the process. The main agent acts as an orchestrator that launches sub-agents and controls their work.
- Quality gates based on LLM-as-Judge — Evaluate the quality of each planning and implementation step using evidence-based scoring and predefined verification rubrics. This fully eliminates cases where an agent produces non-working or incorrect solutions.
- Continuous learning — Builds skills that the agent needs to implement a specific task, which it would otherwise not be able to perform from scratch.
- Spec-driven development pattern — Based on the arc42 specification standard, adjusted for LLM capabilities, to eliminate parts of the specification that add no value to implementation quality or that could degrade it.
- MAKER — An agent reliability pattern introduced in Solving a Million-Step LLM Task with Zero Errors. It removes agent mistakes caused by accumulated context and hallucinations by utilizing clean-state agent launches, filesystem-based memory storage, and multi-agent voting during critical decision-making.
Vibe Coding vs. Specification-Driven Development
This plugin is not a "vibe coding" solution, but out of the box it works like one. By default it is designed to work from a single prompt through to the end of the task, making reasonable assumptions and evidence-based decisions instead of constantly asking for clarification. This is because developer time is more valuable than model time. As a result, the plugin is designed to allow the developer to decide how much time the task is worth. The plugin will always produce working results, but quality will be sub-optimal if no human feedback is provided.
To improve quality, after generating a specification you can correct it or leave comments using //, then run the /plan command again with the --refine flag. You can also verify each planning and implementation phase by adding the --human-in-the-loop flag. According to most known research, human feedback is the most effective way to improve results.
Our tests showed that even when the initially generated specification was incorrect due to lack of information or task complexity, the agent was still able to self-correct until it reached a working solution. However, it usually takes much longer, and results in the agent spending time on wrong paths and stopping more frequently. To avoid this, we strongly advise decomposing tasks into smaller separate tasks with dependencies and reviewing the specification for each one independently. You can add dependencies between tasks as arguments to the /add-task command, and the agent will link them together by adding a depends_on section to the task file frontmatter.
Even if you don't want to spend much time on this process, you can still use the plugin for complex tasks without decomposition or human verification — but you will likely need tools like ralph-loop to keep the agent running for longer.
Learn more about available customization options in Customization.
Domain-Driven Development
Code quality framework with rules for Clean Architecture, SOLID principles, and Domain-Driven Design patterns.
How to install
/plugin install ddd@NeoLabHQ/context-engineering-kit
Rules
- 15 composable rules covering Clean Architecture, SOLID principles, Command-Query Separation, Functional Core/Imperative Shell, Explicit Control Flow, Domain-Specific Naming, and more. See rules reference
FPF - First Principles Framework
A structured reasoning plugin that implements the First Principles Framework (FPF) by Anatoly Levenchuk — a methodology for rigorous, auditable reasoning. The killer feature is turning the black box of AI reasoning into a transparent, evidence-backed audit trail. The plugin makes AI decision-making transparent and auditable. Instead of jumping to solutions, FPF enforces generating competing hypotheses, checking them logically, testing against evidence, then letting developers choose.
Key principles:
- Transparent reasoning - Full audit trail from hypothesis to decision
- Hypothesis-driven - Generate 3-5 competing alternatives before evaluating
- Evidence-based - Computed trust scores, not estimates
- Human-in-the-loop - AI generates options; humans decide (Transformer Mandate)
The core cycle follows three modes of inference:
- Abduction — Generate competing hypotheses (don't anchor on the first idea).
- Deduction — Verify logic and constraints (does the idea make sense?).
- Induction — Gather evidence through tests or research (does the idea work in reality?).
Then, audit for bias, decide, and document the rationale in a durable record.
Warning: This plugin loads the core FPF specification into context, which is large (~600k tokens). As a result, it is loaded into a subagent with the Sonnet[1m] model. However, such an agent can consume your token limit quickly.
How to install
/plugin install fpf@NeoLabHQ/context-engineering-kit
Usage workflow
# Execute complete FPF cycle from hypothesis to decision
/propose-hypotheses What caching strategy should we use?
# The workflow will:
# 1. Initialize context and .fpf/ directory
# 2. Generate competing hypotheses
# 3. Allow you to add your own alternatives
# 4. Verify each against project constraints (parallel)
# 5. Validate with evidence (parallel)
# 6. Compute trust scores (parallel)
# 7. Present comparison for your decision
Commands
- /propose-hypotheses - Execute complete FPF cycle from hypothesis to decision (main workflow)
- /status - Show current FPF phase and hypothesis counts
- /query - Search knowledge base with assurance info
- /decay - Manage evidence freshness (refresh/deprecate/waive)
- /actualize - Reconcile knowledge with codebase changes
- /reset - Archive session and return to IDLE
Agent
- fpf-agent - FPF reasoning specialist for hypothesis generation, verification, validation, and trust calculus using ADI cycle and knowledge layer progression
Kaizen
Continuous improvement methodology inspired by Japanese philosophy and Agile practices.
How to install
/plugin install kaizen@NeoLabHQ/context-engineering-kit
Commands
- /analyse - Auto-selects best Kaizen method (Gemba Walk, Value Stream, or Muda) for target analysis
- /analyse-problem - Comprehensive A3 one-page problem analysis with root cause and action plan
- /why - Iterative Five Whys root cause analysis drilling from symptoms to fundamentals
- /root-cause-tracing - Systematically traces bugs backward through call stack to identify source of invalid data or incorrect behavior
- /cause-and-effect - Systematic Fishbone analysis exploring problem causes across six categories
- /plan-do-check-act - Iterative PDCA cycle for systematic experimentation and continuous improvement
Skills
- kaizen - Continuous improvement methodology with multiple analysis techniques
Customaize Agent
Commands and skills for creating and refining Claude Code extensions.
How to install
/plugin install customaize-agent@NeoLabHQ/context-engineering-kit
Commands
- /customaize-agent:create-agent - Comprehensive guide for creating Claude Code agents with proper structure, triggering conditions, system prompts, and validation
- /customaize-agent:create-command - Interactive assistant for creating new Claude commands with proper structure and patterns
- /customaize-agent:create-workflow-command - Create workflow commands that orchestrate multi-step execution through sub-agents with file-based task prompts
- /customaize-agent:create-skill - Guide for creating effective skills with test-driven approach
- /customaize-agent:create-hook - Create and configure git hooks with intelligent project analysis and automated testing
- /customaize-agent:test-skill - Verify skills work under pressure and resist rationalization using RED-GREEN-REFACTOR cycle
- /customaize-agent:test-prompt - Test any prompt (commands, hooks, skills, subagent instructions) using RED-GREEN-REFACTOR cycle with subagents
- /customaize-agent:apply-anthropic-skill-best-practices - Comprehensive guide for skill development based on Anthropic's official best practices
Skills
- prompt-engineering - Well-known prompt engineering techniques and patterns, includes Anthropic Best Practices and Agent Persuasion Principles
- context-engineering - Deep understanding of context mechanics: attention budget, progressive disclosure, lost-in-middle effect, and practical optimization patterns
- agent-evaluation - Evaluation frameworks for agent systems: LLM-as-Judge, multi-dimensional rubrics, bias mitigation, and the 95% performance finding
Docs
Commands for project analysis and documentation management based on proven writing principles.
How to install
/plugin install docs@NeoLabHQ/context-engineering-kit
Commands
- /update-docs - Update implementation documentation after completing development phases
- /write-concisely - Apply The Elements of Style principles to make documentation clearer and more professional
Tech Stack
Rules for language and framework-specific best practices, automatically applied when the agent works on matching file types.
How to install
/plugin install tech-stack@NeoLabHQ/context-engineering-kit
Rules
- TypeScript Best Practices - Type system guidelines, code style, async patterns, and code quality standards, automatically loaded when agent reading or writing
.tsfiles
MCP
Commands for integrating Model Context Protocol servers with your project. Each setup command supports configuration at multiple levels:
- Project level (shared) - Configuration tracked in git, shared with team via
./CLAUDE.md - Project level (personal) - Local configuration in
./CLAUDE.local.md, not tracked in git - User level (global) - Configuration in
~/.claude/CLAUDE.md, applies to all projects
How to install
/plugin install mcp@NeoLabHQ/context-engineering-kit
Commands
- /mcp:setup-context7-mcp - Guide for setting up Context7 MCP server to load documentation for specific technologies
- /mcp:setup-serena-mcp - Guide for setting up Serena MCP server for semantic code retrieval and editing capabilities
- /mcp:setup-codemap-cli - Guide for setting up Codemap CLI for intelligent codebase visualization and navigation
- /mcp:setup-arxiv-mcp - Guide for setting up arXiv/Paper Search MCP server via Docker MCP for academic paper search and retrieval from multiple sources
- /mcp:build-mcp - Guide for creating high-quality MCP servers that enable LLMs to interact with external services
Theoretical Foundation
This project is based on research and papers from the following sources:
- Self-Refine - Core refinement loop
- Reflexion - Memory integration
- Constitutional AI - Principle-based critique
- LLM-as-a-Judge - Evaluation patterns
- Multi-Agent Debate - Multiple perspectives
- Agentic Context Engineering - Memory curation
- Chain-of-Verification - Hallucination reduction
- Tree of Thoughts - Structured exploration
- Process Reward Models - Step-by-step evaluation
- Verbalized Sampling - Diverse idea generation with 2-3x improvement
- Process Reward Models - Step verification
- Chain of Thought Prompting - Step-by-step reasoning
- Inference-Time Scaling of Verification - Rubric-guided verification
More details about the theoretical foundation can be found on the resources page.
