USP

It uniquely combines academic rigor with practical, platform-agnostic skills, employing progressive disclosure to load content only when needed. This ensures efficient context use while providing foundational and advanced patterns for AI a…

Use cases

01Building new AI agent systems
02Optimizing existing agent performance
03Debugging context-related failures
04Designing multi-agent architectures
05Implementing BDI cognitive models

Detected files (8)

examples/interleaved-thinking/generated_skills/comprehensive-research-agent/SKILL.mdskill

Show content (8515 bytes)

---
name: comprehensive-research-agent
description: "Ensure thorough validation, error recovery, and transparent reasoning in research tasks with multiple tool calls"
---

# Comprehensive Research Agent Best Practices

This skill addresses common failures in multi-step research tasks: unhandled tool errors, missing validation, opaque reasoning, and premature conclusions. It provides structured protocols for source validation, error recovery, and thinking transparency that significantly improves research quality and reliability.

## When to Activate

- Task involves web research with search, read_url, or fetch operations
- Task requires gathering information from multiple sources
- Task has explicit requirements for completeness or verification
- Task includes file operations that need validation (save, write, read)
- Any research or information-gathering workflow with 3+ tool interactions

## Core Concepts

- Validation Checkpoints: Explicit verification steps at phase transitions to confirm tool outputs, source relevance, and information completeness before proceeding
- Error Recovery Protocols: Mandatory acknowledgment and handling of tool failures with fallback strategies rather than silent continuation
- Source Traceability: Maintaining clear tracking of which sources were actually retrieved vs. referenced from prior knowledge to prevent hallucination
- Substantive Thinking Blocks: Detailed reasoning traces that document insights, connections, gaps, and decision rationale at each step
- Cross-Source Validation: Verifying key claims against multiple sources and explicitly noting consensus, contradictions, and information gaps

## Patterns to Avoid

- *Silent Tool Failure**: A tool call returns an error (404, timeout, invalid URL) but the agent proceeds without acknowledging it, potentially missing critical information. Always log failures and attempt recovery or document the gap.
- *Vague Completion Claims**: Agent declares 'I have enough information' or 'research is comprehensive' without specifying what was learned, what sources support claims, or what gaps remain. Replace with specific summaries of coverage.
- *Unvalidated Source Selection**: Agent reads URLs from search results without evaluating relevance, credibility, or recency first. This wastes tool calls on low-quality sources. Always rank and prioritize sources before deep reading.
- *Generic Thinking Blocks**: Thinking contains only next-action descriptions ('Now I will search for X') without analysis of what was learned, how it connects to the goal, or what questions remain. Thinking should be substantive and reflective.
- *Verification Method Error**: Using list_directory to verify file creation can produce false negatives due to caching. Always use read_file for actual content verification.
- *Citation Without Retrieval**: Citing sources (URLs, paper titles) in the final report that were never successfully fetched or read. Track sources explicitly and prohibit citing unretrieved content.
- *Redundant Tool Calls**: Making overlapping searches or reading sources without tracking what has already been obtained. Maintain a 'found resources' tracker to avoid duplication.

## Recommended Practices

- *Implement Pre-Reading Source Evaluation**: Before reading URLs, rank search results by relevance, credibility, recency, and authority. Document selection rationale in thinking blocks.
- *Use Structured Thinking Blocks**: Each thinking block must include: (a) what was learned from the source/action, (b) how it connects to the research goal, (c) any contradictions/gaps identified, (d) strategic decisions made. Avoid generic next-action statements.
- *Add Mandatory Error Acknowledgment**: When any tool fails, the next thinking block must explicitly address it: note the failure type, propose a recovery strategy (retry, alternative source, or documented gap), and explain the chosen approach.
- *Create Pre-Completion Validation Checklist**: Before declaring research complete, verify: all required sections have specific evidence, all sources were successfully retrieved, key claims are cross-validated, and gaps are documented.
- *Implement Cross-Source Validation**: After gathering information from multiple sources, explicitly compare findings. Note where sources agree, where they contradict, and what remains unverified. Use this to assess overall confidence.
- *Maintain Source Tracking Table**: Create a simple table in thinking showing which URLs were fetched, which failed, and which were used for specific claims. Never cite unretrieved sources.
- *Use Read_File for Verification**: When confirming file writes, use read_file to verify actual content rather than list_directory, which can have caching issues causing false negatives.
- *Add Explicit Validation Phase**: After reading sources, write a brief synthesis that confirms usefulness, notes relevance to research goals, and identifies remaining gaps before proceeding to the next phase.

## Guidelines

1. After each tool call, explicitly check for errors in the response and acknowledge failures in the next thinking block with recovery strategy
2. Before reading URLs, rank sources by relevance/credibility and document selection rationale - never read results without evaluation
3. Thinking blocks must be 3-5+ sentences minimum and include: what was learned, connections to goal, gaps/contradictions, and next steps
4. Create a pre-completion checklist verifying: all requirements covered, sources retrieved, claims validated, gaps documented
5. Maintain source tracking - only cite URLs that were successfully fetched; prohibit citing unretrieved sources
6. When writing final reports, include 'Limitations & Gaps' section documenting what was attempted but failed or what remains unverified
7. Use read_file (not list_directory) to verify file content after save operations
8. Cross-validate key claims across at least 2 sources when possible; explicitly note consensus or contradictions
9. Track gathered information to avoid redundant searches - implement 'found resources' tracker for multi-phase research
10. Replace vague 'comprehensive' statements with specific summaries: 'Covered X sources on Y topic; missing Z aspects'

## Examples

- **Before (Anti-Pattern)**: 'I searched for context engineering and found several results. Now I'll read some URLs and then write the report. I have enough information to proceed.'

**After (Pattern)**: 'Search returned 15 results on context engineering. Evaluating relevance: Liu et al. (2024) appears most authoritative on 'lost in the middle' phenomenon; Anthropic documentation likely has current context window specs; Patel (2023) covers RAG best practices. Ranking these as top 3 priorities. Reading top result first. If the primary source fails (URL error), will try backup search for correct documentation URL and note the gap in final report.'
- **Before (Anti-Pattern)**: Tool returns 404 error for Anthropic context windows URL. Agent continues without acknowledgment. Later cites 'Claude has 200K context window' without showing source. Final report cites Google Research paper that was never fetched.

**After (Pattern)**: Tool returned 404 for Anthropic URL. Thinking: 'Primary source failed. Fallback: search for alternative Anthropic documentation URL or find archived version. If unavailable, note context window data from secondary sources only and add disclaimer about verification status.' Then: 'Cross-validated Claude context window: Anthropic blog (successfully read) and two developer documentation sources agree on 200K. Confident in this claim.' Source tracking table shows: Anthropic URL (failed, backup used), Blog (success), Dev docs (success).

---

## Score Expectations

Complex research tasks with multiple tools (6+) and multi-step reasoning chains typically achieve scores in the **65-75 range**. This is not a limitation of the prompt but reflects:

- Inherent variability in tool outputs affecting reasoning paths
- Multiple valid approaches leading to different intermediate scores
- Stochastic nature of long-horizon agent execution

**Focus on relative improvement and pattern elimination** rather than absolute scores. A 5-10% improvement from optimization is significant for complex tasks.

---

## Skill Metadata

**Generated**: 2026-01-11
**Source**: Reasoning Trace Optimizer
**Optimization Iterations**: 10
**Best Score Achieved**: 72/100 (iteration 4)
**Final Score**: 70.0/100
**Score Improvement**: 67.6 → 70.0 (+3.6%)

examples/book-sft-pipeline/SKILL.mdskill

Show content (14252 bytes)

---
name: book-sft-pipeline
description: This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
version: 2.0.0
---

# Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

## When to Activate

Activate this skill when:
- Building fine-tuning datasets from literary works
- Creating author-voice or style-transfer models
- Preparing training data for Tinker or similar SFT platforms
- Designing text segmentation pipelines for long-form content
- Training small models (8B or less) on limited data

## Core Concepts

### The Three Pillars of Book SFT

**1. Intelligent Segmentation**
Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

**2. Diverse Instruction Generation**
Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

**3. Style Over Content**
The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

## Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐               ┌──────────────┐
│   TRAINING   │               │  VALIDATION  │
│    AGENT     │               │    AGENT     │
│ LoRA on      │               │ AI detector  │
│ Tinker       │               │ Originality  │
└──────────────┘               └──────────────┘
```

## Phase 1: Text Extraction

### Critical Rules
1. **Always source ePub over PDF** - OCR errors become learned patterns
2. **Use paragraph-level extraction** - Extract from `<p>` tags to preserve breaks
3. **Remove front/back matter** - Copyright and TOC pollute the dataset

```python
# Extract text from ePub paragraphs
from epub2 import EPub
from bs4 import BeautifulSoup

def extract_epub(path):
    book = EPub(path)
    chapters = []
    for item in book.flow:
        html = book.get_chapter(item.id)
        soup = BeautifulSoup(html, 'html.parser')
        paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
        chapters.append('\n\n'.join(p for p in paragraphs if p))
    return '\n\n'.join(chapters)
```

## Phase 2: Intelligent Segmentation

### Smaller Chunks + Overlap

Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).

```python
def segment(text, min_words=150, max_words=400):
    paragraphs = text.split('\n\n')
    chunks, buffer, buffer_words = [], [], 0
    
    for para in paragraphs:
        words = len(para.split())
        if buffer_words + words > max_words and buffer_words >= min_words:
            chunks.append('\n\n'.join(buffer))
            # Keep last paragraph for overlap
            buffer = [buffer[-1], para] if buffer else [para]
            buffer_words = sum(len(p.split()) for p in buffer)
        else:
            buffer.append(para)
            buffer_words += words
    
    if buffer:
        chunks.append('\n\n'.join(buffer))
    return chunks
```

### Expected Results

For an 86,000-word book:
- Old method (250-650 words): ~150 chunks
- New method (150-400 + overlap): ~300 chunks
- With 2 variants per chunk: 600+ training examples

## Phase 3: Diverse Instruction Generation

### The Key Insight

Using a single prompt template causes memorization. Diverse templates teach the underlying style.

```python
SYSTEM_PROMPTS = [
    "You are an expert creative writer capable of emulating specific literary styles.",
    "You are a literary writer with deep knowledge of classic prose styles.",
    "You are a creative writer skilled at emulating distinctive authorial voices.",
    "You write prose that captures the essence of modernist literature.",
    "You are a talented writer who can channel classic American authors.",
]

PROMPT_TEMPLATES = [
    "Write a passage in the style of {author}: {desc}",
    "Channel {author}'s voice to write about: {desc}",
    "In {author}'s distinctive prose style, describe: {desc}",
    "Write this scene as {author} would have: {desc}",
    "Using {author}'s repetitive technique, describe: {desc}",
    "Capture the rhythm of {author} in this passage: {desc}",
    "Write like {author}: {desc}",
    "In the voice of {author}, write: {desc}",
    "This is a literary exercise. Write like {author}: {desc}",
    "Can you write in {author}'s style? {desc}",
]
```

### Instruction Generation

```python
INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
Focus on: characters present, actions, emotions, setting.
Do NOT quote the text directly.

Excerpt:
{text}
"""

# Use a fast, cheap LLM (e.g., Gemini Flash)
instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))
```

## Phase 4: Dataset Construction

### Message Format

```json
{
    "messages": [
        {"role": "system", "content": "You are an expert creative writer..."},
        {"role": "user", "content": "Write in the style of Author: Scene description..."},
        {"role": "assistant", "content": "The actual book text from chunk..."}
    ]
}
```

### Multiple Variants Per Chunk

```python
def build_examples(chunk, instruction, author, variants=2):
    examples = []
    for i in range(variants):
        system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
        template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
        user = template.format(author=author, desc=instruction)
        examples.append({"messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
            {"role": "assistant", "content": chunk.text}
        ]})
    return examples
```

## Phase 5: LoRA Training on Tinker

### Configuration

```python
CONFIG = {
    "model_name": "Qwen/Qwen3-8B-Base",  # Base, not instruct
    "lora_rank": 32,                      # 352MB adapter
    "learning_rate": 5e-4,                # Higher for LoRA
    "batch_size": 4,
    "epochs": 3,
}
```

### Why Base Model?

Use **base** (pretrained) models, not instruction-tuned versions:
- Base models are more malleable for new styles
- Instruct models have patterns that resist overwriting
- Style is a low-level pattern that base models capture better

### Training Loop

```python
import tinker
from tinker import types

training_client = await service_client.create_lora_training_client_async(
    base_model="Qwen/Qwen3-8B-Base",
    rank=32
)

for epoch in range(3):
    for batch in batches:
        await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
        await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))

result = await training_client.save_weights_for_sampler_async(name="final")
```

## Phase 6: Validation

### Modern Scenario Test

Test with scenarios that couldn't exist in the original book:

```python
TEST_PROMPTS = [
    "Write about a barista making lattes",
    "Describe lovers communicating through text messages",
    "Write about someone anxious about climate change",
]
```

If the model applies style markers to modern scenarios, it learned **style**, not **content**.

### Originality Verification

```bash
# Search training data for output phrases
grep "specific phrase from output" dataset.jsonl
# Should return: No matches
```

### AI Detector Testing

Test outputs with GPTZero, Pangram, or ZeroGPT.

## Known Issues and Solutions

### Character Name Leakage

**Symptom**: Model uses original character names in new scenarios.
**Cause**: Limited name diversity from one book.
**Solution**: Train on multiple books or add synthetic examples.

### Model Parrots Exact Phrases

**Symptom**: Outputs contain exact sentences from training data.
**Cause**: Too few prompt variations or too many epochs.
**Solution**: Use 15+ templates, limit to 3 epochs.

### Fragmented Outputs

**Symptom**: Sentences feel incomplete.
**Cause**: Poor segmentation breaking mid-thought.
**Solution**: Always break at paragraph boundaries.

## Guidelines

1. **Always source ePub over PDF** - OCR errors become learned patterns
2. **Never break mid-sentence** - Boundaries must be grammatically complete
3. **Use diverse prompts** - 15+ templates, 5+ system prompts
4. **Use base models** - Not instruct versions
5. **Use smaller chunks** - 150-400 words for more examples
6. **Reserve test set** - 50 examples minimum
7. **Test on modern scenarios** - Proves style transfer vs memorization
8. **Verify originality** - Grep training data for output phrases

## Expected Results

| Metric | Value |
|--------|-------|
| Training examples | 500-1000 per book |
| Model | Qwen/Qwen3-8B-Base |
| LoRA rank | 32 |
| Adapter size | ~350 MB |
| Training time | ~15 min |
| Loss reduction | 90%+ |
| Style transfer success | ~50% perfect |

## Cost Estimate

| Component | Cost |
|-----------|------|
| LLM (instruction generation) | ~$0.50 |
| Tinker training (15 min) | ~$1.50 |
| **Total** | **~$2.00** |

## Integration with Context Engineering Skills

This example applies several skills from the Agent Skills for Context Engineering collection:

### project-development
The pipeline follows the staged, idempotent architecture pattern:
- **Acquire**: Extract text from ePub
- **Prepare**: Segment into training chunks
- **Process**: Generate synthetic instructions
- **Parse**: Build message format
- **Render**: Output Tinker-compatible JSONL
- **Train**: LoRA fine-tuning
- **Validate**: Modern scenario testing

Each phase is resumable and produces intermediate artifacts for debugging.

### context-compression
Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.

The two-tier strategy mirrors context compression evaluation:
- Tier 1: Fast, deterministic compression
- Tier 2: LLM-assisted for edge cases

### multi-agent-patterns
The pipeline uses the **supervisor/orchestrator** pattern:
- Orchestrator coordinates phases and manages state
- Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
- Each agent receives only the information needed for its task

This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.

### evaluation
Validation follows the **end-state evaluation** pattern:
- Functional testing: Does output match expected style markers?
- Originality verification: Is content genuinely generated?
- External validation: AI detector scores

The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.

### context-fundamentals
Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.

## References

Internal references:
- [Segmentation Strategies](./references/segmentation-strategies.md) - Text chunking patterns
- [Tinker Format Specification](./references/tinker-format.md) - Datum structure
- [Tinker API Documentation](./references/tinker.txt) - Full API reference

Related skills from Agent Skills for Context Engineering:
- project-development - Pipeline architecture patterns
- context-compression - Compression strategies  
- multi-agent-patterns - Agent coordination
- evaluation - Evaluation frameworks
- context-fundamentals - Attention and information density

External resources:
- [Research Paper](https://arxiv.org/pdf/2510.13939) - Chakrabarty et al. 2025
- [Dataset on Hugging Face](https://huggingface.co/datasets/MuratcanKoylan/gertrude-stein-style-sft)
- [Gertrude Stein Case Study](./examples/gertrude-stein/) - Complete working example

---

## Skill Metadata

**Created**: 2025-12-26
**Last Updated**: 2025-12-28
**Author**: Muratcan Koylan
**Version**: 2.0.0
**Standalone**: Yes (separate from main context-engineering collection)

SKILL.mdskill

Show content (8387 bytes)

---
name: context-engineering-collection
description: A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems. Use when building, optimizing, or debugging agent systems that require effective context management.
---

# Agent Skills for Context Engineering

This collection provides structured guidance for building production-grade AI agent systems through effective context engineering.

## When to Activate

Activate these skills when:
- Building new agent systems from scratch
- Optimizing existing agent performance
- Debugging context-related failures
- Designing multi-agent architectures
- Creating or evaluating tools for agents
- Implementing memory and persistence layers

## Skill Map

### Foundational Context Engineering

**Understanding Context Fundamentals**
Context is not just prompt text—it is the complete state available to the language model at inference time, including system instructions, tool definitions, retrieved documents, message history, and tool outputs. Effective context engineering means understanding what information truly matters for the task at hand and curating that information for maximum signal-to-noise ratio.

**Recognizing Context Degradation**
Language models exhibit predictable degradation patterns as context grows: the "lost-in-middle" phenomenon where information in the center of context receives less attention; U-shaped attention curves that prioritize beginning and end; context poisoning when errors compound; and context distraction when irrelevant information overwhelms relevant content.

### Architectural Patterns

**Multi-Agent Coordination**
Production multi-agent systems converge on three dominant patterns: supervisor/orchestrator architectures with centralized control, peer-to-peer swarm architectures for flexible handoffs, and hierarchical structures for complex task decomposition. The critical insight is that sub-agents exist primarily to isolate context rather than to simulate organizational roles.

**Memory System Design**
Memory architectures range from simple scratchpads to sophisticated temporal knowledge graphs. Vector RAG provides semantic retrieval but loses relationship information. Knowledge graphs preserve structure but require more engineering investment. The file-system-as-memory pattern enables just-in-time context loading without stuffing context windows.

**Filesystem-Based Context**
The filesystem provides a single interface for storing, retrieving, and updating effectively unlimited context. Key patterns include scratch pads for tool output offloading, plan persistence for long-horizon tasks, sub-agent communication via shared files, and dynamic skill loading. Agents use `ls`, `glob`, `grep`, and `read_file` for targeted context discovery, often outperforming semantic search for structural queries.

**Hosted Agent Infrastructure**
Background coding agents run in remote sandboxed environments rather than on local machines. Key patterns include pre-built environment images refreshed on regular cadence, warm sandbox pools for instant session starts, filesystem snapshots for session persistence, and multiplayer support for collaborative agent sessions. Critical optimizations include allowing file reads before git sync completes (blocking only writes), predictive sandbox warming when users start typing, and self-spawning agents for parallel task execution.

**Tool Design Principles**
Tools are contracts between deterministic systems and non-deterministic agents. Effective tool design follows the consolidation principle (prefer single comprehensive tools over multiple narrow ones), returns contextual information in errors, supports response format options for token efficiency, and uses clear namespacing.

### Operational Excellence

**Context Compression**
When agent sessions exhaust memory, compression becomes mandatory. The correct optimization target is tokens-per-task, not tokens-per-request. Structured summarization with explicit sections for files, decisions, and next steps preserves more useful information than aggressive compression. Artifact trail integrity remains the weakest dimension across all compression methods.

**Context Optimization**
Techniques include compaction (summarizing context near limits), observation masking (replacing verbose tool outputs with references), prefix caching (reusing KV blocks across requests), and strategic context partitioning (splitting work across sub-agents with isolated contexts).

**Latent Briefing (KV Memory Sharing)**
Orchestrator-worker systems can compound tokens when supervisors accumulate long trajectories but workers see only narrow text slices. Latent Briefing compacts the orchestrator trajectory in the worker model's KV cache using task-guided attention (Attention Matching-style compaction) so workers receive relevant latent state without full-text replay when the stack exposes worker KV state and the models are compatible.

**Evaluation Frameworks**
Production agent evaluation requires multi-dimensional rubrics covering factual accuracy, completeness, tool efficiency, and process quality. Effective patterns include LLM-as-judge for scalability, human evaluation for edge cases, and end-state evaluation for agents that mutate persistent state.

### Development Methodology

**Project Development**
Effective LLM project development begins with task-model fit analysis: validating through manual prototyping that a task is well-suited for LLM processing before building automation. Production pipelines follow staged, idempotent architectures (acquire, prepare, process, parse, render) with file system state management for debugging and caching. Structured output design with explicit format specifications enables reliable parsing. Start with minimal architecture and add complexity only when proven necessary.

## Core Concepts

The collection is organized around three core themes. First, context fundamentals establish what context is, how attention mechanisms work, and why context quality matters more than quantity. Second, architectural patterns cover the structures and coordination mechanisms that enable effective agent systems. Third, operational excellence addresses the ongoing work of optimizing and evaluating production systems.

## Practical Guidance

Each skill can be used independently or in combination. Start with fundamentals to establish context management mental models. Branch into architectural patterns based on your system requirements. Reference operational skills when optimizing production systems.

The skills are platform-agnostic and work with Claude Code, Cursor, or any agent framework that supports custom instructions or skill-like constructs.

## Integration

This collection integrates with itself—skills reference each other and build on shared concepts. The fundamentals skill provides context for all other skills. Architectural skills (multi-agent, memory, tools) can be combined for complex systems. Operational skills (optimization, evaluation) apply to any system built using the foundational and architectural skills.

## References

Internal skills in this collection:
- [context-fundamentals](skills/context-fundamentals/SKILL.md)
- [context-degradation](skills/context-degradation/SKILL.md)
- [context-compression](skills/context-compression/SKILL.md)
- [multi-agent-patterns](skills/multi-agent-patterns/SKILL.md)
- [memory-systems](skills/memory-systems/SKILL.md)
- [tool-design](skills/tool-design/SKILL.md)
- [filesystem-context](skills/filesystem-context/SKILL.md)
- [hosted-agents](skills/hosted-agents/SKILL.md)
- [context-optimization](skills/context-optimization/SKILL.md)
- [latent-briefing](skills/latent-briefing/SKILL.md)
- [evaluation](skills/evaluation/SKILL.md)
- [advanced-evaluation](skills/advanced-evaluation/SKILL.md)
- [project-development](skills/project-development/SKILL.md)
- [bdi-mental-states](skills/bdi-mental-states/SKILL.md)

External resources on context engineering:
- Research on attention mechanisms and context window limitations
- Production experience from leading AI labs on agent system design
- Framework documentation for LangGraph, AutoGen, and CrewAI

---

## Skill Metadata

**Created**: 2025-12-20
**Last Updated**: 2026-04-14
**Author**: Agent Skills for Context Engineering Contributors
**Version**: 1.3.0

skills/advanced-evaluation/SKILL.mdskill

Show content (16260 bytes)

---
name: advanced-evaluation
description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
---

# Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

## When to Activate

Activate this skill when:

- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments

## Core Concepts

### The Evaluation Taxonomy

Select between two primary approaches based on whether ground truth exists:

**Direct Scoring** — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.

**Pairwise Comparison** — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Achieves higher human-judge agreement than direct scoring for preference tasks (Zheng et al., 2023). Watch for position bias and length bias.

### The Bias Landscape

Mitigate these systematic biases in every evaluation system:

**Position Bias**: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.

**Length Bias**: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.

**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.

**Verbosity Bias**: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.

**Authority Bias**: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.

### Metric Selection Framework

Match metrics to the evaluation task structure:

| Task Type | Primary Metrics | Secondary Metrics |
|-----------|-----------------|-------------------|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |

Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

## Evaluation Approaches

### Direct Scoring Implementation

Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.

**Criteria Definition Pattern**:
```
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
```

**Scale Calibration** — Choose scale granularity based on rubric detail:
- 1-3: Binary with neutral option, lowest cognitive load
- 1-5: Standard Likert, best balance of granularity and reliability
- 1-10: Use only with detailed per-level rubrics because calibration is harder

**Prompt Structure for Direct Scoring**:
```
You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with structured JSON containing scores, justifications, and summary.
```

Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.

### Pairwise Comparison Implementation

Apply position bias mitigation in every pairwise evaluation:

1. First pass: Response A in first position, Response B in second
2. Second pass: Response B in first position, Response A in second
3. Consistency check: If passes disagree, return TIE with reduced confidence
4. Final verdict: Consistent winner with averaged confidence

**Prompt Structure for Pairwise Comparison**:
```
You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
```

**Confidence Calibration** — Map confidence to position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE

### Rubric Generation

Generate rubrics to reduce evaluation variance by 40-60% compared to open-ended scoring.

**Include these rubric components**:
1. **Level descriptions**: Clear boundaries for each score level
2. **Characteristics**: Observable features that define each level
3. **Examples**: Representative text for each level (optional but valuable)
4. **Edge cases**: Guidance for ambiguous situations
5. **Scoring guidelines**: General principles for consistent application

**Set strictness calibration** for the use case:
- **Lenient**: Lower passing bar, appropriate for encouraging iteration
- **Balanced**: Typical production expectations
- **Strict**: High standards for safety-critical or high-stakes evaluation

Adapt rubrics to the domain — use domain-specific terminology. A code readability rubric mentions variables, functions, and comments. A medical accuracy rubric references clinical terminology and evidence standards.

## Practical Guidance

### Evaluation Pipeline Design

Build production evaluation systems with these layers: Criteria Loader (rubrics + weights) -> Primary Scorer (direct or pairwise) -> Bias Mitigation (position swap, etc.) -> Confidence Scoring (calibration) -> Output (scores + justifications + confidence). See [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) for the full visual layout.

### Decision Framework: Direct vs. Pairwise

Apply this decision tree:

```
Is there an objective ground truth?
+-- Yes -> Direct Scoring
|   Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
    +-- Yes -> Pairwise Comparison
    |   Examples: tone, style, persuasiveness, creativity
    |
    +-- No -> Consider reference-based evaluation
        Examples: summarization (compare to source), translation (compare to reference)
```

### Scaling Evaluation

For high-volume evaluation, apply one of these strategies:

1. **Panel of LLMs (PoLL)**: Use multiple models as judges and aggregate votes to reduce individual model bias. More expensive but more reliable for high-stakes decisions.

2. **Hierarchical evaluation**: Use a fast cheap model for screening and an expensive model for edge cases. Requires calibration of the screening threshold.

3. **Human-in-the-loop**: Automate clear cases and route low-confidence decisions to human review. Design feedback loops to improve automated evaluation over time.

## Examples

### Example 1: Direct Scoring for Accuracy

**Input**:
```
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
```

**Output**:
```json
{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}
```

### Example 2: Pairwise Comparison with Position Swap

**Input**:
```
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
```

**First Pass (A first)**:
```json
{ "winner": "B", "confidence": 0.8 }
```

**Second Pass (B first)**:
```json
{ "winner": "A", "confidence": 0.6 }
```
(Note: Winner is A because B was in first position)

**Mapped Second Pass**:
```json
{ "winner": "B", "confidence": 0.6 }
```

**Final Result**:
```json
{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}
```

### Example 3: Rubric Generation

**Input**:
```
criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"
```

**Output** (abbreviated):
```json
{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}
```

## Guidelines

1. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%

2. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias

3. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions

4. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective

5. **Include confidence scores** - Calibrate to position consistency and evidence strength

6. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance

7. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations

8. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment

9. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model

10. **Design for iteration** - Evaluation systems improve with feedback loops

## Gotchas

1. **Scoring without justification**: Scores lack grounding and are difficult to debug. Always require evidence-based justification before the score.

2. **Single-pass pairwise comparison**: Position bias corrupts results when positions are not swapped. Always evaluate twice with swapped positions and check consistency.

3. **Overloaded criteria**: Criteria that measure multiple things at once produce unreliable scores. Enforce one criterion = one measurable aspect.

4. **Missing edge case guidance**: Evaluators handle ambiguous cases inconsistently without explicit instructions. Include edge cases in rubrics with clear resolution rules.

5. **Ignoring confidence calibration**: High-confidence wrong judgments are worse than low-confidence ones. Calibrate confidence to position consistency and evidence strength.

6. **Rubric drift**: Rubrics become miscalibrated as quality standards evolve or model capabilities improve. Schedule periodic rubric reviews and re-anchor score levels against fresh human-annotated examples.

7. **Evaluation prompt sensitivity**: Minor wording changes in evaluation prompts (e.g., reordering instructions, changing phrasing) can cause 10-20% score swings. Version-control evaluation prompts and run regression tests before deploying prompt changes.

8. **Uncontrolled length bias**: Longer responses systematically score higher even when conciseness is preferred. Add explicit length-neutrality instructions to evaluation prompts and validate with length-controlled test pairs.

## Integration

This skill integrates with:

- **context-fundamentals** - Evaluation prompts require effective context structure
- **tool-design** - Evaluation tools need proper schemas and error handling
- **context-optimization** - Evaluation prompts can be optimized for token efficiency
- **evaluation** (foundational) - This skill extends the foundational evaluation concepts

## References

Internal reference:
- [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md) - Read when: building an evaluation pipeline from scratch or integrating LLM judges into CI/CD
- [Bias Mitigation Techniques](./references/bias-mitigation.md) - Read when: evaluation results show inconsistent or suspicious scoring patterns
- [Metric Selection Guide](./references/metrics-guide.md) - Read when: choosing statistical metrics to validate evaluation reliability
- [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) - Read when: designing the architecture of a multi-stage evaluation system

External research:
- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) - Read when: surveying the state of the art in LLM evaluation
- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685) - Read when: understanding position bias and MT-Bench methodology
- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634) - Read when: implementing chain-of-thought evaluation scoring
- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926) - Read when: diagnosing systematic bias in evaluation outputs

Related skills in this collection:
- evaluation - Foundational evaluation concepts
- context-fundamentals - Context structure for evaluation prompts
- tool-design - Building evaluation tools

---

## Skill Metadata

**Created**: 2025-12-24
**Last Updated**: 2026-03-17
**Author**: Agent Skills for Context Engineering Contributors
**Version**: 2.0.0

skills/bdi-mental-states/SKILL.mdskill

Show content (14620 bytes)

---
name: bdi-mental-states
description: This skill should be used when the user asks to "model agent mental states", "implement BDI architecture", "create belief-desire-intention models", "transform RDF to beliefs", "build cognitive agent", or mentions BDI ontology, mental state modeling, rational agency, or neuro-symbolic AI integration.
---

# BDI Mental State Modeling

Transform external RDF context into agent mental states (beliefs, desires, intentions) using formal BDI ontology patterns. This skill enables agents to reason about context through cognitive architecture, supporting deliberative reasoning, explainability, and semantic interoperability within multi-agent systems.

## When to Activate

Activate this skill when:
- Processing external RDF context into agent beliefs about world states
- Modeling rational agency with perception, deliberation, and action cycles
- Enabling explainability through traceable reasoning chains
- Implementing BDI frameworks (SEMAS, JADE, JADEX)
- Augmenting LLMs with formal cognitive structures (Logic Augmented Generation)
- Coordinating mental states across multi-agent platforms
- Tracking temporal evolution of beliefs, desires, and intentions
- Linking motivational states to action plans

## Core Concepts

### Mental Reality Architecture

Separate mental states into two ontological categories because BDI reasoning requires distinguishing what persists from what happens:

**Mental States (Endurants)** -- model these as persistent cognitive attributes that hold over time intervals:
- `Belief`: Represent what the agent holds true about the world. Ground every belief in a world state reference.
- `Desire`: Represent what the agent wishes to bring about. Link each desire back to the beliefs that motivate it.
- `Intention`: Represent what the agent commits to achieving. An intention must fulfil a desire and specify a plan.

**Mental Processes (Perdurants)** -- model these as events that create or modify mental states, because tracking causal transitions enables explainability:
- `BeliefProcess`: Triggers belief formation/update from perception. Always connect to a generating world state.
- `DesireProcess`: Generates desires from existing beliefs. Preserves the motivational chain.
- `IntentionProcess`: Commits to selected desires as actionable intentions.

### Cognitive Chain Pattern

Wire beliefs, desires, and intentions into directed chains using bidirectional properties (`motivates`/`isMotivatedBy`, `fulfils`/`isFulfilledBy`) because this enables both forward reasoning (what should the agent do?) and backward tracing (why did the agent act?):

```turtle
:Belief_store_open a bdi:Belief ;
    rdfs:comment "Store is open" ;
    bdi:motivates :Desire_buy_groceries .

:Desire_buy_groceries a bdi:Desire ;
    rdfs:comment "I desire to buy groceries" ;
    bdi:isMotivatedBy :Belief_store_open .

:Intention_go_shopping a bdi:Intention ;
    rdfs:comment "I will buy groceries" ;
    bdi:fulfils :Desire_buy_groceries ;
    bdi:isSupportedBy :Belief_store_open ;
    bdi:specifies :Plan_shopping .
```

### World State Grounding

Always ground mental states in world state references rather than free-text descriptions, because ungrounded beliefs break semantic querying and cross-agent interoperability:

```turtle
:Agent_A a bdi:Agent ;
    bdi:perceives :WorldState_WS1 ;
    bdi:hasMentalState :Belief_B1 .

:WorldState_WS1 a bdi:WorldState ;
    rdfs:comment "Meeting scheduled at 10am in Room 5" ;
    bdi:atTime :TimeInstant_10am .

:Belief_B1 a bdi:Belief ;
    bdi:refersTo :WorldState_WS1 .
```

### Goal-Directed Planning

Connect intentions to plans via `bdi:specifies`, and decompose plans into ordered task sequences using `bdi:precedes`, because this separation allows plan reuse across different intentions while keeping execution order explicit:

```turtle
:Intention_I1 bdi:specifies :Plan_P1 .

:Plan_P1 a bdi:Plan ;
    bdi:addresses :Goal_G1 ;
    bdi:beginsWith :Task_T1 ;
    bdi:endsWith :Task_T3 .

:Task_T1 bdi:precedes :Task_T2 .
:Task_T2 bdi:precedes :Task_T3 .
```

## T2B2T Paradigm

Implement Triples-to-Beliefs-to-Triples as a bidirectional pipeline because agents must both consume external RDF context and produce new RDF assertions. Structure every T2B2T implementation in two explicit phases:

**Phase 1: Triples-to-Beliefs** -- Translate incoming RDF triples into belief instances. Use `bdi:triggers` to connect the external world state to a `BeliefProcess`, and `bdi:generates` to produce the resulting belief. This preserves provenance from source data through to internal cognition:
```turtle
:WorldState_notification a bdi:WorldState ;
    rdfs:comment "Push notification: Payment request $250" ;
    bdi:triggers :BeliefProcess_BP1 .

:BeliefProcess_BP1 a bdi:BeliefProcess ;
    bdi:generates :Belief_payment_request .
```

**Phase 2: Beliefs-to-Triples** -- After BDI deliberation selects an intention and executes a plan, project the results back into RDF using `bdi:bringsAbout`. This closes the loop so downstream systems can consume agent outputs as standard linked data:
```turtle
:Intention_pay a bdi:Intention ;
    bdi:specifies :Plan_payment .

:PlanExecution_PE1 a bdi:PlanExecution ;
    bdi:satisfies :Plan_payment ;
    bdi:bringsAbout :WorldState_payment_complete .
```

## Notation Selection by Level

Choose notation based on the C4 abstraction level being modeled, because mixing notations at the wrong level obscures rather than clarifies the cognitive architecture:

| C4 Level | Notation | Mental State Representation |
|----------|----------|----------------------------|
| L1 Context | ArchiMate | Agent boundaries, external perception sources |
| L2 Container | ArchiMate | BDI reasoning engine, belief store, plan executor |
| L3 Component | UML | Mental state managers, process handlers |
| L4 Code | UML/RDF | Belief/Desire/Intention classes, ontology instances |

## Justification and Explainability

Attach `bdi:Justification` instances to every mental entity using `bdi:isJustifiedBy`, because unjustified mental states make agent reasoning opaque and untraceable. Each justification should capture the evidence or rule that produced the mental state:

```turtle
:Belief_B1 a bdi:Belief ;
    bdi:isJustifiedBy :Justification_J1 .

:Justification_J1 a bdi:Justification ;
    rdfs:comment "Official announcement received via email" .

:Intention_I1 a bdi:Intention ;
    bdi:isJustifiedBy :Justification_J2 .

:Justification_J2 a bdi:Justification ;
    rdfs:comment "Location precondition satisfied" .
```

## Temporal Dimensions

Assign validity intervals to every mental state using `bdi:hasValidity` with `TimeInterval` instances, because beliefs without temporal bounds cannot be garbage-collected or conflict-checked during diachronic reasoning:

```turtle
:Belief_B1 a bdi:Belief ;
    bdi:hasValidity :TimeInterval_TI1 .

:TimeInterval_TI1 a bdi:TimeInterval ;
    bdi:hasStartTime :TimeInstant_9am ;
    bdi:hasEndTime :TimeInstant_11am .
```

Query mental states active at a specific moment using SPARQL temporal filters. Use this pattern to resolve conflicts when multiple beliefs about the same world state overlap in time:

```sparql
SELECT ?mentalState WHERE {
    ?mentalState bdi:hasValidity ?interval .
    ?interval bdi:hasStartTime ?start ;
              bdi:hasEndTime ?end .
    FILTER(?start <= "2025-01-04T10:00:00"^^xsd:dateTime &&
           ?end >= "2025-01-04T10:00:00"^^xsd:dateTime)
}
```

## Compositional Mental Entities

Decompose complex beliefs into constituent parts using `bdi:hasPart` relations, because monolithic beliefs force full replacement on partial updates. Structure composite beliefs so that each sub-belief can be independently updated, queried, or invalidated:

```turtle
:Belief_meeting a bdi:Belief ;
    rdfs:comment "Meeting at 10am in Room 5" ;
    bdi:hasPart :Belief_meeting_time , :Belief_meeting_location .

# Update only location component without touching time
:BeliefProcess_update a bdi:BeliefProcess ;
    bdi:modifies :Belief_meeting_location .
```

## Integration Patterns

### Logic Augmented Generation (LAG)

Use LAG to constrain LLM outputs with ontological structure, because unconstrained generation produces triples that violate BDI class restrictions. Serialize the ontology into the prompt context, then validate generated triples against it before accepting them:

```python
def augment_llm_with_bdi_ontology(prompt, ontology_graph):
    ontology_context = serialize_ontology(ontology_graph, format='turtle')
    augmented_prompt = f"{ontology_context}\n\n{prompt}"

    response = llm.generate(augmented_prompt)
    triples = extract_rdf_triples(response)

    is_consistent = validate_triples(triples, ontology_graph)
    return triples if is_consistent else retry_with_feedback()
```

### SEMAS Rule Translation

Translate BDI ontology patterns into executable production rules when deploying to rule-based agent platforms. Map each cognitive chain link (belief-to-desire, desire-to-intention) to a HEAD/CONDITIONALS/TAIL rule, because this preserves the deliberative semantics while enabling runtime execution:

```prolog
% Belief triggers desire formation
[HEAD: belief(agent_a, store_open)] /
[CONDITIONALS: time(weekday_afternoon)] »
[TAIL: generate_desire(agent_a, buy_groceries)].

% Desire triggers intention commitment
[HEAD: desire(agent_a, buy_groceries)] /
[CONDITIONALS: belief(agent_a, has_shopping_list)] »
[TAIL: commit_intention(agent_a, buy_groceries)].
```

## Guidelines

1. Model world states as configurations independent of agent perspectives, providing referential substrate for mental states.

2. Distinguish endurants (persistent mental states) from perdurants (temporal mental processes), aligning with DOLCE ontology.

3. Treat goals as descriptions rather than mental states, maintaining separation between cognitive and planning layers.

4. Use `hasPart` relations for meronymic structures enabling selective belief updates.

5. Associate every mental entity with temporal constructs via `atTime` or `hasValidity`.

6. Use bidirectional property pairs (`motivates`/`isMotivatedBy`, `generates`/`isGeneratedBy`) for flexible querying.

7. Link mental entities to `Justification` instances for explainability and trust.

8. Implement T2B2T through: (1) translate RDF to beliefs, (2) execute BDI reasoning, (3) project mental states back to RDF.

9. Define existential restrictions on mental processes (e.g., `BeliefProcess ⊑ ∃generates.Belief`).

10. Reuse established ODPs (EventCore, Situation, TimeIndexedSituation, BasicPlan, Provenance) for interoperability.

## Competency Questions

Validate implementation against these SPARQL queries:

```sparql
# CQ1: What beliefs motivated formation of a given desire?
SELECT ?belief WHERE {
    :Desire_D1 bdi:isMotivatedBy ?belief .
}

# CQ2: Which desire does a particular intention fulfill?
SELECT ?desire WHERE {
    :Intention_I1 bdi:fulfils ?desire .
}

# CQ3: Which mental process generated a belief?
SELECT ?process WHERE {
    ?process bdi:generates :Belief_B1 .
}

# CQ4: What is the ordered sequence of tasks in a plan?
SELECT ?task ?nextTask WHERE {
    :Plan_P1 bdi:hasComponent ?task .
    OPTIONAL { ?task bdi:precedes ?nextTask }
} ORDER BY ?task
```

## Gotchas

1. **Conflating mental states with world states**: Mental states reference world states via `bdi:refersTo`, they are not world states themselves. Mixing them collapses the perception-cognition boundary and breaks SPARQL queries that filter by type.

2. **Missing temporal bounds**: Every mental state needs validity intervals for diachronic reasoning. Without them, stale beliefs persist indefinitely and conflict detection becomes impossible.

3. **Flat belief structures**: Use compositional modeling with `hasPart` for complex beliefs. Monolithic beliefs force full replacement when only one attribute changes.

4. **Implicit justifications**: Always link mental entities to explicit `Justification` instances. Unjustified mental states cannot be audited or traced.

5. **Direct intention-to-action mapping**: Intentions specify plans which contain tasks; actions execute tasks. Skipping the plan layer removes the ability to reuse, reorder, or share execution strategies.

6. **Ontology over-complexity**: Start with 5-10 core classes and properties (Belief, Desire, Intention, WorldState, Plan, plus key relations). Expanding the ontology prematurely inflates prompt context and slows SPARQL queries without improving reasoning quality.

7. **Reasoning cost explosion**: Keep belief chains to 3 levels or fewer (belief -> desire -> intention). Deeper chains become prohibitively expensive for LLM inference and rarely improve decision quality over shallower alternatives.

## Integration

- **RDF Processing**: Apply after parsing external RDF context to construct cognitive representations
- **Semantic Reasoning**: Combine with ontology reasoning to infer implicit mental state relationships
- **Multi-Agent Communication**: Integrate with FIPA ACL for cross-platform belief sharing
- **Temporal Context**: Coordinate with temporal reasoning for mental state evolution
- **Explainable AI**: Feed into explanation systems tracing perception through deliberation to action
- **Neuro-Symbolic AI**: Apply in LAG pipelines to constrain LLM outputs with cognitive structures

## References

Internal references:
- [BDI Ontology Core](./references/bdi-ontology-core.md) - Read when: implementing BDI class hierarchies or defining ontology properties from scratch
- [RDF Examples](./references/rdf-examples.md) - Read when: writing Turtle serializations of mental states or debugging triple structure
- [SPARQL Competency Queries](./references/sparql-competency.md) - Read when: validating an implementation against competency questions or building custom queries
- [Framework Integration](./references/framework-integration.md) - Read when: deploying BDI models to SEMAS, JADE, or LAG pipelines

Primary sources:
- Zuppiroli et al. "The Belief-Desire-Intention Ontology" (2025) — Read when: implementing formal BDI class hierarchies or validating ontology alignment
- Rao & Georgeff "BDI agents: From theory to practice" (1995) — Read when: understanding the theoretical foundations of practical reasoning agents
- Bratman "Intention, plans, and practical reason" (1987) — Read when: grounding implementation decisions in the philosophical basis of intentionality

---

## Skill Metadata

**Created**: 2026-01-07
**Last Updated**: 2026-03-17
**Author**: Agent Skills for Context Engineering Contributors
**Version**: 2.0.0

examples/digital-brain-skill/SKILL.mdskill

Show content (7043 bytes)

---
name: digital-brain
description: This skill should be used when the user asks to "write a post", "check my voice", "look up contact", "prepare for meeting", "weekly review", "track goals", or mentions personal brand, content creation, network management, or voice consistency.
version: 1.0.0
---

# Digital Brain

A structured personal operating system for managing digital presence, knowledge, relationships, and goals with AI assistance. Designed for founders building in public, content creators growing their audience, and tech-savvy professionals seeking AI-assisted personal management.

**Important**: This skill uses progressive disclosure. Module-specific instructions are in each subdirectory's `.md` file. Only load what's needed for the current task.

## When to Activate

Activate this skill when the user:

- Requests content creation (posts, threads, newsletters) - load identity/voice.md first
- Asks for help with personal brand or positioning
- Needs to look up or manage contacts/relationships
- Wants to capture or develop content ideas
- Requests meeting preparation or follow-up
- Asks for weekly reviews or goal tracking
- Needs to save or retrieve bookmarked resources
- Wants to organize research or learning materials

**Trigger phrases**: "write a post", "my voice", "content ideas", "who is [name]", "prepare for meeting", "weekly review", "save this", "my goals"

## Core Concepts

### Progressive Disclosure Architecture

The Digital Brain follows a three-level loading pattern:

| Level | When Loaded | Content |
|-------|-------------|---------|
| **L1: Metadata** | Always | This SKILL.md overview |
| **L2: Module Instructions** | On-demand | `[module]/[MODULE].md` files |
| **L3: Data Files** | As-needed | `.jsonl`, `.yaml`, `.md` data |

### File Format Strategy

Formats chosen for optimal agent parsing:

- **JSONL** (`.jsonl`): Append-only logs - ideas, posts, contacts, interactions
- **YAML** (`.yaml`): Structured configs - goals, values, circles
- **Markdown** (`.md`): Narrative content - voice, brand, calendar, todos
- **XML** (`.xml`): Complex prompts - content generation templates

### Append-Only Data Integrity

JSONL files are **append-only**. Never delete entries:
- Mark as `"status": "archived"` instead of deleting
- Preserves history for pattern analysis
- Enables "what worked" retrospectives

## Detailed Topics

### Module Overview

```
digital-brain/
├── identity/     → Voice, brand, values (READ FIRST for content)
├── content/      → Ideas, drafts, posts, calendar
├── knowledge/    → Bookmarks, research, learning
├── network/      → Contacts, interactions, intros
├── operations/   → Todos, goals, meetings, metrics
└── agents/       → Automation scripts
```

### Identity Module (Critical for Content)

**Always read `identity/voice.md` before generating any content.**

Contains:
- `voice.md` - Tone, style, vocabulary, patterns
- `brand.md` - Positioning, audience, content pillars
- `values.yaml` - Core beliefs and principles
- `bio-variants.md` - Platform-specific bios
- `prompts/` - Reusable generation templates

### Content Module

Pipeline: `ideas.jsonl` → `drafts/` → `posts.jsonl`

- Capture ideas immediately to `ideas.jsonl`
- Develop in `drafts/` using `templates/`
- Log published content to `posts.jsonl` with metrics
- Plan in `calendar.md`

### Network Module

Personal CRM with relationship tiers:
- `inner` - Weekly touchpoints
- `active` - Bi-weekly touchpoints
- `network` - Monthly touchpoints
- `dormant` - Quarterly reactivation checks

### Operations Module

Productivity system with priority levels:
- P0: Do today, blocking
- P1: This week, important
- P2: This month, valuable
- P3: Backlog, nice to have

## Practical Guidance

### Content Creation Workflow

```
1. Read identity/voice.md (REQUIRED)
2. Check identity/brand.md for topic alignment
3. Reference content/posts.jsonl for successful patterns
4. Use content/templates/ as starting structure
5. Draft matching voice attributes
6. Log to posts.jsonl after publishing
```

### Pre-Meeting Preparation

```
1. Look up contact: network/contacts.jsonl
2. Get history: network/interactions.jsonl
3. Check pending: operations/todos.md
4. Generate brief with context
```

### Weekly Review Process

```
1. Run: python agents/scripts/weekly_review.py
2. Review metrics in operations/metrics.jsonl
3. Check stale contacts: agents/scripts/stale_contacts.py
4. Update goals progress in operations/goals.yaml
5. Plan next week in content/calendar.md
```

## Examples

### Example: Writing an X Post

**Input**: "Help me write a post about AI agents"

**Process**:
1. Read `identity/voice.md` → Extract voice attributes
2. Check `identity/brand.md` → Confirm "ai_agents" is a content pillar
3. Reference `content/posts.jsonl` → Find similar successful posts
4. Draft post matching voice patterns
5. Suggest adding to `content/ideas.jsonl` if not publishing immediately

**Output**: Post draft in user's authentic voice with platform-appropriate format.

### Example: Contact Lookup

**Input**: "Prepare me for my call with Sarah Chen"

**Process**:
1. Search `network/contacts.jsonl` for "Sarah Chen"
2. Get recent entries from `network/interactions.jsonl`
3. Check `operations/todos.md` for pending items with Sarah
4. Compile brief: role, context, last discussed, follow-ups

**Output**: Pre-meeting brief with relationship context.

## Guidelines

1. **Voice First**: Always read `identity/voice.md` before any content generation
2. **Append Only**: Never delete from JSONL files - archive instead
3. **Update Timestamps**: Set `updated` field when modifying tracked data
4. **Cross-Reference**: Knowledge informs content, network informs operations
5. **Log Interactions**: Always log meetings/calls to `interactions.jsonl`
6. **Preserve History**: Past content in `posts.jsonl` informs future performance

## Integration

This skill integrates context engineering principles:

- **context-fundamentals** - Progressive disclosure, attention budget management
- **memory-systems** - JSONL for persistent memory, structured recall
- **tool-design** - Scripts in `agents/scripts/` follow tool design principles
- **context-optimization** - Module separation prevents context bloat

## References

Internal references:
- [Identity Module](./identity/IDENTITY.md) - Voice and brand details
- [Content Module](./content/CONTENT.md) - Content pipeline docs
- [Network Module](./network/NETWORK.md) - CRM documentation
- [Operations Module](./operations/OPERATIONS.md) - Productivity system
- [Agent Scripts](./agents/AGENTS.md) - Automation documentation

External resources:
- [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering)
- [Anthropic Context Engineering Guide](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)

---

## Skill Metadata

**Created**: 2024-12-29
**Last Updated**: 2024-12-29
**Author**: Murat Can Koylan
**Version**: 1.0.0

examples/interleaved-thinking/SKILL.mdskill

Show content (6430 bytes)

---
name: reasoning-trace-optimizer
description: "Debug and optimize AI agents by analyzing reasoning traces. Activates on 'debug agent', 'optimize prompt', 'analyze reasoning', 'why did the agent fail', 'improve agent performance', or when diagnosing agent failures and context degradation."
---

# Reasoning Trace Optimizer

Debug and optimize AI agents by analyzing their reasoning traces. This skill uses MiniMax M2.1's interleaved thinking to provide deep insight into agent decision-making and generate concrete improvements.

## When to Activate

- User asks to "debug agent", "analyze reasoning", or "optimize prompt"
- Agent task fails and user wants to understand why
- User mentions "context degradation", "tool confusion", or "instruction drift"
- Request to improve agent performance or reduce errors
- User wants to generate shareable learnings from debugging sessions
- After repeated failures on similar tasks

## Core Concepts

### Interleaved Thinking

Unlike standard reasoning models that think once at the start, interleaved thinking allows reasoning BETWEEN each tool interaction. This is critical because:

1. **Long-horizon tasks** require maintaining focus across many turns
2. **External perturbations** (tool outputs, environment changes) need real-time adaptation
3. **Debugging** requires seeing HOW decisions were made, not just WHAT was output

### The Optimization Loop

```
Execute Agent → Capture Traces → Analyze Patterns → Optimize Prompt → Re-run
                                                          ↑____________|
```

Each iteration improves the prompt based on detected patterns until convergence.

### Pattern Detection

Common failure patterns the analyzer detects:

| Pattern | Description |
|---------|-------------|
| `context_degradation` | Model loses track of information over long contexts |
| `tool_confusion` | Model misunderstands tool capabilities or outputs |
| `instruction_drift` | Model gradually deviates from original instructions |
| `goal_abandonment` | Model stops pursuing the original goal |
| `circular_reasoning` | Model repeats similar actions without progress |
| `premature_conclusion` | Model concludes before completing the task |

## Usage Modes

### Mode 1: M2.1 Agent Debugging

Run a task through M2.1 and analyze its reasoning:

```python
from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer

capture = TraceCapture()
trace = capture.run(
    task="Search for Python tutorials and summarize them",
    system_prompt="You are a research assistant.",
    tools=[search_tool],
    tool_executor=execute_search
)

analyzer = TraceAnalyzer()
analysis = analyzer.analyze(trace)

print(f"Score: {analysis.overall_score}/100")
for pattern in analysis.patterns:
    print(f"Found: {pattern.type.value} - {pattern.suggestion}")
```

### Mode 2: Full Optimization Loop

Automatically iterate until the prompt is optimized:

```python
from reasoning_trace_optimizer import OptimizationLoop, LoopConfig

config = LoopConfig(
    max_iterations=5,
    min_score_threshold=80.0,
)

loop = OptimizationLoop(config=config)
result = loop.run(
    task="Analyze this codebase and suggest improvements",
    initial_prompt="You are a code reviewer.",
    tools=[read_file_tool, search_tool],
    tool_executor=execute_tool
)

print(f"Improved: {result.initial_score} → {result.final_score}")
print(f"Final prompt:\n{result.final_prompt}")
```

### Mode 3: Universal Session Analysis

Analyze any agent's previous thinking (works with Claude, GPT, etc.):

When this skill is activated in Claude Code, it can analyze the current session's thinking blocks to identify issues and suggest improvements.

```
/reasoning-trace-optimizer analyze-session
```

### Mode 4: Generate Shareable Skills

Convert optimization learnings into reusable Agent Skills:

```python
from reasoning_trace_optimizer import SkillGenerator

generator = SkillGenerator()
skill_path = generator.generate(
    result=loop_result,
    skill_name="web-search-best-practices",
    output_dir="./skills"
)
```

## CLI Commands

```bash
# Capture reasoning trace
rto capture "Search for Python tutorials" -s "You are a helpful assistant."

# Analyze a task
rto analyze "Debug this code" -o analysis.txt

# Run optimization loop
rto optimize "Research AI papers" --max-iterations 5 --generate-skill

# Generate skill from artifacts
rto generate-skill my-skill-name --artifacts-dir ./optimization_artifacts
```

## Integration with Claude Code

### Auto-trigger on Failure

Add to your hooks to automatically analyze failures:

```json
{
  "hooks": {
    "post_tool_error": {
      "command": "rto analyze-session --last-error"
    }
  }
}
```

### On-demand Analysis

Use the slash command to analyze current session:

```
/reasoning-trace-optimizer
```

This will:
1. Extract thinking blocks from the current session
2. Identify patterns and issues
3. Suggest prompt improvements
4. Optionally update the system prompt

## Guidelines

1. **Preserve full context**: M2.1 requires full response history including thinking blocks for optimal performance
2. **Use appropriate tools**: Define tools clearly with unambiguous descriptions
3. **Set realistic convergence thresholds**: 5-10% improvement per iteration is typical
4. **Review generated skills**: Auto-generated skills should be reviewed before sharing
5. **Monitor token usage**: Each optimization iteration uses significant tokens

## Examples

### Before Optimization

```
System: You are a helpful assistant.

Issue: Agent called wrong tools, lost track of goal after 3 turns
Score: 45/100
Patterns: tool_confusion, goal_abandonment
```

### After Optimization

```
System: You are a research assistant focused on finding accurate information.

IMPORTANT GUIDELINES:
- Always verify search results before summarizing
- If a tool returns an error, try an alternative approach
- Keep track of your original goal throughout the task
- Validate findings against multiple sources when possible

Issue: None
Score: 85/100
Patterns: None detected
```

## References

- MiniMax M2.1 Documentation: https://platform.minimax.io/docs
- Interleaved Thinking Guide: See `docs/interleavedthinking.md`
- Agent Generalization: See `docs/agentthinking.md`

---

## Skill Metadata

**Created**: 2025-01-11
**Author**: Muratcan Koylan
**Version**: 0.1.0
**Powered by**: MiniMax M2.1
**Partnership**: Built in collaboration with MiniMax AI

.claude-plugin/marketplace.jsonmarketplace

Show content (1351 bytes)

{
  "name": "context-engineering-marketplace",
  "owner": {
    "name": "Muratcan Koylan",
    "email": "muratcan.koylan@outlook.com"
  },
  "metadata": {
    "description": "Context Engineering skills for building production-grade AI agent systems",
    "version": "2.1.0"
  },
  "plugins": [
    {
      "name": "context-engineering",
      "description": "Comprehensive context engineering skills for building production-grade AI agent systems — covering fundamentals, degradation patterns, compression, optimization, latent briefing (KV sharing between agents), multi-agent coordination, memory systems, tool design, filesystem context, hosted agents, evaluation, advanced evaluation, project development, and cognitive architecture",
      "source": "./",
      "strict": false,
      "skills": [
        "./skills/context-fundamentals",
        "./skills/context-degradation",
        "./skills/context-compression",
        "./skills/context-optimization",
        "./skills/multi-agent-patterns",
        "./skills/memory-systems",
        "./skills/tool-design",
        "./skills/filesystem-context",
        "./skills/hosted-agents",
        "./skills/evaluation",
        "./skills/advanced-evaluation",
        "./skills/project-development",
        "./skills/bdi-mental-states",
        "./skills/latent-briefing"
      ]
    }
  ]
}

README

Agent Skills for Context Engineering

A comprehensive, open collection of Agent Skills focused on context engineering principles for building production-grade AI agent systems. These skills teach the art and science of curating context to maximize agent effectiveness across any agent platform.

DeepWiki: Learn more here

What is Context Engineering?

Context engineering is the discipline of managing the language model's context window. Unlike prompt engineering, which focuses on crafting effective instructions, context engineering addresses the holistic curation of all information that enters the model's limited attention budget: system prompts, tool definitions, retrieved documents, message history, and tool outputs.

The fundamental challenge is that context windows are constrained not by raw token capacity but by attention mechanics. As context length increases, models exhibit predictable degradation patterns: the "lost-in-the-middle" phenomenon, U-shaped attention curves, and attention scarcity. Effective context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes.

Recognition

This repository is cited in academic research as foundational work on static skill architecture:

"While static skills are well-recognized [Anthropic, 2025b; Muratcan Koylan, 2025], MCE is among the first to dynamically evolve them, bridging manual skill engineering and autonomous self-improvement."

— Meta Context Engineering via Agentic Skill Evolution, Peking University State Key Laboratory of General Artificial Intelligence (2026)

Skills Overview

Foundational Skills

These skills establish the foundational understanding required for all subsequent context engineering work.

Skill	Description
context-fundamentals	Understand what context is, why it matters, and the anatomy of context in agent systems
context-degradation	Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash
context-compression	Design and evaluate compression strategies for long-running sessions

Architectural Skills

These skills cover the patterns and structures for building effective agent systems.

Skill	Description
multi-agent-patterns	Master orchestrator, peer-to-peer, and hierarchical multi-agent architectures
memory-systems	Design short-term, long-term, and graph-based memory architectures
tool-design	Build tools that agents can use effectively
filesystem-context	Use filesystems for dynamic context discovery, tool output offloading, and plan persistence
hosted-agents	NEW Build background coding agents with sandboxed VMs, pre-built images, multiplayer support, and multi-client interfaces

Operational Skills

These skills address the ongoing operation and optimization of agent systems.

Skill	Description
context-optimization	Apply compaction, masking, and caching strategies
latent-briefing	Share task-relevant orchestrator state with workers via task-guided KV cache compaction when the worker runtime is controllable
evaluation	Build evaluation frameworks for agent systems
advanced-evaluation	Master LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, and bias mitigation

Development Methodology

These skills cover the meta-level practices for building LLM-powered projects.

Skill	Description
project-development	Design and build LLM projects from ideation through deployment, including task-model fit analysis, pipeline architecture, and structured output design

Cognitive Architecture Skills

These skills cover formal cognitive modeling for rational agent systems.

Skill	Description
bdi-mental-states	NEW Transform external RDF context into agent mental states (beliefs, desires, intentions) using formal BDI ontology patterns for deliberative reasoning and explainability

Design Philosophy

Progressive Disclosure

Each skill is structured for efficient context use. At startup, agents load only skill names and descriptions. Full content loads only when a skill is activated for relevant tasks.

Platform Agnosticism

These skills focus on transferable principles rather than vendor-specific implementations. The patterns work across Claude Code, Cursor, and any agent platform that supports skills or allows custom instructions.

Conceptual Foundation with Practical Examples

Scripts and examples demonstrate concepts using Python pseudocode that works across environments without requiring specific dependency installations.

Usage

Usage with Claude Code

This repository is a Claude Code Plugin Marketplace containing context engineering skills that Claude automatically discovers and activates based on your task context.

Installation

Step 1: Add the Marketplace

Run this command in Claude Code to register this repository as a plugin source:

/plugin marketplace add muratcankoylan/Agent-Skills-for-Context-Engineering

Step 2: Install the Plugin

Option A - Browse and install:

Select Browse and install plugins
Select context-engineering-marketplace
Select context-engineering
Select Install now

Option B - Direct install via command:

/plugin install context-engineering@context-engineering-marketplace

This installs all 14 skills in a single plugin. Skills are activated automatically based on your task context.

Skill Triggers

Skill	Triggers On
`context-fundamentals`	"understand context", "explain context windows", "design agent architecture"
`context-degradation`	"diagnose context problems", "fix lost-in-middle", "debug agent failures"
`context-compression`	"compress context", "summarize conversation", "reduce token usage"
`context-optimization`	"optimize context", "reduce token costs", "implement KV-cache"
`latent-briefing`	"KV cache compaction between agents", "worker KV memory handoff", "latent briefing", "share trajectory without summarization"
`multi-agent-patterns`	"design multi-agent system", "implement supervisor pattern"
`memory-systems`	"implement agent memory", "build knowledge graph", "track entities"
`tool-design`	"design agent tools", "reduce tool complexity", "implement MCP tools"
`filesystem-context`	"offload context to files", "dynamic context discovery", "agent scratch pad", "file-based context"
`hosted-agents`	"build background agent", "create hosted coding agent", "sandboxed execution", "multiplayer agent", "Modal sandboxes"
`evaluation`	"evaluate agent performance", "build test framework", "measure quality"
`advanced-evaluation`	"implement LLM-as-judge", "compare model outputs", "mitigate bias"
`project-development`	"start LLM project", "design batch pipeline", "evaluate task-model fit"
`bdi-mental-states`	"model agent mental states", "implement BDI architecture", "transform RDF to beliefs", "build cognitive agent"

For Cursor (Open Plugins)

This repository is listed on the Cursor Plugin Directory.

The .plugin/plugin.json manifest follows the Open Plugins standard, so the repo also works with any conformant agent tool (Codex, GitHub Copilot, etc.).

Using Individual Skills

To use a single skill without installing the full plugin, copy its SKILL.md directly into your project's .claude/skills/ directory:

# Example: add just the context-fundamentals skill
mkdir -p .claude/skills
curl -o .claude/skills/context-fundamentals.md \
  https://raw.githubusercontent.com/muratcankoylan/Agent-Skills-for-Context-Engineering/main/skills/context-fundamentals/SKILL.md

Available skills: context-fundamentals, context-degradation, context-compression, context-optimization, latent-briefing, multi-agent-patterns, memory-systems, tool-design, filesystem-context, hosted-agents, evaluation, advanced-evaluation, project-development, bdi-mental-states

For Custom Implementations

Extract the principles and patterns from any skill and implement them in your agent framework. The skills are deliberately platform-agnostic.

Examples

The examples folder contains complete system designs that demonstrate how multiple skills work together in practice.

Example	Description	Skills Applied
digital-brain-skill	NEW Personal operating system for founders and creators. Complete Claude Code skill with 6 modules, 4 automation scripts	context-fundamentals, context-optimization, memory-systems, tool-design, multi-agent-patterns, evaluation, project-development
x-to-book-system	Multi-agent system that monitors X accounts and generates daily synthesized books	multi-agent-patterns, memory-systems, context-optimization, tool-design, evaluation
llm-as-judge-skills	Production-ready LLM evaluation tools with TypeScript implementation, 19 passing tests	advanced-evaluation, tool-design, context-fundamentals, evaluation
book-sft-pipeline	Train models to write in any author's style. Includes Gertrude Stein case study with 70% human score on Pangram, $2 total cost	project-development, context-compression, multi-agent-patterns, evaluation

Each example includes:

Complete PRD with architecture decisions
Skills mapping showing which concepts informed each decision
Implementation guidance

Digital Brain Skill Example

The digital-brain-skill example is a complete personal operating system demonstrating comprehensive skills application:

Progressive Disclosure: 3-level loading (SKILL.md → MODULE.md → data files)
Module Isolation: 6 independent modules (identity, content, knowledge, network, operations, agents)
Append-Only Memory: JSONL files with schema-first lines for agent-friendly parsing
Automation Scripts: 4 consolidated tools (weekly_review, content_ideas, stale_contacts, idea_to_draft)

Includes detailed traceability in HOW-SKILLS-BUILT-THIS.md mapping every architectural decision to specific skill principles.

LLM-as-Judge Skills Example

The llm-as-judge-skills example is a complete TypeScript implementation demonstrating:

Direct Scoring: Evaluate responses against weighted criteria with rubric support
Pairwise Comparison: Compare responses with position bias mitigation
Rubric Generation: Create domain-specific evaluation standards
EvaluatorAgent: High-level agent combining all evaluation capabilities

Book SFT Pipeline Example

The book-sft-pipeline example demonstrates training small models (8B) to write in any author's style:

Intelligent Segmentation: Two-tier chunking with overlap for maximum training examples
Prompt Diversity: 15+ templates to prevent memorization and force style learning
Tinker Integration: Complete LoRA training workflow with $2 total cost
Validation Methodology: Modern scenario testing proves style transfer vs content memorization

Integrates with context engineering skills: project-development, context-compression, multi-agent-patterns, evaluation.

Star History

Structure

Each skill follows the Agent Skills specification:

skill-name/
├── SKILL.md              # Required: instructions + metadata
├── scripts/              # Optional: executable code demonstrating concepts
└── references/           # Optional: additional documentation and resources

See the template folder for the canonical skill structure.

Contributing

This repository follows the Agent Skills open development model. Contributions are welcome from the broader ecosystem. When contributing:

Follow the skill template structure
Provide clear, actionable instructions
Include working examples where appropriate
Document trade-offs and potential issues
Keep SKILL.md under 500 lines for optimal performance

Feel free to contact Muratcan Koylan for collaboration opportunities or any inquiries.

License

MIT License - see LICENSE file for details.

References

The principles in these skills are derived from research and production experience at leading AI labs and framework developers. Each skill includes references to the underlying research and case studies that inform its recommendations.