USP

Uniquely combines real-world session data analysis with research-backed static checks to provide a holistic view of skill performance. Get prioritized P0/P1/P2 fixes, not just generic suggestions, for actionable improvements.

Use cases

01Identifying why a skill isn't triggering as expected.
02Improving user satisfaction with agent interactions.
03Optimizing skill descriptions for better agent selection (CSO).
04Debugging broken skill workflows or environmental inconsistencies.
05Ensuring cost-effectiveness and token compliance of skills.

Detected files (1)

skills/skill-optimizer/SKILL.mdskill

Show content (12691 bytes)

---
name: skill-optimizer
description: "Use when the user wants to analyze, audit, or improve their Agent Skills (SKILL.md files). Triggers on /optimize-skill, /skill-audit, 'optimize skills', 'analyze skills', 'check my skills', 'skill quality'. Also use proactively when the user mentions skills aren't triggering, skills feel broken, or asks why a skill didn't fire."
---

## Rules

- **Read-only**: never modify skill files. Only output report.
- **All 8 dimensions**: do not skip any. If data is insufficient, report "N/A — insufficient session data" rather than omitting.
- **Quantify**: "you had 12 research tasks last week but the skill never triggered" beats "you often do research".
- **Suggest, don't prescribe**: give specific wording suggestions for description improvements, but frame as suggestions.
- **Show evidence**: for undertrigger claims, quote the actual user message that should have triggered the skill.
- **Evidence-based suggestions**: when suggesting description rewrites, cite the specific research finding that motivates the change (e.g., "front-load trigger keywords — MCP study shows 3.6x selection rate improvement").

## Overview

Analyze skills using **historical session data + static quality checks**, output a diagnostic report with P0/P1/P2 prioritized fixes. Scores each skill on a 5-point composite scale across 8 dimensions.

CSO (Claude/Agent Search Optimization) = writing skill descriptions so agents select the right skill at the right time. This skill checks for CSO violations.

## Usage

- `/optimize-skill` → scan all skills
- `/optimize-skill my-skill` → single skill
- `/optimize-skill skill-a skill-b` → multiple specified skills

## Data Sources

Auto-detect the current agent platform and scan the corresponding paths:

| Source | Claude Code | Codex | Shared |
|--------|------------|-------|--------|
| Session transcripts | `~/.claude/projects/**/*.jsonl` | `~/.codex/sessions/**/*.jsonl` | — |
| Skill files | `~/.claude/skills/*/SKILL.md` | `~/.codex/skills/*/SKILL.md` | `~/.agents/skills/*/SKILL.md` |

**Platform detection:** Check which directories exist. Scan all available sources — a user may have both Claude Code and Codex installed.

## Workflow

```
Identify target skills
        ↓
Collect session data (python3 scripts scan JSONL transcripts)
        ↓
Run 8 analysis dimensions
        ↓
Compute composite scores
        ↓
Output report with P0/P1/P2
```

### Step 1: Identify Target Skills

Scan skill directories in order: `~/.claude/skills/`, `~/.codex/skills/`, `~/.agents/skills/`. Deduplicate by skill name (same name in multiple locations = same skill). For each, read `SKILL.md` and extract:
- name, description (from YAML frontmatter)
- trigger keywords (from description field)
- defined workflow steps (Step 1/2/3... or ### sections under Workflow)
- word count

If user specified skill names, filter to only those.

### Step 2: Collect Session Data

Use python3 scripts via Bash to scan session JSONL files. Extract:

**Claude Code sessions** (`~/.claude/projects/**/*.jsonl`):
- `Skill` tool_use calls (which skills were invoked)
- User messages (full text)
- Assistant messages after skill invocation (for workflow tracking)
- User messages after skill invocation (for reaction analysis)

**Codex sessions** (`~/.codex/sessions/**/*.jsonl`):
- `session_meta` events → extract `base_instructions` for skill loading evidence
- `response_item` events → assistant outputs (workflow tracking)
- `event_msg` events → tool execution and skill-related events
- User messages from `turn_context` events (for reaction analysis)

**Note:** Codex injects skills via context rather than explicit `Skill` tool calls. Skill loading (present in `base_instructions`) does NOT equal active invocation. To detect actual use, search for skill-specific workflow markers (step headers, output formats) in `response_item` content within that session. A skill is "invoked" only if the agent produced output following the skill's defined workflow.

**Aggregated:**
- Per-skill: invocation count, trigger keyword match count
- Per-skill: user reaction sentiment after invocation
- Per-skill: workflow step completion markers

### Step 3: Run 8 Analysis Dimensions

**You MUST run ALL 8 dimensions.** The baseline behavior without this skill is to skip dimensions 4.2, 4.3, 4.5b, and 4.8. These are the most valuable dimensions — do not skip them.

#### 4.1 Trigger Rate

Count how many times each skill was actually invoked vs how many times its trigger keywords appeared in user messages.

**Claude Code:** count `Skill` tool_use calls in transcripts.
**Codex:** count sessions where the agent produced output following the skill's workflow markers (not merely loaded in context).

**Diagnose:**
- Never triggered → skill may be useless or trigger words wrong
- Keywords match >> actual invocations → undertrigger problem, description needs work
- High frequency → core skill, worth optimizing

#### 4.2 Post-Invocation User Reaction

**This dimension is critical and easy to skip. Do not skip it.**

After a skill is invoked in a session, read the user's next 3 messages. Classify:
- **Negative**: "no", "wrong", "never mind", "not what I wanted", user interrupts
- **Correction**: user re-describes their intent, manually overrides skill output
- **Positive**: "good", "ok", "continue", "nice", user follows the workflow
- **Silent switch**: user changes topic entirely (likely false positive trigger)

Report per-skill satisfaction rate.

#### 4.3 Workflow Completion Rate

**This dimension is critical and easy to skip. Do not skip it.**

For each skill invocation found in session data:
1. Extract the skill's defined steps from SKILL.md
2. Search the assistant messages in that session for step markers (Step N, specific output formats defined in the skill)
3. Calculate: how far did execution get?

Report: `{skill-name} (N steps): avg completed Step X/N (Y%)`

If a specific step is frequently where execution stops, flag it.

#### 4.4 Static Quality Analysis

Check each SKILL.md against these 14 rules:

| Check | Pass Criteria |
|-------|--------------|
| Frontmatter format | Only `name` + `description`, total < 1024 chars |
| Name format | Letters, numbers, hyphens only |
| Description trigger | Starts with "Use when..." or has explicit trigger conditions |
| Description workflow leak | Description does NOT summarize the skill's workflow steps (CSO violation) |
| Description pushiness | Description actively claims scenarios where it should be used, not just passive |
| Overview section | Present |
| Rules section | Present |
| MUST/NEVER density | Count ALL-CAPS directive words; >5 per 100 words = flag. Note: Meincke et al. (2025) found persuasion directives have inconsistent effects across models. Suggest converting to concrete bright-line rules with rationale, not mere emphasis. |
| Word count | < 500 words (flag if over) |
| Narrative anti-pattern | No "In session X, we found..." storytelling — skills should be instructions, not post-hoc reports |
| YAML quoting safety | description containing `: ` must be wrapped in double quotes, otherwise YAML parse failure makes skill invisible |
| Critical info position | Core trigger conditions and primary actions must be in the first 20% of SKILL.md, not buried in the middle (Lost in the Middle, Liu et al. TACL 2024: U-shaped attention curve) |
| Description 250-char check | Primary trigger keywords must appear within the first 250 characters of description (skill listing truncation point in most agents) |
| Trigger condition count | ≤ 2 trigger conditions in description is ideal; consistent with IFEval (Zhou et al. 2023) finding that LLMs struggle with multi-constraint prompts |

#### 4.5a False Positive Rate (Overtrigger)

Skill was invoked but user immediately rejected or ignored it.

#### 4.5b Undertrigger Detection

**This is the highest-value dimension.** Memento-Skills (arXiv:2603.18743) demonstrates that skills stored as structured files require accurate retrieval/routing to be effective — skills that are never retrieved cannot improve through their read-write learning loop, making undertriggering a compounding problem.

For each skill, extract its **capability keywords** (not just trigger keywords — what the skill CAN do). Then scan user messages for tasks that match those capabilities but where the skill was NOT invoked.

Example: user says "run these tasks in parallel" but parallel-runner was not triggered → undertrigger.

Report: which user messages SHOULD have triggered the skill but didn't, and suggest description improvements.

**Compounding Risk Assessment:**
For skills with chronic undertriggering (0 triggers across 5+ sessions where relevant tasks appeared), flag as "compounding risk" — undertriggered skills cannot self-improve through usage feedback, causing the gap to widen over time. Recommend immediate description rewrite as P0.

#### 4.6 Cross-Skill Conflicts

Compare all skill pairs:
- Trigger keyword overlap (same keywords in two descriptions)
- Workflow overlap (two skills teach similar processes)
- Contradictory guidance

#### 4.7 Environment Consistency

For each skill, extract referenced:
- File paths → check if they exist (`test -e`)
- CLI tools → check if installed (`which`)
- Directories → check if they exist

Flag any broken references.

#### 4.8 Token Economics

**This dimension is critical and easy to skip. Do not skip it.**

For each skill:
- Word count (from Step 1)
- Trigger frequency (from 4.1)
- Cost-effectiveness = trigger count / word count
- Flag: large + never-triggered skills as candidates for removal or compression

**Progressive Disclosure Tier Check:**
Evaluate each skill against the 3-tier loading model (Agent Skills spec):
- Tier 1 (frontmatter): ~100 tokens. Check: is description ≤ 1024 chars?
- Tier 2 (SKILL.md body): <500 lines recommended. Check: word count.
- Tier 3 (reference files): loaded on demand. Check: does skill use reference files for detailed content, or cram everything into SKILL.md?

Flag skills that put 500+ words in SKILL.md without using reference files as "poor progressive disclosure".

### Step 4: Composite Score

Rate each skill on a 5-point scale:

| Score | Meaning |
|-------|---------|
| 5 | Healthy: high trigger rate, positive reactions, complete workflows, clean static |
| 4 | Good: minor issues in 1-2 dimensions |
| 3 | Needs attention: significant gap in 1 dimension or minor gaps in 3+ |
| 2 | Problematic: never triggered, or negative user reactions, or major static issues |
| 1 | Broken: doesn't work, references missing, or fundamentally misaligned |

**Scored dimensions** (weighted average):
- Trigger rate: 25%
- User reaction: 20%
- Workflow completion: 15%
- Static quality: 15%
- Undertrigger: 15%
- Token economics: 10%

**Qualitative dimensions** (reported but not scored — no reliable numeric metric):
- 4.5a Overtrigger: reported as count + examples
- 4.6 Cross-Skill Conflicts: reported as conflict pairs
- 4.7 Environment Consistency: reported as pass/fail per reference

(If a scored dimension has no data — e.g., skill was never invoked so no user reaction — mark as "N/A" and redistribute weight.)

## Report Format

```markdown
# Skill Optimization Report
**Date**: {date}
**Scope**: {all / specified skills}
**Session data**: {N} sessions, {date range}

## Overview
| Skill | Triggers | Reaction | Completion | Static | Undertrigger | Token | Score |
|-------|----------|----------|------------|--------|--------------|-------|-------|
| example-skill | 2 | 100% | 86% | B+ | 1 miss | 486w | 4/5 |

## P0 Fixes (blocking usage)
1. ...

## P1 Improvements (better experience)
1. ...

## P2 Optional Optimizations
1. ...

## Per-Skill Diagnostics
### {skill-name}
#### 4.1 Trigger Rate
...
#### 4.2 User Reaction
...
(all 8 dimensions)
```

## Research Background
The analysis dimensions in this report are grounded in the following research:
- **Undertrigger detection**: Memento-Skills (arXiv:2603.18743) — skills as structured files require accurate routing; unrouted skills cannot self-improve via the read-write learning loop
- **Description quality**: MCP Description Quality (arXiv:2602.18914) — well-written descriptions achieve 72% tool selection rate vs. 20% random baseline (3.6x improvement)
- **Information position**: Lost in the Middle (Liu et al., TACL 2024) — U-shaped LLM attention curve
- **Format impact**: He et al. (arXiv:2411.10541) — format changes alone can cause 9-40% performance variance
- **Instruction compliance**: IFEval (arXiv:2311.07911) — LLMs struggle with multi-constraint prompts

README

Skill Optimizer

中文版

Diagnose and optimize your Agent Skills (SKILL.md files) using real session data + research-backed static analysis. Get a prioritized report with P0/P1/P2 fixes.

Works with Claude Code, Codex, and any agent supporting the Agent Skills open standard. Auto-detects your platform and scans the right paths.

Most skill auditors only do static checks on your SKILL.md. This one also mines your actual session transcripts to measure trigger rates, user satisfaction, workflow completion, and undertrigger gaps — then scores each skill on a 5-point composite scale.

What It Does

6 scored dimensions (weighted into composite score):

Dimension	What's Measured
Trigger Rate	How often is the skill actually invoked vs. how often it should be?
User Reaction	Does the user accept, correct, or reject the skill after invocation?
Workflow Completion	How far through the skill's defined steps does execution get?
Static Quality	14 checks: YAML safety, CSO compliance, info position, word count, etc.
Undertrigger	Missed opportunities — user needed the skill but it wasn't invoked
Token Economics	Cost-effectiveness and progressive disclosure tier compliance

3 qualitative dimensions (reported but not scored):

Dimension	What's Measured
Overtrigger	False positives — skill fired but user didn't want it
Cross-Skill Conflicts	Trigger keyword overlap and contradictory guidance between skills
Environment Consistency	Broken file paths, missing CLI tools, non-existent directories

Installation

Copy the command below and paste it directly into your agent's chat — it will install automatically:

Claude Code

Install the skill-optimizer skill from https://github.com/hqhq1025/skill-optimizer

Codex

Install the skill-optimizer skill from https://github.com/hqhq1025/skill-optimizer into ~/.codex/skills/

Other Agents (Cursor, OpenCode, Gemini CLI, etc.)

Install the skill-optimizer skill from https://github.com/hqhq1025/skill-optimizer into ~/.agents/skills/

Manual install

# Claude Code
git clone https://github.com/hqhq1025/skill-optimizer.git /tmp/skill-optimizer
cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.claude/skills/
rm -rf /tmp/skill-optimizer

# Codex
git clone https://github.com/hqhq1025/skill-optimizer.git /tmp/skill-optimizer
cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.codex/skills/
rm -rf /tmp/skill-optimizer

# Shared (any agent)
git clone https://github.com/hqhq1025/skill-optimizer.git /tmp/skill-optimizer
cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.agents/skills/
rm -rf /tmp/skill-optimizer

Usage

/optimize-skill              # Scan all skills
/optimize-skill my-skill     # Single skill
/optimize-skill skill-a skill-b  # Multiple skills

The skill generates a diagnostic report with:

Overview table — all skills at a glance with scores
P0 Fixes — blocking issues that must be resolved
P1 Improvements — experience improvements
P2 Optimizations — optional tweaks
Per-skill diagnostics — all 8 dimensions for each skill

Multi-Platform Session Analysis

The optimizer auto-detects available platforms and scans session data from all of them:

Platform	Skills Path	Session Data Path
Claude Code	`~/.claude/skills/`	`~/.claude/projects/*/.jsonl`
Codex	`~/.codex/skills/`	`~/.codex/sessions/*/.jsonl`
Shared	`~/.agents/skills/`	—

Research Background

The analysis dimensions are grounded in peer-reviewed research:

Research	Key Finding	Applied In
Memento-Skills (2026)	Skills as structured files require accurate routing; unrouted skills cannot self-improve via the read-write learning loop	Undertrigger detection, compounding risk assessment
MCP Description Quality (2026)	Well-written descriptions achieve 72% tool selection rate vs. 20% random baseline (3.6x improvement)	Description quality checks, evidence-based rewrite suggestions
Lost in the Middle (Liu et al., TACL 2024)	LLM attention follows a U-shaped curve — middle content is ignored	Critical info position check
Prompt Format Impact (He et al., 2024)	Format changes alone cause 9-40% performance variance	Static quality analysis
IFEval (Zhou et al., 2023)	LLMs struggle with multi-constraint prompts	Trigger condition count check
Meincke et al. (2025)	Persuasion directives have inconsistent effects across models	MUST/NEVER density guidance

How It Works

Identify target skills (scan ~/.claude/skills/, ~/.codex/skills/, ~/.agents/skills/)
        ↓
Collect session data (auto-detect platform, scan JSONL transcripts)
        ↓
Run 8 analysis dimensions (6 scored + 3 qualitative)
        ↓
Compute composite scores (weighted average of 6 scored dimensions)
        ↓
Output report with P0/P1/P2 prioritized fixes

Scored dimensions (weighted average):

Trigger rate: 25%
User reaction: 20%
Workflow completion: 15%
Static quality: 15%
Undertrigger: 15%
Token economics: 10%

Qualitative dimensions (overtrigger, cross-skill conflicts, environment consistency) are reported with examples but do not affect the numeric score.

Compatibility

Works with any agent that supports the Agent Skills open standard:

Claude Code
Codex
Cursor
OpenCode
Gemini CLI

Community

LINUX DO — Where we first shared this project

License

MIT