Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
Orchestra-Research

AI-Research-SKILLs

Quality
10.0

This library provides 98 specialized skills enabling AI agents to autonomously conduct the full AI research lifecycle, from ideation and literature review to experiment execution and paper writing. It is ideal for accelerating scientific discovery by offloading complex infrastructure and framework management to AI agents.

USP

It's the most comprehensive open-source library specifically designed for autonomous AI research, offering 98 expert-level skills across 23 categories, enabling agents to handle the entire research lifecycle from idea to paper.

Use cases

  • 01Autonomous AI research orchestration
  • 02Literature survey and idea generation
  • 03Experiment execution and debugging
  • 04ML paper writing and academic plotting
  • 05Distributed LLM pretraining

Detected files (8)

  • 01-model-architecture/litgpt/SKILL.mdskill
    Show content (11010 bytes)
    ---
    name: implementing-llms-litgpt
    description: Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Model Architecture, LitGPT, Lightning AI, LLM Implementation, LoRA, QLoRA, Fine-Tuning, Llama, Gemma, Phi, Mistral, Educational]
    dependencies: [litgpt, torch, transformers]
    ---
    
    # LitGPT - Clean LLM Implementations
    
    ## Quick start
    
    LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.
    
    **Installation**:
    ```bash
    pip install 'litgpt[extra]'
    ```
    
    **Load and use any model**:
    ```python
    from litgpt import LLM
    
    # Load pretrained model
    llm = LLM.load("microsoft/phi-2")
    
    # Generate text
    result = llm.generate(
        "What is the capital of France?",
        max_new_tokens=50,
        temperature=0.7
    )
    print(result)
    ```
    
    **List available models**:
    ```bash
    litgpt download list
    ```
    
    ## Common workflows
    
    ### Workflow 1: Fine-tune on custom dataset
    
    Copy this checklist:
    
    ```
    Fine-Tuning Setup:
    - [ ] Step 1: Download pretrained model
    - [ ] Step 2: Prepare dataset
    - [ ] Step 3: Configure training
    - [ ] Step 4: Run fine-tuning
    ```
    
    **Step 1: Download pretrained model**
    
    ```bash
    # Download Llama 3 8B
    litgpt download meta-llama/Meta-Llama-3-8B
    
    # Download Phi-2 (smaller, faster)
    litgpt download microsoft/phi-2
    
    # Download Gemma 2B
    litgpt download google/gemma-2b
    ```
    
    Models are saved to `checkpoints/` directory.
    
    **Step 2: Prepare dataset**
    
    LitGPT supports multiple formats:
    
    **Alpaca format** (instruction-response):
    ```json
    [
      {
        "instruction": "What is the capital of France?",
        "input": "",
        "output": "The capital of France is Paris."
      },
      {
        "instruction": "Translate to Spanish: Hello, how are you?",
        "input": "",
        "output": "Hola, ¿cómo estás?"
      }
    ]
    ```
    
    Save as `data/my_dataset.json`.
    
    **Step 3: Configure training**
    
    ```bash
    # Full fine-tuning (requires 40GB+ GPU for 7B models)
    litgpt finetune \
      meta-llama/Meta-Llama-3-8B \
      --data JSON \
      --data.json_path data/my_dataset.json \
      --train.max_steps 1000 \
      --train.learning_rate 2e-5 \
      --train.micro_batch_size 1 \
      --train.global_batch_size 16
    
    # LoRA fine-tuning (efficient, 16GB GPU)
    litgpt finetune_lora \
      microsoft/phi-2 \
      --data JSON \
      --data.json_path data/my_dataset.json \
      --lora_r 16 \
      --lora_alpha 32 \
      --lora_dropout 0.05 \
      --train.max_steps 1000 \
      --train.learning_rate 1e-4
    ```
    
    **Step 4: Run fine-tuning**
    
    Training saves checkpoints to `out/finetune/` automatically.
    
    Monitor training:
    ```bash
    # View logs
    tail -f out/finetune/logs.txt
    
    # TensorBoard (if using --train.logger_name tensorboard)
    tensorboard --logdir out/finetune/lightning_logs
    ```
    
    ### Workflow 2: LoRA fine-tuning on single GPU
    
    Most memory-efficient option.
    
    ```
    LoRA Training:
    - [ ] Step 1: Choose base model
    - [ ] Step 2: Configure LoRA parameters
    - [ ] Step 3: Train with LoRA
    - [ ] Step 4: Merge LoRA weights (optional)
    ```
    
    **Step 1: Choose base model**
    
    For limited GPU memory (12-16GB):
    - **Phi-2** (2.7B) - Best quality/size tradeoff
    - **Llama 3 1B** - Smallest, fastest
    - **Gemma 2B** - Good reasoning
    
    **Step 2: Configure LoRA parameters**
    
    ```bash
    litgpt finetune_lora \
      microsoft/phi-2 \
      --data JSON \
      --data.json_path data/my_dataset.json \
      --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
      --lora_alpha 32 \      # LoRA scaling (typically 2×r)
      --lora_dropout 0.05 \  # Prevent overfitting
      --lora_query true \    # Apply LoRA to query projection
      --lora_key false \     # Usually not needed
      --lora_value true \    # Apply LoRA to value projection
      --lora_projection true \  # Apply LoRA to output projection
      --lora_mlp false \     # Usually not needed
      --lora_head false      # Usually not needed
    ```
    
    LoRA rank guide:
    - `r=8`: Lightweight, 2-4MB adapters
    - `r=16`: Standard, good quality
    - `r=32`: High capacity, use for complex tasks
    - `r=64`: Maximum quality, 4× larger adapters
    
    **Step 3: Train with LoRA**
    
    ```bash
    litgpt finetune_lora \
      microsoft/phi-2 \
      --data JSON \
      --data.json_path data/my_dataset.json \
      --lora_r 16 \
      --train.epochs 3 \
      --train.learning_rate 1e-4 \
      --train.micro_batch_size 4 \
      --train.global_batch_size 32 \
      --out_dir out/phi2-lora
    
    # Memory usage: ~8-12GB for Phi-2 with LoRA
    ```
    
    **Step 4: Merge LoRA weights** (optional)
    
    Merge LoRA adapters into base model for deployment:
    
    ```bash
    litgpt merge_lora \
      out/phi2-lora/final \
      --out_dir out/phi2-merged
    ```
    
    Now use merged model:
    ```python
    from litgpt import LLM
    llm = LLM.load("out/phi2-merged")
    ```
    
    ### Workflow 3: Pretrain from scratch
    
    Train new model on your domain data.
    
    ```
    Pretraining:
    - [ ] Step 1: Prepare pretraining dataset
    - [ ] Step 2: Configure model architecture
    - [ ] Step 3: Set up multi-GPU training
    - [ ] Step 4: Launch pretraining
    ```
    
    **Step 1: Prepare pretraining dataset**
    
    LitGPT expects tokenized data. Use `prepare_dataset.py`:
    
    ```bash
    python scripts/prepare_dataset.py \
      --source_path data/my_corpus.txt \
      --checkpoint_dir checkpoints/tokenizer \
      --destination_path data/pretrain \
      --split train,val
    ```
    
    **Step 2: Configure model architecture**
    
    Edit config file or use existing:
    
    ```python
    # config/pythia-160m.yaml
    model_name: pythia-160m
    block_size: 2048
    vocab_size: 50304
    n_layer: 12
    n_head: 12
    n_embd: 768
    rotary_percentage: 0.25
    parallel_residual: true
    bias: true
    ```
    
    **Step 3: Set up multi-GPU training**
    
    ```bash
    # Single GPU
    litgpt pretrain \
      --config config/pythia-160m.yaml \
      --data.data_dir data/pretrain \
      --train.max_tokens 10_000_000_000
    
    # Multi-GPU with FSDP
    litgpt pretrain \
      --config config/pythia-1b.yaml \
      --data.data_dir data/pretrain \
      --devices 8 \
      --train.max_tokens 100_000_000_000
    ```
    
    **Step 4: Launch pretraining**
    
    For large-scale pretraining on cluster:
    
    ```bash
    # Using SLURM
    sbatch --nodes=8 --gpus-per-node=8 \
      pretrain_script.sh
    
    # pretrain_script.sh content:
    litgpt pretrain \
      --config config/pythia-1b.yaml \
      --data.data_dir /shared/data/pretrain \
      --devices 8 \
      --num_nodes 8 \
      --train.global_batch_size 512 \
      --train.max_tokens 300_000_000_000
    ```
    
    ### Workflow 4: Convert and deploy model
    
    Export LitGPT models for production.
    
    ```
    Model Deployment:
    - [ ] Step 1: Test inference locally
    - [ ] Step 2: Quantize model (optional)
    - [ ] Step 3: Convert to GGUF (for llama.cpp)
    - [ ] Step 4: Deploy with API
    ```
    
    **Step 1: Test inference locally**
    
    ```python
    from litgpt import LLM
    
    llm = LLM.load("out/phi2-lora/final")
    
    # Single generation
    print(llm.generate("What is machine learning?"))
    
    # Streaming
    for token in llm.generate("Explain quantum computing", stream=True):
        print(token, end="", flush=True)
    
    # Batch inference
    prompts = ["Hello", "Goodbye", "Thank you"]
    results = [llm.generate(p) for p in prompts]
    ```
    
    **Step 2: Quantize model** (optional)
    
    Reduce model size with minimal quality loss:
    
    ```bash
    # 8-bit quantization (50% size reduction)
    litgpt convert_lit_checkpoint \
      out/phi2-lora/final \
      --dtype bfloat16 \
      --quantize bnb.nf4
    
    # 4-bit quantization (75% size reduction)
    litgpt convert_lit_checkpoint \
      out/phi2-lora/final \
      --quantize bnb.nf4-dq  # Double quantization
    ```
    
    **Step 3: Convert to GGUF** (for llama.cpp)
    
    ```bash
    python scripts/convert_lit_checkpoint.py \
      --checkpoint_path out/phi2-lora/final \
      --output_path models/phi2.gguf \
      --model_name microsoft/phi-2
    ```
    
    **Step 4: Deploy with API**
    
    ```python
    from fastapi import FastAPI
    from litgpt import LLM
    
    app = FastAPI()
    llm = LLM.load("out/phi2-lora/final")
    
    @app.post("/generate")
    def generate(prompt: str, max_tokens: int = 100):
        result = llm.generate(
            prompt,
            max_new_tokens=max_tokens,
            temperature=0.7
        )
        return {"response": result}
    
    # Run: uvicorn api:app --host 0.0.0.0 --port 8000
    ```
    
    ## When to use vs alternatives
    
    **Use LitGPT when:**
    - Want to understand LLM architectures (clean, readable code)
    - Need production-ready training recipes
    - Educational purposes or research
    - Prototyping new model ideas
    - Lightning ecosystem user
    
    **Use alternatives instead:**
    - **Axolotl/TRL**: More fine-tuning features, YAML configs
    - **Megatron-Core**: Maximum performance for >70B models
    - **HuggingFace Transformers**: Broadest model support
    - **vLLM**: Inference-only (no training)
    
    ## Common issues
    
    **Issue: Out of memory during fine-tuning**
    
    Use LoRA instead of full fine-tuning:
    ```bash
    # Instead of litgpt finetune (requires 40GB+)
    litgpt finetune_lora  # Only needs 12-16GB
    ```
    
    Or enable gradient checkpointing:
    ```bash
    litgpt finetune_lora \
      ... \
      --train.gradient_accumulation_iters 4  # Accumulate gradients
    ```
    
    **Issue: Training too slow**
    
    Enable Flash Attention (built-in, automatic on compatible hardware):
    ```python
    # Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
    # No configuration needed
    ```
    
    Use smaller micro-batch and accumulate:
    ```bash
    --train.micro_batch_size 1 \
    --train.global_batch_size 32 \
    --train.gradient_accumulation_iters 32  # Effective batch=32
    ```
    
    **Issue: Model not loading**
    
    Check model name:
    ```bash
    # List all available models
    litgpt download list
    
    # Download if not exists
    litgpt download meta-llama/Meta-Llama-3-8B
    ```
    
    Verify checkpoints directory:
    ```bash
    ls checkpoints/
    # Should see: meta-llama/Meta-Llama-3-8B/
    ```
    
    **Issue: LoRA adapters too large**
    
    Reduce LoRA rank:
    ```bash
    --lora_r 8  # Instead of 16 or 32
    ```
    
    Apply LoRA to fewer layers:
    ```bash
    --lora_query true \
    --lora_value true \
    --lora_projection false \  # Disable this
    --lora_mlp false  # And this
    ```
    
    ## Advanced topics
    
    **Supported architectures**: See [references/supported-models.md](references/supported-models.md) for complete list of 20+ model families with sizes and capabilities.
    
    **Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for proven hyperparameter configurations for pretraining and fine-tuning.
    
    **FSDP configuration**: See [references/distributed-training.md](references/distributed-training.md) for multi-GPU training with Fully Sharded Data Parallel.
    
    **Custom architectures**: See [references/custom-models.md](references/custom-models.md) for implementing new model architectures in LitGPT style.
    
    ## Hardware requirements
    
    - **GPU**: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
    - **Memory**:
      - Inference (Phi-2): 6GB
      - LoRA fine-tuning (7B): 16GB
      - Full fine-tuning (7B): 40GB+
      - Pretraining (1B): 24GB
    - **Storage**: 5-50GB per model (depending on size)
    
    ## Resources
    
    - GitHub: https://github.com/Lightning-AI/litgpt
    - Docs: https://lightning.ai/docs/litgpt
    - Tutorials: https://lightning.ai/docs/litgpt/tutorials
    - Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)
    
    
    
  • 01-model-architecture/nanogpt/SKILL.mdskill
    Show content (6752 bytes)
    ---
    name: nanogpt
    description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training]
    dependencies: [torch, transformers, datasets, tiktoken, wandb]
    ---
    
    # nanoGPT - Minimalist GPT Training
    
    ## Quick start
    
    nanoGPT is a simplified GPT implementation designed for learning and experimentation.
    
    **Installation**:
    ```bash
    pip install torch numpy transformers datasets tiktoken wandb tqdm
    ```
    
    **Train on Shakespeare** (CPU-friendly):
    ```bash
    # Prepare data
    python data/shakespeare_char/prepare.py
    
    # Train (5 minutes on CPU)
    python train.py config/train_shakespeare_char.py
    
    # Generate text
    python sample.py --out_dir=out-shakespeare-char
    ```
    
    **Output**:
    ```
    ROMEO:
    What say'st thou? Shall I speak, and be a man?
    
    JULIET:
    I am afeard, and yet I'll speak; for thou art
    One that hath been a man, and yet I know not
    What thou art.
    ```
    
    ## Common workflows
    
    ### Workflow 1: Character-level Shakespeare
    
    **Complete training pipeline**:
    ```bash
    # Step 1: Prepare data (creates train.bin, val.bin)
    python data/shakespeare_char/prepare.py
    
    # Step 2: Train small model
    python train.py config/train_shakespeare_char.py
    
    # Step 3: Generate text
    python sample.py --out_dir=out-shakespeare-char
    ```
    
    **Config** (`config/train_shakespeare_char.py`):
    ```python
    # Model config
    n_layer = 6          # 6 transformer layers
    n_head = 6           # 6 attention heads
    n_embd = 384         # 384-dim embeddings
    block_size = 256     # 256 char context
    
    # Training config
    batch_size = 64
    learning_rate = 1e-3
    max_iters = 5000
    eval_interval = 500
    
    # Hardware
    device = 'cpu'  # Or 'cuda'
    compile = False # Set True for PyTorch 2.0
    ```
    
    **Training time**: ~5 minutes (CPU), ~1 minute (GPU)
    
    ### Workflow 2: Reproduce GPT-2 (124M)
    
    **Multi-GPU training on OpenWebText**:
    ```bash
    # Step 1: Prepare OpenWebText (takes ~1 hour)
    python data/openwebtext/prepare.py
    
    # Step 2: Train GPT-2 124M with DDP (8 GPUs)
    torchrun --standalone --nproc_per_node=8 \
      train.py config/train_gpt2.py
    
    # Step 3: Sample from trained model
    python sample.py --out_dir=out
    ```
    
    **Config** (`config/train_gpt2.py`):
    ```python
    # GPT-2 (124M) architecture
    n_layer = 12
    n_head = 12
    n_embd = 768
    block_size = 1024
    dropout = 0.0
    
    # Training
    batch_size = 12
    gradient_accumulation_steps = 5 * 8  # Total batch ~0.5M tokens
    learning_rate = 6e-4
    max_iters = 600000
    lr_decay_iters = 600000
    
    # System
    compile = True  # PyTorch 2.0
    ```
    
    **Training time**: ~4 days (8× A100)
    
    ### Workflow 3: Fine-tune pretrained GPT-2
    
    **Start from OpenAI checkpoint**:
    ```python
    # In train.py or config
    init_from = 'gpt2'  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
    
    # Model loads OpenAI weights automatically
    python train.py config/finetune_shakespeare.py
    ```
    
    **Example config** (`config/finetune_shakespeare.py`):
    ```python
    # Start from GPT-2
    init_from = 'gpt2'
    
    # Dataset
    dataset = 'shakespeare_char'
    batch_size = 1
    block_size = 1024
    
    # Fine-tuning
    learning_rate = 3e-5  # Lower LR for fine-tuning
    max_iters = 2000
    warmup_iters = 100
    
    # Regularization
    weight_decay = 1e-1
    ```
    
    ### Workflow 4: Custom dataset
    
    **Train on your own text**:
    ```python
    # data/custom/prepare.py
    import numpy as np
    
    # Load your data
    with open('my_data.txt', 'r') as f:
        text = f.read()
    
    # Create character mappings
    chars = sorted(list(set(text)))
    stoi = {ch: i for i, ch in enumerate(chars)}
    itos = {i: ch for i, ch in enumerate(chars)}
    
    # Tokenize
    data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
    
    # Split train/val
    n = len(data)
    train_data = data[:int(n*0.9)]
    val_data = data[int(n*0.9):]
    
    # Save
    train_data.tofile('data/custom/train.bin')
    val_data.tofile('data/custom/val.bin')
    ```
    
    **Train**:
    ```bash
    python data/custom/prepare.py
    python train.py --dataset=custom
    ```
    
    ## When to use vs alternatives
    
    **Use nanoGPT when**:
    - Learning how GPT works
    - Experimenting with transformer variants
    - Teaching/education purposes
    - Quick prototyping
    - Limited compute (can run on CPU)
    
    **Simplicity advantages**:
    - **~300 lines**: Entire model in `model.py`
    - **~300 lines**: Training loop in `train.py`
    - **Hackable**: Easy to modify
    - **No abstractions**: Pure PyTorch
    
    **Use alternatives instead**:
    - **HuggingFace Transformers**: Production use, many models
    - **Megatron-LM**: Large-scale distributed training
    - **LitGPT**: More architectures, production-ready
    - **PyTorch Lightning**: Need high-level framework
    
    ## Common issues
    
    **Issue: CUDA out of memory**
    
    Reduce batch size or context length:
    ```python
    batch_size = 1  # Reduce from 12
    block_size = 512  # Reduce from 1024
    gradient_accumulation_steps = 40  # Increase to maintain effective batch
    ```
    
    **Issue: Training too slow**
    
    Enable compilation (PyTorch 2.0+):
    ```python
    compile = True  # 2× speedup
    ```
    
    Use mixed precision:
    ```python
    dtype = 'bfloat16'  # Or 'float16'
    ```
    
    **Issue: Poor generation quality**
    
    Train longer:
    ```python
    max_iters = 10000  # Increase from 5000
    ```
    
    Lower temperature:
    ```python
    # In sample.py
    temperature = 0.7  # Lower from 1.0
    top_k = 200       # Add top-k sampling
    ```
    
    **Issue: Can't load GPT-2 weights**
    
    Install transformers:
    ```bash
    pip install transformers
    ```
    
    Check model name:
    ```python
    init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
    ```
    
    ## Advanced topics
    
    **Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply.
    
    **Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup.
    
    **Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details.
    
    ## Hardware requirements
    
    - **Shakespeare (char-level)**:
      - CPU: 5 minutes
      - GPU (T4): 1 minute
      - VRAM: <1GB
    
    - **GPT-2 (124M)**:
      - 1× A100: ~1 week
      - 8× A100: ~4 days
      - VRAM: ~16GB per GPU
    
    - **GPT-2 Medium (350M)**:
      - 8× A100: ~2 weeks
      - VRAM: ~40GB per GPU
    
    **Performance**:
    - With `compile=True`: 2× speedup
    - With `dtype=bfloat16`: 50% memory reduction
    
    ## Resources
    
    - GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
    - Video: "Let's build GPT" by Andrej Karpathy
    - Paper: "Attention is All You Need" (Vaswani et al.)
    - OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
    - Educational: Best for understanding transformers from scratch
    
    
    
  • 01-model-architecture/rwkv/SKILL.mdskill
    Show content (7099 bytes)
    ---
    name: rwkv-architecture
    description: RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [RWKV, Model Architecture, RNN, Transformer Hybrid, Linear Complexity, Infinite Context, Efficient Inference, Linux Foundation, Alternative Architecture]
    dependencies: [rwkv, torch, transformers]
    ---
    
    # RWKV - Receptance Weighted Key Value
    
    ## Quick start
    
    RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
    
    **Installation**:
    ```bash
    # Install PyTorch
    pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
    
    # Install dependencies
    pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
    
    # Install RWKV
    pip install rwkv
    ```
    
    **Basic usage** (GPT mode + RNN mode):
    ```python
    import os
    from rwkv.model import RWKV
    
    os.environ["RWKV_JIT_ON"] = '1'
    os.environ["RWKV_CUDA_ON"] = '1'  # Use CUDA kernel for speed
    
    # Load model
    model = RWKV(
        model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
        strategy='cuda fp16'
    )
    
    # GPT mode (parallel processing)
    out, state = model.forward([187, 510, 1563, 310, 247], None)
    print(out.detach().cpu().numpy())  # Logits
    
    # RNN mode (sequential processing, same result)
    out, state = model.forward([187, 510], None)  # First 2 tokens
    out, state = model.forward([1563], state)      # Next token
    out, state = model.forward([310, 247], state)  # Last tokens
    print(out.detach().cpu().numpy())  # Same logits as above!
    ```
    
    ## Common workflows
    
    ### Workflow 1: Text generation (streaming)
    
    **Efficient token-by-token generation**:
    ```python
    from rwkv.model import RWKV
    from rwkv.utils import PIPELINE
    
    model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
    pipeline = PIPELINE(model, "20B_tokenizer.json")
    
    # Initial prompt
    prompt = "The future of AI is"
    state = None
    
    # Generate token by token
    for token in prompt:
        out, state = pipeline.model.forward(pipeline.encode(token), state)
    
    # Continue generation
    for _ in range(100):
        out, state = pipeline.model.forward(None, state)
        token = pipeline.sample_logits(out)
        print(pipeline.decode(token), end='', flush=True)
    ```
    
    **Key advantage**: Constant memory per token (no growing KV cache)
    
    ### Workflow 2: Long context processing (infinite context)
    
    **Process million-token sequences**:
    ```python
    model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
    
    # Process very long document
    state = None
    long_document = load_document()  # e.g., 1M tokens
    
    # Stream through entire document
    for chunk in chunks(long_document, chunk_size=1024):
        out, state = model.forward(chunk, state)
    
    # State now contains information from entire 1M token document
    # Memory usage: O(1) (constant, not O(n)!)
    ```
    
    ### Workflow 3: Fine-tuning RWKV
    
    **Standard fine-tuning workflow**:
    ```python
    # Training script
    import pytorch_lightning as pl
    from rwkv.model import RWKV
    from rwkv.trainer import RWKVTrainer
    
    # Configure model
    config = {
        'n_layer': 24,
        'n_embd': 1024,
        'vocab_size': 50277,
        'ctx_len': 1024
    }
    
    # Setup trainer
    trainer = pl.Trainer(
        accelerator='gpu',
        devices=8,
        precision='bf16',
        strategy='deepspeed_stage_2',
        max_epochs=1
    )
    
    # Train
    model = RWKV(config)
    trainer.fit(model, train_dataloader)
    ```
    
    ### Workflow 4: RWKV vs Transformer comparison
    
    **Memory comparison** (1M token sequence):
    ```python
    # Transformer (GPT)
    # Memory: O(n²) for attention
    # KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
    # Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)
    
    # RWKV
    # Memory: O(1) per token
    # State: hidden_dim × n_layers = 4096 × 24 = ~400KB
    # 1,000,000× more efficient!
    ```
    
    **Speed comparison** (inference):
    ```python
    # Transformer: O(n) per token (quadratic overall)
    # First token: 1 computation
    # Second token: 2 computations
    # ...
    # 1000th token: 1000 computations
    
    # RWKV: O(1) per token (linear overall)
    # Every token: 1 computation
    # 1000th token: 1 computation (same as first!)
    ```
    
    ## When to use vs alternatives
    
    **Use RWKV when**:
    - Need very long context (100K+ tokens)
    - Want constant memory usage
    - Building streaming applications
    - Need RNN efficiency with Transformer performance
    - Memory-constrained deployment
    
    **Key advantages**:
    - **Linear time**: O(n) vs O(n²) for Transformers
    - **No KV cache**: Constant memory per token
    - **Infinite context**: No fixed window limit
    - **Parallelizable training**: Like GPT
    - **Sequential inference**: Like RNN
    
    **Use alternatives instead**:
    - **Transformers**: Need absolute best performance, have compute
    - **Mamba**: Want state-space models
    - **RetNet**: Need retention mechanism
    - **Hyena**: Want convolution-based approach
    
    ## Common issues
    
    **Issue: Out of memory during training**
    
    Use gradient checkpointing and DeepSpeed:
    ```python
    trainer = pl.Trainer(
        strategy='deepspeed_stage_3',  # Full ZeRO-3
        precision='bf16'
    )
    ```
    
    **Issue: Slow inference**
    
    Enable CUDA kernel:
    ```python
    os.environ["RWKV_CUDA_ON"] = '1'
    ```
    
    **Issue: Model not loading**
    
    Check model path and strategy:
    ```python
    model = RWKV(
        model='/absolute/path/to/model.pth',
        strategy='cuda fp16'  # Or 'cpu fp32' for CPU
    )
    ```
    
    **Issue: State management in RNN mode**
    
    Always pass state between forward calls:
    ```python
    # WRONG: State lost
    out1, _ = model.forward(tokens1, None)
    out2, _ = model.forward(tokens2, None)  # No context from tokens1!
    
    # CORRECT: State preserved
    out1, state = model.forward(tokens1, None)
    out2, state = model.forward(tokens2, state)  # Has context from tokens1
    ```
    
    ## Advanced topics
    
    **Time-mixing and channel-mixing**: See [references/architecture-details.md](references/architecture-details.md) for WKV operation, time-decay mechanism, and receptance gates.
    
    **State management**: See [references/state-management.md](references/state-management.md) for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.
    
    **RWKV-7 improvements**: See [references/rwkv7.md](references/rwkv7.md) for latest architectural improvements (March 2025) and multimodal capabilities.
    
    ## Hardware requirements
    
    - **GPU**: NVIDIA (CUDA 11.6+) or CPU
    - **VRAM** (FP16):
      - 169M model: 1GB
      - 430M model: 2GB
      - 1.5B model: 4GB
      - 3B model: 8GB
      - 7B model: 16GB
      - 14B model: 32GB
    - **Inference**: O(1) memory per token
    - **Training**: Parallelizable like GPT
    
    **Performance** (vs Transformers):
    - **Speed**: Similar training, faster inference
    - **Memory**: 1000× less for long sequences
    - **Scaling**: Linear vs quadratic
    
    ## Resources
    
    - Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
    - Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
    - GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
    - Docs: https://wiki.rwkv.com/
    - Models: https://huggingface.co/BlinkDL
    - Linux Foundation AI: Official project
    - Production: Microsoft Windows, Office integration, NeMo support
    
    
    
  • 01-model-architecture/torchtitan/SKILL.mdskill
    Show content (8927 bytes)
    ---
    name: distributed-llm-pretraining-torchtitan
    description: Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Model Architecture, Distributed Training, TorchTitan, FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel, Float8, Llama, Pretraining]
    dependencies: [torch>=2.6.0, torchtitan>=0.2.0, torchao>=0.5.0]
    ---
    
    # TorchTitan - PyTorch Native Distributed LLM Pretraining
    
    ## Quick start
    
    TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
    
    **Installation**:
    ```bash
    # From PyPI (stable)
    pip install torchtitan
    
    # From source (latest features, requires PyTorch nightly)
    git clone https://github.com/pytorch/torchtitan
    cd torchtitan
    pip install -r requirements.txt
    ```
    
    **Download tokenizer**:
    ```bash
    # Get HF token from https://huggingface.co/settings/tokens
    python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
    ```
    
    **Start training on 8 GPUs**:
    ```bash
    CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
    ```
    
    ## Common workflows
    
    ### Workflow 1: Pretrain Llama 3.1 8B on single node
    
    Copy this checklist:
    
    ```
    Single Node Pretraining:
    - [ ] Step 1: Download tokenizer
    - [ ] Step 2: Configure training
    - [ ] Step 3: Launch training
    - [ ] Step 4: Monitor and checkpoint
    ```
    
    **Step 1: Download tokenizer**
    
    ```bash
    python scripts/download_hf_assets.py \
      --repo_id meta-llama/Llama-3.1-8B \
      --assets tokenizer \
      --hf_token=YOUR_HF_TOKEN
    ```
    
    **Step 2: Configure training**
    
    Edit or create a TOML config file:
    
    ```toml
    # llama3_8b_custom.toml
    [job]
    dump_folder = "./outputs"
    description = "Llama 3.1 8B training"
    
    [model]
    name = "llama3"
    flavor = "8B"
    hf_assets_path = "./assets/hf/Llama-3.1-8B"
    
    [optimizer]
    name = "AdamW"
    lr = 3e-4
    
    [lr_scheduler]
    warmup_steps = 200
    
    [training]
    local_batch_size = 2
    seq_len = 8192
    max_norm = 1.0
    steps = 1000
    dataset = "c4"
    
    [parallelism]
    data_parallel_shard_degree = -1  # Use all GPUs for FSDP
    
    [activation_checkpoint]
    mode = "selective"
    selective_ac_option = "op"
    
    [checkpoint]
    enable = true
    folder = "checkpoint"
    interval = 500
    ```
    
    **Step 3: Launch training**
    
    ```bash
    # 8 GPUs on single node
    CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
    
    # Or explicitly with torchrun
    torchrun --nproc_per_node=8 \
      -m torchtitan.train \
      --job.config_file ./llama3_8b_custom.toml
    ```
    
    **Step 4: Monitor and checkpoint**
    
    TensorBoard logs are saved to `./outputs/tb/`:
    ```bash
    tensorboard --logdir ./outputs/tb
    ```
    
    ### Workflow 2: Multi-node training with SLURM
    
    ```
    Multi-Node Training:
    - [ ] Step 1: Configure parallelism for scale
    - [ ] Step 2: Set up SLURM script
    - [ ] Step 3: Submit job
    - [ ] Step 4: Resume from checkpoint
    ```
    
    **Step 1: Configure parallelism for scale**
    
    For 70B model on 256 GPUs (32 nodes):
    ```toml
    [parallelism]
    data_parallel_shard_degree = 32  # FSDP across 32 ranks
    tensor_parallel_degree = 8        # TP within node
    pipeline_parallel_degree = 1      # No PP for 70B
    context_parallel_degree = 1       # Increase for long sequences
    ```
    
    **Step 2: Set up SLURM script**
    
    ```bash
    #!/bin/bash
    #SBATCH --job-name=llama70b
    #SBATCH --nodes=32
    #SBATCH --ntasks-per-node=8
    #SBATCH --gpus-per-node=8
    
    srun torchrun \
      --nnodes=32 \
      --nproc_per_node=8 \
      --rdzv_backend=c10d \
      --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
      -m torchtitan.train \
      --job.config_file ./llama3_70b.toml
    ```
    
    **Step 3: Submit job**
    
    ```bash
    sbatch multinode_trainer.slurm
    ```
    
    **Step 4: Resume from checkpoint**
    
    Training auto-resumes if checkpoint exists in configured folder.
    
    ### Workflow 3: Enable Float8 training for H100s
    
    Float8 provides 30-50% speedup on H100 GPUs.
    
    ```
    Float8 Training:
    - [ ] Step 1: Install torchao
    - [ ] Step 2: Configure Float8
    - [ ] Step 3: Launch with compile
    ```
    
    **Step 1: Install torchao**
    
    ```bash
    USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
    ```
    
    **Step 2: Configure Float8**
    
    Add to your TOML config:
    ```toml
    [model]
    converters = ["quantize.linear.float8"]
    
    [quantize.linear.float8]
    enable_fsdp_float8_all_gather = true
    precompute_float8_dynamic_scale_for_fsdp = true
    filter_fqns = ["output"]  # Exclude output layer
    
    [compile]
    enable = true
    components = ["model", "loss"]
    ```
    
    **Step 3: Launch with compile**
    
    ```bash
    CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
      --model.converters="quantize.linear.float8" \
      --quantize.linear.float8.enable_fsdp_float8_all_gather \
      --compile.enable
    ```
    
    ### Workflow 4: 4D parallelism for 405B models
    
    ```
    4D Parallelism (FSDP + TP + PP + CP):
    - [ ] Step 1: Create seed checkpoint
    - [ ] Step 2: Configure 4D parallelism
    - [ ] Step 3: Launch on 512 GPUs
    ```
    
    **Step 1: Create seed checkpoint**
    
    Required for consistent initialization across PP stages:
    ```bash
    NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
      --checkpoint.enable \
      --checkpoint.create_seed_checkpoint \
      --parallelism.data_parallel_shard_degree 1 \
      --parallelism.tensor_parallel_degree 1 \
      --parallelism.pipeline_parallel_degree 1
    ```
    
    **Step 2: Configure 4D parallelism**
    
    ```toml
    [parallelism]
    data_parallel_shard_degree = 8   # FSDP
    tensor_parallel_degree = 8       # TP within node
    pipeline_parallel_degree = 8     # PP across nodes
    context_parallel_degree = 1      # CP for long sequences
    
    [training]
    local_batch_size = 32
    seq_len = 8192
    ```
    
    **Step 3: Launch on 512 GPUs**
    
    ```bash
    # 64 nodes x 8 GPUs = 512 GPUs
    srun torchrun --nnodes=64 --nproc_per_node=8 \
      -m torchtitan.train \
      --job.config_file ./llama3_405b.toml
    ```
    
    ## When to use vs alternatives
    
    **Use TorchTitan when:**
    - Pretraining LLMs from scratch (8B to 405B+)
    - Need PyTorch-native solution without third-party dependencies
    - Require composable 4D parallelism (FSDP2, TP, PP, CP)
    - Training on H100s with Float8 support
    - Want interoperable checkpoints with torchtune/HuggingFace
    
    **Use alternatives instead:**
    - **Megatron-LM**: Maximum performance for NVIDIA-only deployments
    - **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support
    - **Axolotl/TRL**: Fine-tuning rather than pretraining
    - **LitGPT**: Educational, smaller-scale training
    
    ## Common issues
    
    **Issue: Out of memory on large models**
    
    Enable activation checkpointing and reduce batch size:
    ```toml
    [activation_checkpoint]
    mode = "full"  # Instead of "selective"
    
    [training]
    local_batch_size = 1
    ```
    
    Or use gradient accumulation:
    ```toml
    [training]
    local_batch_size = 1
    global_batch_size = 32  # Accumulates gradients
    ```
    
    **Issue: TP causes high memory with async collectives**
    
    Set environment variable:
    ```bash
    export TORCH_NCCL_AVOID_RECORD_STREAMS=1
    ```
    
    **Issue: Float8 training not faster**
    
    Float8 only benefits large GEMMs. Filter small layers:
    ```toml
    [quantize.linear.float8]
    filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
    ```
    
    **Issue: Checkpoint loading fails after parallelism change**
    
    Use DCP's resharding capability:
    ```bash
    # Convert sharded checkpoint to single file
    python -m torch.distributed.checkpoint.format_utils \
      dcp_to_torch checkpoint/step-1000 checkpoint.pt
    ```
    
    **Issue: Pipeline parallelism initialization**
    
    Create seed checkpoint first (see Workflow 4, Step 1).
    
    ## Supported models
    
    | Model | Sizes | Status |
    |-------|-------|--------|
    | Llama 3.1 | 8B, 70B, 405B | Production |
    | Llama 4 | Various | Experimental |
    | DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
    | GPT-OSS | 20B, 120B (MoE) | Experimental |
    | Qwen 3 | Various | Experimental |
    | Flux | Diffusion | Experimental |
    
    ## Performance benchmarks (H100)
    
    | Model | GPUs | Parallelism | TPS/GPU | Techniques |
    |-------|------|-------------|---------|------------|
    | Llama 8B | 8 | FSDP | 5,762 | Baseline |
    | Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
    | Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
    | Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |
    
    ## Advanced topics
    
    **FSDP2 configuration**: See [references/fsdp.md](references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
    
    **Float8 training**: See [references/float8.md](references/float8.md) for tensorwise vs rowwise scaling recipes.
    
    **Checkpointing**: See [references/checkpoint.md](references/checkpoint.md) for HuggingFace conversion and async checkpointing.
    
    **Adding custom models**: See [references/custom-models.md](references/custom-models.md) for TrainSpec protocol.
    
    ## Resources
    
    - GitHub: https://github.com/pytorch/torchtitan
    - Paper: https://arxiv.org/abs/2410.06511
    - ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
    - PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44
    
    
  • 02-tokenization/huggingface-tokenizers/SKILL.mdskill
    Show content (13674 bytes)
    ---
    name: huggingface-tokenizers
    description: Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production]
    dependencies: [tokenizers, transformers, datasets]
    ---
    
    # HuggingFace Tokenizers - Fast Tokenization for NLP
    
    Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
    
    ## When to use HuggingFace Tokenizers
    
    **Use HuggingFace Tokenizers when:**
    - Need extremely fast tokenization (<20s per GB of text)
    - Training custom tokenizers from scratch
    - Want alignment tracking (token → original text position)
    - Building production NLP pipelines
    - Need to tokenize large corpora efficiently
    
    **Performance**:
    - **Speed**: <20 seconds to tokenize 1GB on CPU
    - **Implementation**: Rust core with Python/Node.js bindings
    - **Efficiency**: 10-100× faster than pure Python implementations
    
    **Use alternatives instead**:
    - **SentencePiece**: Language-independent, used by T5/ALBERT
    - **tiktoken**: OpenAI's BPE tokenizer for GPT models
    - **transformers AutoTokenizer**: Loading pretrained only (uses this library internally)
    
    ## Quick start
    
    ### Installation
    
    ```bash
    # Install tokenizers
    pip install tokenizers
    
    # With transformers integration
    pip install tokenizers transformers
    ```
    
    ### Load pretrained tokenizer
    
    ```python
    from tokenizers import Tokenizer
    
    # Load from HuggingFace Hub
    tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
    
    # Encode text
    output = tokenizer.encode("Hello, how are you?")
    print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
    print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]
    
    # Decode back
    text = tokenizer.decode(output.ids)
    print(text)  # "hello, how are you?"
    ```
    
    ### Train custom BPE tokenizer
    
    ```python
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    
    # Initialize tokenizer with BPE model
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    
    # Configure trainer
    trainer = BpeTrainer(
        vocab_size=30000,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        min_frequency=2
    )
    
    # Train on files
    files = ["train.txt", "validation.txt"]
    tokenizer.train(files, trainer)
    
    # Save
    tokenizer.save("my-tokenizer.json")
    ```
    
    **Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
    
    ### Batch encoding with padding
    
    ```python
    # Enable padding
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
    
    # Encode batch
    texts = ["Hello world", "This is a longer sentence"]
    encodings = tokenizer.encode_batch(texts)
    
    for encoding in encodings:
        print(encoding.ids)
    # [101, 7592, 2088, 102, 3, 3, 3]
    # [101, 2023, 2003, 1037, 2936, 6251, 102]
    ```
    
    ## Tokenization algorithms
    
    ### BPE (Byte-Pair Encoding)
    
    **How it works**:
    1. Start with character-level vocabulary
    2. Find most frequent character pair
    3. Merge into new token, add to vocabulary
    4. Repeat until vocabulary size reached
    
    **Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
    
    ```python
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import ByteLevel
    
    tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
    tokenizer.pre_tokenizer = ByteLevel()
    
    trainer = BpeTrainer(
        vocab_size=50257,
        special_tokens=["<|endoftext|>"],
        min_frequency=2
    )
    
    tokenizer.train(files=["data.txt"], trainer=trainer)
    ```
    
    **Advantages**:
    - Handles OOV words well (breaks into subwords)
    - Flexible vocabulary size
    - Good for morphologically rich languages
    
    **Trade-offs**:
    - Tokenization depends on merge order
    - May split common words unexpectedly
    
    ### WordPiece
    
    **How it works**:
    1. Start with character vocabulary
    2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))`
    3. Merge highest scoring pair
    4. Repeat until vocabulary size reached
    
    **Used by**: BERT, DistilBERT, MobileBERT
    
    ```python
    from tokenizers import Tokenizer
    from tokenizers.models import WordPiece
    from tokenizers.trainers import WordPieceTrainer
    from tokenizers.pre_tokenizers import Whitespace
    from tokenizers.normalizers import BertNormalizer
    
    tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
    tokenizer.normalizer = BertNormalizer(lowercase=True)
    tokenizer.pre_tokenizer = Whitespace()
    
    trainer = WordPieceTrainer(
        vocab_size=30522,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        continuing_subword_prefix="##"
    )
    
    tokenizer.train(files=["corpus.txt"], trainer=trainer)
    ```
    
    **Advantages**:
    - Prioritizes meaningful merges (high score = semantically related)
    - Used successfully in BERT (state-of-the-art results)
    
    **Trade-offs**:
    - Unknown words become `[UNK]` if no subword match
    - Saves vocabulary, not merge rules (larger files)
    
    ### Unigram
    
    **How it works**:
    1. Start with large vocabulary (all substrings)
    2. Compute loss for corpus with current vocabulary
    3. Remove tokens with minimal impact on loss
    4. Repeat until vocabulary size reached
    
    **Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece)
    
    ```python
    from tokenizers import Tokenizer
    from tokenizers.models import Unigram
    from tokenizers.trainers import UnigramTrainer
    
    tokenizer = Tokenizer(Unigram())
    
    trainer = UnigramTrainer(
        vocab_size=8000,
        special_tokens=["<unk>", "<s>", "</s>"],
        unk_token="<unk>"
    )
    
    tokenizer.train(files=["data.txt"], trainer=trainer)
    ```
    
    **Advantages**:
    - Probabilistic (finds most likely tokenization)
    - Works well for languages without word boundaries
    - Handles diverse linguistic contexts
    
    **Trade-offs**:
    - Computationally expensive to train
    - More hyperparameters to tune
    
    ## Tokenization pipeline
    
    Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing**
    
    ### Normalization
    
    Clean and standardize text:
    
    ```python
    from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
    
    tokenizer.normalizer = Sequence([
        NFD(),           # Unicode normalization (decompose)
        Lowercase(),     # Convert to lowercase
        StripAccents()   # Remove accents
    ])
    
    # Input: "Héllo WORLD"
    # After normalization: "hello world"
    ```
    
    **Common normalizers**:
    - `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
    - `Lowercase()` - Convert to lowercase
    - `StripAccents()` - Remove accents (é → e)
    - `Strip()` - Remove whitespace
    - `Replace(pattern, content)` - Regex replacement
    
    ### Pre-tokenization
    
    Split text into word-like units:
    
    ```python
    from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
    
    # Split on whitespace and punctuation
    tokenizer.pre_tokenizer = Sequence([
        Whitespace(),
        Punctuation()
    ])
    
    # Input: "Hello, world!"
    # After pre-tokenization: ["Hello", ",", "world", "!"]
    ```
    
    **Common pre-tokenizers**:
    - `Whitespace()` - Split on spaces, tabs, newlines
    - `ByteLevel()` - GPT-2 style byte-level splitting
    - `Punctuation()` - Isolate punctuation
    - `Digits(individual_digits=True)` - Split digits individually
    - `Metaspace()` - Replace spaces with ▁ (SentencePiece style)
    
    ### Post-processing
    
    Add special tokens for model input:
    
    ```python
    from tokenizers.processors import TemplateProcessing
    
    # BERT-style: [CLS] sentence [SEP]
    tokenizer.post_processor = TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B [SEP]",
        special_tokens=[
            ("[CLS]", 1),
            ("[SEP]", 2),
        ],
    )
    ```
    
    **Common patterns**:
    ```python
    # GPT-2: sentence <|endoftext|>
    TemplateProcessing(
        single="$A <|endoftext|>",
        special_tokens=[("<|endoftext|>", 50256)]
    )
    
    # RoBERTa: <s> sentence </s>
    TemplateProcessing(
        single="<s> $A </s>",
        pair="<s> $A </s> </s> $B </s>",
        special_tokens=[("<s>", 0), ("</s>", 2)]
    )
    ```
    
    ## Alignment tracking
    
    Track token positions in original text:
    
    ```python
    output = tokenizer.encode("Hello, world!")
    
    # Get token offsets
    for token, offset in zip(output.tokens, output.offsets):
        start, end = offset
        print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
    
    # Output:
    # hello      → [ 0,  5): 'Hello'
    # ,          → [ 5,  6): ','
    # world      → [ 7, 12): 'world'
    # !          → [12, 13): '!'
    ```
    
    **Use cases**:
    - Named entity recognition (map predictions back to text)
    - Question answering (extract answer spans)
    - Token classification (align labels to original positions)
    
    ## Integration with transformers
    
    ### Load with AutoTokenizer
    
    ```python
    from transformers import AutoTokenizer
    
    # AutoTokenizer automatically uses fast tokenizers
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # Check if using fast tokenizer
    print(tokenizer.is_fast)  # True
    
    # Access underlying tokenizers.Tokenizer
    fast_tokenizer = tokenizer.backend_tokenizer
    print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>
    ```
    
    ### Convert custom tokenizer to transformers
    
    ```python
    from tokenizers import Tokenizer
    from transformers import PreTrainedTokenizerFast
    
    # Train custom tokenizer
    tokenizer = Tokenizer(BPE())
    # ... train tokenizer ...
    tokenizer.save("my-tokenizer.json")
    
    # Wrap for transformers
    transformers_tokenizer = PreTrainedTokenizerFast(
        tokenizer_file="my-tokenizer.json",
        unk_token="[UNK]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        sep_token="[SEP]",
        mask_token="[MASK]"
    )
    
    # Use like any transformers tokenizer
    outputs = transformers_tokenizer(
        "Hello world",
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    ```
    
    ## Common patterns
    
    ### Train from iterator (large datasets)
    
    ```python
    from datasets import load_dataset
    
    # Load dataset
    dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
    
    # Create batch iterator
    def batch_iterator(batch_size=1000):
        for i in range(0, len(dataset), batch_size):
            yield dataset[i:i + batch_size]["text"]
    
    # Train tokenizer
    tokenizer.train_from_iterator(
        batch_iterator(),
        trainer=trainer,
        length=len(dataset)  # For progress bar
    )
    ```
    
    **Performance**: Processes 1GB in ~10-20 minutes
    
    ### Enable truncation and padding
    
    ```python
    # Enable truncation
    tokenizer.enable_truncation(max_length=512)
    
    # Enable padding
    tokenizer.enable_padding(
        pad_id=tokenizer.token_to_id("[PAD]"),
        pad_token="[PAD]",
        length=512  # Fixed length, or None for batch max
    )
    
    # Encode with both
    output = tokenizer.encode("This is a long sentence that will be truncated...")
    print(len(output.ids))  # 512
    ```
    
    ### Multi-processing
    
    ```python
    from tokenizers import Tokenizer
    from multiprocessing import Pool
    
    # Load tokenizer
    tokenizer = Tokenizer.from_file("tokenizer.json")
    
    def encode_batch(texts):
        return tokenizer.encode_batch(texts)
    
    # Process large corpus in parallel
    with Pool(8) as pool:
        # Split corpus into chunks
        chunk_size = 1000
        chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
    
        # Encode in parallel
        results = pool.map(encode_batch, chunks)
    ```
    
    **Speedup**: 5-8× with 8 cores
    
    ## Performance benchmarks
    
    ### Training speed
    
    | Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
    |-------------|-----------------|-----------------|--------------|
    | 10 MB       | 15 sec          | 18 sec          | 25 sec       |
    | 100 MB      | 1.5 min         | 2 min           | 4 min        |
    | 1 GB        | 15 min          | 20 min          | 40 min       |
    
    **Hardware**: 16-core CPU, tested on English Wikipedia
    
    ### Tokenization speed
    
    | Implementation | 1 GB corpus | Throughput    |
    |----------------|-------------|---------------|
    | Pure Python    | ~20 minutes | ~50 MB/min    |
    | HF Tokenizers  | ~15 seconds | ~4 GB/min     |
    | **Speedup**    | **80×**     | **80×**       |
    
    **Test**: English text, average sentence length 20 words
    
    ### Memory usage
    
    | Task                    | Memory  |
    |-------------------------|---------|
    | Load tokenizer          | ~10 MB  |
    | Train BPE (30k vocab)   | ~200 MB |
    | Encode 1M sentences     | ~500 MB |
    
    ## Supported models
    
    Pre-trained tokenizers available via `from_pretrained()`:
    
    **BERT family**:
    - `bert-base-uncased`, `bert-large-cased`
    - `distilbert-base-uncased`
    - `roberta-base`, `roberta-large`
    
    **GPT family**:
    - `gpt2`, `gpt2-medium`, `gpt2-large`
    - `distilgpt2`
    
    **T5 family**:
    - `t5-small`, `t5-base`, `t5-large`
    - `google/flan-t5-xxl`
    
    **Other**:
    - `facebook/bart-base`, `facebook/mbart-large-cc25`
    - `albert-base-v2`, `albert-xlarge-v2`
    - `xlm-roberta-base`, `xlm-roberta-large`
    
    Browse all: https://huggingface.co/models?library=tokenizers
    
    ## References
    
    - **[Training Guide](references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets
    - **[Algorithms Deep Dive](references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail
    - **[Pipeline Components](references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders
    - **[Transformers Integration](references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens
    
    ## Resources
    
    - **Docs**: https://huggingface.co/docs/tokenizers
    - **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+
    - **Version**: 0.20.0+
    - **Course**: https://huggingface.co/learn/nlp-course/chapter6/1
    - **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)
    
    
    
  • 01-model-architecture/mamba/SKILL.mdskill
    Show content (7368 bytes)
    ---
    name: mamba-architecture
    description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers]
    dependencies: [mamba-ssm, torch, transformers, causal-conv1d]
    ---
    
    # Mamba - Selective State Space Models
    
    ## Quick start
    
    Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.
    
    **Installation**:
    ```bash
    # Install causal-conv1d (optional, for efficiency)
    pip install causal-conv1d>=1.4.0
    
    # Install Mamba
    pip install mamba-ssm
    # Or both together
    pip install mamba-ssm[causal-conv1d]
    ```
    
    **Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+
    
    **Basic usage** (Mamba block):
    ```python
    import torch
    from mamba_ssm import Mamba
    
    batch, length, dim = 2, 64, 16
    x = torch.randn(batch, length, dim).to("cuda")
    
    model = Mamba(
        d_model=dim,      # Model dimension
        d_state=16,       # SSM state dimension
        d_conv=4,         # Conv1d kernel size
        expand=2          # Expansion factor
    ).to("cuda")
    
    y = model(x)  # O(n) complexity!
    assert y.shape == x.shape
    ```
    
    ## Common workflows
    
    ### Workflow 1: Language model with Mamba-2
    
    **Complete LM with generation**:
    ```python
    from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
    from mamba_ssm.models.config_mamba import MambaConfig
    import torch
    
    # Configure Mamba-2 LM
    config = MambaConfig(
        d_model=1024,           # Hidden dimension
        n_layer=24,             # Number of layers
        vocab_size=50277,       # Vocabulary size
        ssm_cfg=dict(
            layer="Mamba2",     # Use Mamba-2
            d_state=128,        # Larger state for Mamba-2
            headdim=64,         # Head dimension
            ngroups=1           # Number of groups
        )
    )
    
    model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
    
    # Generate text
    input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
    output = model.generate(
        input_ids=input_ids,
        max_length=100,
        temperature=0.7,
        top_p=0.9
    )
    ```
    
    ### Workflow 2: Use pretrained Mamba models
    
    **Load from HuggingFace**:
    ```python
    from transformers import AutoTokenizer
    from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
    
    # Load pretrained model
    model_name = "state-spaces/mamba-2.8b"
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")  # Use compatible tokenizer
    model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
    
    # Generate
    prompt = "The future of AI is"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=200,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2
    )
    generated_text = tokenizer.decode(output_ids[0])
    print(generated_text)
    ```
    
    **Available models**:
    - `state-spaces/mamba-130m`
    - `state-spaces/mamba-370m`
    - `state-spaces/mamba-790m`
    - `state-spaces/mamba-1.4b`
    - `state-spaces/mamba-2.8b`
    
    ### Workflow 3: Mamba-1 vs Mamba-2
    
    **Mamba-1** (smaller state):
    ```python
    from mamba_ssm import Mamba
    
    model = Mamba(
        d_model=256,
        d_state=16,      # Smaller state dimension
        d_conv=4,
        expand=2
    ).to("cuda")
    ```
    
    **Mamba-2** (multi-head, larger state):
    ```python
    from mamba_ssm import Mamba2
    
    model = Mamba2(
        d_model=256,
        d_state=128,     # Larger state dimension
        d_conv=4,
        expand=2,
        headdim=64,      # Head dimension for multi-head
        ngroups=1        # Parallel groups
    ).to("cuda")
    ```
    
    **Key differences**:
    - **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
    - **Architecture**: Mamba-2 has multi-head structure
    - **Normalization**: Mamba-2 uses RMSNorm
    - **Distributed**: Mamba-2 supports tensor parallelism
    
    ### Workflow 4: Benchmark vs Transformers
    
    **Generation speed comparison**:
    ```bash
    # Benchmark Mamba
    python benchmarks/benchmark_generation_mamba_simple.py \
      --model-name "state-spaces/mamba-2.8b" \
      --prompt "The future of machine learning is" \
      --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
    
    # Benchmark Transformer
    python benchmarks/benchmark_generation_mamba_simple.py \
      --model-name "EleutherAI/pythia-2.8b" \
      --prompt "The future of machine learning is" \
      --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
    ```
    
    **Expected results**:
    - **Mamba**: 5× faster inference
    - **Memory**: No KV cache needed
    - **Scaling**: Linear with sequence length
    
    ## When to use vs alternatives
    
    **Use Mamba when**:
    - Need long sequences (100K+ tokens)
    - Want faster inference than Transformers
    - Memory-constrained (no KV cache)
    - Building streaming applications
    - Linear scaling important
    
    **Advantages**:
    - **O(n) complexity**: Linear vs quadratic
    - **5× faster inference**: No attention overhead
    - **No KV cache**: Lower memory usage
    - **Million-token sequences**: Hardware-efficient
    - **Streaming**: Constant memory per token
    
    **Use alternatives instead**:
    - **Transformers**: Need best-in-class performance, have compute
    - **RWKV**: Want RNN+Transformer hybrid
    - **RetNet**: Need retention-based architecture
    - **Hyena**: Want convolution-based approach
    
    ## Common issues
    
    **Issue: CUDA out of memory**
    
    Reduce batch size or use gradient checkpointing:
    ```python
    model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
    model.gradient_checkpointing_enable()  # Enable checkpointing
    ```
    
    **Issue: Slow installation**
    
    Install binary wheels (not source):
    ```bash
    pip install mamba-ssm --no-build-isolation
    ```
    
    **Issue: Missing causal-conv1d**
    
    Install separately:
    ```bash
    pip install causal-conv1d>=1.4.0
    ```
    
    **Issue: Model not loading from HuggingFace**
    
    Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`):
    ```python
    from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
    model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
    ```
    
    ## Advanced topics
    
    **Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.
    
    **Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup.
    
    **Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques.
    
    ## Hardware requirements
    
    - **GPU**: NVIDIA with CUDA 11.6+
    - **VRAM**:
      - 130M model: 2GB
      - 370M model: 4GB
      - 790M model: 8GB
      - 1.4B model: 14GB
      - 2.8B model: 28GB (FP16)
    - **Inference**: 5× faster than Transformers
    - **Memory**: No KV cache (lower than Transformers)
    
    **Performance** (vs Transformers):
    - **Speed**: 5× faster inference
    - **Memory**: 50% less (no KV cache)
    - **Scaling**: Linear vs quadratic
    
    ## Resources
    
    - Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
    - Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
    - GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
    - Models: https://huggingface.co/state-spaces
    - Docs: Repository README and wiki
    
    
    
  • 0-autoresearch-skill/SKILL.mdskill
    Show content (24945 bytes)
    ---
    name: autoresearch
    description: Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
    version: 1.0.0
    author: Orchestra Research
    license: MIT
    tags: [Autonomous Research, Two-Loop Architecture, Experiment Orchestration, Research Synthesis, Project Management]
    ---
    
    # Autoresearch
    
    Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.
    
    You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.
    
    **This runs fully autonomously.** Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.
    
    ## Getting Started
    
    Users arrive in different states. Determine which and proceed:
    
    | User State | What to Do |
    |---|---|
    | Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap |
    | Clear research question | Bootstrap directly |
    | Existing plan or proposal | Review plan, set up workspace, enter loops |
    | Resuming (research-state.yaml exists) | Read state, continue from where you left off |
    
    If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.
    
    **Step 0 — before anything else**: Set up the agent continuity loop. See [Agent Continuity](#agent-continuity-mandatory--set-up-first). This is MANDATORY. Without it, the research stops after one cycle.
    
    ### Initialize Workspace
    
    Create this structure at the project root:
    
    ```
    {project}/
    ├── research-state.yaml       # Central state tracking
    ├── research-log.md           # Decision timeline
    ├── findings.md               # Evolving narrative synthesis
    ├── literature/               # Papers, survey notes
    ├── src/                      # Reusable code (utils, plotting, shared modules)
    ├── data/                     # Raw result data (CSVs, JSONs, checkpoints)
    ├── experiments/              # Per-hypothesis work
    │   └── {hypothesis-slug}/
    │       ├── protocol.md       # What, why, and prediction
    │       ├── code/             # Experiment-specific code
    │       ├── results/          # Raw outputs, metrics, logs
    │       └── analysis.md       # What we learned
    ├── to_human/                 # Progress presentations and reports for human review
    └── paper/                    # Final paper (via ml-paper-writing)
    ```
    
    - **`src/`**: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.
    - **`data/`**: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., `trajectory_H1_runs001-010.csv`). Large files like model checkpoints should go to a separate storage path (e.g., `/data/`, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.
    
    Initialize `research-state.yaml`, `research-log.md`, and `findings.md` from [templates/](templates/). Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.
    
    ## The Two-Loop Architecture
    
    This is the core engine. Everything else supports it.
    
    ```
    BOOTSTRAP (once, lightweight)
      Scope question → search literature → form initial hypotheses
    
    INNER LOOP (fast, autonomous, repeating)
      Pick hypothesis → experiment → measure → record → learn → next
      Goal: run constrained experiments with clear measurable outcomes
    
    OUTER LOOP (periodic, reflective)
      Review results → find patterns → update findings.md →
      new hypotheses → decide direction
      Goal: synthesize understanding, find the story — this is where novelty comes from
    
    FINALIZE (when concluding)
      Write paper via ml-paper-writing → final presentation → archive
    ```
    
    The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results *mean*? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.
    
    There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.
    
    ### Research is Non-Linear
    
    The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:
    
    - **Return to literature** when results surprise you, assumptions break, or you need context for a new direction — always save what you find to `literature/`
    - **Brainstorm new ideas** using `21-research-ideation/` skills when you're stuck or when results open unexpected questions
    - **Pivot the question entirely** if experiments reveal the original question was wrong or less interesting than what you found
    
    This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.
    
    ## Bootstrap: Literature and Hypotheses
    
    Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.
    
    1. **Search literature** for the research question. Use multiple sources — never stop at one:
       - **Exa MCP** (`web_search_exa`) if available — best for broad discovery and finding relevant papers quickly
       - **Semantic Scholar** (`pip install semanticscholar`) — best for ML/AI papers, citation graphs, and specific paper lookup. See `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code examples
       - **arXiv** (`pip install arxiv`) — best for recent preprints and open-access papers
       - **CrossRef** — best for DOI lookup and BibTeX retrieval
       - Keep searching until you have good coverage. If one source comes up empty, try another with different keywords
    
       **Save everything to `literature/`**: For every paper you find, save a summary to `literature/` — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running `literature/survey.md` with all summaries. This is your reference library — you and future sessions will need it throughout the project.
    
    2. **Identify gaps** from the literature
       - What's been tried? What hasn't? Where do existing methods break?
       - What do Discussion sections flag as future work?
    
    3. **Form initial hypotheses** — invoke `21-research-ideation/` skills
       - `brainstorming-research-ideas` for structured diverge-converge workflow
       - `creative-thinking-for-research` for deeper cognitive frameworks
       - Each hypothesis must be testable with a clear prediction
    
    4. **Define the evaluation**
       - Set the proxy metric and baseline before running experiments
       - The metric should be computable quickly (minutes, not hours)
       - Lock evaluation criteria upfront to prevent unconscious metric gaming
    
    5. **Record** in research-state.yaml, log the bootstrap in research-log.md
    
    ## The Inner Loop
    
    Rapid iteration with clear measurable outcomes. Two flavors:
    
    - **Optimization**: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch.
    - **Discovery**: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize.
    
    ```
    1.  Pick the highest-priority untested hypothesis
    2.  Write a protocol: what change, what prediction, why
        Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
        This creates temporal proof your plan existed before results
    3.  Run the experiment (invoke the relevant domain skill)
    4.  Sanity check before trusting results:
        - Did training converge? No NaN/Inf?
        - Does baseline reproduce expected performance?
        - Data loading correct? (spot-check a few samples)
    5.  Measure the proxy metric
    6.  Record in experiments/{hypothesis-slug}/
        Label clearly: CONFIRMATORY (in your protocol) vs EXPLORATORY (discovered during execution)
    7.  If positive: keep, note WHY it worked
    8.  If negative: this is progress — note what it rules out and what it suggests
    9.  Update research-state.yaml
    10. If stuck: search literature or invoke ideation skills — don't just keep trying random things
    ```
    
    **Never stop.** Even if something fails, find a path forward. Debug, adjust, simplify, or pivot — but keep the research moving. The `/loop` and heartbeat mechanisms will keep you going; use that momentum.
    
    ### Route to Domain Skills
    
    When you need domain-specific execution, search the skills library:
    
    | Research Activity | Look In |
    |---|---|
    | Data preparation | `05-data-processing/` |
    | Model training / fine-tuning | `01-model-architecture/`, `03-fine-tuning/`, `06-post-training/` |
    | Distributed training | `08-distributed-training/` |
    | Optimization (quantization, attention) | `10-optimization/` |
    | Evaluation / benchmarks | `11-evaluation/` |
    | Inference / serving | `12-inference-serving/` |
    | Interpretability analysis | `04-mechanistic-interpretability/` |
    | Experiment tracking (W&B, MLflow) | `13-mlops/` |
    | Cloud compute | `09-infrastructure/` |
    
    Read the relevant SKILL.md before starting — it has workflows, common issues, and code examples. See [references/skill-routing.md](references/skill-routing.md) for a complete guide.
    
    ### Track the Experiment Trajectory
    
    Maintain a running record of measurable outcomes across experiments:
    
    ```json
    {
      "experiment_id": "run_014",
      "hypothesis": "H3",
      "metric_value": 0.847,
      "baseline": 0.812,
      "delta": "+0.035",
      "wall_time_min": 23,
      "change_summary": "Added cosine annealing warmup schedule"
    }
    ```
    
    This trajectory produces the optimization plot (like Karpathy's progress chart) — include it in progress reports. Humans love seeing the upward curve.
    
    ## The Outer Loop
    
    Step back from individual experiments. Synthesize.
    
    ```
    1. Review all results since last reflection
    2. Cluster by type: what kinds of changes worked? Which didn't?
    3. Ask WHY — identify the mechanism behind successes and failures
    4. Update findings.md with current understanding
    5. Search literature if results were surprising or assumptions need revisiting
    6. Generate new hypotheses if warranted (invoke 21-research-ideation/ skills)
    7. Decide direction (see criteria below)
    8. Update research-state.yaml with new direction
    9. Log the reflection in research-log.md
    10. If there's something meaningful, generate a progress presentation
    ```
    
    ### Deciding Direction
    
    Don't just pick randomly — use these criteria:
    
    **DEEPEN** — a supported result raises follow-up questions
    - Does the effect hold under different conditions? What's the mechanism?
    - Action: generate sub-hypotheses (H1.1, H1.2) → back to inner loop
    
    **BROADEN** — current results are solid, but adjacent questions are untested
    - New questions emerged. The current contribution is clear but more is possible.
    - Action: generate new root hypotheses → back to inner loop
    
    **PIVOT** — results invalidate key assumptions or something more interesting appeared
    - A core assumption was wrong, or an unexpected finding is more promising than the original question.
    - Action: return to literature with new questions → re-bootstrap
    
    **CONCLUDE** — sufficient evidence for a contribution
    - At least one hypothesis is strongly supported (or a coherent set of negative results)
    - Key ablations completed, error analysis done
    - findings.md reads like a paper backbone — a human could write the abstract from it
    - No critical open questions that would change the story
    
    Note: coherent negative results are a valid contribution. "X does NOT work because Y" is publishable if the reasoning is rigorous.
    
    ### findings.md Is Your Project Memory
    
    This file serves two purposes: it's the research narrative for humans AND your accumulated knowledge base as an agent. Read it at the start of every session, /loop tick, or heartbeat to remember what you've learned.
    
    After every outer loop, update it to answer:
    
    - What do we know so far? (Current Understanding)
    - What patterns explain our results? (Patterns and Insights)
    - What specific things did we learn not to repeat? (Lessons and Constraints)
    - What remains open? (Open Questions)
    
    The "Lessons and Constraints" section is especially important — it captures specific actionable learnings like "weight decay > 0.1 diverges at this scale" or "baseline only reproduces with batch_size=64." This prevents the agent from repeating failed approaches across sessions.
    
    **Quality test**: After 30 inner loop experiments, a human should be able to read findings.md and write a paper abstract from it. If they can't, the outer loop isn't synthesizing — it's just logging.
    
    ## Agent Continuity (MANDATORY — Set Up First)
    
    **Before doing anything else**, set up the wall-clock loop. This is what keeps the research running continuously. Without it, the agent stops after one cycle.
    
    ### Claude Code
    
    **First action — run this immediately:**
    
    ```
    /loop 20m Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a report in to_human/ and opening the HTML/PDF. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper.
    ```
    
    This fires every 20 minutes regardless of what's happening. It's a rhythm that keeps you working — not a research phase boundary. If your previous work isn't done, just continue it.
    
    ### OpenClaw
    
    **First action — set up a 20-minute cron job in the current session. You MUST do this before any research work.**
    
    Use the `cron.add` tool to create a recurring job bound to this chat session:
    
    ```json
    {
      "name": "autoresearch-loop",
      "schedule": { "kind": "every", "everyMs": 1200000 },
      "sessionTarget": "current",
      "payload": {
        "kind": "agentTurn",
        "message": "Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a PDF report in to_human/ and sending it to the user via Telegram, WhatsApp, or Slack. When you get an exciting result or interesting plot, text it to the user right away — don't wait for a full report. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper."
      }
    }
    ```
    
    Key details:
    - `sessionTarget: "current"` binds the cron to this chat session so it maintains conversation context across ticks
    - `everyMs: 1200000` = 20 minutes
    - After creation, verify with `cron.list` that the job exists and is enabled
    - If the cron fires while you're mid-experiment, just continue — the tick is a nudge, not a restart
    
    ### What the Loop Does
    
    The `/loop` and cron job are purely **wall-clock rhythm**. They are completely separate from your research loops (inner/outer). On each tick:
    
    1. Read `research-state.yaml` and `findings.md` — remember where you are
    2. Check if anything is broken (failed experiments, stalled training, errors)
    3. If on track → keep working on whatever you were doing
    4. If stuck or something's wrong → step back, diagnose, fix, then continue
    5. Never idle. Always be making progress.
    
    ## Progress Reporting
    
    When you have something meaningful to share, create a research presentation — not just a status dashboard, but a compelling story.
    
    **When to report** (your judgment):
    - After an outer loop that found a significant pattern
    - When the optimization trajectory shows clear progress (include the plot!)
    - After a pivot in direction
    - Before requesting human input on a decision
    - When concluding
    
    **What to include** (adapt to what's compelling):
    - The research question and why it matters
    - Key results with visualizations (plots, metric tables)
    - The optimization trajectory chart (metric over experiments)
    - What was tried and why (selective, not exhaustive)
    - Current understanding (the findings narrative)
    - What's planned next
    
    For Claude Code: generate HTML and `open` it. If HTML fails to open or render, convert to PDF as fallback (use `weasyprint`, `playwright pdf`, or `wkhtmltopdf`). For OpenClaw: generate PDF directly.
    
    See [references/progress-reporting.md](references/progress-reporting.md) for template scaffolding and the optimization plot approach. Use the template as a starting point — be creative with what you show.
    
    ## Git Protocol
    
    Commit at natural research milestones:
    
    | When | Message Pattern |
    |---|---|
    | Workspace initialized | `research(init): {project} — {question}` |
    | Experiment protocol locked | `research(protocol): {hypothesis}` |
    | Significant results | `research(results): {hypothesis} — {outcome}` |
    | Outer loop direction change | `research(reflect): {direction} — {reason}` |
    | Paper draft complete | `research(paper): {title}` |
    
    **Hard rule**: Protocol commits MUST precede result commits. Never combine them. The git history is your lightweight pre-registration — it proves what you planned before you saw results. Don't commit after every experiment — commit when there's meaningful progress.
    
    ## Concluding: Paper Writing
    
    When the outer loop decides to CONCLUDE:
    
    1. Ensure findings.md has a clear, well-supported narrative
    2. Study 2-3 top related papers to learn their format, style, and section structure
    3. Invoke the `20-ml-paper-writing` skill — it has LaTeX templates for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, and systems venues
    4. Feed it the accumulated literature, experimental results, and findings
    5. Follow its citation verification workflow — never hallucinate references
    6. Generate a final comprehensive research presentation
    
    Proceed autonomously through the writing process. If the ml-paper-writing skill suggests human collaboration points, adapt and keep going — produce the best draft you can. The human will review and provide feedback.
    
    ## Research Discipline
    
    Principles to enforce continuously — not tied to any specific phase:
    
    - **Lock before you run**: Commit your experiment protocol to git before executing. This proves your plan existed before you saw results. Never combine protocol + results in one commit.
    - **Confirmatory vs exploratory**: Results matching your locked protocol are confirmatory. Everything else is exploratory — interesting but requiring more skepticism.
    - **Negative results are progress**: A refuted hypothesis tells you something. Log what it rules out and what it suggests. Don't treat it as failure.
    - **Sanity check before analysis**: Verify training converged, baselines reproduce, and data is correct before trusting your primary metric.
    - **Return to literature when confused**: Don't guess — search. If results surprise you or assumptions break, go find papers. Use Exa MCP for discovery, Semantic Scholar for specific ML/AI paper lookup, arXiv for preprints.
    - **Never stop**: Don't wait for human approval on routine decisions. If a skill or tool suggests collaboration, adapt and keep going. Find the best path forward autonomously. The human will see your progress reports and can redirect if needed.
    - **Use whatever compute is available**: Adapt to the user's environment — local GPU, cluster job submission, cloud instances, or just CPU. If no GPU is available, use CPU and adjust experiment scale accordingly. Don't block on compute availability.
    
    ## Quality Standards
    
    **Good agent behavior:**
    - Hypotheses have mechanistic reasoning ("X because Y, predicting Z"), not just "try X"
    - findings.md builds a coherent narrative, not a flat list of results
    - Negative results are recorded with what they rule out
    - The agent updates its model when experiments contradict expectations
    - Progress reports tell a research story with compelling visualizations
    
    **Bad agent behavior:**
    - Pure hyperparameter sweeps without interpretation
    - findings.md is just experiment logs copy-pasted
    - Agent never revisits its assumptions after failures
    - Optimizing metrics without understanding why changes work
    
    ## When to Use vs Alternatives
    
    **Use autoresearch when:**
    - You have a research question explorable through experiments
    - There's a measurable proxy metric for inner loop optimization
    - The real contribution requires synthesis beyond the metric
    - You want continuous autonomous research operation
    
    **Use individual domain skills instead when:**
    - You have a specific one-off task (train a model, run eval, write a paper)
    - No iterative experimentation needed
    
    ## Common Issues
    
    **Inner loop stalls (no metric improvement)**
    Run an outer loop. Is the metric the right one? Is the search space exhausted? Consider broadening or pivoting. Search literature for new approaches.
    
    **Stuck and not making progress**
    Don't keep trying random changes. Step back: search literature for related work, invoke `21-research-ideation/` brainstorming skills, or run an outer loop reflection. Being stuck means you need new information or a new perspective, not more experiments.
    
    **Results contradict baseline expectations**
    Investigate, don't ignore. Return to literature — your protocol might have an error, the published baseline may be wrong, or conditions differ. Update findings.md with what you learn.
    
    **Agent loses context between ticks**
    Ensure research-state.yaml and findings.md are updated after every action. These files are your memory across sessions.
    
    **Can't find relevant papers**
    Try multiple approaches in order: Exa MCP for broad search, Semantic Scholar for specific ML/AI paper lookup (`pip install semanticscholar`), arXiv for preprints (`pip install arxiv`). Check `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code. Note: Google Scholar has no official API — use Semantic Scholar instead for programmatic search.
    
    **No GPU available**
    Use CPU and scale experiments down. Many research tasks (analysis, interpretability, small model training) run fine on CPU. Adjust experiment design to fit available compute rather than blocking.
    
    **Experiments take longer than /loop interval**
    Normal. On the next tick, check if it finished. If not, keep waiting or do something else useful (update notes, search papers). Adjust interval if needed.
    
    **Not sure when to conclude**
    Three questions: Do you have a strongly supported finding? Can you explain WHY it works? Would findings.md make a convincing paper abstract? If yes to all: conclude.
    
    ## Advanced Topics
    
    - **Detailed agent continuity**: [references/agent-continuity.md](references/agent-continuity.md)
    - **Progress presentation templates**: [references/progress-reporting.md](references/progress-reporting.md)
    - **Complete skill routing**: [references/skill-routing.md](references/skill-routing.md)
    
  • .claude-plugin/marketplace.jsonmarketplace
    Show content (12081 bytes)
    {
      "name": "ai-research-skills",
      "owner": {
        "name": "Orchestra Research",
        "email": "zechen@orchestra-research.com"
      },
      "metadata": {
        "description": "Comprehensive library of 98 AI research engineering skills enabling autonomous AI research from hypothesis to experimental verification",
        "version": "1.2.0"
      },
      "plugins": [
        {
          "name": "model-architecture",
          "description": "LLM architectures and implementations including LitGPT, Mamba, NanoGPT, RWKV, and TorchTitan. Use when implementing, training, or understanding transformer and alternative architectures.",
          "source": "./",
          "strict": false,
          "skills": [
            "./01-model-architecture/litgpt",
            "./01-model-architecture/mamba",
            "./01-model-architecture/nanogpt",
            "./01-model-architecture/rwkv",
            "./01-model-architecture/torchtitan"
          ]
        },
        {
          "name": "tokenization",
          "description": "Text tokenization for LLMs including HuggingFace Tokenizers and SentencePiece. Use when training custom tokenizers or handling multilingual text.",
          "source": "./",
          "strict": false,
          "skills": [
            "./02-tokenization/huggingface-tokenizers",
            "./02-tokenization/sentencepiece"
          ]
        },
        {
          "name": "fine-tuning",
          "description": "LLM fine-tuning frameworks including Axolotl, LLaMA-Factory, PEFT, and Unsloth. Use when fine-tuning models with LoRA, QLoRA, or full fine-tuning.",
          "source": "./",
          "strict": false,
          "skills": [
            "./03-fine-tuning/axolotl",
            "./03-fine-tuning/llama-factory",
            "./03-fine-tuning/peft",
            "./03-fine-tuning/unsloth"
          ]
        },
        {
          "name": "mechanistic-interpretability",
          "description": "Neural network interpretability tools including TransformerLens, SAELens, NNSight, and pyvene. Use when analyzing model internals, finding circuits, or understanding how models compute.",
          "source": "./",
          "strict": false,
          "skills": [
            "./04-mechanistic-interpretability/nnsight",
            "./04-mechanistic-interpretability/pyvene",
            "./04-mechanistic-interpretability/saelens",
            "./04-mechanistic-interpretability/transformer-lens"
          ]
        },
        {
          "name": "data-processing",
          "description": "Data curation and processing at scale including NeMo Curator and Ray Data. Use when preparing training datasets or processing large-scale data.",
          "source": "./",
          "strict": false,
          "skills": [
            "./05-data-processing/nemo-curator",
            "./05-data-processing/ray-data"
          ]
        },
        {
          "name": "post-training",
          "description": "RLHF and preference alignment including TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, and torchforge. Use when aligning models with human preferences, training reward models, or large-scale RL training.",
          "source": "./",
          "strict": false,
          "skills": [
            "./06-post-training/grpo-rl-training",
            "./06-post-training/miles",
            "./06-post-training/openrlhf",
            "./06-post-training/simpo",
            "./06-post-training/slime",
            "./06-post-training/torchforge",
            "./06-post-training/trl-fine-tuning",
            "./06-post-training/verl"
          ]
        },
        {
          "name": "safety-alignment",
          "description": "AI safety and content moderation including Constitutional AI, LlamaGuard, NeMo Guardrails, and Prompt Guard. Use when implementing safety filters, content moderation, or prompt injection detection.",
          "source": "./",
          "strict": false,
          "skills": [
            "./07-safety-alignment/constitutional-ai",
            "./07-safety-alignment/llamaguard",
            "./07-safety-alignment/nemo-guardrails",
            "./07-safety-alignment/prompt-guard"
          ]
        },
        {
          "name": "distributed-training",
          "description": "Multi-GPU and multi-node training including DeepSpeed, PyTorch FSDP, Accelerate, Megatron-Core, PyTorch Lightning, and Ray Train. Use when training large models across GPUs.",
          "source": "./",
          "strict": false,
          "skills": [
            "./08-distributed-training/accelerate",
            "./08-distributed-training/deepspeed",
            "./08-distributed-training/megatron-core",
            "./08-distributed-training/pytorch-fsdp2",
            "./08-distributed-training/pytorch-lightning",
            "./08-distributed-training/ray-train"
          ]
        },
        {
          "name": "infrastructure",
          "description": "GPU cloud and compute orchestration including Modal, Lambda Labs, and SkyPilot. Use when deploying training jobs or managing GPU resources.",
          "source": "./",
          "strict": false,
          "skills": [
            "./09-infrastructure/lambda-labs",
            "./09-infrastructure/modal",
            "./09-infrastructure/skypilot"
          ]
        },
        {
          "name": "optimization",
          "description": "Model optimization and quantization including Flash Attention, bitsandbytes, GPTQ, AWQ, GGUF, and HQQ. Use when reducing memory, accelerating inference, or quantizing models.",
          "source": "./",
          "strict": false,
          "skills": [
            "./10-optimization/awq",
            "./10-optimization/bitsandbytes",
            "./10-optimization/flash-attention",
            "./10-optimization/gguf",
            "./10-optimization/gptq",
            "./10-optimization/hqq",
            "./10-optimization/ml-training-recipes"
          ]
        },
        {
          "name": "evaluation",
          "description": "LLM benchmarking and evaluation including lm-evaluation-harness, BigCode Evaluation Harness, and NeMo Evaluator. Use when benchmarking models or measuring performance.",
          "source": "./",
          "strict": false,
          "skills": [
            "./11-evaluation/bigcode-evaluation-harness",
            "./11-evaluation/lm-evaluation-harness",
            "./11-evaluation/nemo-evaluator"
          ]
        },
        {
          "name": "inference-serving",
          "description": "Production LLM inference including vLLM, TensorRT-LLM, llama.cpp, and SGLang. Use when deploying models for production inference.",
          "source": "./",
          "strict": false,
          "skills": [
            "./12-inference-serving/llama-cpp",
            "./12-inference-serving/sglang",
            "./12-inference-serving/tensorrt-llm",
            "./12-inference-serving/vllm"
          ]
        },
        {
          "name": "mlops",
          "description": "ML experiment tracking and lifecycle including Weights & Biases, MLflow, and TensorBoard. Use when tracking experiments or managing models.",
          "source": "./",
          "strict": false,
          "skills": [
            "./13-mlops/mlflow",
            "./13-mlops/tensorboard",
            "./13-mlops/weights-and-biases"
          ]
        },
        {
          "name": "agents",
          "description": "LLM agent frameworks including LangChain, LlamaIndex, CrewAI, and AutoGPT. Use when building chatbots, autonomous agents, or tool-using systems.",
          "source": "./",
          "strict": false,
          "skills": [
            "./14-agents/autogpt",
            "./14-agents/crewai",
            "./14-agents/langchain",
            "./14-agents/llamaindex"
          ]
        },
        {
          "name": "rag",
          "description": "Retrieval-Augmented Generation including Chroma, FAISS, Pinecone, Qdrant, and Sentence Transformers. Use when building semantic search or document retrieval systems.",
          "source": "./",
          "strict": false,
          "skills": [
            "./15-rag/chroma",
            "./15-rag/faiss",
            "./15-rag/pinecone",
            "./15-rag/qdrant",
            "./15-rag/sentence-transformers"
          ]
        },
        {
          "name": "prompt-engineering",
          "description": "Structured LLM outputs including DSPy, Instructor, Guidance, and Outlines. Use when extracting structured data or constraining LLM outputs.",
          "source": "./",
          "strict": false,
          "skills": [
            "./16-prompt-engineering/dspy",
            "./16-prompt-engineering/guidance",
            "./16-prompt-engineering/instructor",
            "./16-prompt-engineering/outlines"
          ]
        },
        {
          "name": "observability",
          "description": "LLM application monitoring including LangSmith and Phoenix. Use when debugging LLM apps or monitoring production systems.",
          "source": "./",
          "strict": false,
          "skills": [
            "./17-observability/langsmith",
            "./17-observability/phoenix"
          ]
        },
        {
          "name": "multimodal",
          "description": "Vision, audio, and multimodal models including CLIP, Whisper, LLaVA, BLIP-2, Segment Anything, Stable Diffusion, AudioCraft, Cosmos Policy, OpenPI, and OpenVLA-OFT. Use when working with images, audio, multimodal tasks, or vision-language-action robot policies.",
          "source": "./",
          "strict": false,
          "skills": [
            "./18-multimodal/audiocraft",
            "./18-multimodal/blip-2",
            "./18-multimodal/clip",
            "./18-multimodal/cosmos-policy",
            "./18-multimodal/llava",
            "./18-multimodal/openpi",
            "./18-multimodal/openvla-oft",
            "./18-multimodal/segment-anything",
            "./18-multimodal/stable-diffusion",
            "./18-multimodal/whisper"
          ]
        },
        {
          "name": "emerging-techniques",
          "description": "Advanced ML techniques including MoE Training, Model Merging, Long Context, Speculative Decoding, Knowledge Distillation, and Model Pruning. Use when implementing cutting-edge optimization or architecture techniques.",
          "source": "./",
          "strict": false,
          "skills": [
            "./19-emerging-techniques/knowledge-distillation",
            "./19-emerging-techniques/long-context",
            "./19-emerging-techniques/model-merging",
            "./19-emerging-techniques/model-pruning",
            "./19-emerging-techniques/moe-training",
            "./19-emerging-techniques/speculative-decoding"
          ]
        },
        {
          "name": "autoresearch",
          "description": "Autonomous research orchestration using a two-loop architecture. Manages the full research lifecycle from literature survey to paper writing, routing to domain-specific skills for execution. Use when starting a research project, running autonomous experiments, or managing multi-hypothesis research.",
          "source": "./",
          "strict": false,
          "skills": [
            "./0-autoresearch-skill"
          ]
        },
        {
          "name": "ml-paper-writing",
          "description": "Write publication-ready ML/AI/Systems papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, OSDI, NSDI, ASPLOS, SOSP. Includes LaTeX templates, citation verification, reviewer guidelines, publication-quality figure generation, systems paper structural blueprints, and conference presentation slides.",
          "source": "./",
          "strict": false,
          "skills": [
            "./20-ml-paper-writing/ml-paper-writing",
            "./20-ml-paper-writing/academic-plotting",
            "./20-ml-paper-writing/systems-paper-writing",
            "./20-ml-paper-writing/presenting-conference-talks"
          ]
        },
        {
          "name": "ideation",
          "description": "Research ideation frameworks including structured brainstorming and creative thinking. Use when exploring new research directions, generating novel ideas, or seeking fresh angles on existing work.",
          "source": "./",
          "strict": false,
          "skills": [
            "./21-research-ideation/brainstorming-research-ideas",
            "./21-research-ideation/creative-thinking-for-research"
          ]
        },
        {
          "name": "agent-native-research-artifact",
          "description": "Agent-Native Research Artifact (ARA) tooling: compile any research input (paper, repo, notes) into a structured artifact, record session provenance as a post-task epilogue, and run Seal Level 2 epistemic review. Use when ingesting research into a falsifiable, agent-traversable artifact, capturing how a research project actually evolved, or auditing an ARA for evidence-claim alignment.",
          "source": "./",
          "strict": false,
          "skills": [
            "./22-agent-native-research-artifact/compiler",
            "./22-agent-native-research-artifact/research-manager",
            "./22-agent-native-research-artifact/rigor-reviewer"
          ]
        }
      ]
    }
    

README

AI Research Skills Library

The most comprehensive open-source skills library enabling AI agents to autonomously conduct AI research — from idea to paper

AI Research Skills Demo

License: MIT npm version Blog Post Slack Twitter LinkedIn

98 Skills Powering AI Research in 2026

View All 23 Categories
Autoresearch (1)Ideation (2)ML Paper Writing (2)
Model Architecture (5)Fine-Tuning (4)Post-Training (8)
Distributed Training (6)Optimization (6)Inference (4)
Tokenization (2)Data Processing (2)Evaluation (3)
Safety & Alignment (4)Agents (4)RAG (5)
Multimodal (7)Prompt Engineering (4)MLOps (3)
Observability (2)Infrastructure (3)Mech Interp (4)
Emerging Techniques (6)Agent-Native Research Artifact (3)

Table of Contents

Our Mission

We enable AI agents to autonomously conduct AI research — from literature survey and idea generation through experiment execution to paper writing. The library provides both the research orchestration layer (autoresearch, ideation, paper writing) and the engineering skills (training, evaluation, deployment) needed at each stage.

AI Research Agent System
System diagram of an AI research agent

Path Towards AI Research Agent

Modern AI research requires mastering dozens of specialized tools and frameworks. AI Researchers spend more time debugging infrastructure than testing hypotheses — slowing the pace of scientific discovery. We provide a comprehensive skills library that enables AI agents to autonomously conduct the full research lifecycle — from brainstorming ideas to writing the paper.

  • Autonomous Research - The autoresearch skill orchestrates the entire research workflow using a two-loop architecture, routing to domain skills as needed
  • Specialized Expertise - Each domain skill provides deep, production-ready knowledge of a specific framework (Megatron-LM, vLLM, TRL, etc.)
  • End-to-End Coverage - 98 skills spanning the full AI research lifecycle, from ideation and literature survey to experiments and paper writing
  • Research-Grade Quality - Documentation sourced from official repos, real GitHub issues, and battle-tested production workflows

Available AI Research Engineering Skills

Quality over quantity: Each skill provides comprehensive, expert-level guidance with real code examples, troubleshooting guides, and production-ready workflows.

📦 Quick Install (Recommended)

For humans — interactive installer with one command:

npx @orchestra-research/ai-research-skills

For AI agents — point your agent to the welcome doc and it handles the rest:

Read https://www.orchestra-research.com/ai-research-skills/welcome.md and follow the instructions to install and use AI Research Skills.

This installs all 98 skills, loads the autoresearch orchestration layer, and starts autonomous research.

What the installer does
  • Auto-detects your installed coding agents (Claude Code, Hermes Agent, OpenCode, Cursor, Gemini CLI, etc.)
  • Installs skills to ~/.orchestra/skills/ with symlinks to each agent (falls back to copy on Windows)
  • Offers everything, quickstart bundle, by category, or individual skills
  • Updates installed skills with latest versions
  • Uninstalls all or selected skills
CLI Commands
# Interactive installer (recommended)
npx @orchestra-research/ai-research-skills

# Direct commands
npx @orchestra-research/ai-research-skills list      # View installed skills
npx @orchestra-research/ai-research-skills update    # Update installed skills
Claude Code Marketplace (Alternative)

Install skill categories directly using the Claude Code CLI:

# Add the marketplace
/plugin marketplace add orchestra-research/AI-research-SKILLs

# Install by category (23 categories available)
/plugin install fine-tuning@ai-research-skills        # Axolotl, LLaMA-Factory, PEFT, Unsloth
/plugin install post-training@ai-research-skills      # TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge
/plugin install inference-serving@ai-research-skills  # vLLM, TensorRT-LLM, llama.cpp, SGLang
/plugin install distributed-training@ai-research-skills
/plugin install optimization@ai-research-skills

All 23 Categories (98 Skills)

CategorySkillsIncluded
Autoresearch1Autonomous research orchestration — central layer that manages the full lifecycle and routes to all other skills
Ideation2Research Brainstorming, Creative Thinking
ML Paper Writing2ML Paper Writing (LaTeX templates, citation verification), Academic Plotting
Model Architecture5LitGPT, Mamba, NanoGPT, RWKV, TorchTitan
Tokenization2HuggingFace Tokenizers, SentencePiece
Fine-Tuning4Axolotl, LLaMA-Factory, PEFT, Unsloth
Mech Interp4TransformerLens, SAELens, pyvene, nnsight
Data Processing2NeMo Curator, Ray Data
Post-Training8TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge
Safety4Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard
Distributed6DeepSpeed, FSDP, Accelerate, Megatron-Core, Lightning, Ray Train
Infrastructure3Modal, Lambda Labs, SkyPilot
Optimization6Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF
Evaluation3lm-eval-harness, BigCode, NeMo Evaluator
Inference4vLLM, TensorRT-LLM, llama.cpp, SGLang
MLOps3W&B, MLflow, TensorBoard
Agents4LangChain, LlamaIndex, CrewAI, AutoGPT
RAG5Chroma, FAISS, Pinecone, Qdrant, Sentence Transformers
Prompt Eng4DSPy, Instructor, Guidance, Outlines
Observability2LangSmith, Phoenix
Multimodal7CLIP, Whisper, LLaVA, BLIP-2, SAM, Stable Diffusion, AudioCraft
Emerging6MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning
Agent-Native Research Artifact3ARA Compiler, Research Manager, Rigor Reviewer
View All 98 Skills in Details

🔬 Autoresearch (1 skill) — Central Orchestration Layer

  • Autoresearch - Autonomous research orchestration using a two-loop architecture (inner optimization + outer synthesis). Manages the full lifecycle from literature survey to paper writing, routing to all domain-specific skills. Supports Claude Code /loop and OpenClaw heartbeat for continuous operation (390 lines + 3 refs)

🏗️ Model Architecture (5 skills)

  • LitGPT - Lightning AI's 20+ clean LLM implementations with production training recipes (462 lines + 4 refs)
  • Mamba - State-space models with O(n) complexity, 5× faster than Transformers (253 lines + 3 refs)
  • RWKV - RNN+Transformer hybrid, infinite context, Linux Foundation project (253 lines + 3 refs)
  • NanoGPT - Educational GPT in ~300 lines by Karpathy (283 lines + 3 refs)
  • TorchTitan - PyTorch-native distributed training for Llama 3.1 with 4D parallelism

🔤 Tokenization (2 skills)

  • HuggingFace Tokenizers - Rust-based, <20s/GB, BPE/WordPiece/Unigram algorithms (486 lines + 4 refs)
  • SentencePiece - Language-independent, 50k sentences/sec, used by T5/ALBERT (228 lines + 2 refs)

🎯 Fine-Tuning (4 skills)

  • Axolotl - YAML-based fine-tuning with 100+ models (156 lines + 4 refs)
  • LLaMA-Factory - WebUI no-code fine-tuning (78 lines + 5 refs)
  • Unsloth - 2x faster QLoRA fine-tuning (75 lines + 4 refs)
  • PEFT - Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods (431 lines + 2 refs)

🔬 Mechanistic Interpretability (4 skills)

  • TransformerLens - Neel Nanda's library for mech interp with HookPoints, activation caching (346 lines + 3 refs)
  • SAELens - Sparse Autoencoder training and analysis for feature discovery (386 lines + 3 refs)
  • pyvene - Stanford's causal intervention library with declarative configs (473 lines + 3 refs)
  • nnsight - Remote interpretability via NDIF, run experiments on 70B+ models (436 lines + 3 refs)

📊 Data Processing (2 skills)

  • Ray Data - Distributed ML data processing, streaming execution, GPU support (318 lines + 2 refs)
  • NeMo Curator - GPU-accelerated data curation, 16× faster deduplication (375 lines + 2 refs)

🎓 Post-Training (8 skills)

  • TRL Fine-Tuning - Transformer Reinforcement Learning (447 lines + 4 refs)
  • GRPO-RL-Training (TRL) - Group Relative Policy Optimization with TRL (569 lines, gold standard)
  • OpenRLHF - Full RLHF pipeline with Ray + vLLM (241 lines + 4 refs)
  • SimPO - Simple Preference Optimization, no reference model needed (211 lines + 3 refs)
  • verl - ByteDance's HybridFlow RL framework, FSDP/Megatron + vLLM/SGLang backends (389 lines + 2 refs)
  • slime - THUDM's Megatron+SGLang framework powering GLM-4.x models (464 lines + 2 refs)
  • miles - Enterprise fork of slime with FP8, INT4, speculative RL for MoE training (315 lines + 2 refs)
  • torchforge - Meta's PyTorch-native RL with Monarch+TorchTitan+vLLM (380 lines + 2 refs)

🛡️ Safety & Alignment (4 skills)

  • Constitutional AI - AI-driven self-improvement via principles (282 lines)
  • LlamaGuard - Safety classifier for LLM inputs/outputs (329 lines)
  • NeMo Guardrails - Programmable guardrails with Colang (289 lines)
  • Prompt Guard - Meta's 86M prompt injection & jailbreak detector, 99%+ TPR, <2ms GPU (313 lines)

⚡ Distributed Training (6 skills)

  • Megatron-Core - NVIDIA's framework for training 2B-462B param models with 47% MFU on H100 (359 lines + 4 refs)
  • DeepSpeed - Microsoft's ZeRO optimization (137 lines + 9 refs)
  • PyTorch FSDP2 - Fully Sharded Data Parallel v2 with fully_shard and DTensor (231 lines + 12 refs)
  • Accelerate - HuggingFace's 4-line distributed training API (324 lines + 3 refs)
  • PyTorch Lightning - High-level training framework with Trainer class (339 lines + 3 refs)
  • Ray Train - Multi-node orchestration and hyperparameter tuning (399 lines + 1 ref)

🚀 Optimization (6 skills)

  • Flash Attention - 2-4x faster attention with memory efficiency (359 lines + 2 refs)
  • bitsandbytes - 8-bit/4-bit quantization for 50-75% memory reduction (403 lines + 3 refs)
  • GPTQ - 4-bit post-training quantization, 4× memory reduction, <2% accuracy loss (443 lines + 3 refs)
  • AWQ - Activation-aware weight quantization, 4-bit with minimal accuracy loss (310 lines + 2 refs)
  • HQQ - Half-Quadratic Quantization, no calibration data needed, multi-backend (370 lines + 2 refs)
  • GGUF - llama.cpp quantization format, K-quant methods, CPU/Metal inference (380 lines + 2 refs)

📊 Evaluation (3 skills)

  • lm-evaluation-harness - EleutherAI's standard for benchmarking LLMs across 60+ tasks (482 lines + 4 refs)
  • BigCode Evaluation Harness - Code model benchmarking with HumanEval, MBPP, MultiPL-E, pass@k metrics (406 lines + 3 refs)
  • NeMo Evaluator - NVIDIA's enterprise platform for 100+ benchmarks across 18+ harnesses with multi-backend execution (454 lines + 4 refs)

☁️ Infrastructure (3 skills)

  • Modal - Serverless GPU cloud with Python-native API, T4-H200 on-demand (342 lines + 2 refs)
  • SkyPilot - Multi-cloud orchestration across 20+ providers with spot recovery (390 lines + 2 refs)
  • Lambda Labs - Reserved/on-demand GPU cloud with H100/A100, persistent filesystems (390 lines + 2 refs)

🔥 Inference & Serving (4 skills)

  • vLLM - High-throughput LLM serving with PagedAttention (356 lines + 4 refs, production-ready)
  • TensorRT-LLM - NVIDIA's fastest inference, 24k tok/s, FP8/INT4 quantization (180 lines + 3 refs)
  • llama.cpp - CPU/Apple Silicon inference, GGUF quantization (251 lines + 3 refs)
  • SGLang - Structured generation with RadixAttention, 5-10× faster for agents (435 lines + 3 refs)

🤖 Agents (4 skills)

  • LangChain - Most popular agent framework, 500+ integrations, ReAct pattern (658 lines + 3 refs, production-ready)
  • LlamaIndex - Data framework for LLM apps, 300+ connectors, RAG-focused (535 lines + 3 refs)
  • CrewAI - Multi-agent orchestration, role-based collaboration, autonomous workflows (498 lines + 3 refs)
  • AutoGPT - Autonomous AI agent platform, visual workflow builder, continuous execution (400 lines + 2 refs)

🔍 RAG (5 skills)

  • Chroma - Open-source embedding database, local/cloud, 24k stars (385 lines + 1 ref)
  • FAISS - Facebook's similarity search, billion-scale, GPU acceleration (295 lines)
  • Sentence Transformers - 5000+ embedding models, multilingual, 15k stars (370 lines)
  • Pinecone - Managed vector database, auto-scaling, <100ms latency (410 lines)
  • Qdrant - High-performance vector search, Rust-powered, hybrid search with filtering (493 lines + 2 refs)

🎨 Multimodal (7 skills)

  • CLIP - OpenAI's vision-language model, zero-shot classification, 25k stars (320 lines)
  • Whisper - Robust speech recognition, 99 languages, 73k stars (395 lines)
  • LLaVA - Vision-language assistant, image chat, GPT-4V level (360 lines)
  • Stable Diffusion - Text-to-image generation via HuggingFace Diffusers, SDXL, ControlNet (380 lines + 2 refs)
  • Segment Anything - Meta's SAM for zero-shot image segmentation with points/boxes (500 lines + 2 refs)
  • BLIP-2 - Vision-language pretraining with Q-Former, image captioning, VQA (500 lines + 2 refs)
  • AudioCraft - Meta's MusicGen/AudioGen for text-to-music and text-to-sound (470 lines + 2 refs)

🎯 Prompt Engineering (4 skills)

  • DSPy - Declarative prompt programming with optimizers, Stanford NLP, 22k stars (438 lines + 3 refs)
  • Instructor - Structured LLM outputs with Pydantic validation, 15k stars (726 lines + 3 refs)
  • Guidance - Constrained generation with regex/grammars, Microsoft Research, 18k stars (485 lines + 3 refs)
  • Outlines - Structured text with FSM, zero-overhead, 8k stars (601 lines + 3 refs)

📊 MLOps (3 skills)

  • Weights & Biases - Experiment tracking, sweeps, artifacts, model registry (427 lines + 3 refs)
  • MLflow - Model registry, tracking, deployment, autologging (514 lines + 3 refs)
  • TensorBoard - Visualization, profiling, embeddings, scalars/images (538 lines + 3 refs)

👁️ Observability (2 skills)

  • LangSmith - LLM observability, tracing, evaluation, monitoring for AI apps (422 lines + 2 refs)
  • Phoenix - Open-source AI observability with OpenTelemetry tracing and LLM evaluation (380 lines + 2 refs)

🔬 Emerging Techniques (6 skills)

  • MoE Training - Mixture of Experts training with DeepSpeed, Mixtral 8x7B, 5× cost reduction (515 lines + 3 refs)
  • Model Merging - Combine models with TIES, DARE, SLERP using mergekit (528 lines + 3 refs)
  • Long Context - Extend context windows with RoPE, YaRN, ALiBi, 32k-128k tokens (624 lines + 3 refs)
  • Speculative Decoding - 1.5-3.6× faster inference with Medusa, Lookahead (379 lines)
  • Knowledge Distillation - Compress models 70B→7B with MiniLLM, temperature scaling (424 lines)
  • Model Pruning - 50% sparsity with Wanda, SparseGPT, <1% accuracy loss (417 lines)

📝 ML Paper Writing (2 skills)

  • ML Paper Writing - Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM with LaTeX templates, citation verification, and writing best practices (532 lines + 5 refs)
  • Academic Plotting - Generate publication-quality figures for ML papers: architecture diagrams via Gemini AI and data-driven charts via matplotlib/seaborn with venue-specific styling (479 lines + 3 refs)

💡 Ideation (2 skills)

  • Research Brainstorming - Structured ideation frameworks for discovering high-impact research directions with 10 complementary lenses (384 lines)
  • Creative Thinking - Cognitive science frameworks (bisociation, structure-mapping, constraint manipulation) for genuinely novel research ideas (366 lines)

🧬 Agent-Native Research Artifact (3 skills)

  • ARA Compiler - Compiles any research input (PDF papers, repos, experiment logs, raw notes) into a complete Agent-Native Research Artifact with claims, exploration graph, evidence, and code stubs (245 lines + 3 refs)
  • ARA Research Manager - Post-task research recorder that runs at session end to extract decisions, experiments, dead ends, and pivots from conversation history into the ara/ directory with user-vs-AI provenance tags (324 lines + 3 refs)
  • ARA Rigor Reviewer - ARA Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope, coherence, exploration integrity, methodology) with severity-ranked findings (322 lines + 1 ref)

Demos

All 87 skills in this repo are automatically synced to Orchestra Research, where you can add them to your projects with one click and use them with AI research agents.

See skills in action → demos/

We maintain a curated collection of demo repositories showing how to use skills for real AI research tasks:

DemoSkills UsedWhat It Does
Norm Heterogeneity → LoRA BrittlenessAutoresearch, ML Paper Writing, IdeationAgent autonomously discovered norm heterogeneity predicts fine-tuning difficulty (r=-0.99), pivoting from a null result on ETF overlaps
RL Algorithm Brain ScanAutoresearch, GRPO, TRL, SAELens, TransformerLens, ML Paper WritingAgent found DPO is a rank-1 perturbation (95.6% recovery from one SVD direction) while online RL is distributed and structure-preserving
NeMo Eval: GPQA BenchmarkNeMo EvaluatorCompare Llama 8B/70B/405B on graduate-level science questions
LoRA Without Regret ReproductionGRPO, TRLReproduce SFT + GRPO RL experiments via prompting
Layer-Wise Quantization Experimentllama.cpp, GGUFInvestigate optimal layer precision allocation—early layers at Q8 achieve 1.9× compression with 1.3% perplexity loss
Cross-Lingual Alignment AnalysisFAISSQuantify how well multilingual embeddings align semantic concepts across 8 languages using FAISS similarity search
Scientific Plotting DemoAcademic PlottingGenerate publication-quality figures for the Andes QoE-aware LLM serving paper — Gemini AI architecture diagrams + matplotlib data charts (CDF, multi-panel grids, bar charts)

Featured Demos: Two papers produced entirely by AI agents using the autoresearch skill. The Norm Heterogeneity paper demonstrates autonomous research pivoting — the agent refuted its own hypothesis and discovered a stronger finding. The RL Brain Scan paper demonstrates multi-skill orchestration — the agent trained RL models, analyzed internals with interpretability tools, and synthesized the insight that "DPO is rank-1 alignment." Both papers written end-to-end by the agent.

Skill Structure

Each skill follows a battle-tested format for maximum usefulness:

skill-name/
├── SKILL.md                    # Quick reference (50-150 lines)
│   ├── Metadata (name, description, version)
│   ├── When to use this skill
│   ├── Quick patterns & examples
│   └── Links to references
│
├── references/                 # Deep documentation (300KB+)
│   ├── README.md              # From GitHub/official docs
│   ├── api.md                 # API reference
│   ├── tutorials.md           # Step-by-step guides
│   ├── issues.md              # Real GitHub issues & solutions
│   ├── releases.md            # Version history & breaking changes
│   └── file_structure.md      # Codebase navigation
│
├── scripts/                    # Helper scripts (optional)
└── assets/                     # Templates & examples (optional)
Quality Standards
  • 300KB+ documentation from official sources
  • Real GitHub issues & solutions (when available)
  • Code examples with language detection
  • Version history & breaking changes
  • Links to official docs

Roadmap

We're building towards 80 comprehensive skills across the full AI research lifecycle. See our detailed roadmap for the complete development plan.

View Full Roadmap →

View Detailed Statistics
MetricCurrentTarget
Skills87 (high-quality, standardized YAML)80 ✅
Avg Lines/Skill420 lines (focused + progressive disclosure)200-600 lines
Documentation~130,000 lines total (SKILL.md + references)100,000+ lines
Gold Standard Skills65 with comprehensive references50+
Contributors1100+
CoverageArchitecture, Tokenization, Fine-Tuning, Mechanistic Interpretability, Data Processing, Post-Training, Safety, Distributed, Optimization, Evaluation, Infrastructure, Inference, Agents, RAG, Multimodal, Prompt Engineering, MLOps, Observability, ML Paper Writing, Ideation, AutoresearchFull Lifecycle ✅

Recent Progress: npm package @orchestra-research/ai-research-skills for one-command installation across all coding agents

Philosophy: Quality > Quantity. Following Anthropic official best practices - each skill provides 200-500 lines of focused, actionable guidance with progressive disclosure.

Repository Structure

claude-ai-research-skills/
├── README.md                    ← You are here
├── CONTRIBUTING.md              ← Contribution guide
├── demos/                       ← Curated demo gallery (links to demo repos)
├── docs/
├── 0-autoresearch-skill/        (1 skill ✓ - Autonomous research orchestration)
├── 01-model-architecture/       (5 skills ✓ - LitGPT, Mamba, RWKV, NanoGPT, TorchTitan)
├── 02-tokenization/             (2 skills ✓ - HuggingFace Tokenizers, SentencePiece)
├── 03-fine-tuning/              (4 skills ✓ - Axolotl, LLaMA-Factory, Unsloth, PEFT)
├── 04-mechanistic-interpretability/ (4 skills ✓ - TransformerLens, SAELens, pyvene, nnsight)
├── 05-data-processing/          (2 skills ✓ - Ray Data, NeMo Curator)
├── 06-post-training/            (8 skills ✓ - TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge)
├── 07-safety-alignment/         (4 skills ✓ - Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard)
├── 08-distributed-training/     (6 skills ✓ - Megatron-Core, DeepSpeed, FSDP, Accelerate, Lightning, Ray Train)
├── 09-infrastructure/           (3 skills ✓ - Modal, SkyPilot, Lambda Labs)
├── 10-optimization/             (6 skills ✓ - Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF)
├── 11-evaluation/               (3 skills ✓ - lm-evaluation-harness, BigCode, NeMo Evaluator)
├── 12-inference-serving/        (4 skills ✓ - vLLM, TensorRT-LLM, llama.cpp, SGLang)
├── 13-mlops/                    (3 skills ✓ - Weights & Biases, MLflow, TensorBoard)
├── 14-agents/                   (4 skills ✓ - LangChain, LlamaIndex, CrewAI, AutoGPT)
├── 15-rag/                      (5 skills ✓ - Chroma, FAISS, Sentence Transformers, Pinecone, Qdrant)
├── 16-prompt-engineering/       (4 skills ✓ - DSPy, Instructor, Guidance, Outlines)
├── 17-observability/            (2 skills ✓ - LangSmith, Phoenix)
├── 18-multimodal/               (7 skills ✓ - CLIP, Whisper, LLaVA, Stable Diffusion, SAM, BLIP-2, AudioCraft)
├── 19-emerging-techniques/      (6 skills ✓ - MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning)
├── 20-ml-paper-writing/         (2 skills ✓ - ML Paper Writing with LaTeX templates, Academic Plotting)
├── 21-research-ideation/           (2 skills ✓ - Research Brainstorming, Creative Thinking)
├── 22-agent-native-research-artifact/ (3 skills ✓ - ARA Compiler, Research Manager, Rigor Reviewer)
└── packages/ai-research-skills/ (npm package for one-command installation)

Use Cases

For Researchers

"I need to fine-tune Llama 3 with custom data" → 03-fine-tuning/axolotl/ - YAML configs, 100+ model support

For ML Engineers

"How do I optimize inference latency?" → 12-inference-serving/vllm/ - PagedAttention, batching

For Students

"I want to learn how transformers work" → 01-model-architecture/litgpt/ - Clean implementations

For Teams

"We need to scale training to 100 GPUs" → 08-distributed-training/deepspeed/ - ZeRO stages, 3D parallelism

License

MIT License - See LICENSE for details.

Note: Individual skills may reference libraries with different licenses. Please check each project's license before use.

Citation

If you use AI Research Skills in your work or find it helpful for a publication, we'd appreciate a citation:

BibTeX

@software{ai_research_skills,
  title     = {AI Research Skills Library},
  author    = {{Orchestra Research}},
  year      = {2025},
  url       = {https://github.com/orchestra-research/AI-research-SKILLs},
  note      = {Open-source skills library enabling AI agents to autonomously conduct AI research}
}

APA

Orchestra Research. (2025). AI Research Skills Library [Computer software]. https://github.com/orchestra-research/AI-research-SKILLs

Chicago

Orchestra Research. "AI Research Skills Library." GitHub, 2025. https://github.com/orchestra-research/AI-research-SKILLs.

IEEE

Orchestra Research, "AI Research Skills Library," 2025. [Online]. Available: https://github.com/orchestra-research/AI-research-SKILLs

Tip: You can also click "Cite this repository" in the GitHub sidebar for auto-formatted citations.

Acknowledgments

Built with:

  • Claude Code - AI pair programming
  • Skill Seeker - Automated doc scraping
  • Open Source AI Community - For amazing tools and docs

Special thanks to:

  • EleutherAI, HuggingFace, NVIDIA, Lightning AI, Meta AI, Anthropic
  • All researchers who maintain excellent documentation

Contributors

Thanks to all the people who have contributed to the AI Research Skills Library:

We welcome contributions from the AI research community! See CONTRIBUTING.md for detailed guidelines on:

  • Adding new skills
  • Improving existing skills
  • Quality standards and best practices
  • Submission process

Recent Updates

April 2026 - v1.6.0 🧬 Agent-Native Research Artifact (ARA) — 23rd Category, 98 Skills
  • 🧬 NEW CATEGORY: 22-agent-native-research-artifact/ (the 23rd category) — three skills that turn research outputs into a falsifiable, agent-traversable artifact:
    • 🛠️ ARA Compiler — compiles any input (PDF papers, GitHub repos, experiment logs, raw notes) into a structured ARA with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph (research DAG), and grounded evidence
    • 📋 ARA Research Manager — post-task epilogue that scans conversation history at session end and writes decisions, experiments, dead ends, claims, heuristics, and pivots into the ara/ directory with user / ai-suggested / ai-executed / user-revised provenance tags
    • 🔍 ARA Rigor Reviewer — Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and emitting a severity-ranked report with a Strong Accept-to-Reject recommendation
  • 🔗 Sourced from the Agent-Native-Research-Artifact-Init reference repo, restructured to AI-research-SKILLs standards (kebab-case names, third-person descriptions, Title-Case tags, one-level-deep references)
  • 🧩 Plugin entry agent-native-research-artifact added to .claude-plugin/marketplace.json; CLI category registered as 22-agent-native-research-artifact with three individual skill entries in the npm installer
  • 🔄 Auto-syncs to Orchestra marketplace via sync-skills.yml on push; npm package republished as @orchestra-research/ai-research-skills@1.6.0 via publish-npm.yml on version bump
  • 📊 98 total skills across 23 categories — full lifecycle from idea → paper → falsifiable, auditable artifact
March 2026 - v1.4.0 🔬 Autoresearch & 86 Skills — Full Research Lifecycle
  • 🔬 NEW SKILL: Autoresearch — autonomous research orchestration using a two-loop architecture (inner optimization loop + outer synthesis loop)
  • 🧠 Manages the full research lifecycle: literature survey → ideation → experiments → synthesis → paper writing
  • 🔄 Routes to all 86 domain skills automatically — agents don't need to know which skill to use
  • ⏰ Mandatory /loop (Claude Code) and cron job (OpenClaw) for continuous autonomous operation
  • 📊 Generates research presentations (HTML/PDF) with optimization trajectory plots for human review
  • 📝 Findings.md as persistent project memory across sessions with "Lessons and Constraints" tracking
  • 🗂️ Structured workspace: research-state.yaml, findings.md, research-log.md, literature/, experiments/, src/, data/, to_human/
  • 📄 Two demo papers produced by autoresearch: Norm Heterogeneity → LoRA Brittleness and RL Algorithm Brain Scan
  • 🚀 WELCOME.md for cold-start agent bootstrap — one URL to go from zero to autonomous research
  • 📦 npm v1.4.x with Windows symlink fallback, all 22 categories installable
  • 🤖 Supported agents: Claude Code, Hermes Agent, OpenCode, OpenClaw, Cursor, Codex, Gemini CLI, Qwen Code
  • 📊 87 total skills across 22 categories — complete research lifecycle coverage
February 2026 - v0.15.0 🛡️ Prompt Guard & 83 Skills
  • 🛡️ NEW SKILL: Prompt Guard - Meta's 86M prompt injection & jailbreak detector
  • ⚡ 99%+ TPR, <1% FPR, <2ms GPU latency, multilingual (8 languages)
  • 🔒 3 workflows: user input filtering, third-party data filtering, batch RAG processing
  • 📊 83 total skills across 20 categories
January 2026 - v0.14.0 📦 npm Package & 82 Skills
  • 📦 NEW: npx @orchestra-research/ai-research-skills - One-command installation for all coding agents
  • 🤖 Supported agents: Claude Code, OpenCode, Cursor, Codex, Gemini CLI, Qwen Code
  • ✨ Interactive installer with category/individual skill selection
  • 🔄 Update installed skills, selective uninstall
  • 📊 82 total skills (5 new post-training skills: verl, slime, miles, torchforge + TorchTitan)
  • 🏗️ Megatron-Core moved to Distributed Training category
January 2026 - v0.13.0 📝 ML Paper Writing & Demos Gallery
  • 📝 NEW CATEGORY: ML Paper Writing (20th category, 77th skill)
  • 🎯 Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM
  • 📚 Writing philosophy from top researchers (Neel Nanda, Farquhar, Gopen & Swan, Lipton, Perez)
  • 🔬 Citation verification workflow - never hallucinate references
  • 📄 LaTeX templates for 6 major conferences
  • 🎪 NEW: Curated demos gallery (demos/) showcasing skills in action
  • 🔗 Demo repos: NeMo Evaluator benchmark, LoRA Without Regret reproduction
  • 📖 936-line comprehensive SKILL.md with 4 workflows
January 2026 - v0.12.0 📊 NeMo Evaluator SDK
  • 📊 NEW SKILL: NeMo Evaluator SDK for enterprise LLM benchmarking
  • 🔧 NVIDIA's evaluation platform with 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM)
  • ⚡ Multi-backend execution: local Docker, Slurm HPC, Lepton cloud
  • 📦 Container-first architecture for reproducible evaluation
  • 📝 454 lines SKILL.md + 4 comprehensive reference files (~48KB documentation)
December 2025 - v0.11.0 🔬 Mechanistic Interpretability
  • 🔬 NEW CATEGORY: Mechanistic Interpretability (4 skills)
  • 🔍 TransformerLens skill: Neel Nanda's library for mech interp with HookPoints, activation caching, circuit analysis
  • 🧠 SAELens skill: Sparse Autoencoder training and analysis for feature discovery, monosemanticity research
  • ⚡ pyvene skill: Stanford's causal intervention library with declarative configs, DAS, activation patching
  • 🌐 nnsight skill: Remote interpretability via NDIF, run experiments on 70B+ models without local GPUs
  • 📝 ~6,500 new lines of documentation across 16 files
  • 76 total skills (filling the missing 04 category slot)
November 25, 2025 - v0.10.0 🎉 70 Skills Complete!
  • 🎉 ROADMAP COMPLETE: Reached 70-skill milestone!
  • 🚀 Added 4 skills: Lambda Labs, Segment Anything (SAM), BLIP-2, AudioCraft
  • ☁️ Lambda Labs skill: Reserved/on-demand GPU cloud with H100/A100, persistent filesystems, 1-Click Clusters
  • 🖼️ SAM skill: Meta's Segment Anything for zero-shot image segmentation with points/boxes/masks
  • 👁️ BLIP-2 skill: Vision-language pretraining with Q-Former, image captioning, VQA
  • 🎵 AudioCraft skill: Meta's MusicGen/AudioGen for text-to-music and text-to-sound generation
  • 📝 ~10,000 new lines of documentation across 12 files
  • 70 total skills (100% roadmap complete!)
November 25, 2025 - v0.9.0
  • 🚀 Added 2 infrastructure skills: Modal, SkyPilot
  • ☁️ Modal skill: Serverless GPU cloud with Python-native API, T4-H200 on-demand, auto-scaling
  • 🌐 SkyPilot skill: Multi-cloud orchestration across 20+ providers with spot recovery
  • ✨ New Infrastructure category (2 skills - serverless GPU and multi-cloud orchestration)
  • 📝 ~2,500 new lines of documentation across 6 files
  • 66 total skills (94% towards 70-skill target)
November 25, 2025 - v0.8.0
  • 🚀 Added 5 high-priority skills: HQQ, GGUF, Phoenix, AutoGPT, Stable Diffusion
  • ⚡ HQQ skill: Half-Quadratic Quantization without calibration data, multi-backend support
  • 📦 GGUF skill: llama.cpp quantization format, K-quant methods, CPU/Metal inference
  • 👁️ Phoenix skill: Open-source AI observability with OpenTelemetry tracing and LLM evaluation
  • 🤖 AutoGPT skill: Autonomous AI agent platform with visual workflow builder
  • 🎨 Stable Diffusion skill: Text-to-image generation via Diffusers, SDXL, ControlNet, LoRA
  • 📝 ~9,000 new lines of documentation across 15 files
  • 64 total skills (91% towards 70-skill target)
November 25, 2025 - v0.7.0
  • 🚀 Added 5 high-priority skills: PEFT, CrewAI, Qdrant, AWQ, LangSmith
  • ✨ New Observability category with LangSmith for LLM tracing and evaluation
  • 🎯 PEFT skill: Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods
  • 🤖 CrewAI skill: Multi-agent orchestration with role-based collaboration
  • 🔍 Qdrant skill: High-performance Rust vector search with hybrid filtering
  • ⚡ AWQ skill: Activation-aware 4-bit quantization with minimal accuracy loss
  • 📝 ~8,000 new lines of documentation across 15 files
  • 59 total skills (84% towards 70-skill target)
November 15, 2025 - v0.6.0
  • 📊 Added 3 comprehensive MLOps skills: Weights & Biases, MLflow, TensorBoard
  • ✨ New MLOps category (3 skills - experiment tracking, model registry, visualization)
  • 📝 ~10,000 new lines of documentation across 13 files
  • 🔧 Comprehensive coverage: experiment tracking, hyperparameter sweeps, model registry, profiling, embeddings visualization
  • 54 total skills (77% towards 70-skill target)
November 12, 2025 - v0.5.0
  • 🎯 Added 4 comprehensive prompt engineering skills: DSPy, Instructor, Guidance, Outlines
  • ✨ New Prompt Engineering category (4 skills - DSPy, Instructor, Guidance, Outlines)
  • 📝 ~10,000 new lines of documentation across 16 files
  • 🔧 Comprehensive coverage: declarative programming, structured outputs, constrained generation, FSM-based generation
  • 47 total skills (67% towards 70-skill target)
November 9, 2025 - v0.4.0
  • 🤖 Added 11 comprehensive skills: LangChain, LlamaIndex, Chroma, FAISS, Sentence Transformers, Pinecone, CLIP, Whisper, LLaVA
  • ✨ New Agents category (2 skills - LangChain, LlamaIndex)
  • 🔍 New RAG category (4 skills - Chroma, FAISS, Sentence Transformers, Pinecone)
  • 🎨 New Multimodal category (3 skills - CLIP, Whisper, LLaVA)
  • 📝 ~15,000 new lines of documentation
  • 43 total skills (61% towards 70-skill target)
November 8, 2025 - v0.3.0
  • 🚀 Added 8 comprehensive skills: TensorRT-LLM, llama.cpp, SGLang, GPTQ, HuggingFace Tokenizers, SentencePiece, Ray Data, NeMo Curator
  • ⚡ Completed Inference & Serving category (4/4 skills)
  • 🔤 New Tokenization category (2 skills)
  • 📊 New Data Processing category (2 skills)
  • 📝 9,617 new lines of documentation across 30 files
  • 32 total skills (45% towards 70-skill target)
November 6, 2025 - v0.2.0
  • Added 10 skills from GitHub (Megatron-Core, Lightning, Ray Train, etc.)
  • Improved skill structure with comprehensive references
  • Created strategic roadmap to 70 skills
  • Added contribution guidelines
November 3, 2025 - v0.1.0
  • 🎉 Initial release with 5 fine-tuning skills

Community

Join our community to stay updated, ask questions, and connect with other AI researchers:

  • SkillEvolve Meta-Skill - Connect your agent to the collective intelligence of the community. Captures techniques discovered during sessions and shares them back as curated skills.
  • Slack Community - Chat with the team and other users
  • Twitter/X - Follow for updates and announcements
  • LinkedIn - Connect professionally

Star History

Star History Chart