USP

It's the most comprehensive open-source library specifically designed for autonomous AI research, offering 98 expert-level skills across 23 categories, enabling agents to handle the entire research lifecycle from idea to paper.

Use cases

01Autonomous AI research orchestration
02Literature survey and idea generation
03Experiment execution and debugging
04ML paper writing and academic plotting
05Distributed LLM pretraining

Detected files (8)

01-model-architecture/litgpt/SKILL.mdskill

Show content (11010 bytes)

---
name: implementing-llms-litgpt
description: Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, LitGPT, Lightning AI, LLM Implementation, LoRA, QLoRA, Fine-Tuning, Llama, Gemma, Phi, Mistral, Educational]
dependencies: [litgpt, torch, transformers]
---

# LitGPT - Clean LLM Implementations

## Quick start

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

**Installation**:
```bash
pip install 'litgpt[extra]'
```

**Load and use any model**:
```python
from litgpt import LLM

# Load pretrained model
llm = LLM.load("microsoft/phi-2")

# Generate text
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)
```

**List available models**:
```bash
litgpt download list
```

## Common workflows

### Workflow 1: Fine-tune on custom dataset

Copy this checklist:

```
Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning
```

**Step 1: Download pretrained model**

```bash
# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2

# Download Gemma 2B
litgpt download google/gemma-2b
```

Models are saved to `checkpoints/` directory.

**Step 2: Prepare dataset**

LitGPT supports multiple formats:

**Alpaca format** (instruction-response):
```json
[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]
```

Save as `data/my_dataset.json`.

**Step 3: Configure training**

```bash
# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --train.max_steps 1000 \
  --train.learning_rate 1e-4
```

**Step 4: Run fine-tuning**

Training saves checkpoints to `out/finetune/` automatically.

Monitor training:
```bash
# View logs
tail -f out/finetune/logs.txt

# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs
```

### Workflow 2: LoRA fine-tuning on single GPU

Most memory-efficient option.

```
LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)
```

**Step 1: Choose base model**

For limited GPU memory (12-16GB):
- **Phi-2** (2.7B) - Best quality/size tradeoff
- **Llama 3 1B** - Smallest, fastest
- **Gemma 2B** - Good reasoning

**Step 2: Configure LoRA parameters**

```bash
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed
```

LoRA rank guide:
- `r=8`: Lightweight, 2-4MB adapters
- `r=16`: Standard, good quality
- `r=32`: High capacity, use for complex tasks
- `r=64`: Maximum quality, 4× larger adapters

**Step 3: Train with LoRA**

```bash
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# Memory usage: ~8-12GB for Phi-2 with LoRA
```

**Step 4: Merge LoRA weights** (optional)

Merge LoRA adapters into base model for deployment:

```bash
litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged
```

Now use merged model:
```python
from litgpt import LLM
llm = LLM.load("out/phi2-merged")
```

### Workflow 3: Pretrain from scratch

Train new model on your domain data.

```
Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining
```

**Step 1: Prepare pretraining dataset**

LitGPT expects tokenized data. Use `prepare_dataset.py`:

```bash
python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val
```

**Step 2: Configure model architecture**

Edit config file or use existing:

```python
# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true
```

**Step 3: Set up multi-GPU training**

```bash
# Single GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000
```

**Step 4: Launch pretraining**

For large-scale pretraining on cluster:

```bash
# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh content:
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000
```

### Workflow 4: Convert and deploy model

Export LitGPT models for production.

```
Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API
```

**Step 1: Test inference locally**

```python
from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# Single generation
print(llm.generate("What is machine learning?"))

# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]
```

**Step 2: Quantize model** (optional)

Reduce model size with minimal quality loss:

```bash
# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # Double quantization
```

**Step 3: Convert to GGUF** (for llama.cpp)

```bash
python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2
```

**Step 4: Deploy with API**

```python
from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000
```

## When to use vs alternatives

**Use LitGPT when:**
- Want to understand LLM architectures (clean, readable code)
- Need production-ready training recipes
- Educational purposes or research
- Prototyping new model ideas
- Lightning ecosystem user

**Use alternatives instead:**
- **Axolotl/TRL**: More fine-tuning features, YAML configs
- **Megatron-Core**: Maximum performance for >70B models
- **HuggingFace Transformers**: Broadest model support
- **vLLM**: Inference-only (no training)

## Common issues

**Issue: Out of memory during fine-tuning**

Use LoRA instead of full fine-tuning:
```bash
# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora  # Only needs 12-16GB
```

Or enable gradient checkpointing:
```bash
litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients
```

**Issue: Training too slow**

Enable Flash Attention (built-in, automatic on compatible hardware):
```python
# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed
```

Use smaller micro-batch and accumulate:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32
```

**Issue: Model not loading**

Check model name:
```bash
# List all available models
litgpt download list

# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B
```

Verify checkpoints directory:
```bash
ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/
```

**Issue: LoRA adapters too large**

Reduce LoRA rank:
```bash
--lora_r 8  # Instead of 16 or 32
```

Apply LoRA to fewer layers:
```bash
--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this
```

## Advanced topics

**Supported architectures**: See [references/supported-models.md](references/supported-models.md) for complete list of 20+ model families with sizes and capabilities.

**Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for proven hyperparameter configurations for pretraining and fine-tuning.

**FSDP configuration**: See [references/distributed-training.md](references/distributed-training.md) for multi-GPU training with Fully Sharded Data Parallel.

**Custom architectures**: See [references/custom-models.md](references/custom-models.md) for implementing new model architectures in LitGPT style.

## Hardware requirements

- **GPU**: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
- **Memory**:
  - Inference (Phi-2): 6GB
  - LoRA fine-tuning (7B): 16GB
  - Full fine-tuning (7B): 40GB+
  - Pretraining (1B): 24GB
- **Storage**: 5-50GB per model (depending on size)

## Resources

- GitHub: https://github.com/Lightning-AI/litgpt
- Docs: https://lightning.ai/docs/litgpt
- Tutorials: https://lightning.ai/docs/litgpt/tutorials
- Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)

01-model-architecture/nanogpt/SKILL.mdskill

Show content (6752 bytes)

---
name: nanogpt
description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training]
dependencies: [torch, transformers, datasets, tiktoken, wandb]
---

# nanoGPT - Minimalist GPT Training

## Quick start

nanoGPT is a simplified GPT implementation designed for learning and experimentation.

**Installation**:
```bash
pip install torch numpy transformers datasets tiktoken wandb tqdm
```

**Train on Shakespeare** (CPU-friendly):
```bash
# Prepare data
python data/shakespeare_char/prepare.py

# Train (5 minutes on CPU)
python train.py config/train_shakespeare_char.py

# Generate text
python sample.py --out_dir=out-shakespeare-char
```

**Output**:
```
ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.
```

## Common workflows

### Workflow 1: Character-level Shakespeare

**Complete training pipeline**:
```bash
# Step 1: Prepare data (creates train.bin, val.bin)
python data/shakespeare_char/prepare.py

# Step 2: Train small model
python train.py config/train_shakespeare_char.py

# Step 3: Generate text
python sample.py --out_dir=out-shakespeare-char
```

**Config** (`config/train_shakespeare_char.py`):
```python
# Model config
n_layer = 6          # 6 transformer layers
n_head = 6           # 6 attention heads
n_embd = 384         # 384-dim embeddings
block_size = 256     # 256 char context

# Training config
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500

# Hardware
device = 'cpu'  # Or 'cuda'
compile = False # Set True for PyTorch 2.0
```

**Training time**: ~5 minutes (CPU), ~1 minute (GPU)

### Workflow 2: Reproduce GPT-2 (124M)

**Multi-GPU training on OpenWebText**:
```bash
# Step 1: Prepare OpenWebText (takes ~1 hour)
python data/openwebtext/prepare.py

# Step 2: Train GPT-2 124M with DDP (8 GPUs)
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# Step 3: Sample from trained model
python sample.py --out_dir=out
```

**Config** (`config/train_gpt2.py`):
```python
# GPT-2 (124M) architecture
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# Training
batch_size = 12
gradient_accumulation_steps = 5 * 8  # Total batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# System
compile = True  # PyTorch 2.0
```

**Training time**: ~4 days (8× A100)

### Workflow 3: Fine-tune pretrained GPT-2

**Start from OpenAI checkpoint**:
```python
# In train.py or config
init_from = 'gpt2'  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl

# Model loads OpenAI weights automatically
python train.py config/finetune_shakespeare.py
```

**Example config** (`config/finetune_shakespeare.py`):
```python
# Start from GPT-2
init_from = 'gpt2'

# Dataset
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# Fine-tuning
learning_rate = 3e-5  # Lower LR for fine-tuning
max_iters = 2000
warmup_iters = 100

# Regularization
weight_decay = 1e-1
```

### Workflow 4: Custom dataset

**Train on your own text**:
```python
# data/custom/prepare.py
import numpy as np

# Load your data
with open('my_data.txt', 'r') as f:
    text = f.read()

# Create character mappings
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Tokenize
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# Split train/val
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# Save
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')
```

**Train**:
```bash
python data/custom/prepare.py
python train.py --dataset=custom
```

## When to use vs alternatives

**Use nanoGPT when**:
- Learning how GPT works
- Experimenting with transformer variants
- Teaching/education purposes
- Quick prototyping
- Limited compute (can run on CPU)

**Simplicity advantages**:
- **~300 lines**: Entire model in `model.py`
- **~300 lines**: Training loop in `train.py`
- **Hackable**: Easy to modify
- **No abstractions**: Pure PyTorch

**Use alternatives instead**:
- **HuggingFace Transformers**: Production use, many models
- **Megatron-LM**: Large-scale distributed training
- **LitGPT**: More architectures, production-ready
- **PyTorch Lightning**: Need high-level framework

## Common issues

**Issue: CUDA out of memory**

Reduce batch size or context length:
```python
batch_size = 1  # Reduce from 12
block_size = 512  # Reduce from 1024
gradient_accumulation_steps = 40  # Increase to maintain effective batch
```

**Issue: Training too slow**

Enable compilation (PyTorch 2.0+):
```python
compile = True  # 2× speedup
```

Use mixed precision:
```python
dtype = 'bfloat16'  # Or 'float16'
```

**Issue: Poor generation quality**

Train longer:
```python
max_iters = 10000  # Increase from 5000
```

Lower temperature:
```python
# In sample.py
temperature = 0.7  # Lower from 1.0
top_k = 200       # Add top-k sampling
```

**Issue: Can't load GPT-2 weights**

Install transformers:
```bash
pip install transformers
```

Check model name:
```python
init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
```

## Advanced topics

**Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply.

**Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup.

**Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details.

## Hardware requirements

- **Shakespeare (char-level)**:
  - CPU: 5 minutes
  - GPU (T4): 1 minute
  - VRAM: <1GB

- **GPT-2 (124M)**:
  - 1× A100: ~1 week
  - 8× A100: ~4 days
  - VRAM: ~16GB per GPU

- **GPT-2 Medium (350M)**:
  - 8× A100: ~2 weeks
  - VRAM: ~40GB per GPU

**Performance**:
- With `compile=True`: 2× speedup
- With `dtype=bfloat16`: 50% memory reduction

## Resources

- GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
- Video: "Let's build GPT" by Andrej Karpathy
- Paper: "Attention is All You Need" (Vaswani et al.)
- OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
- Educational: Best for understanding transformers from scratch

01-model-architecture/rwkv/SKILL.mdskill

Show content (7099 bytes)

---
name: rwkv-architecture
description: RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [RWKV, Model Architecture, RNN, Transformer Hybrid, Linear Complexity, Infinite Context, Efficient Inference, Linux Foundation, Alternative Architecture]
dependencies: [rwkv, torch, transformers]
---

# RWKV - Receptance Weighted Key Value

## Quick start

RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).

**Installation**:
```bash
# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

# Install RWKV
pip install rwkv
```

**Basic usage** (GPT mode + RNN mode):
```python
import os
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1'  # Use CUDA kernel for speed

# Load model
model = RWKV(
    model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
    strategy='cuda fp16'
)

# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy())  # Logits

# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None)  # First 2 tokens
out, state = model.forward([1563], state)      # Next token
out, state = model.forward([310, 247], state)  # Last tokens
print(out.detach().cpu().numpy())  # Same logits as above!
```

## Common workflows

### Workflow 1: Text generation (streaming)

**Efficient token-by-token generation**:
```python
from rwkv.model import RWKV
from rwkv.utils import PIPELINE

model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")

# Initial prompt
prompt = "The future of AI is"
state = None

# Generate token by token
for token in prompt:
    out, state = pipeline.model.forward(pipeline.encode(token), state)

# Continue generation
for _ in range(100):
    out, state = pipeline.model.forward(None, state)
    token = pipeline.sample_logits(out)
    print(pipeline.decode(token), end='', flush=True)
```

**Key advantage**: Constant memory per token (no growing KV cache)

### Workflow 2: Long context processing (infinite context)

**Process million-token sequences**:
```python
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')

# Process very long document
state = None
long_document = load_document()  # e.g., 1M tokens

# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
    out, state = model.forward(chunk, state)

# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)
```

### Workflow 3: Fine-tuning RWKV

**Standard fine-tuning workflow**:
```python
# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer

# Configure model
config = {
    'n_layer': 24,
    'n_embd': 1024,
    'vocab_size': 50277,
    'ctx_len': 1024
}

# Setup trainer
trainer = pl.Trainer(
    accelerator='gpu',
    devices=8,
    precision='bf16',
    strategy='deepspeed_stage_2',
    max_epochs=1
)

# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)
```

### Workflow 4: RWKV vs Transformer comparison

**Memory comparison** (1M token sequence):
```python
# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)

# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!
```

**Speed comparison** (inference):
```python
# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations

# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)
```

## When to use vs alternatives

**Use RWKV when**:
- Need very long context (100K+ tokens)
- Want constant memory usage
- Building streaming applications
- Need RNN efficiency with Transformer performance
- Memory-constrained deployment

**Key advantages**:
- **Linear time**: O(n) vs O(n²) for Transformers
- **No KV cache**: Constant memory per token
- **Infinite context**: No fixed window limit
- **Parallelizable training**: Like GPT
- **Sequential inference**: Like RNN

**Use alternatives instead**:
- **Transformers**: Need absolute best performance, have compute
- **Mamba**: Want state-space models
- **RetNet**: Need retention mechanism
- **Hyena**: Want convolution-based approach

## Common issues

**Issue: Out of memory during training**

Use gradient checkpointing and DeepSpeed:
```python
trainer = pl.Trainer(
    strategy='deepspeed_stage_3',  # Full ZeRO-3
    precision='bf16'
)
```

**Issue: Slow inference**

Enable CUDA kernel:
```python
os.environ["RWKV_CUDA_ON"] = '1'
```

**Issue: Model not loading**

Check model path and strategy:
```python
model = RWKV(
    model='/absolute/path/to/model.pth',
    strategy='cuda fp16'  # Or 'cpu fp32' for CPU
)
```

**Issue: State management in RNN mode**

Always pass state between forward calls:
```python
# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None)  # No context from tokens1!

# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state)  # Has context from tokens1
```

## Advanced topics

**Time-mixing and channel-mixing**: See [references/architecture-details.md](references/architecture-details.md) for WKV operation, time-decay mechanism, and receptance gates.

**State management**: See [references/state-management.md](references/state-management.md) for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.

**RWKV-7 improvements**: See [references/rwkv7.md](references/rwkv7.md) for latest architectural improvements (March 2025) and multimodal capabilities.

## Hardware requirements

- **GPU**: NVIDIA (CUDA 11.6+) or CPU
- **VRAM** (FP16):
  - 169M model: 1GB
  - 430M model: 2GB
  - 1.5B model: 4GB
  - 3B model: 8GB
  - 7B model: 16GB
  - 14B model: 32GB
- **Inference**: O(1) memory per token
- **Training**: Parallelizable like GPT

**Performance** (vs Transformers):
- **Speed**: Similar training, faster inference
- **Memory**: 1000× less for long sequences
- **Scaling**: Linear vs quadratic

## Resources

- Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023)
- Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025)
- GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+
- Docs: https://wiki.rwkv.com/
- Models: https://huggingface.co/BlinkDL
- Linux Foundation AI: Official project
- Production: Microsoft Windows, Office integration, NeMo support

01-model-architecture/torchtitan/SKILL.mdskill

Show content (8927 bytes)

---
name: distributed-llm-pretraining-torchtitan
description: Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, Distributed Training, TorchTitan, FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel, Float8, Llama, Pretraining]
dependencies: [torch>=2.6.0, torchtitan>=0.2.0, torchao>=0.5.0]
---

# TorchTitan - PyTorch Native Distributed LLM Pretraining

## Quick start

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.

**Installation**:
```bash
# From PyPI (stable)
pip install torchtitan

# From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
```

**Download tokenizer**:
```bash
# Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
```

**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
```

## Common workflows

### Workflow 1: Pretrain Llama 3.1 8B on single node

Copy this checklist:

```
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
```

**Step 1: Download tokenizer**

```bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
```

**Step 2: Configure training**

Edit or create a TOML config file:

```toml
# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"

[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"

[optimizer]
name = "AdamW"
lr = 3e-4

[lr_scheduler]
warmup_steps = 200

[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"

[parallelism]
data_parallel_shard_degree = -1  # Use all GPUs for FSDP

[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"

[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
```

**Step 3: Launch training**

```bash
# 8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

# Or explicitly with torchrun
torchrun --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_8b_custom.toml
```

**Step 4: Monitor and checkpoint**

TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb
```

### Workflow 2: Multi-node training with SLURM

```
Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
```

**Step 1: Configure parallelism for scale**

For 70B model on 256 GPUs (32 nodes):
```toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences
```

**Step 2: Set up SLURM script**

```bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
```

**Step 3: Submit job**

```bash
sbatch multinode_trainer.slurm
```

**Step 4: Resume from checkpoint**

Training auto-resumes if checkpoint exists in configured folder.

### Workflow 3: Enable Float8 training for H100s

Float8 provides 30-50% speedup on H100 GPUs.

```
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
```

**Step 1: Install torchao**

```bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
```

**Step 2: Configure Float8**

Add to your TOML config:
```toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
```

**Step 3: Launch with compile**

```bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable
```

### Workflow 4: 4D parallelism for 405B models

```
4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
```

**Step 1: Create seed checkpoint**

Required for consistent initialization across PP stages:
```bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
```

**Step 2: Configure 4D parallelism**

```toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
```

**Step 3: Launch on 512 GPUs**

```bash
# 64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_405b.toml
```

## When to use vs alternatives

**Use TorchTitan when:**
- Pretraining LLMs from scratch (8B to 405B+)
- Need PyTorch-native solution without third-party dependencies
- Require composable 4D parallelism (FSDP2, TP, PP, CP)
- Training on H100s with Float8 support
- Want interoperable checkpoints with torchtune/HuggingFace

**Use alternatives instead:**
- **Megatron-LM**: Maximum performance for NVIDIA-only deployments
- **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support
- **Axolotl/TRL**: Fine-tuning rather than pretraining
- **LitGPT**: Educational, smaller-scale training

## Common issues

**Issue: Out of memory on large models**

Enable activation checkpointing and reduce batch size:
```toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1
```

Or use gradient accumulation:
```toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients
```

**Issue: TP causes high memory with async collectives**

Set environment variable:
```bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
```

**Issue: Float8 training not faster**

Float8 only benefits large GEMMs. Filter small layers:
```toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
```

**Issue: Checkpoint loading fails after parallelism change**

Use DCP's resharding capability:
```bash
# Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils \
  dcp_to_torch checkpoint/step-1000 checkpoint.pt
```

**Issue: Pipeline parallelism initialization**

Create seed checkpoint first (see Workflow 4, Step 1).

## Supported models

| Model | Sizes | Status |
|-------|-------|--------|
| Llama 3.1 | 8B, 70B, 405B | Production |
| Llama 4 | Various | Experimental |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
| GPT-OSS | 20B, 120B (MoE) | Experimental |
| Qwen 3 | Various | Experimental |
| Flux | Diffusion | Experimental |

## Performance benchmarks (H100)

| Model | GPUs | Parallelism | TPS/GPU | Techniques |
|-------|------|-------------|---------|------------|
| Llama 8B | 8 | FSDP | 5,762 | Baseline |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |

## Advanced topics

**FSDP2 configuration**: See [references/fsdp.md](references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.

**Float8 training**: See [references/float8.md](references/float8.md) for tensorwise vs rowwise scaling recipes.

**Checkpointing**: See [references/checkpoint.md](references/checkpoint.md) for HuggingFace conversion and async checkpointing.

**Adding custom models**: See [references/custom-models.md](references/custom-models.md) for TrainSpec protocol.

## Resources

- GitHub: https://github.com/pytorch/torchtitan
- Paper: https://arxiv.org/abs/2410.06511
- ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
- PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44

02-tokenization/huggingface-tokenizers/SKILL.mdskill

Show content (13674 bytes)

---
name: huggingface-tokenizers
description: Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production]
dependencies: [tokenizers, transformers, datasets]
---

# HuggingFace Tokenizers - Fast Tokenization for NLP

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

## When to use HuggingFace Tokenizers

**Use HuggingFace Tokenizers when:**
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently

**Performance**:
- **Speed**: <20 seconds to tokenize 1GB on CPU
- **Implementation**: Rust core with Python/Node.js bindings
- **Efficiency**: 10-100× faster than pure Python implementations

**Use alternatives instead**:
- **SentencePiece**: Language-independent, used by T5/ALBERT
- **tiktoken**: OpenAI's BPE tokenizer for GPT models
- **transformers AutoTokenizer**: Loading pretrained only (uses this library internally)

## Quick start

### Installation

```bash
# Install tokenizers
pip install tokenizers

# With transformers integration
pip install tokenizers transformers
```

### Load pretrained tokenizer

```python
from tokenizers import Tokenizer

# Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode text
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]

# Decode back
text = tokenizer.decode(output.ids)
print(text)  # "hello, how are you?"
```

### Train custom BPE tokenizer

```python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2
)

# Train on files
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)

# Save
tokenizer.save("my-tokenizer.json")
```

**Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB

### Batch encoding with padding

```python
# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

# Encode batch
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)

for encoding in encodings:
    print(encoding.ids)
# [101, 7592, 2088, 102, 3, 3, 3]
# [101, 2023, 2003, 1037, 2936, 6251, 102]
```

## Tokenization algorithms

### BPE (Byte-Pair Encoding)

**How it works**:
1. Start with character-level vocabulary
2. Find most frequent character pair
3. Merge into new token, add to vocabulary
4. Repeat until vocabulary size reached

**Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa

```python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["data.txt"], trainer=trainer)
```

**Advantages**:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages

**Trade-offs**:
- Tokenization depends on merge order
- May split common words unexpectedly

### WordPiece

**How it works**:
1. Start with character vocabulary
2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))`
3. Merge highest scoring pair
4. Repeat until vocabulary size reached

**Used by**: BERT, DistilBERT, MobileBERT

```python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
```

**Advantages**:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)

**Trade-offs**:
- Unknown words become `[UNK]` if no subword match
- Saves vocabulary, not merge rules (larger files)

### Unigram

**How it works**:
1. Start with large vocabulary (all substrings)
2. Compute loss for corpus with current vocabulary
3. Remove tokens with minimal impact on loss
4. Repeat until vocabulary size reached

**Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece)

```python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)

tokenizer.train(files=["data.txt"], trainer=trainer)
```

**Advantages**:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts

**Trade-offs**:
- Computationally expensive to train
- More hyperparameters to tune

## Tokenization pipeline

Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing**

### Normalization

Clean and standardize text:

```python
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([
    NFD(),           # Unicode normalization (decompose)
    Lowercase(),     # Convert to lowercase
    StripAccents()   # Remove accents
])

# Input: "Héllo WORLD"
# After normalization: "hello world"
```

**Common normalizers**:
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
- `Lowercase()` - Convert to lowercase
- `StripAccents()` - Remove accents (é → e)
- `Strip()` - Remove whitespace
- `Replace(pattern, content)` - Regex replacement

### Pre-tokenization

Split text into word-like units:

```python
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel

# Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([
    Whitespace(),
    Punctuation()
])

# Input: "Hello, world!"
# After pre-tokenization: ["Hello", ",", "world", "!"]
```

**Common pre-tokenizers**:
- `Whitespace()` - Split on spaces, tabs, newlines
- `ByteLevel()` - GPT-2 style byte-level splitting
- `Punctuation()` - Isolate punctuation
- `Digits(individual_digits=True)` - Split digits individually
- `Metaspace()` - Replace spaces with ▁ (SentencePiece style)

### Post-processing

Add special tokens for model input:

```python
from tokenizers.processors import TemplateProcessing

# BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)
```

**Common patterns**:
```python
# GPT-2: sentence <|endoftext|>
TemplateProcessing(
    single="$A <|endoftext|>",
    special_tokens=[("<|endoftext|>", 50256)]
)

# RoBERTa: <s> sentence </s>
TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> </s> $B </s>",
    special_tokens=[("<s>", 0), ("</s>", 2)]
)
```

## Alignment tracking

Track token positions in original text:

```python
output = tokenizer.encode("Hello, world!")

# Get token offsets
for token, offset in zip(output.tokens, output.offsets):
    start, end = offset
    print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")

# Output:
# hello      → [ 0,  5): 'Hello'
# ,          → [ 5,  6): ','
# world      → [ 7, 12): 'world'
# !          → [12, 13): '!'
```

**Use cases**:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)

## Integration with transformers

### Load with AutoTokenizer

```python
from transformers import AutoTokenizer

# AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if using fast tokenizer
print(tokenizer.is_fast)  # True

# Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>
```

### Convert custom tokenizer to transformers

```python
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

# Train custom tokenizer
tokenizer = Tokenizer(BPE())
# ... train tokenizer ...
tokenizer.save("my-tokenizer.json")

# Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]"
)

# Use like any transformers tokenizer
outputs = transformers_tokenizer(
    "Hello world",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
```

## Common patterns

### Train from iterator (large datasets)

```python
from datasets import load_dataset

# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Create batch iterator
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i:i + batch_size]["text"]

# Train tokenizer
tokenizer.train_from_iterator(
    batch_iterator(),
    trainer=trainer,
    length=len(dataset)  # For progress bar
)
```

**Performance**: Processes 1GB in ~10-20 minutes

### Enable truncation and padding

```python
# Enable truncation
tokenizer.enable_truncation(max_length=512)

# Enable padding
tokenizer.enable_padding(
    pad_id=tokenizer.token_to_id("[PAD]"),
    pad_token="[PAD]",
    length=512  # Fixed length, or None for batch max
)

# Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids))  # 512
```

### Multi-processing

```python
from tokenizers import Tokenizer
from multiprocessing import Pool

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

def encode_batch(texts):
    return tokenizer.encode_batch(texts)

# Process large corpus in parallel
with Pool(8) as pool:
    # Split corpus into chunks
    chunk_size = 1000
    chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]

    # Encode in parallel
    results = pool.map(encode_batch, chunks)
```

**Speedup**: 5-8× with 8 cores

## Performance benchmarks

### Training speed

| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|-------------|-----------------|-----------------|--------------|
| 10 MB       | 15 sec          | 18 sec          | 25 sec       |
| 100 MB      | 1.5 min         | 2 min           | 4 min        |
| 1 GB        | 15 min          | 20 min          | 40 min       |

**Hardware**: 16-core CPU, tested on English Wikipedia

### Tokenization speed

| Implementation | 1 GB corpus | Throughput    |
|----------------|-------------|---------------|
| Pure Python    | ~20 minutes | ~50 MB/min    |
| HF Tokenizers  | ~15 seconds | ~4 GB/min     |
| **Speedup**    | **80×**     | **80×**       |

**Test**: English text, average sentence length 20 words

### Memory usage

| Task                    | Memory  |
|-------------------------|---------|
| Load tokenizer          | ~10 MB  |
| Train BPE (30k vocab)   | ~200 MB |
| Encode 1M sentences     | ~500 MB |

## Supported models

Pre-trained tokenizers available via `from_pretrained()`:

**BERT family**:
- `bert-base-uncased`, `bert-large-cased`
- `distilbert-base-uncased`
- `roberta-base`, `roberta-large`

**GPT family**:
- `gpt2`, `gpt2-medium`, `gpt2-large`
- `distilgpt2`

**T5 family**:
- `t5-small`, `t5-base`, `t5-large`
- `google/flan-t5-xxl`

**Other**:
- `facebook/bart-base`, `facebook/mbart-large-cc25`
- `albert-base-v2`, `albert-xlarge-v2`
- `xlm-roberta-base`, `xlm-roberta-large`

Browse all: https://huggingface.co/models?library=tokenizers

## References

- **[Training Guide](references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets
- **[Algorithms Deep Dive](references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail
- **[Pipeline Components](references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders
- **[Transformers Integration](references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens

## Resources

- **Docs**: https://huggingface.co/docs/tokenizers
- **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+
- **Version**: 0.20.0+
- **Course**: https://huggingface.co/learn/nlp-course/chapter6/1
- **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)

01-model-architecture/mamba/SKILL.mdskill

Show content (7368 bytes)

---
name: mamba-architecture
description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers]
dependencies: [mamba-ssm, torch, transformers, causal-conv1d]
---

# Mamba - Selective State Space Models

## Quick start

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

**Installation**:
```bash
# Install causal-conv1d (optional, for efficiency)
pip install causal-conv1d>=1.4.0

# Install Mamba
pip install mamba-ssm
# Or both together
pip install mamba-ssm[causal-conv1d]
```

**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

**Basic usage** (Mamba block):
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape
```

## Common workflows

### Workflow 1: Language model with Mamba-2

**Complete LM with generation**:
```python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

# Configure Mamba-2 LM
config = MambaConfig(
    d_model=1024,           # Hidden dimension
    n_layer=24,             # Number of layers
    vocab_size=50277,       # Vocabulary size
    ssm_cfg=dict(
        layer="Mamba2",     # Use Mamba-2
        d_state=128,        # Larger state for Mamba-2
        headdim=64,         # Head dimension
        ngroups=1           # Number of groups
    )
)

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

# Generate text
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
output = model.generate(
    input_ids=input_ids,
    max_length=100,
    temperature=0.7,
    top_p=0.9
)
```

### Workflow 2: Use pretrained Mamba models

**Load from HuggingFace**:
```python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

# Load pretrained model
model_name = "state-spaces/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")  # Use compatible tokenizer
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

# Generate
prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(
    input_ids=input_ids,
    max_length=200,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2
)
generated_text = tokenizer.decode(output_ids[0])
print(generated_text)
```

**Available models**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`

### Workflow 3: Mamba-1 vs Mamba-2

**Mamba-1** (smaller state):
```python
from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")
```

**Mamba-2** (multi-head, larger state):
```python
from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")
```

**Key differences**:
- **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
- **Architecture**: Mamba-2 has multi-head structure
- **Normalization**: Mamba-2 uses RMSNorm
- **Distributed**: Mamba-2 supports tensor parallelism

### Workflow 4: Benchmark vs Transformers

**Generation speed comparison**:
```bash
# Benchmark Mamba
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "state-spaces/mamba-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

# Benchmark Transformer
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "EleutherAI/pythia-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
```

**Expected results**:
- **Mamba**: 5× faster inference
- **Memory**: No KV cache needed
- **Scaling**: Linear with sequence length

## When to use vs alternatives

**Use Mamba when**:
- Need long sequences (100K+ tokens)
- Want faster inference than Transformers
- Memory-constrained (no KV cache)
- Building streaming applications
- Linear scaling important

**Advantages**:
- **O(n) complexity**: Linear vs quadratic
- **5× faster inference**: No attention overhead
- **No KV cache**: Lower memory usage
- **Million-token sequences**: Hardware-efficient
- **Streaming**: Constant memory per token

**Use alternatives instead**:
- **Transformers**: Need best-in-class performance, have compute
- **RWKV**: Want RNN+Transformer hybrid
- **RetNet**: Need retention-based architecture
- **Hyena**: Want convolution-based approach

## Common issues

**Issue: CUDA out of memory**

Reduce batch size or use gradient checkpointing:
```python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing
```

**Issue: Slow installation**

Install binary wheels (not source):
```bash
pip install mamba-ssm --no-build-isolation
```

**Issue: Missing causal-conv1d**

Install separately:
```bash
pip install causal-conv1d>=1.4.0
```

**Issue: Model not loading from HuggingFace**

Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`):
```python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
```

## Advanced topics

**Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

**Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup.

**Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques.

## Hardware requirements

- **GPU**: NVIDIA with CUDA 11.6+
- **VRAM**:
  - 130M model: 2GB
  - 370M model: 4GB
  - 790M model: 8GB
  - 1.4B model: 14GB
  - 2.8B model: 28GB (FP16)
- **Inference**: 5× faster than Transformers
- **Memory**: No KV cache (lower than Transformers)

**Performance** (vs Transformers):
- **Speed**: 5× faster inference
- **Memory**: 50% less (no KV cache)
- **Scaling**: Linear vs quadratic

## Resources

- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
- GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
- Models: https://huggingface.co/state-spaces
- Docs: Repository README and wiki

0-autoresearch-skill/SKILL.mdskill

Show content (24945 bytes)

---
name: autoresearch
description: Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Autonomous Research, Two-Loop Architecture, Experiment Orchestration, Research Synthesis, Project Management]
---

# Autoresearch

Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.

You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.

**This runs fully autonomously.** Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.

## Getting Started

Users arrive in different states. Determine which and proceed:

| User State | What to Do |
|---|---|
| Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap |
| Clear research question | Bootstrap directly |
| Existing plan or proposal | Review plan, set up workspace, enter loops |
| Resuming (research-state.yaml exists) | Read state, continue from where you left off |

If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.

**Step 0 — before anything else**: Set up the agent continuity loop. See [Agent Continuity](#agent-continuity-mandatory--set-up-first). This is MANDATORY. Without it, the research stops after one cycle.

### Initialize Workspace

Create this structure at the project root:

```
{project}/
├── research-state.yaml       # Central state tracking
├── research-log.md           # Decision timeline
├── findings.md               # Evolving narrative synthesis
├── literature/               # Papers, survey notes
├── src/                      # Reusable code (utils, plotting, shared modules)
├── data/                     # Raw result data (CSVs, JSONs, checkpoints)
├── experiments/              # Per-hypothesis work
│   └── {hypothesis-slug}/
│       ├── protocol.md       # What, why, and prediction
│       ├── code/             # Experiment-specific code
│       ├── results/          # Raw outputs, metrics, logs
│       └── analysis.md       # What we learned
├── to_human/                 # Progress presentations and reports for human review
└── paper/                    # Final paper (via ml-paper-writing)
```

- **`src/`**: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.
- **`data/`**: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., `trajectory_H1_runs001-010.csv`). Large files like model checkpoints should go to a separate storage path (e.g., `/data/`, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.

Initialize `research-state.yaml`, `research-log.md`, and `findings.md` from [templates/](templates/). Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.

## The Two-Loop Architecture

This is the core engine. Everything else supports it.

```
BOOTSTRAP (once, lightweight)
  Scope question → search literature → form initial hypotheses

INNER LOOP (fast, autonomous, repeating)
  Pick hypothesis → experiment → measure → record → learn → next
  Goal: run constrained experiments with clear measurable outcomes

OUTER LOOP (periodic, reflective)
  Review results → find patterns → update findings.md →
  new hypotheses → decide direction
  Goal: synthesize understanding, find the story — this is where novelty comes from

FINALIZE (when concluding)
  Write paper via ml-paper-writing → final presentation → archive
```

The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results *mean*? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.

There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.

### Research is Non-Linear

The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:

- **Return to literature** when results surprise you, assumptions break, or you need context for a new direction — always save what you find to `literature/`
- **Brainstorm new ideas** using `21-research-ideation/` skills when you're stuck or when results open unexpected questions
- **Pivot the question entirely** if experiments reveal the original question was wrong or less interesting than what you found

This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.

## Bootstrap: Literature and Hypotheses

Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.

1. **Search literature** for the research question. Use multiple sources — never stop at one:
   - **Exa MCP** (`web_search_exa`) if available — best for broad discovery and finding relevant papers quickly
   - **Semantic Scholar** (`pip install semanticscholar`) — best for ML/AI papers, citation graphs, and specific paper lookup. See `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code examples
   - **arXiv** (`pip install arxiv`) — best for recent preprints and open-access papers
   - **CrossRef** — best for DOI lookup and BibTeX retrieval
   - Keep searching until you have good coverage. If one source comes up empty, try another with different keywords

   **Save everything to `literature/`**: For every paper you find, save a summary to `literature/` — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running `literature/survey.md` with all summaries. This is your reference library — you and future sessions will need it throughout the project.

2. **Identify gaps** from the literature
   - What's been tried? What hasn't? Where do existing methods break?
   - What do Discussion sections flag as future work?

3. **Form initial hypotheses** — invoke `21-research-ideation/` skills
   - `brainstorming-research-ideas` for structured diverge-converge workflow
   - `creative-thinking-for-research` for deeper cognitive frameworks
   - Each hypothesis must be testable with a clear prediction

4. **Define the evaluation**
   - Set the proxy metric and baseline before running experiments
   - The metric should be computable quickly (minutes, not hours)
   - Lock evaluation criteria upfront to prevent unconscious metric gaming

5. **Record** in research-state.yaml, log the bootstrap in research-log.md

## The Inner Loop

Rapid iteration with clear measurable outcomes. Two flavors:

- **Optimization**: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch.
- **Discovery**: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize.

```
1.  Pick the highest-priority untested hypothesis
2.  Write a protocol: what change, what prediction, why
    Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
    This creates temporal proof your plan existed before results
3.  Run the experiment (invoke the relevant domain skill)
4.  Sanity check before trusting results:
    - Did training converge? No NaN/Inf?
    - Does baseline reproduce expected performance?
    - Data loading correct? (spot-check a few samples)
5.  Measure the proxy metric
6.  Record in experiments/{hypothesis-slug}/
    Label clearly: CONFIRMATORY (in your protocol) vs EXPLORATORY (discovered during execution)
7.  If positive: keep, note WHY it worked
8.  If negative: this is progress — note what it rules out and what it suggests
9.  Update research-state.yaml
10. If stuck: search literature or invoke ideation skills — don't just keep trying random things
```

**Never stop.** Even if something fails, find a path forward. Debug, adjust, simplify, or pivot — but keep the research moving. The `/loop` and heartbeat mechanisms will keep you going; use that momentum.

### Route to Domain Skills

When you need domain-specific execution, search the skills library:

| Research Activity | Look In |
|---|---|
| Data preparation | `05-data-processing/` |
| Model training / fine-tuning | `01-model-architecture/`, `03-fine-tuning/`, `06-post-training/` |
| Distributed training | `08-distributed-training/` |
| Optimization (quantization, attention) | `10-optimization/` |
| Evaluation / benchmarks | `11-evaluation/` |
| Inference / serving | `12-inference-serving/` |
| Interpretability analysis | `04-mechanistic-interpretability/` |
| Experiment tracking (W&B, MLflow) | `13-mlops/` |
| Cloud compute | `09-infrastructure/` |

Read the relevant SKILL.md before starting — it has workflows, common issues, and code examples. See [references/skill-routing.md](references/skill-routing.md) for a complete guide.

### Track the Experiment Trajectory

Maintain a running record of measurable outcomes across experiments:

```json
{
  "experiment_id": "run_014",
  "hypothesis": "H3",
  "metric_value": 0.847,
  "baseline": 0.812,
  "delta": "+0.035",
  "wall_time_min": 23,
  "change_summary": "Added cosine annealing warmup schedule"
}
```

This trajectory produces the optimization plot (like Karpathy's progress chart) — include it in progress reports. Humans love seeing the upward curve.

## The Outer Loop

Step back from individual experiments. Synthesize.

```
1. Review all results since last reflection
2. Cluster by type: what kinds of changes worked? Which didn't?
3. Ask WHY — identify the mechanism behind successes and failures
4. Update findings.md with current understanding
5. Search literature if results were surprising or assumptions need revisiting
6. Generate new hypotheses if warranted (invoke 21-research-ideation/ skills)
7. Decide direction (see criteria below)
8. Update research-state.yaml with new direction
9. Log the reflection in research-log.md
10. If there's something meaningful, generate a progress presentation
```

### Deciding Direction

Don't just pick randomly — use these criteria:

**DEEPEN** — a supported result raises follow-up questions
- Does the effect hold under different conditions? What's the mechanism?
- Action: generate sub-hypotheses (H1.1, H1.2) → back to inner loop

**BROADEN** — current results are solid, but adjacent questions are untested
- New questions emerged. The current contribution is clear but more is possible.
- Action: generate new root hypotheses → back to inner loop

**PIVOT** — results invalidate key assumptions or something more interesting appeared
- A core assumption was wrong, or an unexpected finding is more promising than the original question.
- Action: return to literature with new questions → re-bootstrap

**CONCLUDE** — sufficient evidence for a contribution
- At least one hypothesis is strongly supported (or a coherent set of negative results)
- Key ablations completed, error analysis done
- findings.md reads like a paper backbone — a human could write the abstract from it
- No critical open questions that would change the story

Note: coherent negative results are a valid contribution. "X does NOT work because Y" is publishable if the reasoning is rigorous.

### findings.md Is Your Project Memory

This file serves two purposes: it's the research narrative for humans AND your accumulated knowledge base as an agent. Read it at the start of every session, /loop tick, or heartbeat to remember what you've learned.

After every outer loop, update it to answer:

- What do we know so far? (Current Understanding)
- What patterns explain our results? (Patterns and Insights)
- What specific things did we learn not to repeat? (Lessons and Constraints)
- What remains open? (Open Questions)

The "Lessons and Constraints" section is especially important — it captures specific actionable learnings like "weight decay > 0.1 diverges at this scale" or "baseline only reproduces with batch_size=64." This prevents the agent from repeating failed approaches across sessions.

**Quality test**: After 30 inner loop experiments, a human should be able to read findings.md and write a paper abstract from it. If they can't, the outer loop isn't synthesizing — it's just logging.

## Agent Continuity (MANDATORY — Set Up First)

**Before doing anything else**, set up the wall-clock loop. This is what keeps the research running continuously. Without it, the agent stops after one cycle.

### Claude Code

**First action — run this immediately:**

```
/loop 20m Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a report in to_human/ and opening the HTML/PDF. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper.
```

This fires every 20 minutes regardless of what's happening. It's a rhythm that keeps you working — not a research phase boundary. If your previous work isn't done, just continue it.

### OpenClaw

**First action — set up a 20-minute cron job in the current session. You MUST do this before any research work.**

Use the `cron.add` tool to create a recurring job bound to this chat session:

```json
{
  "name": "autoresearch-loop",
  "schedule": { "kind": "every", "everyMs": 1200000 },
  "sessionTarget": "current",
  "payload": {
    "kind": "agentTurn",
    "message": "Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a PDF report in to_human/ and sending it to the user via Telegram, WhatsApp, or Slack. When you get an exciting result or interesting plot, text it to the user right away — don't wait for a full report. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper."
  }
}
```

Key details:
- `sessionTarget: "current"` binds the cron to this chat session so it maintains conversation context across ticks
- `everyMs: 1200000` = 20 minutes
- After creation, verify with `cron.list` that the job exists and is enabled
- If the cron fires while you're mid-experiment, just continue — the tick is a nudge, not a restart

### What the Loop Does

The `/loop` and cron job are purely **wall-clock rhythm**. They are completely separate from your research loops (inner/outer). On each tick:

1. Read `research-state.yaml` and `findings.md` — remember where you are
2. Check if anything is broken (failed experiments, stalled training, errors)
3. If on track → keep working on whatever you were doing
4. If stuck or something's wrong → step back, diagnose, fix, then continue
5. Never idle. Always be making progress.

## Progress Reporting

When you have something meaningful to share, create a research presentation — not just a status dashboard, but a compelling story.

**When to report** (your judgment):
- After an outer loop that found a significant pattern
- When the optimization trajectory shows clear progress (include the plot!)
- After a pivot in direction
- Before requesting human input on a decision
- When concluding

**What to include** (adapt to what's compelling):
- The research question and why it matters
- Key results with visualizations (plots, metric tables)
- The optimization trajectory chart (metric over experiments)
- What was tried and why (selective, not exhaustive)
- Current understanding (the findings narrative)
- What's planned next

For Claude Code: generate HTML and `open` it. If HTML fails to open or render, convert to PDF as fallback (use `weasyprint`, `playwright pdf`, or `wkhtmltopdf`). For OpenClaw: generate PDF directly.

See [references/progress-reporting.md](references/progress-reporting.md) for template scaffolding and the optimization plot approach. Use the template as a starting point — be creative with what you show.

## Git Protocol

Commit at natural research milestones:

| When | Message Pattern |
|---|---|
| Workspace initialized | `research(init): {project} — {question}` |
| Experiment protocol locked | `research(protocol): {hypothesis}` |
| Significant results | `research(results): {hypothesis} — {outcome}` |
| Outer loop direction change | `research(reflect): {direction} — {reason}` |
| Paper draft complete | `research(paper): {title}` |

**Hard rule**: Protocol commits MUST precede result commits. Never combine them. The git history is your lightweight pre-registration — it proves what you planned before you saw results. Don't commit after every experiment — commit when there's meaningful progress.

## Concluding: Paper Writing

When the outer loop decides to CONCLUDE:

1. Ensure findings.md has a clear, well-supported narrative
2. Study 2-3 top related papers to learn their format, style, and section structure
3. Invoke the `20-ml-paper-writing` skill — it has LaTeX templates for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, and systems venues
4. Feed it the accumulated literature, experimental results, and findings
5. Follow its citation verification workflow — never hallucinate references
6. Generate a final comprehensive research presentation

Proceed autonomously through the writing process. If the ml-paper-writing skill suggests human collaboration points, adapt and keep going — produce the best draft you can. The human will review and provide feedback.

## Research Discipline

Principles to enforce continuously — not tied to any specific phase:

- **Lock before you run**: Commit your experiment protocol to git before executing. This proves your plan existed before you saw results. Never combine protocol + results in one commit.
- **Confirmatory vs exploratory**: Results matching your locked protocol are confirmatory. Everything else is exploratory — interesting but requiring more skepticism.
- **Negative results are progress**: A refuted hypothesis tells you something. Log what it rules out and what it suggests. Don't treat it as failure.
- **Sanity check before analysis**: Verify training converged, baselines reproduce, and data is correct before trusting your primary metric.
- **Return to literature when confused**: Don't guess — search. If results surprise you or assumptions break, go find papers. Use Exa MCP for discovery, Semantic Scholar for specific ML/AI paper lookup, arXiv for preprints.
- **Never stop**: Don't wait for human approval on routine decisions. If a skill or tool suggests collaboration, adapt and keep going. Find the best path forward autonomously. The human will see your progress reports and can redirect if needed.
- **Use whatever compute is available**: Adapt to the user's environment — local GPU, cluster job submission, cloud instances, or just CPU. If no GPU is available, use CPU and adjust experiment scale accordingly. Don't block on compute availability.

## Quality Standards

**Good agent behavior:**
- Hypotheses have mechanistic reasoning ("X because Y, predicting Z"), not just "try X"
- findings.md builds a coherent narrative, not a flat list of results
- Negative results are recorded with what they rule out
- The agent updates its model when experiments contradict expectations
- Progress reports tell a research story with compelling visualizations

**Bad agent behavior:**
- Pure hyperparameter sweeps without interpretation
- findings.md is just experiment logs copy-pasted
- Agent never revisits its assumptions after failures
- Optimizing metrics without understanding why changes work

## When to Use vs Alternatives

**Use autoresearch when:**
- You have a research question explorable through experiments
- There's a measurable proxy metric for inner loop optimization
- The real contribution requires synthesis beyond the metric
- You want continuous autonomous research operation

**Use individual domain skills instead when:**
- You have a specific one-off task (train a model, run eval, write a paper)
- No iterative experimentation needed

## Common Issues

**Inner loop stalls (no metric improvement)**
Run an outer loop. Is the metric the right one? Is the search space exhausted? Consider broadening or pivoting. Search literature for new approaches.

**Stuck and not making progress**
Don't keep trying random changes. Step back: search literature for related work, invoke `21-research-ideation/` brainstorming skills, or run an outer loop reflection. Being stuck means you need new information or a new perspective, not more experiments.

**Results contradict baseline expectations**
Investigate, don't ignore. Return to literature — your protocol might have an error, the published baseline may be wrong, or conditions differ. Update findings.md with what you learn.

**Agent loses context between ticks**
Ensure research-state.yaml and findings.md are updated after every action. These files are your memory across sessions.

**Can't find relevant papers**
Try multiple approaches in order: Exa MCP for broad search, Semantic Scholar for specific ML/AI paper lookup (`pip install semanticscholar`), arXiv for preprints (`pip install arxiv`). Check `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code. Note: Google Scholar has no official API — use Semantic Scholar instead for programmatic search.

**No GPU available**
Use CPU and scale experiments down. Many research tasks (analysis, interpretability, small model training) run fine on CPU. Adjust experiment design to fit available compute rather than blocking.

**Experiments take longer than /loop interval**
Normal. On the next tick, check if it finished. If not, keep waiting or do something else useful (update notes, search papers). Adjust interval if needed.

**Not sure when to conclude**
Three questions: Do you have a strongly supported finding? Can you explain WHY it works? Would findings.md make a convincing paper abstract? If yes to all: conclude.

## Advanced Topics

- **Detailed agent continuity**: [references/agent-continuity.md](references/agent-continuity.md)
- **Progress presentation templates**: [references/progress-reporting.md](references/progress-reporting.md)
- **Complete skill routing**: [references/skill-routing.md](references/skill-routing.md)

.claude-plugin/marketplace.jsonmarketplace

Show content (12081 bytes)

{
  "name": "ai-research-skills",
  "owner": {
    "name": "Orchestra Research",
    "email": "zechen@orchestra-research.com"
  },
  "metadata": {
    "description": "Comprehensive library of 98 AI research engineering skills enabling autonomous AI research from hypothesis to experimental verification",
    "version": "1.2.0"
  },
  "plugins": [
    {
      "name": "model-architecture",
      "description": "LLM architectures and implementations including LitGPT, Mamba, NanoGPT, RWKV, and TorchTitan. Use when implementing, training, or understanding transformer and alternative architectures.",
      "source": "./",
      "strict": false,
      "skills": [
        "./01-model-architecture/litgpt",
        "./01-model-architecture/mamba",
        "./01-model-architecture/nanogpt",
        "./01-model-architecture/rwkv",
        "./01-model-architecture/torchtitan"
      ]
    },
    {
      "name": "tokenization",
      "description": "Text tokenization for LLMs including HuggingFace Tokenizers and SentencePiece. Use when training custom tokenizers or handling multilingual text.",
      "source": "./",
      "strict": false,
      "skills": [
        "./02-tokenization/huggingface-tokenizers",
        "./02-tokenization/sentencepiece"
      ]
    },
    {
      "name": "fine-tuning",
      "description": "LLM fine-tuning frameworks including Axolotl, LLaMA-Factory, PEFT, and Unsloth. Use when fine-tuning models with LoRA, QLoRA, or full fine-tuning.",
      "source": "./",
      "strict": false,
      "skills": [
        "./03-fine-tuning/axolotl",
        "./03-fine-tuning/llama-factory",
        "./03-fine-tuning/peft",
        "./03-fine-tuning/unsloth"
      ]
    },
    {
      "name": "mechanistic-interpretability",
      "description": "Neural network interpretability tools including TransformerLens, SAELens, NNSight, and pyvene. Use when analyzing model internals, finding circuits, or understanding how models compute.",
      "source": "./",
      "strict": false,
      "skills": [
        "./04-mechanistic-interpretability/nnsight",
        "./04-mechanistic-interpretability/pyvene",
        "./04-mechanistic-interpretability/saelens",
        "./04-mechanistic-interpretability/transformer-lens"
      ]
    },
    {
      "name": "data-processing",
      "description": "Data curation and processing at scale including NeMo Curator and Ray Data. Use when preparing training datasets or processing large-scale data.",
      "source": "./",
      "strict": false,
      "skills": [
        "./05-data-processing/nemo-curator",
        "./05-data-processing/ray-data"
      ]
    },
    {
      "name": "post-training",
      "description": "RLHF and preference alignment including TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, and torchforge. Use when aligning models with human preferences, training reward models, or large-scale RL training.",
      "source": "./",
      "strict": false,
      "skills": [
        "./06-post-training/grpo-rl-training",
        "./06-post-training/miles",
        "./06-post-training/openrlhf",
        "./06-post-training/simpo",
        "./06-post-training/slime",
        "./06-post-training/torchforge",
        "./06-post-training/trl-fine-tuning",
        "./06-post-training/verl"
      ]
    },
    {
      "name": "safety-alignment",
      "description": "AI safety and content moderation including Constitutional AI, LlamaGuard, NeMo Guardrails, and Prompt Guard. Use when implementing safety filters, content moderation, or prompt injection detection.",
      "source": "./",
      "strict": false,
      "skills": [
        "./07-safety-alignment/constitutional-ai",
        "./07-safety-alignment/llamaguard",
        "./07-safety-alignment/nemo-guardrails",
        "./07-safety-alignment/prompt-guard"
      ]
    },
    {
      "name": "distributed-training",
      "description": "Multi-GPU and multi-node training including DeepSpeed, PyTorch FSDP, Accelerate, Megatron-Core, PyTorch Lightning, and Ray Train. Use when training large models across GPUs.",
      "source": "./",
      "strict": false,
      "skills": [
        "./08-distributed-training/accelerate",
        "./08-distributed-training/deepspeed",
        "./08-distributed-training/megatron-core",
        "./08-distributed-training/pytorch-fsdp2",
        "./08-distributed-training/pytorch-lightning",
        "./08-distributed-training/ray-train"
      ]
    },
    {
      "name": "infrastructure",
      "description": "GPU cloud and compute orchestration including Modal, Lambda Labs, and SkyPilot. Use when deploying training jobs or managing GPU resources.",
      "source": "./",
      "strict": false,
      "skills": [
        "./09-infrastructure/lambda-labs",
        "./09-infrastructure/modal",
        "./09-infrastructure/skypilot"
      ]
    },
    {
      "name": "optimization",
      "description": "Model optimization and quantization including Flash Attention, bitsandbytes, GPTQ, AWQ, GGUF, and HQQ. Use when reducing memory, accelerating inference, or quantizing models.",
      "source": "./",
      "strict": false,
      "skills": [
        "./10-optimization/awq",
        "./10-optimization/bitsandbytes",
        "./10-optimization/flash-attention",
        "./10-optimization/gguf",
        "./10-optimization/gptq",
        "./10-optimization/hqq",
        "./10-optimization/ml-training-recipes"
      ]
    },
    {
      "name": "evaluation",
      "description": "LLM benchmarking and evaluation including lm-evaluation-harness, BigCode Evaluation Harness, and NeMo Evaluator. Use when benchmarking models or measuring performance.",
      "source": "./",
      "strict": false,
      "skills": [
        "./11-evaluation/bigcode-evaluation-harness",
        "./11-evaluation/lm-evaluation-harness",
        "./11-evaluation/nemo-evaluator"
      ]
    },
    {
      "name": "inference-serving",
      "description": "Production LLM inference including vLLM, TensorRT-LLM, llama.cpp, and SGLang. Use when deploying models for production inference.",
      "source": "./",
      "strict": false,
      "skills": [
        "./12-inference-serving/llama-cpp",
        "./12-inference-serving/sglang",
        "./12-inference-serving/tensorrt-llm",
        "./12-inference-serving/vllm"
      ]
    },
    {
      "name": "mlops",
      "description": "ML experiment tracking and lifecycle including Weights & Biases, MLflow, and TensorBoard. Use when tracking experiments or managing models.",
      "source": "./",
      "strict": false,
      "skills": [
        "./13-mlops/mlflow",
        "./13-mlops/tensorboard",
        "./13-mlops/weights-and-biases"
      ]
    },
    {
      "name": "agents",
      "description": "LLM agent frameworks including LangChain, LlamaIndex, CrewAI, and AutoGPT. Use when building chatbots, autonomous agents, or tool-using systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./14-agents/autogpt",
        "./14-agents/crewai",
        "./14-agents/langchain",
        "./14-agents/llamaindex"
      ]
    },
    {
      "name": "rag",
      "description": "Retrieval-Augmented Generation including Chroma, FAISS, Pinecone, Qdrant, and Sentence Transformers. Use when building semantic search or document retrieval systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./15-rag/chroma",
        "./15-rag/faiss",
        "./15-rag/pinecone",
        "./15-rag/qdrant",
        "./15-rag/sentence-transformers"
      ]
    },
    {
      "name": "prompt-engineering",
      "description": "Structured LLM outputs including DSPy, Instructor, Guidance, and Outlines. Use when extracting structured data or constraining LLM outputs.",
      "source": "./",
      "strict": false,
      "skills": [
        "./16-prompt-engineering/dspy",
        "./16-prompt-engineering/guidance",
        "./16-prompt-engineering/instructor",
        "./16-prompt-engineering/outlines"
      ]
    },
    {
      "name": "observability",
      "description": "LLM application monitoring including LangSmith and Phoenix. Use when debugging LLM apps or monitoring production systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./17-observability/langsmith",
        "./17-observability/phoenix"
      ]
    },
    {
      "name": "multimodal",
      "description": "Vision, audio, and multimodal models including CLIP, Whisper, LLaVA, BLIP-2, Segment Anything, Stable Diffusion, AudioCraft, Cosmos Policy, OpenPI, and OpenVLA-OFT. Use when working with images, audio, multimodal tasks, or vision-language-action robot policies.",
      "source": "./",
      "strict": false,
      "skills": [
        "./18-multimodal/audiocraft",
        "./18-multimodal/blip-2",
        "./18-multimodal/clip",
        "./18-multimodal/cosmos-policy",
        "./18-multimodal/llava",
        "./18-multimodal/openpi",
        "./18-multimodal/openvla-oft",
        "./18-multimodal/segment-anything",
        "./18-multimodal/stable-diffusion",
        "./18-multimodal/whisper"
      ]
    },
    {
      "name": "emerging-techniques",
      "description": "Advanced ML techniques including MoE Training, Model Merging, Long Context, Speculative Decoding, Knowledge Distillation, and Model Pruning. Use when implementing cutting-edge optimization or architecture techniques.",
      "source": "./",
      "strict": false,
      "skills": [
        "./19-emerging-techniques/knowledge-distillation",
        "./19-emerging-techniques/long-context",
        "./19-emerging-techniques/model-merging",
        "./19-emerging-techniques/model-pruning",
        "./19-emerging-techniques/moe-training",
        "./19-emerging-techniques/speculative-decoding"
      ]
    },
    {
      "name": "autoresearch",
      "description": "Autonomous research orchestration using a two-loop architecture. Manages the full research lifecycle from literature survey to paper writing, routing to domain-specific skills for execution. Use when starting a research project, running autonomous experiments, or managing multi-hypothesis research.",
      "source": "./",
      "strict": false,
      "skills": [
        "./0-autoresearch-skill"
      ]
    },
    {
      "name": "ml-paper-writing",
      "description": "Write publication-ready ML/AI/Systems papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, OSDI, NSDI, ASPLOS, SOSP. Includes LaTeX templates, citation verification, reviewer guidelines, publication-quality figure generation, systems paper structural blueprints, and conference presentation slides.",
      "source": "./",
      "strict": false,
      "skills": [
        "./20-ml-paper-writing/ml-paper-writing",
        "./20-ml-paper-writing/academic-plotting",
        "./20-ml-paper-writing/systems-paper-writing",
        "./20-ml-paper-writing/presenting-conference-talks"
      ]
    },
    {
      "name": "ideation",
      "description": "Research ideation frameworks including structured brainstorming and creative thinking. Use when exploring new research directions, generating novel ideas, or seeking fresh angles on existing work.",
      "source": "./",
      "strict": false,
      "skills": [
        "./21-research-ideation/brainstorming-research-ideas",
        "./21-research-ideation/creative-thinking-for-research"
      ]
    },
    {
      "name": "agent-native-research-artifact",
      "description": "Agent-Native Research Artifact (ARA) tooling: compile any research input (paper, repo, notes) into a structured artifact, record session provenance as a post-task epilogue, and run Seal Level 2 epistemic review. Use when ingesting research into a falsifiable, agent-traversable artifact, capturing how a research project actually evolved, or auditing an ARA for evidence-claim alignment.",
      "source": "./",
      "strict": false,
      "skills": [
        "./22-agent-native-research-artifact/compiler",
        "./22-agent-native-research-artifact/research-manager",
        "./22-agent-native-research-artifact/rigor-reviewer"
      ]
    }
  ]
}

README

AI Research `Skills` Library

The most comprehensive open-source skills library enabling AI agents to autonomously conduct AI research — from idea to paper

AI Research Skills Demo

98 Skills Powering AI Research in 2026

View All 23 Categories


Autoresearch (1)	Ideation (2)	ML Paper Writing (2)
Model Architecture (5)	Fine-Tuning (4)	Post-Training (8)
Distributed Training (6)	Optimization (6)	Inference (4)
Tokenization (2)	Data Processing (2)	Evaluation (3)
Safety & Alignment (4)	Agents (4)	RAG (5)
Multimodal (7)	Prompt Engineering (4)	MLOps (3)
Observability (2)	Infrastructure (3)	Mech Interp (4)
Emerging Techniques (6)	Agent-Native Research Artifact (3)

Our Mission
Path Towards AI Research Agent
Available AI Research Engineering Skills
Demos
Skill Structure
Roadmap
Repository Structure
Use Cases
Contributors
Citation
Community

Our Mission

We enable AI agents to autonomously conduct AI research — from literature survey and idea generation through experiment execution to paper writing. The library provides both the research orchestration layer (autoresearch, ideation, paper writing) and the engineering skills (training, evaluation, deployment) needed at each stage.

AI Research Agent System
System diagram of an AI research agent

Path Towards AI Research Agent

Modern AI research requires mastering dozens of specialized tools and frameworks. AI Researchers spend more time debugging infrastructure than testing hypotheses — slowing the pace of scientific discovery. We provide a comprehensive skills library that enables AI agents to autonomously conduct the full research lifecycle — from brainstorming ideas to writing the paper.

Autonomous Research - The autoresearch skill orchestrates the entire research workflow using a two-loop architecture, routing to domain skills as needed
Specialized Expertise - Each domain skill provides deep, production-ready knowledge of a specific framework (Megatron-LM, vLLM, TRL, etc.)
End-to-End Coverage - 98 skills spanning the full AI research lifecycle, from ideation and literature survey to experiments and paper writing
Research-Grade Quality - Documentation sourced from official repos, real GitHub issues, and battle-tested production workflows

Available AI Research Engineering Skills

Quality over quantity: Each skill provides comprehensive, expert-level guidance with real code examples, troubleshooting guides, and production-ready workflows.

📦 Quick Install (Recommended)

For humans — interactive installer with one command:

npx @orchestra-research/ai-research-skills

For AI agents — point your agent to the welcome doc and it handles the rest:

Read https://www.orchestra-research.com/ai-research-skills/welcome.md and follow the instructions to install and use AI Research Skills.

This installs all 98 skills, loads the autoresearch orchestration layer, and starts autonomous research.

What the installer does

Auto-detects your installed coding agents (Claude Code, Hermes Agent, OpenCode, Cursor, Gemini CLI, etc.)
Installs skills to ~/.orchestra/skills/ with symlinks to each agent (falls back to copy on Windows)
Offers everything, quickstart bundle, by category, or individual skills
Updates installed skills with latest versions
Uninstalls all or selected skills

CLI Commands

# Interactive installer (recommended)
npx @orchestra-research/ai-research-skills

# Direct commands
npx @orchestra-research/ai-research-skills list      # View installed skills
npx @orchestra-research/ai-research-skills update    # Update installed skills

Claude Code Marketplace (Alternative)

Install skill categories directly using the Claude Code CLI:

# Add the marketplace
/plugin marketplace add orchestra-research/AI-research-SKILLs

# Install by category (23 categories available)
/plugin install fine-tuning@ai-research-skills        # Axolotl, LLaMA-Factory, PEFT, Unsloth
/plugin install post-training@ai-research-skills      # TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge
/plugin install inference-serving@ai-research-skills  # vLLM, TensorRT-LLM, llama.cpp, SGLang
/plugin install distributed-training@ai-research-skills
/plugin install optimization@ai-research-skills

All 23 Categories (98 Skills)

Category	Skills	Included
Autoresearch	1	Autonomous research orchestration — central layer that manages the full lifecycle and routes to all other skills
Ideation	2	Research Brainstorming, Creative Thinking
ML Paper Writing	2	ML Paper Writing (LaTeX templates, citation verification), Academic Plotting
Model Architecture	5	LitGPT, Mamba, NanoGPT, RWKV, TorchTitan
Tokenization	2	HuggingFace Tokenizers, SentencePiece
Fine-Tuning	4	Axolotl, LLaMA-Factory, PEFT, Unsloth
Mech Interp	4	TransformerLens, SAELens, pyvene, nnsight
Data Processing	2	NeMo Curator, Ray Data
Post-Training	8	TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge
Safety	4	Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard
Distributed	6	DeepSpeed, FSDP, Accelerate, Megatron-Core, Lightning, Ray Train
Infrastructure	3	Modal, Lambda Labs, SkyPilot
Optimization	6	Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF
Evaluation	3	lm-eval-harness, BigCode, NeMo Evaluator
Inference	4	vLLM, TensorRT-LLM, llama.cpp, SGLang
MLOps	3	W&B, MLflow, TensorBoard
Agents	4	LangChain, LlamaIndex, CrewAI, AutoGPT
RAG	5	Chroma, FAISS, Pinecone, Qdrant, Sentence Transformers
Prompt Eng	4	DSPy, Instructor, Guidance, Outlines
Observability	2	LangSmith, Phoenix
Multimodal	7	CLIP, Whisper, LLaVA, BLIP-2, SAM, Stable Diffusion, AudioCraft
Emerging	6	MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning
Agent-Native Research Artifact	3	ARA Compiler, Research Manager, Rigor Reviewer

View All 98 Skills in Details

🔬 Autoresearch (1 skill) — Central Orchestration Layer

Autoresearch - Autonomous research orchestration using a two-loop architecture (inner optimization + outer synthesis). Manages the full lifecycle from literature survey to paper writing, routing to all domain-specific skills. Supports Claude Code /loop and OpenClaw heartbeat for continuous operation (390 lines + 3 refs)

🏗️ Model Architecture (5 skills)

LitGPT - Lightning AI's 20+ clean LLM implementations with production training recipes (462 lines + 4 refs)
Mamba - State-space models with O(n) complexity, 5× faster than Transformers (253 lines + 3 refs)
RWKV - RNN+Transformer hybrid, infinite context, Linux Foundation project (253 lines + 3 refs)
NanoGPT - Educational GPT in ~300 lines by Karpathy (283 lines + 3 refs)
TorchTitan - PyTorch-native distributed training for Llama 3.1 with 4D parallelism

🔤 Tokenization (2 skills)

HuggingFace Tokenizers - Rust-based, <20s/GB, BPE/WordPiece/Unigram algorithms (486 lines + 4 refs)
SentencePiece - Language-independent, 50k sentences/sec, used by T5/ALBERT (228 lines + 2 refs)

🎯 Fine-Tuning (4 skills)

Axolotl - YAML-based fine-tuning with 100+ models (156 lines + 4 refs)
LLaMA-Factory - WebUI no-code fine-tuning (78 lines + 5 refs)
Unsloth - 2x faster QLoRA fine-tuning (75 lines + 4 refs)
PEFT - Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods (431 lines + 2 refs)

🔬 Mechanistic Interpretability (4 skills)

TransformerLens - Neel Nanda's library for mech interp with HookPoints, activation caching (346 lines + 3 refs)
SAELens - Sparse Autoencoder training and analysis for feature discovery (386 lines + 3 refs)
pyvene - Stanford's causal intervention library with declarative configs (473 lines + 3 refs)
nnsight - Remote interpretability via NDIF, run experiments on 70B+ models (436 lines + 3 refs)

📊 Data Processing (2 skills)

Ray Data - Distributed ML data processing, streaming execution, GPU support (318 lines + 2 refs)
NeMo Curator - GPU-accelerated data curation, 16× faster deduplication (375 lines + 2 refs)

🎓 Post-Training (8 skills)

TRL Fine-Tuning - Transformer Reinforcement Learning (447 lines + 4 refs)
GRPO-RL-Training (TRL) - Group Relative Policy Optimization with TRL (569 lines, gold standard)
OpenRLHF - Full RLHF pipeline with Ray + vLLM (241 lines + 4 refs)
SimPO - Simple Preference Optimization, no reference model needed (211 lines + 3 refs)
verl - ByteDance's HybridFlow RL framework, FSDP/Megatron + vLLM/SGLang backends (389 lines + 2 refs)
slime - THUDM's Megatron+SGLang framework powering GLM-4.x models (464 lines + 2 refs)
miles - Enterprise fork of slime with FP8, INT4, speculative RL for MoE training (315 lines + 2 refs)
torchforge - Meta's PyTorch-native RL with Monarch+TorchTitan+vLLM (380 lines + 2 refs)

🛡️ Safety & Alignment (4 skills)

Constitutional AI - AI-driven self-improvement via principles (282 lines)
LlamaGuard - Safety classifier for LLM inputs/outputs (329 lines)
NeMo Guardrails - Programmable guardrails with Colang (289 lines)
Prompt Guard - Meta's 86M prompt injection & jailbreak detector, 99%+ TPR, <2ms GPU (313 lines)

⚡ Distributed Training (6 skills)

Megatron-Core - NVIDIA's framework for training 2B-462B param models with 47% MFU on H100 (359 lines + 4 refs)
DeepSpeed - Microsoft's ZeRO optimization (137 lines + 9 refs)
PyTorch FSDP2 - Fully Sharded Data Parallel v2 with fully_shard and DTensor (231 lines + 12 refs)
Accelerate - HuggingFace's 4-line distributed training API (324 lines + 3 refs)
PyTorch Lightning - High-level training framework with Trainer class (339 lines + 3 refs)
Ray Train - Multi-node orchestration and hyperparameter tuning (399 lines + 1 ref)

🚀 Optimization (6 skills)

Flash Attention - 2-4x faster attention with memory efficiency (359 lines + 2 refs)
bitsandbytes - 8-bit/4-bit quantization for 50-75% memory reduction (403 lines + 3 refs)
GPTQ - 4-bit post-training quantization, 4× memory reduction, <2% accuracy loss (443 lines + 3 refs)
AWQ - Activation-aware weight quantization, 4-bit with minimal accuracy loss (310 lines + 2 refs)
HQQ - Half-Quadratic Quantization, no calibration data needed, multi-backend (370 lines + 2 refs)
GGUF - llama.cpp quantization format, K-quant methods, CPU/Metal inference (380 lines + 2 refs)

📊 Evaluation (3 skills)

lm-evaluation-harness - EleutherAI's standard for benchmarking LLMs across 60+ tasks (482 lines + 4 refs)
BigCode Evaluation Harness - Code model benchmarking with HumanEval, MBPP, MultiPL-E, pass@k metrics (406 lines + 3 refs)
NeMo Evaluator - NVIDIA's enterprise platform for 100+ benchmarks across 18+ harnesses with multi-backend execution (454 lines + 4 refs)

☁️ Infrastructure (3 skills)

Modal - Serverless GPU cloud with Python-native API, T4-H200 on-demand (342 lines + 2 refs)
SkyPilot - Multi-cloud orchestration across 20+ providers with spot recovery (390 lines + 2 refs)
Lambda Labs - Reserved/on-demand GPU cloud with H100/A100, persistent filesystems (390 lines + 2 refs)

🔥 Inference & Serving (4 skills)

vLLM - High-throughput LLM serving with PagedAttention (356 lines + 4 refs, production-ready)
TensorRT-LLM - NVIDIA's fastest inference, 24k tok/s, FP8/INT4 quantization (180 lines + 3 refs)
llama.cpp - CPU/Apple Silicon inference, GGUF quantization (251 lines + 3 refs)
SGLang - Structured generation with RadixAttention, 5-10× faster for agents (435 lines + 3 refs)

🤖 Agents (4 skills)

LangChain - Most popular agent framework, 500+ integrations, ReAct pattern (658 lines + 3 refs, production-ready)
LlamaIndex - Data framework for LLM apps, 300+ connectors, RAG-focused (535 lines + 3 refs)
CrewAI - Multi-agent orchestration, role-based collaboration, autonomous workflows (498 lines + 3 refs)
AutoGPT - Autonomous AI agent platform, visual workflow builder, continuous execution (400 lines + 2 refs)

🔍 RAG (5 skills)

Chroma - Open-source embedding database, local/cloud, 24k stars (385 lines + 1 ref)
FAISS - Facebook's similarity search, billion-scale, GPU acceleration (295 lines)
Sentence Transformers - 5000+ embedding models, multilingual, 15k stars (370 lines)
Pinecone - Managed vector database, auto-scaling, <100ms latency (410 lines)
Qdrant - High-performance vector search, Rust-powered, hybrid search with filtering (493 lines + 2 refs)

🎨 Multimodal (7 skills)

CLIP - OpenAI's vision-language model, zero-shot classification, 25k stars (320 lines)
Whisper - Robust speech recognition, 99 languages, 73k stars (395 lines)
LLaVA - Vision-language assistant, image chat, GPT-4V level (360 lines)
Stable Diffusion - Text-to-image generation via HuggingFace Diffusers, SDXL, ControlNet (380 lines + 2 refs)
Segment Anything - Meta's SAM for zero-shot image segmentation with points/boxes (500 lines + 2 refs)
BLIP-2 - Vision-language pretraining with Q-Former, image captioning, VQA (500 lines + 2 refs)
AudioCraft - Meta's MusicGen/AudioGen for text-to-music and text-to-sound (470 lines + 2 refs)

🎯 Prompt Engineering (4 skills)

DSPy - Declarative prompt programming with optimizers, Stanford NLP, 22k stars (438 lines + 3 refs)
Instructor - Structured LLM outputs with Pydantic validation, 15k stars (726 lines + 3 refs)
Guidance - Constrained generation with regex/grammars, Microsoft Research, 18k stars (485 lines + 3 refs)
Outlines - Structured text with FSM, zero-overhead, 8k stars (601 lines + 3 refs)

📊 MLOps (3 skills)

Weights & Biases - Experiment tracking, sweeps, artifacts, model registry (427 lines + 3 refs)
MLflow - Model registry, tracking, deployment, autologging (514 lines + 3 refs)
TensorBoard - Visualization, profiling, embeddings, scalars/images (538 lines + 3 refs)

👁️ Observability (2 skills)

LangSmith - LLM observability, tracing, evaluation, monitoring for AI apps (422 lines + 2 refs)
Phoenix - Open-source AI observability with OpenTelemetry tracing and LLM evaluation (380 lines + 2 refs)

🔬 Emerging Techniques (6 skills)

MoE Training - Mixture of Experts training with DeepSpeed, Mixtral 8x7B, 5× cost reduction (515 lines + 3 refs)
Model Merging - Combine models with TIES, DARE, SLERP using mergekit (528 lines + 3 refs)
Long Context - Extend context windows with RoPE, YaRN, ALiBi, 32k-128k tokens (624 lines + 3 refs)
Speculative Decoding - 1.5-3.6× faster inference with Medusa, Lookahead (379 lines)
Knowledge Distillation - Compress models 70B→7B with MiniLLM, temperature scaling (424 lines)
Model Pruning - 50% sparsity with Wanda, SparseGPT, <1% accuracy loss (417 lines)

📝 ML Paper Writing (2 skills)

ML Paper Writing - Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM with LaTeX templates, citation verification, and writing best practices (532 lines + 5 refs)
Academic Plotting - Generate publication-quality figures for ML papers: architecture diagrams via Gemini AI and data-driven charts via matplotlib/seaborn with venue-specific styling (479 lines + 3 refs)

💡 Ideation (2 skills)

Research Brainstorming - Structured ideation frameworks for discovering high-impact research directions with 10 complementary lenses (384 lines)
Creative Thinking - Cognitive science frameworks (bisociation, structure-mapping, constraint manipulation) for genuinely novel research ideas (366 lines)

🧬 Agent-Native Research Artifact (3 skills)

ARA Compiler - Compiles any research input (PDF papers, repos, experiment logs, raw notes) into a complete Agent-Native Research Artifact with claims, exploration graph, evidence, and code stubs (245 lines + 3 refs)
ARA Research Manager - Post-task research recorder that runs at session end to extract decisions, experiments, dead ends, and pivots from conversation history into the ara/ directory with user-vs-AI provenance tags (324 lines + 3 refs)
ARA Rigor Reviewer - ARA Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope, coherence, exploration integrity, methodology) with severity-ranked findings (322 lines + 1 ref)

Demos

All 87 skills in this repo are automatically synced to Orchestra Research, where you can add them to your projects with one click and use them with AI research agents.

See skills in action → demos/

We maintain a curated collection of demo repositories showing how to use skills for real AI research tasks:

Demo	Skills Used	What It Does
Norm Heterogeneity → LoRA Brittleness	Autoresearch, ML Paper Writing, Ideation	Agent autonomously discovered norm heterogeneity predicts fine-tuning difficulty (r=-0.99), pivoting from a null result on ETF overlaps
RL Algorithm Brain Scan	Autoresearch, GRPO, TRL, SAELens, TransformerLens, ML Paper Writing	Agent found DPO is a rank-1 perturbation (95.6% recovery from one SVD direction) while online RL is distributed and structure-preserving
NeMo Eval: GPQA Benchmark	NeMo Evaluator	Compare Llama 8B/70B/405B on graduate-level science questions
LoRA Without Regret Reproduction	GRPO, TRL	Reproduce SFT + GRPO RL experiments via prompting
Layer-Wise Quantization Experiment	llama.cpp, GGUF	Investigate optimal layer precision allocation—early layers at Q8 achieve 1.9× compression with 1.3% perplexity loss
Cross-Lingual Alignment Analysis	FAISS	Quantify how well multilingual embeddings align semantic concepts across 8 languages using FAISS similarity search
Scientific Plotting Demo	Academic Plotting	Generate publication-quality figures for the Andes QoE-aware LLM serving paper — Gemini AI architecture diagrams + matplotlib data charts (CDF, multi-panel grids, bar charts)

Featured Demos: Two papers produced entirely by AI agents using the autoresearch skill. The Norm Heterogeneity paper demonstrates autonomous research pivoting — the agent refuted its own hypothesis and discovered a stronger finding. The RL Brain Scan paper demonstrates multi-skill orchestration — the agent trained RL models, analyzed internals with interpretability tools, and synthesized the insight that "DPO is rank-1 alignment." Both papers written end-to-end by the agent.

Skill Structure

Each skill follows a battle-tested format for maximum usefulness:

skill-name/
├── SKILL.md                    # Quick reference (50-150 lines)
│   ├── Metadata (name, description, version)
│   ├── When to use this skill
│   ├── Quick patterns & examples
│   └── Links to references
│
├── references/                 # Deep documentation (300KB+)
│   ├── README.md              # From GitHub/official docs
│   ├── api.md                 # API reference
│   ├── tutorials.md           # Step-by-step guides
│   ├── issues.md              # Real GitHub issues & solutions
│   ├── releases.md            # Version history & breaking changes
│   └── file_structure.md      # Codebase navigation
│
├── scripts/                    # Helper scripts (optional)
└── assets/                     # Templates & examples (optional)

Quality Standards

300KB+ documentation from official sources
Real GitHub issues & solutions (when available)
Code examples with language detection
Version history & breaking changes
Links to official docs

Roadmap

We're building towards 80 comprehensive skills across the full AI research lifecycle. See our detailed roadmap for the complete development plan.

View Full Roadmap →

View Detailed Statistics

Metric	Current	Target
Skills	87 (high-quality, standardized YAML)	80 ✅
Avg Lines/Skill	420 lines (focused + progressive disclosure)	200-600 lines
Documentation	~130,000 lines total (SKILL.md + references)	100,000+ lines
Gold Standard Skills	65 with comprehensive references	50+
Contributors	1	100+
Coverage	Architecture, Tokenization, Fine-Tuning, Mechanistic Interpretability, Data Processing, Post-Training, Safety, Distributed, Optimization, Evaluation, Infrastructure, Inference, Agents, RAG, Multimodal, Prompt Engineering, MLOps, Observability, ML Paper Writing, Ideation, Autoresearch	Full Lifecycle ✅

Recent Progress: npm package @orchestra-research/ai-research-skills for one-command installation across all coding agents

Philosophy: Quality > Quantity. Following Anthropic official best practices - each skill provides 200-500 lines of focused, actionable guidance with progressive disclosure.

Repository Structure

claude-ai-research-skills/
├── README.md                    ← You are here
├── CONTRIBUTING.md              ← Contribution guide
├── demos/                       ← Curated demo gallery (links to demo repos)
├── docs/
├── 0-autoresearch-skill/        (1 skill ✓ - Autonomous research orchestration)
├── 01-model-architecture/       (5 skills ✓ - LitGPT, Mamba, RWKV, NanoGPT, TorchTitan)
├── 02-tokenization/             (2 skills ✓ - HuggingFace Tokenizers, SentencePiece)
├── 03-fine-tuning/              (4 skills ✓ - Axolotl, LLaMA-Factory, Unsloth, PEFT)
├── 04-mechanistic-interpretability/ (4 skills ✓ - TransformerLens, SAELens, pyvene, nnsight)
├── 05-data-processing/          (2 skills ✓ - Ray Data, NeMo Curator)
├── 06-post-training/            (8 skills ✓ - TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge)
├── 07-safety-alignment/         (4 skills ✓ - Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard)
├── 08-distributed-training/     (6 skills ✓ - Megatron-Core, DeepSpeed, FSDP, Accelerate, Lightning, Ray Train)
├── 09-infrastructure/           (3 skills ✓ - Modal, SkyPilot, Lambda Labs)
├── 10-optimization/             (6 skills ✓ - Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF)
├── 11-evaluation/               (3 skills ✓ - lm-evaluation-harness, BigCode, NeMo Evaluator)
├── 12-inference-serving/        (4 skills ✓ - vLLM, TensorRT-LLM, llama.cpp, SGLang)
├── 13-mlops/                    (3 skills ✓ - Weights & Biases, MLflow, TensorBoard)
├── 14-agents/                   (4 skills ✓ - LangChain, LlamaIndex, CrewAI, AutoGPT)
├── 15-rag/                      (5 skills ✓ - Chroma, FAISS, Sentence Transformers, Pinecone, Qdrant)
├── 16-prompt-engineering/       (4 skills ✓ - DSPy, Instructor, Guidance, Outlines)
├── 17-observability/            (2 skills ✓ - LangSmith, Phoenix)
├── 18-multimodal/               (7 skills ✓ - CLIP, Whisper, LLaVA, Stable Diffusion, SAM, BLIP-2, AudioCraft)
├── 19-emerging-techniques/      (6 skills ✓ - MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning)
├── 20-ml-paper-writing/         (2 skills ✓ - ML Paper Writing with LaTeX templates, Academic Plotting)
├── 21-research-ideation/           (2 skills ✓ - Research Brainstorming, Creative Thinking)
├── 22-agent-native-research-artifact/ (3 skills ✓ - ARA Compiler, Research Manager, Rigor Reviewer)
└── packages/ai-research-skills/ (npm package for one-command installation)

Use Cases

For Researchers

"I need to fine-tune Llama 3 with custom data" → 03-fine-tuning/axolotl/ - YAML configs, 100+ model support

For ML Engineers

"How do I optimize inference latency?" → 12-inference-serving/vllm/ - PagedAttention, batching

For Students

"I want to learn how transformers work" → 01-model-architecture/litgpt/ - Clean implementations

For Teams

"We need to scale training to 100 GPUs" → 08-distributed-training/deepspeed/ - ZeRO stages, 3D parallelism

License

MIT License - See LICENSE for details.

Note: Individual skills may reference libraries with different licenses. Please check each project's license before use.

Citation

If you use AI Research Skills in your work or find it helpful for a publication, we'd appreciate a citation:

BibTeX

@software{ai_research_skills,
  title     = {AI Research Skills Library},
  author    = {{Orchestra Research}},
  year      = {2025},
  url       = {https://github.com/orchestra-research/AI-research-SKILLs},
  note      = {Open-source skills library enabling AI agents to autonomously conduct AI research}
}

APA

Orchestra Research. (2025). AI Research Skills Library [Computer software]. https://github.com/orchestra-research/AI-research-SKILLs

Chicago

Orchestra Research. "AI Research Skills Library." GitHub, 2025. https://github.com/orchestra-research/AI-research-SKILLs.

IEEE

Orchestra Research, "AI Research Skills Library," 2025. [Online]. Available: https://github.com/orchestra-research/AI-research-SKILLs

Tip: You can also click "Cite this repository" in the GitHub sidebar for auto-formatted citations.

Acknowledgments

Built with:

Claude Code - AI pair programming
Skill Seeker - Automated doc scraping
Open Source AI Community - For amazing tools and docs

Special thanks to:

EleutherAI, HuggingFace, NVIDIA, Lightning AI, Meta AI, Anthropic
All researchers who maintain excellent documentation

Contributors

Thanks to all the people who have contributed to the AI Research Skills Library:

We welcome contributions from the AI research community! See CONTRIBUTING.md for detailed guidelines on:

Adding new skills
Improving existing skills
Quality standards and best practices
Submission process

Recent Updates

April 2026 - v1.6.0 🧬 Agent-Native Research Artifact (ARA) — 23rd Category, 98 Skills

🧬 NEW CATEGORY: 22-agent-native-research-artifact/ (the 23rd category) — three skills that turn research outputs into a falsifiable, agent-traversable artifact:
- 🛠️ ARA Compiler — compiles any input (PDF papers, GitHub repos, experiment logs, raw notes) into a structured ARA with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph (research DAG), and grounded evidence
- 📋 ARA Research Manager — post-task epilogue that scans conversation history at session end and writes decisions, experiments, dead ends, claims, heuristics, and pivots into the ara/ directory with user / ai-suggested / ai-executed / user-revised provenance tags
- 🔍 ARA Rigor Reviewer — Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and emitting a severity-ranked report with a Strong Accept-to-Reject recommendation
🔗 Sourced from the Agent-Native-Research-Artifact-Init reference repo, restructured to AI-research-SKILLs standards (kebab-case names, third-person descriptions, Title-Case tags, one-level-deep references)
🧩 Plugin entry agent-native-research-artifact added to .claude-plugin/marketplace.json; CLI category registered as 22-agent-native-research-artifact with three individual skill entries in the npm installer
🔄 Auto-syncs to Orchestra marketplace via sync-skills.yml on push; npm package republished as @orchestra-research/ai-research-skills@1.6.0 via publish-npm.yml on version bump
📊 98 total skills across 23 categories — full lifecycle from idea → paper → falsifiable, auditable artifact

March 2026 - v1.4.0 🔬 Autoresearch & 86 Skills — Full Research Lifecycle

🔬 NEW SKILL: Autoresearch — autonomous research orchestration using a two-loop architecture (inner optimization loop + outer synthesis loop)
🧠 Manages the full research lifecycle: literature survey → ideation → experiments → synthesis → paper writing
🔄 Routes to all 86 domain skills automatically — agents don't need to know which skill to use
⏰ Mandatory /loop (Claude Code) and cron job (OpenClaw) for continuous autonomous operation
📊 Generates research presentations (HTML/PDF) with optimization trajectory plots for human review
📝 Findings.md as persistent project memory across sessions with "Lessons and Constraints" tracking
🗂️ Structured workspace: research-state.yaml, findings.md, research-log.md, literature/, experiments/, src/, data/, to_human/
📄 Two demo papers produced by autoresearch: Norm Heterogeneity → LoRA Brittleness and RL Algorithm Brain Scan
🚀 WELCOME.md for cold-start agent bootstrap — one URL to go from zero to autonomous research
📦 npm v1.4.x with Windows symlink fallback, all 22 categories installable
🤖 Supported agents: Claude Code, Hermes Agent, OpenCode, OpenClaw, Cursor, Codex, Gemini CLI, Qwen Code
📊 87 total skills across 22 categories — complete research lifecycle coverage

February 2026 - v0.15.0 🛡️ Prompt Guard & 83 Skills

🛡️ NEW SKILL: Prompt Guard - Meta's 86M prompt injection & jailbreak detector
⚡ 99%+ TPR, <1% FPR, <2ms GPU latency, multilingual (8 languages)
🔒 3 workflows: user input filtering, third-party data filtering, batch RAG processing
📊 83 total skills across 20 categories

January 2026 - v0.14.0 📦 npm Package & 82 Skills

📦 NEW: npx @orchestra-research/ai-research-skills - One-command installation for all coding agents
🤖 Supported agents: Claude Code, OpenCode, Cursor, Codex, Gemini CLI, Qwen Code
✨ Interactive installer with category/individual skill selection
🔄 Update installed skills, selective uninstall
📊 82 total skills (5 new post-training skills: verl, slime, miles, torchforge + TorchTitan)
🏗️ Megatron-Core moved to Distributed Training category

January 2026 - v0.13.0 📝 ML Paper Writing & Demos Gallery

📝 NEW CATEGORY: ML Paper Writing (20th category, 77th skill)
🎯 Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM
📚 Writing philosophy from top researchers (Neel Nanda, Farquhar, Gopen & Swan, Lipton, Perez)
🔬 Citation verification workflow - never hallucinate references
📄 LaTeX templates for 6 major conferences
🎪 NEW: Curated demos gallery (demos/) showcasing skills in action
🔗 Demo repos: NeMo Evaluator benchmark, LoRA Without Regret reproduction
📖 936-line comprehensive SKILL.md with 4 workflows

January 2026 - v0.12.0 📊 NeMo Evaluator SDK

📊 NEW SKILL: NeMo Evaluator SDK for enterprise LLM benchmarking
🔧 NVIDIA's evaluation platform with 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM)
⚡ Multi-backend execution: local Docker, Slurm HPC, Lepton cloud
📦 Container-first architecture for reproducible evaluation
📝 454 lines SKILL.md + 4 comprehensive reference files (~48KB documentation)

December 2025 - v0.11.0 🔬 Mechanistic Interpretability

🔬 NEW CATEGORY: Mechanistic Interpretability (4 skills)
🔍 TransformerLens skill: Neel Nanda's library for mech interp with HookPoints, activation caching, circuit analysis
🧠 SAELens skill: Sparse Autoencoder training and analysis for feature discovery, monosemanticity research
⚡ pyvene skill: Stanford's causal intervention library with declarative configs, DAS, activation patching
🌐 nnsight skill: Remote interpretability via NDIF, run experiments on 70B+ models without local GPUs
📝 ~6,500 new lines of documentation across 16 files
76 total skills (filling the missing 04 category slot)

November 25, 2025 - v0.10.0 🎉 70 Skills Complete!

🎉 ROADMAP COMPLETE: Reached 70-skill milestone!
🚀 Added 4 skills: Lambda Labs, Segment Anything (SAM), BLIP-2, AudioCraft
☁️ Lambda Labs skill: Reserved/on-demand GPU cloud with H100/A100, persistent filesystems, 1-Click Clusters
🖼️ SAM skill: Meta's Segment Anything for zero-shot image segmentation with points/boxes/masks
👁️ BLIP-2 skill: Vision-language pretraining with Q-Former, image captioning, VQA
🎵 AudioCraft skill: Meta's MusicGen/AudioGen for text-to-music and text-to-sound generation
📝 ~10,000 new lines of documentation across 12 files
70 total skills (100% roadmap complete!)

November 25, 2025 - v0.9.0

🚀 Added 2 infrastructure skills: Modal, SkyPilot
☁️ Modal skill: Serverless GPU cloud with Python-native API, T4-H200 on-demand, auto-scaling
🌐 SkyPilot skill: Multi-cloud orchestration across 20+ providers with spot recovery
✨ New Infrastructure category (2 skills - serverless GPU and multi-cloud orchestration)
📝 ~2,500 new lines of documentation across 6 files
66 total skills (94% towards 70-skill target)

November 25, 2025 - v0.8.0

🚀 Added 5 high-priority skills: HQQ, GGUF, Phoenix, AutoGPT, Stable Diffusion
⚡ HQQ skill: Half-Quadratic Quantization without calibration data, multi-backend support
📦 GGUF skill: llama.cpp quantization format, K-quant methods, CPU/Metal inference
👁️ Phoenix skill: Open-source AI observability with OpenTelemetry tracing and LLM evaluation
🤖 AutoGPT skill: Autonomous AI agent platform with visual workflow builder
🎨 Stable Diffusion skill: Text-to-image generation via Diffusers, SDXL, ControlNet, LoRA
📝 ~9,000 new lines of documentation across 15 files
64 total skills (91% towards 70-skill target)

November 25, 2025 - v0.7.0

🚀 Added 5 high-priority skills: PEFT, CrewAI, Qdrant, AWQ, LangSmith
✨ New Observability category with LangSmith for LLM tracing and evaluation
🎯 PEFT skill: Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods
🤖 CrewAI skill: Multi-agent orchestration with role-based collaboration
🔍 Qdrant skill: High-performance Rust vector search with hybrid filtering
⚡ AWQ skill: Activation-aware 4-bit quantization with minimal accuracy loss
📝 ~8,000 new lines of documentation across 15 files
59 total skills (84% towards 70-skill target)

November 15, 2025 - v0.6.0

📊 Added 3 comprehensive MLOps skills: Weights & Biases, MLflow, TensorBoard
✨ New MLOps category (3 skills - experiment tracking, model registry, visualization)
📝 ~10,000 new lines of documentation across 13 files
🔧 Comprehensive coverage: experiment tracking, hyperparameter sweeps, model registry, profiling, embeddings visualization
54 total skills (77% towards 70-skill target)

November 12, 2025 - v0.5.0

🎯 Added 4 comprehensive prompt engineering skills: DSPy, Instructor, Guidance, Outlines
✨ New Prompt Engineering category (4 skills - DSPy, Instructor, Guidance, Outlines)
📝 ~10,000 new lines of documentation across 16 files
🔧 Comprehensive coverage: declarative programming, structured outputs, constrained generation, FSM-based generation
47 total skills (67% towards 70-skill target)

November 9, 2025 - v0.4.0

🤖 Added 11 comprehensive skills: LangChain, LlamaIndex, Chroma, FAISS, Sentence Transformers, Pinecone, CLIP, Whisper, LLaVA
✨ New Agents category (2 skills - LangChain, LlamaIndex)
🔍 New RAG category (4 skills - Chroma, FAISS, Sentence Transformers, Pinecone)
🎨 New Multimodal category (3 skills - CLIP, Whisper, LLaVA)
📝 ~15,000 new lines of documentation
43 total skills (61% towards 70-skill target)

November 8, 2025 - v0.3.0

🚀 Added 8 comprehensive skills: TensorRT-LLM, llama.cpp, SGLang, GPTQ, HuggingFace Tokenizers, SentencePiece, Ray Data, NeMo Curator
⚡ Completed Inference & Serving category (4/4 skills)
🔤 New Tokenization category (2 skills)
📊 New Data Processing category (2 skills)
📝 9,617 new lines of documentation across 30 files
32 total skills (45% towards 70-skill target)

November 6, 2025 - v0.2.0

Added 10 skills from GitHub (Megatron-Core, Lightning, Ray Train, etc.)
Improved skill structure with comprehensive references
Created strategic roadmap to 70 skills
Added contribution guidelines

November 3, 2025 - v0.1.0

🎉 Initial release with 5 fine-tuning skills

Community

Join our community to stay updated, ask questions, and connect with other AI researchers:

SkillEvolve Meta-Skill - Connect your agent to the collective intelligence of the community. Captures techniques discovered during sessions and shares them back as curated skills.
Slack Community - Chat with the team and other users
Twitter/X - Follow for updates and announcements
LinkedIn - Connect professionally