USP
It's the most comprehensive open-source library specifically designed for autonomous AI research, offering 98 expert-level skills across 23 categories, enabling agents to handle the entire research lifecycle from idea to paper.
Use cases
- 01Autonomous AI research orchestration
- 02Literature survey and idea generation
- 03Experiment execution and debugging
- 04ML paper writing and academic plotting
- 05Distributed LLM pretraining
Detected files (8)
01-model-architecture/litgpt/SKILL.mdskillShow content (11010 bytes)
--- name: implementing-llms-litgpt description: Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers. version: 1.0.0 author: Orchestra Research license: MIT tags: [Model Architecture, LitGPT, Lightning AI, LLM Implementation, LoRA, QLoRA, Fine-Tuning, Llama, Gemma, Phi, Mistral, Educational] dependencies: [litgpt, torch, transformers] --- # LitGPT - Clean LLM Implementations ## Quick start LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows. **Installation**: ```bash pip install 'litgpt[extra]' ``` **Load and use any model**: ```python from litgpt import LLM # Load pretrained model llm = LLM.load("microsoft/phi-2") # Generate text result = llm.generate( "What is the capital of France?", max_new_tokens=50, temperature=0.7 ) print(result) ``` **List available models**: ```bash litgpt download list ``` ## Common workflows ### Workflow 1: Fine-tune on custom dataset Copy this checklist: ``` Fine-Tuning Setup: - [ ] Step 1: Download pretrained model - [ ] Step 2: Prepare dataset - [ ] Step 3: Configure training - [ ] Step 4: Run fine-tuning ``` **Step 1: Download pretrained model** ```bash # Download Llama 3 8B litgpt download meta-llama/Meta-Llama-3-8B # Download Phi-2 (smaller, faster) litgpt download microsoft/phi-2 # Download Gemma 2B litgpt download google/gemma-2b ``` Models are saved to `checkpoints/` directory. **Step 2: Prepare dataset** LitGPT supports multiple formats: **Alpaca format** (instruction-response): ```json [ { "instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris." }, { "instruction": "Translate to Spanish: Hello, how are you?", "input": "", "output": "Hola, ¿cómo estás?" } ] ``` Save as `data/my_dataset.json`. **Step 3: Configure training** ```bash # Full fine-tuning (requires 40GB+ GPU for 7B models) litgpt finetune \ meta-llama/Meta-Llama-3-8B \ --data JSON \ --data.json_path data/my_dataset.json \ --train.max_steps 1000 \ --train.learning_rate 2e-5 \ --train.micro_batch_size 1 \ --train.global_batch_size 16 # LoRA fine-tuning (efficient, 16GB GPU) litgpt finetune_lora \ microsoft/phi-2 \ --data JSON \ --data.json_path data/my_dataset.json \ --lora_r 16 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --train.max_steps 1000 \ --train.learning_rate 1e-4 ``` **Step 4: Run fine-tuning** Training saves checkpoints to `out/finetune/` automatically. Monitor training: ```bash # View logs tail -f out/finetune/logs.txt # TensorBoard (if using --train.logger_name tensorboard) tensorboard --logdir out/finetune/lightning_logs ``` ### Workflow 2: LoRA fine-tuning on single GPU Most memory-efficient option. ``` LoRA Training: - [ ] Step 1: Choose base model - [ ] Step 2: Configure LoRA parameters - [ ] Step 3: Train with LoRA - [ ] Step 4: Merge LoRA weights (optional) ``` **Step 1: Choose base model** For limited GPU memory (12-16GB): - **Phi-2** (2.7B) - Best quality/size tradeoff - **Llama 3 1B** - Smallest, fastest - **Gemma 2B** - Good reasoning **Step 2: Configure LoRA parameters** ```bash litgpt finetune_lora \ microsoft/phi-2 \ --data JSON \ --data.json_path data/my_dataset.json \ --lora_r 16 \ # LoRA rank (8-64, higher=more capacity) --lora_alpha 32 \ # LoRA scaling (typically 2×r) --lora_dropout 0.05 \ # Prevent overfitting --lora_query true \ # Apply LoRA to query projection --lora_key false \ # Usually not needed --lora_value true \ # Apply LoRA to value projection --lora_projection true \ # Apply LoRA to output projection --lora_mlp false \ # Usually not needed --lora_head false # Usually not needed ``` LoRA rank guide: - `r=8`: Lightweight, 2-4MB adapters - `r=16`: Standard, good quality - `r=32`: High capacity, use for complex tasks - `r=64`: Maximum quality, 4× larger adapters **Step 3: Train with LoRA** ```bash litgpt finetune_lora \ microsoft/phi-2 \ --data JSON \ --data.json_path data/my_dataset.json \ --lora_r 16 \ --train.epochs 3 \ --train.learning_rate 1e-4 \ --train.micro_batch_size 4 \ --train.global_batch_size 32 \ --out_dir out/phi2-lora # Memory usage: ~8-12GB for Phi-2 with LoRA ``` **Step 4: Merge LoRA weights** (optional) Merge LoRA adapters into base model for deployment: ```bash litgpt merge_lora \ out/phi2-lora/final \ --out_dir out/phi2-merged ``` Now use merged model: ```python from litgpt import LLM llm = LLM.load("out/phi2-merged") ``` ### Workflow 3: Pretrain from scratch Train new model on your domain data. ``` Pretraining: - [ ] Step 1: Prepare pretraining dataset - [ ] Step 2: Configure model architecture - [ ] Step 3: Set up multi-GPU training - [ ] Step 4: Launch pretraining ``` **Step 1: Prepare pretraining dataset** LitGPT expects tokenized data. Use `prepare_dataset.py`: ```bash python scripts/prepare_dataset.py \ --source_path data/my_corpus.txt \ --checkpoint_dir checkpoints/tokenizer \ --destination_path data/pretrain \ --split train,val ``` **Step 2: Configure model architecture** Edit config file or use existing: ```python # config/pythia-160m.yaml model_name: pythia-160m block_size: 2048 vocab_size: 50304 n_layer: 12 n_head: 12 n_embd: 768 rotary_percentage: 0.25 parallel_residual: true bias: true ``` **Step 3: Set up multi-GPU training** ```bash # Single GPU litgpt pretrain \ --config config/pythia-160m.yaml \ --data.data_dir data/pretrain \ --train.max_tokens 10_000_000_000 # Multi-GPU with FSDP litgpt pretrain \ --config config/pythia-1b.yaml \ --data.data_dir data/pretrain \ --devices 8 \ --train.max_tokens 100_000_000_000 ``` **Step 4: Launch pretraining** For large-scale pretraining on cluster: ```bash # Using SLURM sbatch --nodes=8 --gpus-per-node=8 \ pretrain_script.sh # pretrain_script.sh content: litgpt pretrain \ --config config/pythia-1b.yaml \ --data.data_dir /shared/data/pretrain \ --devices 8 \ --num_nodes 8 \ --train.global_batch_size 512 \ --train.max_tokens 300_000_000_000 ``` ### Workflow 4: Convert and deploy model Export LitGPT models for production. ``` Model Deployment: - [ ] Step 1: Test inference locally - [ ] Step 2: Quantize model (optional) - [ ] Step 3: Convert to GGUF (for llama.cpp) - [ ] Step 4: Deploy with API ``` **Step 1: Test inference locally** ```python from litgpt import LLM llm = LLM.load("out/phi2-lora/final") # Single generation print(llm.generate("What is machine learning?")) # Streaming for token in llm.generate("Explain quantum computing", stream=True): print(token, end="", flush=True) # Batch inference prompts = ["Hello", "Goodbye", "Thank you"] results = [llm.generate(p) for p in prompts] ``` **Step 2: Quantize model** (optional) Reduce model size with minimal quality loss: ```bash # 8-bit quantization (50% size reduction) litgpt convert_lit_checkpoint \ out/phi2-lora/final \ --dtype bfloat16 \ --quantize bnb.nf4 # 4-bit quantization (75% size reduction) litgpt convert_lit_checkpoint \ out/phi2-lora/final \ --quantize bnb.nf4-dq # Double quantization ``` **Step 3: Convert to GGUF** (for llama.cpp) ```bash python scripts/convert_lit_checkpoint.py \ --checkpoint_path out/phi2-lora/final \ --output_path models/phi2.gguf \ --model_name microsoft/phi-2 ``` **Step 4: Deploy with API** ```python from fastapi import FastAPI from litgpt import LLM app = FastAPI() llm = LLM.load("out/phi2-lora/final") @app.post("/generate") def generate(prompt: str, max_tokens: int = 100): result = llm.generate( prompt, max_new_tokens=max_tokens, temperature=0.7 ) return {"response": result} # Run: uvicorn api:app --host 0.0.0.0 --port 8000 ``` ## When to use vs alternatives **Use LitGPT when:** - Want to understand LLM architectures (clean, readable code) - Need production-ready training recipes - Educational purposes or research - Prototyping new model ideas - Lightning ecosystem user **Use alternatives instead:** - **Axolotl/TRL**: More fine-tuning features, YAML configs - **Megatron-Core**: Maximum performance for >70B models - **HuggingFace Transformers**: Broadest model support - **vLLM**: Inference-only (no training) ## Common issues **Issue: Out of memory during fine-tuning** Use LoRA instead of full fine-tuning: ```bash # Instead of litgpt finetune (requires 40GB+) litgpt finetune_lora # Only needs 12-16GB ``` Or enable gradient checkpointing: ```bash litgpt finetune_lora \ ... \ --train.gradient_accumulation_iters 4 # Accumulate gradients ``` **Issue: Training too slow** Enable Flash Attention (built-in, automatic on compatible hardware): ```python # Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series) # No configuration needed ``` Use smaller micro-batch and accumulate: ```bash --train.micro_batch_size 1 \ --train.global_batch_size 32 \ --train.gradient_accumulation_iters 32 # Effective batch=32 ``` **Issue: Model not loading** Check model name: ```bash # List all available models litgpt download list # Download if not exists litgpt download meta-llama/Meta-Llama-3-8B ``` Verify checkpoints directory: ```bash ls checkpoints/ # Should see: meta-llama/Meta-Llama-3-8B/ ``` **Issue: LoRA adapters too large** Reduce LoRA rank: ```bash --lora_r 8 # Instead of 16 or 32 ``` Apply LoRA to fewer layers: ```bash --lora_query true \ --lora_value true \ --lora_projection false \ # Disable this --lora_mlp false # And this ``` ## Advanced topics **Supported architectures**: See [references/supported-models.md](references/supported-models.md) for complete list of 20+ model families with sizes and capabilities. **Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for proven hyperparameter configurations for pretraining and fine-tuning. **FSDP configuration**: See [references/distributed-training.md](references/distributed-training.md) for multi-GPU training with Fully Sharded Data Parallel. **Custom architectures**: See [references/custom-models.md](references/custom-models.md) for implementing new model architectures in LitGPT style. ## Hardware requirements - **GPU**: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS) - **Memory**: - Inference (Phi-2): 6GB - LoRA fine-tuning (7B): 16GB - Full fine-tuning (7B): 40GB+ - Pretraining (1B): 24GB - **Storage**: 5-50GB per model (depending on size) ## Resources - GitHub: https://github.com/Lightning-AI/litgpt - Docs: https://lightning.ai/docs/litgpt - Tutorials: https://lightning.ai/docs/litgpt/tutorials - Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)01-model-architecture/nanogpt/SKILL.mdskillShow content (6752 bytes)
--- name: nanogpt description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU). version: 1.0.0 author: Orchestra Research license: MIT tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training] dependencies: [torch, transformers, datasets, tiktoken, wandb] --- # nanoGPT - Minimalist GPT Training ## Quick start nanoGPT is a simplified GPT implementation designed for learning and experimentation. **Installation**: ```bash pip install torch numpy transformers datasets tiktoken wandb tqdm ``` **Train on Shakespeare** (CPU-friendly): ```bash # Prepare data python data/shakespeare_char/prepare.py # Train (5 minutes on CPU) python train.py config/train_shakespeare_char.py # Generate text python sample.py --out_dir=out-shakespeare-char ``` **Output**: ``` ROMEO: What say'st thou? Shall I speak, and be a man? JULIET: I am afeard, and yet I'll speak; for thou art One that hath been a man, and yet I know not What thou art. ``` ## Common workflows ### Workflow 1: Character-level Shakespeare **Complete training pipeline**: ```bash # Step 1: Prepare data (creates train.bin, val.bin) python data/shakespeare_char/prepare.py # Step 2: Train small model python train.py config/train_shakespeare_char.py # Step 3: Generate text python sample.py --out_dir=out-shakespeare-char ``` **Config** (`config/train_shakespeare_char.py`): ```python # Model config n_layer = 6 # 6 transformer layers n_head = 6 # 6 attention heads n_embd = 384 # 384-dim embeddings block_size = 256 # 256 char context # Training config batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500 # Hardware device = 'cpu' # Or 'cuda' compile = False # Set True for PyTorch 2.0 ``` **Training time**: ~5 minutes (CPU), ~1 minute (GPU) ### Workflow 2: Reproduce GPT-2 (124M) **Multi-GPU training on OpenWebText**: ```bash # Step 1: Prepare OpenWebText (takes ~1 hour) python data/openwebtext/prepare.py # Step 2: Train GPT-2 124M with DDP (8 GPUs) torchrun --standalone --nproc_per_node=8 \ train.py config/train_gpt2.py # Step 3: Sample from trained model python sample.py --out_dir=out ``` **Config** (`config/train_gpt2.py`): ```python # GPT-2 (124M) architecture n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 dropout = 0.0 # Training batch_size = 12 gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000 # System compile = True # PyTorch 2.0 ``` **Training time**: ~4 days (8× A100) ### Workflow 3: Fine-tune pretrained GPT-2 **Start from OpenAI checkpoint**: ```python # In train.py or config init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl # Model loads OpenAI weights automatically python train.py config/finetune_shakespeare.py ``` **Example config** (`config/finetune_shakespeare.py`): ```python # Start from GPT-2 init_from = 'gpt2' # Dataset dataset = 'shakespeare_char' batch_size = 1 block_size = 1024 # Fine-tuning learning_rate = 3e-5 # Lower LR for fine-tuning max_iters = 2000 warmup_iters = 100 # Regularization weight_decay = 1e-1 ``` ### Workflow 4: Custom dataset **Train on your own text**: ```python # data/custom/prepare.py import numpy as np # Load your data with open('my_data.txt', 'r') as f: text = f.read() # Create character mappings chars = sorted(list(set(text))) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)} # Tokenize data = np.array([stoi[ch] for ch in text], dtype=np.uint16) # Split train/val n = len(data) train_data = data[:int(n*0.9)] val_data = data[int(n*0.9):] # Save train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin') ``` **Train**: ```bash python data/custom/prepare.py python train.py --dataset=custom ``` ## When to use vs alternatives **Use nanoGPT when**: - Learning how GPT works - Experimenting with transformer variants - Teaching/education purposes - Quick prototyping - Limited compute (can run on CPU) **Simplicity advantages**: - **~300 lines**: Entire model in `model.py` - **~300 lines**: Training loop in `train.py` - **Hackable**: Easy to modify - **No abstractions**: Pure PyTorch **Use alternatives instead**: - **HuggingFace Transformers**: Production use, many models - **Megatron-LM**: Large-scale distributed training - **LitGPT**: More architectures, production-ready - **PyTorch Lightning**: Need high-level framework ## Common issues **Issue: CUDA out of memory** Reduce batch size or context length: ```python batch_size = 1 # Reduce from 12 block_size = 512 # Reduce from 1024 gradient_accumulation_steps = 40 # Increase to maintain effective batch ``` **Issue: Training too slow** Enable compilation (PyTorch 2.0+): ```python compile = True # 2× speedup ``` Use mixed precision: ```python dtype = 'bfloat16' # Or 'float16' ``` **Issue: Poor generation quality** Train longer: ```python max_iters = 10000 # Increase from 5000 ``` Lower temperature: ```python # In sample.py temperature = 0.7 # Lower from 1.0 top_k = 200 # Add top-k sampling ``` **Issue: Can't load GPT-2 weights** Install transformers: ```bash pip install transformers ``` Check model name: ```python init_from = 'gpt2' # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl ``` ## Advanced topics **Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply. **Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup. **Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details. ## Hardware requirements - **Shakespeare (char-level)**: - CPU: 5 minutes - GPU (T4): 1 minute - VRAM: <1GB - **GPT-2 (124M)**: - 1× A100: ~1 week - 8× A100: ~4 days - VRAM: ~16GB per GPU - **GPT-2 Medium (350M)**: - 8× A100: ~2 weeks - VRAM: ~40GB per GPU **Performance**: - With `compile=True`: 2× speedup - With `dtype=bfloat16`: 50% memory reduction ## Resources - GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+ - Video: "Let's build GPT" by Andrej Karpathy - Paper: "Attention is All You Need" (Vaswani et al.) - OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext - Educational: Best for understanding transformers from scratch01-model-architecture/rwkv/SKILL.mdskillShow content (7099 bytes)
--- name: rwkv-architecture description: RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters. version: 1.0.0 author: Orchestra Research license: MIT tags: [RWKV, Model Architecture, RNN, Transformer Hybrid, Linear Complexity, Infinite Context, Efficient Inference, Linux Foundation, Alternative Architecture] dependencies: [rwkv, torch, transformers] --- # RWKV - Receptance Weighted Key Value ## Quick start RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference). **Installation**: ```bash # Install PyTorch pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121 # Install dependencies pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade # Install RWKV pip install rwkv ``` **Basic usage** (GPT mode + RNN mode): ```python import os from rwkv.model import RWKV os.environ["RWKV_JIT_ON"] = '1' os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed # Load model model = RWKV( model='/path/to/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16' ) # GPT mode (parallel processing) out, state = model.forward([187, 510, 1563, 310, 247], None) print(out.detach().cpu().numpy()) # Logits # RNN mode (sequential processing, same result) out, state = model.forward([187, 510], None) # First 2 tokens out, state = model.forward([1563], state) # Next token out, state = model.forward([310, 247], state) # Last tokens print(out.detach().cpu().numpy()) # Same logits as above! ``` ## Common workflows ### Workflow 1: Text generation (streaming) **Efficient token-by-token generation**: ```python from rwkv.model import RWKV from rwkv.utils import PIPELINE model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16') pipeline = PIPELINE(model, "20B_tokenizer.json") # Initial prompt prompt = "The future of AI is" state = None # Generate token by token for token in prompt: out, state = pipeline.model.forward(pipeline.encode(token), state) # Continue generation for _ in range(100): out, state = pipeline.model.forward(None, state) token = pipeline.sample_logits(out) print(pipeline.decode(token), end='', flush=True) ``` **Key advantage**: Constant memory per token (no growing KV cache) ### Workflow 2: Long context processing (infinite context) **Process million-token sequences**: ```python model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16') # Process very long document state = None long_document = load_document() # e.g., 1M tokens # Stream through entire document for chunk in chunks(long_document, chunk_size=1024): out, state = model.forward(chunk, state) # State now contains information from entire 1M token document # Memory usage: O(1) (constant, not O(n)!) ``` ### Workflow 3: Fine-tuning RWKV **Standard fine-tuning workflow**: ```python # Training script import pytorch_lightning as pl from rwkv.model import RWKV from rwkv.trainer import RWKVTrainer # Configure model config = { 'n_layer': 24, 'n_embd': 1024, 'vocab_size': 50277, 'ctx_len': 1024 } # Setup trainer trainer = pl.Trainer( accelerator='gpu', devices=8, precision='bf16', strategy='deepspeed_stage_2', max_epochs=1 ) # Train model = RWKV(config) trainer.fit(model, train_dataloader) ``` ### Workflow 4: RWKV vs Transformer comparison **Memory comparison** (1M token sequence): ```python # Transformer (GPT) # Memory: O(n²) for attention # KV cache: 1M × hidden_dim × n_layers × 2 (keys + values) # Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!) # RWKV # Memory: O(1) per token # State: hidden_dim × n_layers = 4096 × 24 = ~400KB # 1,000,000× more efficient! ``` **Speed comparison** (inference): ```python # Transformer: O(n) per token (quadratic overall) # First token: 1 computation # Second token: 2 computations # ... # 1000th token: 1000 computations # RWKV: O(1) per token (linear overall) # Every token: 1 computation # 1000th token: 1 computation (same as first!) ``` ## When to use vs alternatives **Use RWKV when**: - Need very long context (100K+ tokens) - Want constant memory usage - Building streaming applications - Need RNN efficiency with Transformer performance - Memory-constrained deployment **Key advantages**: - **Linear time**: O(n) vs O(n²) for Transformers - **No KV cache**: Constant memory per token - **Infinite context**: No fixed window limit - **Parallelizable training**: Like GPT - **Sequential inference**: Like RNN **Use alternatives instead**: - **Transformers**: Need absolute best performance, have compute - **Mamba**: Want state-space models - **RetNet**: Need retention mechanism - **Hyena**: Want convolution-based approach ## Common issues **Issue: Out of memory during training** Use gradient checkpointing and DeepSpeed: ```python trainer = pl.Trainer( strategy='deepspeed_stage_3', # Full ZeRO-3 precision='bf16' ) ``` **Issue: Slow inference** Enable CUDA kernel: ```python os.environ["RWKV_CUDA_ON"] = '1' ``` **Issue: Model not loading** Check model path and strategy: ```python model = RWKV( model='/absolute/path/to/model.pth', strategy='cuda fp16' # Or 'cpu fp32' for CPU ) ``` **Issue: State management in RNN mode** Always pass state between forward calls: ```python # WRONG: State lost out1, _ = model.forward(tokens1, None) out2, _ = model.forward(tokens2, None) # No context from tokens1! # CORRECT: State preserved out1, state = model.forward(tokens1, None) out2, state = model.forward(tokens2, state) # Has context from tokens1 ``` ## Advanced topics **Time-mixing and channel-mixing**: See [references/architecture-details.md](references/architecture-details.md) for WKV operation, time-decay mechanism, and receptance gates. **State management**: See [references/state-management.md](references/state-management.md) for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations. **RWKV-7 improvements**: See [references/rwkv7.md](references/rwkv7.md) for latest architectural improvements (March 2025) and multimodal capabilities. ## Hardware requirements - **GPU**: NVIDIA (CUDA 11.6+) or CPU - **VRAM** (FP16): - 169M model: 1GB - 430M model: 2GB - 1.5B model: 4GB - 3B model: 8GB - 7B model: 16GB - 14B model: 32GB - **Inference**: O(1) memory per token - **Training**: Parallelizable like GPT **Performance** (vs Transformers): - **Speed**: Similar training, faster inference - **Memory**: 1000× less for long sequences - **Scaling**: Linear vs quadratic ## Resources - Paper (RWKV): https://arxiv.org/abs/2305.13048 (May 2023) - Paper (RWKV-7): https://arxiv.org/abs/2503.14456 (March 2025) - GitHub: https://github.com/BlinkDL/RWKV-LM ⭐ 12,000+ - Docs: https://wiki.rwkv.com/ - Models: https://huggingface.co/BlinkDL - Linux Foundation AI: Official project - Production: Microsoft Windows, Office integration, NeMo support01-model-architecture/torchtitan/SKILL.mdskillShow content (8927 bytes)
--- name: distributed-llm-pretraining-torchtitan description: Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. version: 1.0.0 author: Orchestra Research license: MIT tags: [Model Architecture, Distributed Training, TorchTitan, FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel, Float8, Llama, Pretraining] dependencies: [torch>=2.6.0, torchtitan>=0.2.0, torchao>=0.5.0] --- # TorchTitan - PyTorch Native Distributed LLM Pretraining ## Quick start TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs. **Installation**: ```bash # From PyPI (stable) pip install torchtitan # From source (latest features, requires PyTorch nightly) git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt ``` **Download tokenizer**: ```bash # Get HF token from https://huggingface.co/settings/tokens python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=... ``` **Start training on 8 GPUs**: ```bash CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh ``` ## Common workflows ### Workflow 1: Pretrain Llama 3.1 8B on single node Copy this checklist: ``` Single Node Pretraining: - [ ] Step 1: Download tokenizer - [ ] Step 2: Configure training - [ ] Step 3: Launch training - [ ] Step 4: Monitor and checkpoint ``` **Step 1: Download tokenizer** ```bash python scripts/download_hf_assets.py \ --repo_id meta-llama/Llama-3.1-8B \ --assets tokenizer \ --hf_token=YOUR_HF_TOKEN ``` **Step 2: Configure training** Edit or create a TOML config file: ```toml # llama3_8b_custom.toml [job] dump_folder = "./outputs" description = "Llama 3.1 8B training" [model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B" [optimizer] name = "AdamW" lr = 3e-4 [lr_scheduler] warmup_steps = 200 [training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4" [parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP [activation_checkpoint] mode = "selective" selective_ac_option = "op" [checkpoint] enable = true folder = "checkpoint" interval = 500 ``` **Step 3: Launch training** ```bash # 8 GPUs on single node CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh # Or explicitly with torchrun torchrun --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_8b_custom.toml ``` **Step 4: Monitor and checkpoint** TensorBoard logs are saved to `./outputs/tb/`: ```bash tensorboard --logdir ./outputs/tb ``` ### Workflow 2: Multi-node training with SLURM ``` Multi-Node Training: - [ ] Step 1: Configure parallelism for scale - [ ] Step 2: Set up SLURM script - [ ] Step 3: Submit job - [ ] Step 4: Resume from checkpoint ``` **Step 1: Configure parallelism for scale** For 70B model on 256 GPUs (32 nodes): ```toml [parallelism] data_parallel_shard_degree = 32 # FSDP across 32 ranks tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 1 # No PP for 70B context_parallel_degree = 1 # Increase for long sequences ``` **Step 2: Set up SLURM script** ```bash #!/bin/bash #SBATCH --job-name=llama70b #SBATCH --nodes=32 #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 srun torchrun \ --nnodes=32 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ -m torchtitan.train \ --job.config_file ./llama3_70b.toml ``` **Step 3: Submit job** ```bash sbatch multinode_trainer.slurm ``` **Step 4: Resume from checkpoint** Training auto-resumes if checkpoint exists in configured folder. ### Workflow 3: Enable Float8 training for H100s Float8 provides 30-50% speedup on H100 GPUs. ``` Float8 Training: - [ ] Step 1: Install torchao - [ ] Step 2: Configure Float8 - [ ] Step 3: Launch with compile ``` **Step 1: Install torchao** ```bash USE_CPP=0 pip install git+https://github.com/pytorch/ao.git ``` **Step 2: Configure Float8** Add to your TOML config: ```toml [model] converters = ["quantize.linear.float8"] [quantize.linear.float8] enable_fsdp_float8_all_gather = true precompute_float8_dynamic_scale_for_fsdp = true filter_fqns = ["output"] # Exclude output layer [compile] enable = true components = ["model", "loss"] ``` **Step 3: Launch with compile** ```bash CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \ --model.converters="quantize.linear.float8" \ --quantize.linear.float8.enable_fsdp_float8_all_gather \ --compile.enable ``` ### Workflow 4: 4D parallelism for 405B models ``` 4D Parallelism (FSDP + TP + PP + CP): - [ ] Step 1: Create seed checkpoint - [ ] Step 2: Configure 4D parallelism - [ ] Step 3: Launch on 512 GPUs ``` **Step 1: Create seed checkpoint** Required for consistent initialization across PP stages: ```bash NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \ --checkpoint.enable \ --checkpoint.create_seed_checkpoint \ --parallelism.data_parallel_shard_degree 1 \ --parallelism.tensor_parallel_degree 1 \ --parallelism.pipeline_parallel_degree 1 ``` **Step 2: Configure 4D parallelism** ```toml [parallelism] data_parallel_shard_degree = 8 # FSDP tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 8 # PP across nodes context_parallel_degree = 1 # CP for long sequences [training] local_batch_size = 32 seq_len = 8192 ``` **Step 3: Launch on 512 GPUs** ```bash # 64 nodes x 8 GPUs = 512 GPUs srun torchrun --nnodes=64 --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_405b.toml ``` ## When to use vs alternatives **Use TorchTitan when:** - Pretraining LLMs from scratch (8B to 405B+) - Need PyTorch-native solution without third-party dependencies - Require composable 4D parallelism (FSDP2, TP, PP, CP) - Training on H100s with Float8 support - Want interoperable checkpoints with torchtune/HuggingFace **Use alternatives instead:** - **Megatron-LM**: Maximum performance for NVIDIA-only deployments - **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support - **Axolotl/TRL**: Fine-tuning rather than pretraining - **LitGPT**: Educational, smaller-scale training ## Common issues **Issue: Out of memory on large models** Enable activation checkpointing and reduce batch size: ```toml [activation_checkpoint] mode = "full" # Instead of "selective" [training] local_batch_size = 1 ``` Or use gradient accumulation: ```toml [training] local_batch_size = 1 global_batch_size = 32 # Accumulates gradients ``` **Issue: TP causes high memory with async collectives** Set environment variable: ```bash export TORCH_NCCL_AVOID_RECORD_STREAMS=1 ``` **Issue: Float8 training not faster** Float8 only benefits large GEMMs. Filter small layers: ```toml [quantize.linear.float8] filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"] ``` **Issue: Checkpoint loading fails after parallelism change** Use DCP's resharding capability: ```bash # Convert sharded checkpoint to single file python -m torch.distributed.checkpoint.format_utils \ dcp_to_torch checkpoint/step-1000 checkpoint.pt ``` **Issue: Pipeline parallelism initialization** Create seed checkpoint first (see Workflow 4, Step 1). ## Supported models | Model | Sizes | Status | |-------|-------|--------| | Llama 3.1 | 8B, 70B, 405B | Production | | Llama 4 | Various | Experimental | | DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental | | GPT-OSS | 20B, 120B (MoE) | Experimental | | Qwen 3 | Various | Experimental | | Flux | Diffusion | Experimental | ## Performance benchmarks (H100) | Model | GPUs | Parallelism | TPS/GPU | Techniques | |-------|------|-------------|---------|------------| | Llama 8B | 8 | FSDP | 5,762 | Baseline | | Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% | | Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel | | Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel | ## Advanced topics **FSDP2 configuration**: See [references/fsdp.md](references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents. **Float8 training**: See [references/float8.md](references/float8.md) for tensorwise vs rowwise scaling recipes. **Checkpointing**: See [references/checkpoint.md](references/checkpoint.md) for HuggingFace conversion and async checkpointing. **Adding custom models**: See [references/custom-models.md](references/custom-models.md) for TrainSpec protocol. ## Resources - GitHub: https://github.com/pytorch/torchtitan - Paper: https://arxiv.org/abs/2410.06511 - ICLR 2025: https://iclr.cc/virtual/2025/poster/29620 - PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/4402-tokenization/huggingface-tokenizers/SKILL.mdskillShow content (13674 bytes)
--- name: huggingface-tokenizers description: Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training. version: 1.0.0 author: Orchestra Research license: MIT tags: [Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production] dependencies: [tokenizers, transformers, datasets] --- # HuggingFace Tokenizers - Fast Tokenization for NLP Fast, production-ready tokenizers with Rust performance and Python ease-of-use. ## When to use HuggingFace Tokenizers **Use HuggingFace Tokenizers when:** - Need extremely fast tokenization (<20s per GB of text) - Training custom tokenizers from scratch - Want alignment tracking (token → original text position) - Building production NLP pipelines - Need to tokenize large corpora efficiently **Performance**: - **Speed**: <20 seconds to tokenize 1GB on CPU - **Implementation**: Rust core with Python/Node.js bindings - **Efficiency**: 10-100× faster than pure Python implementations **Use alternatives instead**: - **SentencePiece**: Language-independent, used by T5/ALBERT - **tiktoken**: OpenAI's BPE tokenizer for GPT models - **transformers AutoTokenizer**: Loading pretrained only (uses this library internally) ## Quick start ### Installation ```bash # Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers ``` ### Load pretrained tokenizer ```python from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?" ``` ### Train custom BPE tokenizer ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace # Initialize tokenizer with BPE model tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace() # Configure trainer trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 ) # Train on files files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer) # Save tokenizer.save("my-tokenizer.json") ``` **Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB ### Batch encoding with padding ```python # Enable padding tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # Encode batch texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts) for encoding in encodings: print(encoding.ids) # [101, 7592, 2088, 102, 3, 3, 3] # [101, 2023, 2003, 1037, 2936, 6251, 102] ``` ## Tokenization algorithms ### BPE (Byte-Pair Encoding) **How it works**: 1. Start with character-level vocabulary 2. Find most frequent character pair 3. Merge into new token, add to vocabulary 4. Repeat until vocabulary size reached **Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel() trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2 ) tokenizer.train(files=["data.txt"], trainer=trainer) ``` **Advantages**: - Handles OOV words well (breaks into subwords) - Flexible vocabulary size - Good for morphologically rich languages **Trade-offs**: - Tokenization depends on merge order - May split common words unexpectedly ### WordPiece **How it works**: 1. Start with character vocabulary 2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))` 3. Merge highest scoring pair 4. Repeat until vocabulary size reached **Used by**: BERT, DistilBERT, MobileBERT ```python from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import BertNormalizer tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = BertNormalizer(lowercase=True) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##" ) tokenizer.train(files=["corpus.txt"], trainer=trainer) ``` **Advantages**: - Prioritizes meaningful merges (high score = semantically related) - Used successfully in BERT (state-of-the-art results) **Trade-offs**: - Unknown words become `[UNK]` if no subword match - Saves vocabulary, not merge rules (larger files) ### Unigram **How it works**: 1. Start with large vocabulary (all substrings) 2. Compute loss for corpus with current vocabulary 3. Remove tokens with minimal impact on loss 4. Repeat until vocabulary size reached **Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece) ```python from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer tokenizer = Tokenizer(Unigram()) trainer = UnigramTrainer( vocab_size=8000, special_tokens=["<unk>", "<s>", "</s>"], unk_token="<unk>" ) tokenizer.train(files=["data.txt"], trainer=trainer) ``` **Advantages**: - Probabilistic (finds most likely tokenization) - Works well for languages without word boundaries - Handles diverse linguistic contexts **Trade-offs**: - Computationally expensive to train - More hyperparameters to tune ## Tokenization pipeline Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing** ### Normalization Clean and standardize text: ```python from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents ]) # Input: "Héllo WORLD" # After normalization: "hello world" ``` **Common normalizers**: - `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms - `Lowercase()` - Convert to lowercase - `StripAccents()` - Remove accents (é → e) - `Strip()` - Remove whitespace - `Replace(pattern, content)` - Regex replacement ### Pre-tokenization Split text into word-like units: ```python from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel # Split on whitespace and punctuation tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ]) # Input: "Hello, world!" # After pre-tokenization: ["Hello", ",", "world", "!"] ``` **Common pre-tokenizers**: - `Whitespace()` - Split on spaces, tabs, newlines - `ByteLevel()` - GPT-2 style byte-level splitting - `Punctuation()` - Isolate punctuation - `Digits(individual_digits=True)` - Split digits individually - `Metaspace()` - Replace spaces with ▁ (SentencePiece style) ### Post-processing Add special tokens for model input: ```python from tokenizers.processors import TemplateProcessing # BERT-style: [CLS] sentence [SEP] tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], ) ``` **Common patterns**: ```python # GPT-2: sentence <|endoftext|> TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] ) # RoBERTa: <s> sentence </s> TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] ) ``` ## Alignment tracking Track token positions in original text: ```python output = tokenizer.encode("Hello, world!") # Get token offsets for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}") # Output: # hello → [ 0, 5): 'Hello' # , → [ 5, 6): ',' # world → [ 7, 12): 'world' # ! → [12, 13): '!' ``` **Use cases**: - Named entity recognition (map predictions back to text) - Question answering (extract answer spans) - Token classification (align labels to original positions) ## Integration with transformers ### Load with AutoTokenizer ```python from transformers import AutoTokenizer # AutoTokenizer automatically uses fast tokenizers tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Check if using fast tokenizer print(tokenizer.is_fast) # True # Access underlying tokenizers.Tokenizer fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'> ``` ### Convert custom tokenizer to transformers ```python from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast # Train custom tokenizer tokenizer = Tokenizer(BPE()) # ... train tokenizer ... tokenizer.save("my-tokenizer.json") # Wrap for transformers transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" ) # Use like any transformers tokenizer outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" ) ``` ## Common patterns ### Train from iterator (large datasets) ```python from datasets import load_dataset # Load dataset dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train") # Create batch iterator def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"] # Train tokenizer tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar ) ``` **Performance**: Processes 1GB in ~10-20 minutes ### Enable truncation and padding ```python # Enable truncation tokenizer.enable_truncation(max_length=512) # Enable padding tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max ) # Encode with both output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512 ``` ### Multi-processing ```python from tokenizers import Tokenizer from multiprocessing import Pool # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") def encode_batch(texts): return tokenizer.encode_batch(texts) # Process large corpus in parallel with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)] # Encode in parallel results = pool.map(encode_batch, chunks) ``` **Speedup**: 5-8× with 8 cores ## Performance benchmarks ### Training speed | Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) | |-------------|-----------------|-----------------|--------------| | 10 MB | 15 sec | 18 sec | 25 sec | | 100 MB | 1.5 min | 2 min | 4 min | | 1 GB | 15 min | 20 min | 40 min | **Hardware**: 16-core CPU, tested on English Wikipedia ### Tokenization speed | Implementation | 1 GB corpus | Throughput | |----------------|-------------|---------------| | Pure Python | ~20 minutes | ~50 MB/min | | HF Tokenizers | ~15 seconds | ~4 GB/min | | **Speedup** | **80×** | **80×** | **Test**: English text, average sentence length 20 words ### Memory usage | Task | Memory | |-------------------------|---------| | Load tokenizer | ~10 MB | | Train BPE (30k vocab) | ~200 MB | | Encode 1M sentences | ~500 MB | ## Supported models Pre-trained tokenizers available via `from_pretrained()`: **BERT family**: - `bert-base-uncased`, `bert-large-cased` - `distilbert-base-uncased` - `roberta-base`, `roberta-large` **GPT family**: - `gpt2`, `gpt2-medium`, `gpt2-large` - `distilgpt2` **T5 family**: - `t5-small`, `t5-base`, `t5-large` - `google/flan-t5-xxl` **Other**: - `facebook/bart-base`, `facebook/mbart-large-cc25` - `albert-base-v2`, `albert-xlarge-v2` - `xlm-roberta-base`, `xlm-roberta-large` Browse all: https://huggingface.co/models?library=tokenizers ## References - **[Training Guide](references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets - **[Algorithms Deep Dive](references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail - **[Pipeline Components](references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders - **[Transformers Integration](references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens ## Resources - **Docs**: https://huggingface.co/docs/tokenizers - **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+ - **Version**: 0.20.0+ - **Course**: https://huggingface.co/learn/nlp-course/chapter6/1 - **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)01-model-architecture/mamba/SKILL.mdskillShow content (7368 bytes)
--- name: mamba-architecture description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace. version: 1.0.0 author: Orchestra Research license: MIT tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers] dependencies: [mamba-ssm, torch, transformers, causal-conv1d] --- # Mamba - Selective State Space Models ## Quick start Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling. **Installation**: ```bash # Install causal-conv1d (optional, for efficiency) pip install causal-conv1d>=1.4.0 # Install Mamba pip install mamba-ssm # Or both together pip install mamba-ssm[causal-conv1d] ``` **Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+ **Basic usage** (Mamba block): ```python import torch from mamba_ssm import Mamba batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba( d_model=dim, # Model dimension d_state=16, # SSM state dimension d_conv=4, # Conv1d kernel size expand=2 # Expansion factor ).to("cuda") y = model(x) # O(n) complexity! assert y.shape == x.shape ``` ## Common workflows ### Workflow 1: Language model with Mamba-2 **Complete LM with generation**: ```python from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel from mamba_ssm.models.config_mamba import MambaConfig import torch # Configure Mamba-2 LM config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) ) model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16) # Generate text input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 ) ``` ### Workflow 2: Use pretrained Mamba models **Load from HuggingFace**: ```python from transformers import AutoTokenizer from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel # Load pretrained model model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16) # Generate prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text) ``` **Available models**: - `state-spaces/mamba-130m` - `state-spaces/mamba-370m` - `state-spaces/mamba-790m` - `state-spaces/mamba-1.4b` - `state-spaces/mamba-2.8b` ### Workflow 3: Mamba-1 vs Mamba-2 **Mamba-1** (smaller state): ```python from mamba_ssm import Mamba model = Mamba( d_model=256, d_state=16, # Smaller state dimension d_conv=4, expand=2 ).to("cuda") ``` **Mamba-2** (multi-head, larger state): ```python from mamba_ssm import Mamba2 model = Mamba2( d_model=256, d_state=128, # Larger state dimension d_conv=4, expand=2, headdim=64, # Head dimension for multi-head ngroups=1 # Parallel groups ).to("cuda") ``` **Key differences**: - **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128) - **Architecture**: Mamba-2 has multi-head structure - **Normalization**: Mamba-2 uses RMSNorm - **Distributed**: Mamba-2 supports tensor parallelism ### Workflow 4: Benchmark vs Transformers **Generation speed comparison**: ```bash # Benchmark Mamba python benchmarks/benchmark_generation_mamba_simple.py \ --model-name "state-spaces/mamba-2.8b" \ --prompt "The future of machine learning is" \ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2 # Benchmark Transformer python benchmarks/benchmark_generation_mamba_simple.py \ --model-name "EleutherAI/pythia-2.8b" \ --prompt "The future of machine learning is" \ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2 ``` **Expected results**: - **Mamba**: 5× faster inference - **Memory**: No KV cache needed - **Scaling**: Linear with sequence length ## When to use vs alternatives **Use Mamba when**: - Need long sequences (100K+ tokens) - Want faster inference than Transformers - Memory-constrained (no KV cache) - Building streaming applications - Linear scaling important **Advantages**: - **O(n) complexity**: Linear vs quadratic - **5× faster inference**: No attention overhead - **No KV cache**: Lower memory usage - **Million-token sequences**: Hardware-efficient - **Streaming**: Constant memory per token **Use alternatives instead**: - **Transformers**: Need best-in-class performance, have compute - **RWKV**: Want RNN+Transformer hybrid - **RetNet**: Need retention-based architecture - **Hyena**: Want convolution-based approach ## Common issues **Issue: CUDA out of memory** Reduce batch size or use gradient checkpointing: ```python model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16) model.gradient_checkpointing_enable() # Enable checkpointing ``` **Issue: Slow installation** Install binary wheels (not source): ```bash pip install mamba-ssm --no-build-isolation ``` **Issue: Missing causal-conv1d** Install separately: ```bash pip install causal-conv1d>=1.4.0 ``` **Issue: Model not loading from HuggingFace** Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`): ```python from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b") ``` ## Advanced topics **Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity. **Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup. **Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques. ## Hardware requirements - **GPU**: NVIDIA with CUDA 11.6+ - **VRAM**: - 130M model: 2GB - 370M model: 4GB - 790M model: 8GB - 1.4B model: 14GB - 2.8B model: 28GB (FP16) - **Inference**: 5× faster than Transformers - **Memory**: No KV cache (lower than Transformers) **Performance** (vs Transformers): - **Speed**: 5× faster inference - **Memory**: 50% less (no KV cache) - **Scaling**: Linear vs quadratic ## Resources - Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023) - Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024) - GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+ - Models: https://huggingface.co/state-spaces - Docs: Repository README and wiki0-autoresearch-skill/SKILL.mdskillShow content (24945 bytes)
--- name: autoresearch description: Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort. version: 1.0.0 author: Orchestra Research license: MIT tags: [Autonomous Research, Two-Loop Architecture, Experiment Orchestration, Research Synthesis, Project Management] --- # Autoresearch Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution. You are a research project manager, not a domain expert. You orchestrate; the domain skills execute. **This runs fully autonomously.** Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own. ## Getting Started Users arrive in different states. Determine which and proceed: | User State | What to Do | |---|---| | Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap | | Clear research question | Bootstrap directly | | Existing plan or proposal | Review plan, set up workspace, enter loops | | Resuming (research-state.yaml exists) | Read state, continue from where you left off | If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching. **Step 0 — before anything else**: Set up the agent continuity loop. See [Agent Continuity](#agent-continuity-mandatory--set-up-first). This is MANDATORY. Without it, the research stops after one cycle. ### Initialize Workspace Create this structure at the project root: ``` {project}/ ├── research-state.yaml # Central state tracking ├── research-log.md # Decision timeline ├── findings.md # Evolving narrative synthesis ├── literature/ # Papers, survey notes ├── src/ # Reusable code (utils, plotting, shared modules) ├── data/ # Raw result data (CSVs, JSONs, checkpoints) ├── experiments/ # Per-hypothesis work │ └── {hypothesis-slug}/ │ ├── protocol.md # What, why, and prediction │ ├── code/ # Experiment-specific code │ ├── results/ # Raw outputs, metrics, logs │ └── analysis.md # What we learned ├── to_human/ # Progress presentations and reports for human review └── paper/ # Final paper (via ml-paper-writing) ``` - **`src/`**: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory. - **`data/`**: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., `trajectory_H1_runs001-010.csv`). Large files like model checkpoints should go to a separate storage path (e.g., `/data/`, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory. Initialize `research-state.yaml`, `research-log.md`, and `findings.md` from [templates/](templates/). Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement. ## The Two-Loop Architecture This is the core engine. Everything else supports it. ``` BOOTSTRAP (once, lightweight) Scope question → search literature → form initial hypotheses INNER LOOP (fast, autonomous, repeating) Pick hypothesis → experiment → measure → record → learn → next Goal: run constrained experiments with clear measurable outcomes OUTER LOOP (periodic, reflective) Review results → find patterns → update findings.md → new hypotheses → decide direction Goal: synthesize understanding, find the story — this is where novelty comes from FINALIZE (when concluding) Write paper via ml-paper-writing → final presentation → archive ``` The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results *mean*? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover. There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm. ### Research is Non-Linear The two-loop structure is a rhythm, not a railroad. At any point during research you can and should: - **Return to literature** when results surprise you, assumptions break, or you need context for a new direction — always save what you find to `literature/` - **Brainstorm new ideas** using `21-research-ideation/` skills when you're stuck or when results open unexpected questions - **Pivot the question entirely** if experiments reveal the original question was wrong or less interesting than what you found This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help. ## Bootstrap: Literature and Hypotheses Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey. 1. **Search literature** for the research question. Use multiple sources — never stop at one: - **Exa MCP** (`web_search_exa`) if available — best for broad discovery and finding relevant papers quickly - **Semantic Scholar** (`pip install semanticscholar`) — best for ML/AI papers, citation graphs, and specific paper lookup. See `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code examples - **arXiv** (`pip install arxiv`) — best for recent preprints and open-access papers - **CrossRef** — best for DOI lookup and BibTeX retrieval - Keep searching until you have good coverage. If one source comes up empty, try another with different keywords **Save everything to `literature/`**: For every paper you find, save a summary to `literature/` — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running `literature/survey.md` with all summaries. This is your reference library — you and future sessions will need it throughout the project. 2. **Identify gaps** from the literature - What's been tried? What hasn't? Where do existing methods break? - What do Discussion sections flag as future work? 3. **Form initial hypotheses** — invoke `21-research-ideation/` skills - `brainstorming-research-ideas` for structured diverge-converge workflow - `creative-thinking-for-research` for deeper cognitive frameworks - Each hypothesis must be testable with a clear prediction 4. **Define the evaluation** - Set the proxy metric and baseline before running experiments - The metric should be computable quickly (minutes, not hours) - Lock evaluation criteria upfront to prevent unconscious metric gaming 5. **Record** in research-state.yaml, log the bootstrap in research-log.md ## The Inner Loop Rapid iteration with clear measurable outcomes. Two flavors: - **Optimization**: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch. - **Discovery**: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize. ``` 1. Pick the highest-priority untested hypothesis 2. Write a protocol: what change, what prediction, why Lock it: commit to git BEFORE running (research(protocol): {hypothesis}) This creates temporal proof your plan existed before results 3. Run the experiment (invoke the relevant domain skill) 4. Sanity check before trusting results: - Did training converge? No NaN/Inf? - Does baseline reproduce expected performance? - Data loading correct? (spot-check a few samples) 5. Measure the proxy metric 6. Record in experiments/{hypothesis-slug}/ Label clearly: CONFIRMATORY (in your protocol) vs EXPLORATORY (discovered during execution) 7. If positive: keep, note WHY it worked 8. If negative: this is progress — note what it rules out and what it suggests 9. Update research-state.yaml 10. If stuck: search literature or invoke ideation skills — don't just keep trying random things ``` **Never stop.** Even if something fails, find a path forward. Debug, adjust, simplify, or pivot — but keep the research moving. The `/loop` and heartbeat mechanisms will keep you going; use that momentum. ### Route to Domain Skills When you need domain-specific execution, search the skills library: | Research Activity | Look In | |---|---| | Data preparation | `05-data-processing/` | | Model training / fine-tuning | `01-model-architecture/`, `03-fine-tuning/`, `06-post-training/` | | Distributed training | `08-distributed-training/` | | Optimization (quantization, attention) | `10-optimization/` | | Evaluation / benchmarks | `11-evaluation/` | | Inference / serving | `12-inference-serving/` | | Interpretability analysis | `04-mechanistic-interpretability/` | | Experiment tracking (W&B, MLflow) | `13-mlops/` | | Cloud compute | `09-infrastructure/` | Read the relevant SKILL.md before starting — it has workflows, common issues, and code examples. See [references/skill-routing.md](references/skill-routing.md) for a complete guide. ### Track the Experiment Trajectory Maintain a running record of measurable outcomes across experiments: ```json { "experiment_id": "run_014", "hypothesis": "H3", "metric_value": 0.847, "baseline": 0.812, "delta": "+0.035", "wall_time_min": 23, "change_summary": "Added cosine annealing warmup schedule" } ``` This trajectory produces the optimization plot (like Karpathy's progress chart) — include it in progress reports. Humans love seeing the upward curve. ## The Outer Loop Step back from individual experiments. Synthesize. ``` 1. Review all results since last reflection 2. Cluster by type: what kinds of changes worked? Which didn't? 3. Ask WHY — identify the mechanism behind successes and failures 4. Update findings.md with current understanding 5. Search literature if results were surprising or assumptions need revisiting 6. Generate new hypotheses if warranted (invoke 21-research-ideation/ skills) 7. Decide direction (see criteria below) 8. Update research-state.yaml with new direction 9. Log the reflection in research-log.md 10. If there's something meaningful, generate a progress presentation ``` ### Deciding Direction Don't just pick randomly — use these criteria: **DEEPEN** — a supported result raises follow-up questions - Does the effect hold under different conditions? What's the mechanism? - Action: generate sub-hypotheses (H1.1, H1.2) → back to inner loop **BROADEN** — current results are solid, but adjacent questions are untested - New questions emerged. The current contribution is clear but more is possible. - Action: generate new root hypotheses → back to inner loop **PIVOT** — results invalidate key assumptions or something more interesting appeared - A core assumption was wrong, or an unexpected finding is more promising than the original question. - Action: return to literature with new questions → re-bootstrap **CONCLUDE** — sufficient evidence for a contribution - At least one hypothesis is strongly supported (or a coherent set of negative results) - Key ablations completed, error analysis done - findings.md reads like a paper backbone — a human could write the abstract from it - No critical open questions that would change the story Note: coherent negative results are a valid contribution. "X does NOT work because Y" is publishable if the reasoning is rigorous. ### findings.md Is Your Project Memory This file serves two purposes: it's the research narrative for humans AND your accumulated knowledge base as an agent. Read it at the start of every session, /loop tick, or heartbeat to remember what you've learned. After every outer loop, update it to answer: - What do we know so far? (Current Understanding) - What patterns explain our results? (Patterns and Insights) - What specific things did we learn not to repeat? (Lessons and Constraints) - What remains open? (Open Questions) The "Lessons and Constraints" section is especially important — it captures specific actionable learnings like "weight decay > 0.1 diverges at this scale" or "baseline only reproduces with batch_size=64." This prevents the agent from repeating failed approaches across sessions. **Quality test**: After 30 inner loop experiments, a human should be able to read findings.md and write a paper abstract from it. If they can't, the outer loop isn't synthesizing — it's just logging. ## Agent Continuity (MANDATORY — Set Up First) **Before doing anything else**, set up the wall-clock loop. This is what keeps the research running continuously. Without it, the agent stops after one cycle. ### Claude Code **First action — run this immediately:** ``` /loop 20m Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a report in to_human/ and opening the HTML/PDF. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper. ``` This fires every 20 minutes regardless of what's happening. It's a rhythm that keeps you working — not a research phase boundary. If your previous work isn't done, just continue it. ### OpenClaw **First action — set up a 20-minute cron job in the current session. You MUST do this before any research work.** Use the `cron.add` tool to create a recurring job bound to this chat session: ```json { "name": "autoresearch-loop", "schedule": { "kind": "every", "everyMs": 1200000 }, "sessionTarget": "current", "payload": { "kind": "agentTurn", "message": "Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a PDF report in to_human/ and sending it to the user via Telegram, WhatsApp, or Slack. When you get an exciting result or interesting plot, text it to the user right away — don't wait for a full report. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper." } } ``` Key details: - `sessionTarget: "current"` binds the cron to this chat session so it maintains conversation context across ticks - `everyMs: 1200000` = 20 minutes - After creation, verify with `cron.list` that the job exists and is enabled - If the cron fires while you're mid-experiment, just continue — the tick is a nudge, not a restart ### What the Loop Does The `/loop` and cron job are purely **wall-clock rhythm**. They are completely separate from your research loops (inner/outer). On each tick: 1. Read `research-state.yaml` and `findings.md` — remember where you are 2. Check if anything is broken (failed experiments, stalled training, errors) 3. If on track → keep working on whatever you were doing 4. If stuck or something's wrong → step back, diagnose, fix, then continue 5. Never idle. Always be making progress. ## Progress Reporting When you have something meaningful to share, create a research presentation — not just a status dashboard, but a compelling story. **When to report** (your judgment): - After an outer loop that found a significant pattern - When the optimization trajectory shows clear progress (include the plot!) - After a pivot in direction - Before requesting human input on a decision - When concluding **What to include** (adapt to what's compelling): - The research question and why it matters - Key results with visualizations (plots, metric tables) - The optimization trajectory chart (metric over experiments) - What was tried and why (selective, not exhaustive) - Current understanding (the findings narrative) - What's planned next For Claude Code: generate HTML and `open` it. If HTML fails to open or render, convert to PDF as fallback (use `weasyprint`, `playwright pdf`, or `wkhtmltopdf`). For OpenClaw: generate PDF directly. See [references/progress-reporting.md](references/progress-reporting.md) for template scaffolding and the optimization plot approach. Use the template as a starting point — be creative with what you show. ## Git Protocol Commit at natural research milestones: | When | Message Pattern | |---|---| | Workspace initialized | `research(init): {project} — {question}` | | Experiment protocol locked | `research(protocol): {hypothesis}` | | Significant results | `research(results): {hypothesis} — {outcome}` | | Outer loop direction change | `research(reflect): {direction} — {reason}` | | Paper draft complete | `research(paper): {title}` | **Hard rule**: Protocol commits MUST precede result commits. Never combine them. The git history is your lightweight pre-registration — it proves what you planned before you saw results. Don't commit after every experiment — commit when there's meaningful progress. ## Concluding: Paper Writing When the outer loop decides to CONCLUDE: 1. Ensure findings.md has a clear, well-supported narrative 2. Study 2-3 top related papers to learn their format, style, and section structure 3. Invoke the `20-ml-paper-writing` skill — it has LaTeX templates for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, and systems venues 4. Feed it the accumulated literature, experimental results, and findings 5. Follow its citation verification workflow — never hallucinate references 6. Generate a final comprehensive research presentation Proceed autonomously through the writing process. If the ml-paper-writing skill suggests human collaboration points, adapt and keep going — produce the best draft you can. The human will review and provide feedback. ## Research Discipline Principles to enforce continuously — not tied to any specific phase: - **Lock before you run**: Commit your experiment protocol to git before executing. This proves your plan existed before you saw results. Never combine protocol + results in one commit. - **Confirmatory vs exploratory**: Results matching your locked protocol are confirmatory. Everything else is exploratory — interesting but requiring more skepticism. - **Negative results are progress**: A refuted hypothesis tells you something. Log what it rules out and what it suggests. Don't treat it as failure. - **Sanity check before analysis**: Verify training converged, baselines reproduce, and data is correct before trusting your primary metric. - **Return to literature when confused**: Don't guess — search. If results surprise you or assumptions break, go find papers. Use Exa MCP for discovery, Semantic Scholar for specific ML/AI paper lookup, arXiv for preprints. - **Never stop**: Don't wait for human approval on routine decisions. If a skill or tool suggests collaboration, adapt and keep going. Find the best path forward autonomously. The human will see your progress reports and can redirect if needed. - **Use whatever compute is available**: Adapt to the user's environment — local GPU, cluster job submission, cloud instances, or just CPU. If no GPU is available, use CPU and adjust experiment scale accordingly. Don't block on compute availability. ## Quality Standards **Good agent behavior:** - Hypotheses have mechanistic reasoning ("X because Y, predicting Z"), not just "try X" - findings.md builds a coherent narrative, not a flat list of results - Negative results are recorded with what they rule out - The agent updates its model when experiments contradict expectations - Progress reports tell a research story with compelling visualizations **Bad agent behavior:** - Pure hyperparameter sweeps without interpretation - findings.md is just experiment logs copy-pasted - Agent never revisits its assumptions after failures - Optimizing metrics without understanding why changes work ## When to Use vs Alternatives **Use autoresearch when:** - You have a research question explorable through experiments - There's a measurable proxy metric for inner loop optimization - The real contribution requires synthesis beyond the metric - You want continuous autonomous research operation **Use individual domain skills instead when:** - You have a specific one-off task (train a model, run eval, write a paper) - No iterative experimentation needed ## Common Issues **Inner loop stalls (no metric improvement)** Run an outer loop. Is the metric the right one? Is the search space exhausted? Consider broadening or pivoting. Search literature for new approaches. **Stuck and not making progress** Don't keep trying random changes. Step back: search literature for related work, invoke `21-research-ideation/` brainstorming skills, or run an outer loop reflection. Being stuck means you need new information or a new perspective, not more experiments. **Results contradict baseline expectations** Investigate, don't ignore. Return to literature — your protocol might have an error, the published baseline may be wrong, or conditions differ. Update findings.md with what you learn. **Agent loses context between ticks** Ensure research-state.yaml and findings.md are updated after every action. These files are your memory across sessions. **Can't find relevant papers** Try multiple approaches in order: Exa MCP for broad search, Semantic Scholar for specific ML/AI paper lookup (`pip install semanticscholar`), arXiv for preprints (`pip install arxiv`). Check `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code. Note: Google Scholar has no official API — use Semantic Scholar instead for programmatic search. **No GPU available** Use CPU and scale experiments down. Many research tasks (analysis, interpretability, small model training) run fine on CPU. Adjust experiment design to fit available compute rather than blocking. **Experiments take longer than /loop interval** Normal. On the next tick, check if it finished. If not, keep waiting or do something else useful (update notes, search papers). Adjust interval if needed. **Not sure when to conclude** Three questions: Do you have a strongly supported finding? Can you explain WHY it works? Would findings.md make a convincing paper abstract? If yes to all: conclude. ## Advanced Topics - **Detailed agent continuity**: [references/agent-continuity.md](references/agent-continuity.md) - **Progress presentation templates**: [references/progress-reporting.md](references/progress-reporting.md) - **Complete skill routing**: [references/skill-routing.md](references/skill-routing.md).claude-plugin/marketplace.jsonmarketplaceShow content (12081 bytes)
{ "name": "ai-research-skills", "owner": { "name": "Orchestra Research", "email": "zechen@orchestra-research.com" }, "metadata": { "description": "Comprehensive library of 98 AI research engineering skills enabling autonomous AI research from hypothesis to experimental verification", "version": "1.2.0" }, "plugins": [ { "name": "model-architecture", "description": "LLM architectures and implementations including LitGPT, Mamba, NanoGPT, RWKV, and TorchTitan. Use when implementing, training, or understanding transformer and alternative architectures.", "source": "./", "strict": false, "skills": [ "./01-model-architecture/litgpt", "./01-model-architecture/mamba", "./01-model-architecture/nanogpt", "./01-model-architecture/rwkv", "./01-model-architecture/torchtitan" ] }, { "name": "tokenization", "description": "Text tokenization for LLMs including HuggingFace Tokenizers and SentencePiece. Use when training custom tokenizers or handling multilingual text.", "source": "./", "strict": false, "skills": [ "./02-tokenization/huggingface-tokenizers", "./02-tokenization/sentencepiece" ] }, { "name": "fine-tuning", "description": "LLM fine-tuning frameworks including Axolotl, LLaMA-Factory, PEFT, and Unsloth. Use when fine-tuning models with LoRA, QLoRA, or full fine-tuning.", "source": "./", "strict": false, "skills": [ "./03-fine-tuning/axolotl", "./03-fine-tuning/llama-factory", "./03-fine-tuning/peft", "./03-fine-tuning/unsloth" ] }, { "name": "mechanistic-interpretability", "description": "Neural network interpretability tools including TransformerLens, SAELens, NNSight, and pyvene. Use when analyzing model internals, finding circuits, or understanding how models compute.", "source": "./", "strict": false, "skills": [ "./04-mechanistic-interpretability/nnsight", "./04-mechanistic-interpretability/pyvene", "./04-mechanistic-interpretability/saelens", "./04-mechanistic-interpretability/transformer-lens" ] }, { "name": "data-processing", "description": "Data curation and processing at scale including NeMo Curator and Ray Data. Use when preparing training datasets or processing large-scale data.", "source": "./", "strict": false, "skills": [ "./05-data-processing/nemo-curator", "./05-data-processing/ray-data" ] }, { "name": "post-training", "description": "RLHF and preference alignment including TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, and torchforge. Use when aligning models with human preferences, training reward models, or large-scale RL training.", "source": "./", "strict": false, "skills": [ "./06-post-training/grpo-rl-training", "./06-post-training/miles", "./06-post-training/openrlhf", "./06-post-training/simpo", "./06-post-training/slime", "./06-post-training/torchforge", "./06-post-training/trl-fine-tuning", "./06-post-training/verl" ] }, { "name": "safety-alignment", "description": "AI safety and content moderation including Constitutional AI, LlamaGuard, NeMo Guardrails, and Prompt Guard. Use when implementing safety filters, content moderation, or prompt injection detection.", "source": "./", "strict": false, "skills": [ "./07-safety-alignment/constitutional-ai", "./07-safety-alignment/llamaguard", "./07-safety-alignment/nemo-guardrails", "./07-safety-alignment/prompt-guard" ] }, { "name": "distributed-training", "description": "Multi-GPU and multi-node training including DeepSpeed, PyTorch FSDP, Accelerate, Megatron-Core, PyTorch Lightning, and Ray Train. Use when training large models across GPUs.", "source": "./", "strict": false, "skills": [ "./08-distributed-training/accelerate", "./08-distributed-training/deepspeed", "./08-distributed-training/megatron-core", "./08-distributed-training/pytorch-fsdp2", "./08-distributed-training/pytorch-lightning", "./08-distributed-training/ray-train" ] }, { "name": "infrastructure", "description": "GPU cloud and compute orchestration including Modal, Lambda Labs, and SkyPilot. Use when deploying training jobs or managing GPU resources.", "source": "./", "strict": false, "skills": [ "./09-infrastructure/lambda-labs", "./09-infrastructure/modal", "./09-infrastructure/skypilot" ] }, { "name": "optimization", "description": "Model optimization and quantization including Flash Attention, bitsandbytes, GPTQ, AWQ, GGUF, and HQQ. Use when reducing memory, accelerating inference, or quantizing models.", "source": "./", "strict": false, "skills": [ "./10-optimization/awq", "./10-optimization/bitsandbytes", "./10-optimization/flash-attention", "./10-optimization/gguf", "./10-optimization/gptq", "./10-optimization/hqq", "./10-optimization/ml-training-recipes" ] }, { "name": "evaluation", "description": "LLM benchmarking and evaluation including lm-evaluation-harness, BigCode Evaluation Harness, and NeMo Evaluator. Use when benchmarking models or measuring performance.", "source": "./", "strict": false, "skills": [ "./11-evaluation/bigcode-evaluation-harness", "./11-evaluation/lm-evaluation-harness", "./11-evaluation/nemo-evaluator" ] }, { "name": "inference-serving", "description": "Production LLM inference including vLLM, TensorRT-LLM, llama.cpp, and SGLang. Use when deploying models for production inference.", "source": "./", "strict": false, "skills": [ "./12-inference-serving/llama-cpp", "./12-inference-serving/sglang", "./12-inference-serving/tensorrt-llm", "./12-inference-serving/vllm" ] }, { "name": "mlops", "description": "ML experiment tracking and lifecycle including Weights & Biases, MLflow, and TensorBoard. Use when tracking experiments or managing models.", "source": "./", "strict": false, "skills": [ "./13-mlops/mlflow", "./13-mlops/tensorboard", "./13-mlops/weights-and-biases" ] }, { "name": "agents", "description": "LLM agent frameworks including LangChain, LlamaIndex, CrewAI, and AutoGPT. Use when building chatbots, autonomous agents, or tool-using systems.", "source": "./", "strict": false, "skills": [ "./14-agents/autogpt", "./14-agents/crewai", "./14-agents/langchain", "./14-agents/llamaindex" ] }, { "name": "rag", "description": "Retrieval-Augmented Generation including Chroma, FAISS, Pinecone, Qdrant, and Sentence Transformers. Use when building semantic search or document retrieval systems.", "source": "./", "strict": false, "skills": [ "./15-rag/chroma", "./15-rag/faiss", "./15-rag/pinecone", "./15-rag/qdrant", "./15-rag/sentence-transformers" ] }, { "name": "prompt-engineering", "description": "Structured LLM outputs including DSPy, Instructor, Guidance, and Outlines. Use when extracting structured data or constraining LLM outputs.", "source": "./", "strict": false, "skills": [ "./16-prompt-engineering/dspy", "./16-prompt-engineering/guidance", "./16-prompt-engineering/instructor", "./16-prompt-engineering/outlines" ] }, { "name": "observability", "description": "LLM application monitoring including LangSmith and Phoenix. Use when debugging LLM apps or monitoring production systems.", "source": "./", "strict": false, "skills": [ "./17-observability/langsmith", "./17-observability/phoenix" ] }, { "name": "multimodal", "description": "Vision, audio, and multimodal models including CLIP, Whisper, LLaVA, BLIP-2, Segment Anything, Stable Diffusion, AudioCraft, Cosmos Policy, OpenPI, and OpenVLA-OFT. Use when working with images, audio, multimodal tasks, or vision-language-action robot policies.", "source": "./", "strict": false, "skills": [ "./18-multimodal/audiocraft", "./18-multimodal/blip-2", "./18-multimodal/clip", "./18-multimodal/cosmos-policy", "./18-multimodal/llava", "./18-multimodal/openpi", "./18-multimodal/openvla-oft", "./18-multimodal/segment-anything", "./18-multimodal/stable-diffusion", "./18-multimodal/whisper" ] }, { "name": "emerging-techniques", "description": "Advanced ML techniques including MoE Training, Model Merging, Long Context, Speculative Decoding, Knowledge Distillation, and Model Pruning. Use when implementing cutting-edge optimization or architecture techniques.", "source": "./", "strict": false, "skills": [ "./19-emerging-techniques/knowledge-distillation", "./19-emerging-techniques/long-context", "./19-emerging-techniques/model-merging", "./19-emerging-techniques/model-pruning", "./19-emerging-techniques/moe-training", "./19-emerging-techniques/speculative-decoding" ] }, { "name": "autoresearch", "description": "Autonomous research orchestration using a two-loop architecture. Manages the full research lifecycle from literature survey to paper writing, routing to domain-specific skills for execution. Use when starting a research project, running autonomous experiments, or managing multi-hypothesis research.", "source": "./", "strict": false, "skills": [ "./0-autoresearch-skill" ] }, { "name": "ml-paper-writing", "description": "Write publication-ready ML/AI/Systems papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, OSDI, NSDI, ASPLOS, SOSP. Includes LaTeX templates, citation verification, reviewer guidelines, publication-quality figure generation, systems paper structural blueprints, and conference presentation slides.", "source": "./", "strict": false, "skills": [ "./20-ml-paper-writing/ml-paper-writing", "./20-ml-paper-writing/academic-plotting", "./20-ml-paper-writing/systems-paper-writing", "./20-ml-paper-writing/presenting-conference-talks" ] }, { "name": "ideation", "description": "Research ideation frameworks including structured brainstorming and creative thinking. Use when exploring new research directions, generating novel ideas, or seeking fresh angles on existing work.", "source": "./", "strict": false, "skills": [ "./21-research-ideation/brainstorming-research-ideas", "./21-research-ideation/creative-thinking-for-research" ] }, { "name": "agent-native-research-artifact", "description": "Agent-Native Research Artifact (ARA) tooling: compile any research input (paper, repo, notes) into a structured artifact, record session provenance as a post-task epilogue, and run Seal Level 2 epistemic review. Use when ingesting research into a falsifiable, agent-traversable artifact, capturing how a research project actually evolved, or auditing an ARA for evidence-claim alignment.", "source": "./", "strict": false, "skills": [ "./22-agent-native-research-artifact/compiler", "./22-agent-native-research-artifact/research-manager", "./22-agent-native-research-artifact/rigor-reviewer" ] } ] }
README
AI Research Skills Library
The most comprehensive open-source skills library enabling AI agents to autonomously conduct AI research — from idea to paper
98 Skills Powering AI Research in 2026
View All 23 Categories
| Autoresearch (1) | Ideation (2) | ML Paper Writing (2) |
| Model Architecture (5) | Fine-Tuning (4) | Post-Training (8) |
| Distributed Training (6) | Optimization (6) | Inference (4) |
| Tokenization (2) | Data Processing (2) | Evaluation (3) |
| Safety & Alignment (4) | Agents (4) | RAG (5) |
| Multimodal (7) | Prompt Engineering (4) | MLOps (3) |
| Observability (2) | Infrastructure (3) | Mech Interp (4) |
| Emerging Techniques (6) | Agent-Native Research Artifact (3) |
Table of Contents
- Our Mission
- Path Towards AI Research Agent
- Available AI Research Engineering Skills
- Demos
- Skill Structure
- Roadmap
- Repository Structure
- Use Cases
- Contributors
- Citation
- Community
Our Mission
We enable AI agents to autonomously conduct AI research — from literature survey and idea generation through experiment execution to paper writing. The library provides both the research orchestration layer (autoresearch, ideation, paper writing) and the engineering skills (training, evaluation, deployment) needed at each stage.
System diagram of an AI research agent
Path Towards AI Research Agent
Modern AI research requires mastering dozens of specialized tools and frameworks. AI Researchers spend more time debugging infrastructure than testing hypotheses — slowing the pace of scientific discovery. We provide a comprehensive skills library that enables AI agents to autonomously conduct the full research lifecycle — from brainstorming ideas to writing the paper.
- Autonomous Research - The autoresearch skill orchestrates the entire research workflow using a two-loop architecture, routing to domain skills as needed
- Specialized Expertise - Each domain skill provides deep, production-ready knowledge of a specific framework (Megatron-LM, vLLM, TRL, etc.)
- End-to-End Coverage - 98 skills spanning the full AI research lifecycle, from ideation and literature survey to experiments and paper writing
- Research-Grade Quality - Documentation sourced from official repos, real GitHub issues, and battle-tested production workflows
Available AI Research Engineering Skills
Quality over quantity: Each skill provides comprehensive, expert-level guidance with real code examples, troubleshooting guides, and production-ready workflows.
📦 Quick Install (Recommended)
For humans — interactive installer with one command:
npx @orchestra-research/ai-research-skills
For AI agents — point your agent to the welcome doc and it handles the rest:
Read https://www.orchestra-research.com/ai-research-skills/welcome.md and follow the instructions to install and use AI Research Skills.
This installs all 98 skills, loads the autoresearch orchestration layer, and starts autonomous research.
What the installer does
- Auto-detects your installed coding agents (Claude Code, Hermes Agent, OpenCode, Cursor, Gemini CLI, etc.)
- Installs skills to
~/.orchestra/skills/with symlinks to each agent (falls back to copy on Windows) - Offers everything, quickstart bundle, by category, or individual skills
- Updates installed skills with latest versions
- Uninstalls all or selected skills
CLI Commands
# Interactive installer (recommended)
npx @orchestra-research/ai-research-skills
# Direct commands
npx @orchestra-research/ai-research-skills list # View installed skills
npx @orchestra-research/ai-research-skills update # Update installed skills
Claude Code Marketplace (Alternative)
Install skill categories directly using the Claude Code CLI:
# Add the marketplace
/plugin marketplace add orchestra-research/AI-research-SKILLs
# Install by category (23 categories available)
/plugin install fine-tuning@ai-research-skills # Axolotl, LLaMA-Factory, PEFT, Unsloth
/plugin install post-training@ai-research-skills # TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge
/plugin install inference-serving@ai-research-skills # vLLM, TensorRT-LLM, llama.cpp, SGLang
/plugin install distributed-training@ai-research-skills
/plugin install optimization@ai-research-skills
All 23 Categories (98 Skills)
| Category | Skills | Included |
|---|---|---|
| Autoresearch | 1 | Autonomous research orchestration — central layer that manages the full lifecycle and routes to all other skills |
| Ideation | 2 | Research Brainstorming, Creative Thinking |
| ML Paper Writing | 2 | ML Paper Writing (LaTeX templates, citation verification), Academic Plotting |
| Model Architecture | 5 | LitGPT, Mamba, NanoGPT, RWKV, TorchTitan |
| Tokenization | 2 | HuggingFace Tokenizers, SentencePiece |
| Fine-Tuning | 4 | Axolotl, LLaMA-Factory, PEFT, Unsloth |
| Mech Interp | 4 | TransformerLens, SAELens, pyvene, nnsight |
| Data Processing | 2 | NeMo Curator, Ray Data |
| Post-Training | 8 | TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge |
| Safety | 4 | Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard |
| Distributed | 6 | DeepSpeed, FSDP, Accelerate, Megatron-Core, Lightning, Ray Train |
| Infrastructure | 3 | Modal, Lambda Labs, SkyPilot |
| Optimization | 6 | Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF |
| Evaluation | 3 | lm-eval-harness, BigCode, NeMo Evaluator |
| Inference | 4 | vLLM, TensorRT-LLM, llama.cpp, SGLang |
| MLOps | 3 | W&B, MLflow, TensorBoard |
| Agents | 4 | LangChain, LlamaIndex, CrewAI, AutoGPT |
| RAG | 5 | Chroma, FAISS, Pinecone, Qdrant, Sentence Transformers |
| Prompt Eng | 4 | DSPy, Instructor, Guidance, Outlines |
| Observability | 2 | LangSmith, Phoenix |
| Multimodal | 7 | CLIP, Whisper, LLaVA, BLIP-2, SAM, Stable Diffusion, AudioCraft |
| Emerging | 6 | MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning |
| Agent-Native Research Artifact | 3 | ARA Compiler, Research Manager, Rigor Reviewer |
View All 98 Skills in Details
🔬 Autoresearch (1 skill) — Central Orchestration Layer
- Autoresearch - Autonomous research orchestration using a two-loop architecture (inner optimization + outer synthesis). Manages the full lifecycle from literature survey to paper writing, routing to all domain-specific skills. Supports Claude Code /loop and OpenClaw heartbeat for continuous operation (390 lines + 3 refs)
🏗️ Model Architecture (5 skills)
- LitGPT - Lightning AI's 20+ clean LLM implementations with production training recipes (462 lines + 4 refs)
- Mamba - State-space models with O(n) complexity, 5× faster than Transformers (253 lines + 3 refs)
- RWKV - RNN+Transformer hybrid, infinite context, Linux Foundation project (253 lines + 3 refs)
- NanoGPT - Educational GPT in ~300 lines by Karpathy (283 lines + 3 refs)
- TorchTitan - PyTorch-native distributed training for Llama 3.1 with 4D parallelism
🔤 Tokenization (2 skills)
- HuggingFace Tokenizers - Rust-based, <20s/GB, BPE/WordPiece/Unigram algorithms (486 lines + 4 refs)
- SentencePiece - Language-independent, 50k sentences/sec, used by T5/ALBERT (228 lines + 2 refs)
🎯 Fine-Tuning (4 skills)
- Axolotl - YAML-based fine-tuning with 100+ models (156 lines + 4 refs)
- LLaMA-Factory - WebUI no-code fine-tuning (78 lines + 5 refs)
- Unsloth - 2x faster QLoRA fine-tuning (75 lines + 4 refs)
- PEFT - Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods (431 lines + 2 refs)
🔬 Mechanistic Interpretability (4 skills)
- TransformerLens - Neel Nanda's library for mech interp with HookPoints, activation caching (346 lines + 3 refs)
- SAELens - Sparse Autoencoder training and analysis for feature discovery (386 lines + 3 refs)
- pyvene - Stanford's causal intervention library with declarative configs (473 lines + 3 refs)
- nnsight - Remote interpretability via NDIF, run experiments on 70B+ models (436 lines + 3 refs)
📊 Data Processing (2 skills)
- Ray Data - Distributed ML data processing, streaming execution, GPU support (318 lines + 2 refs)
- NeMo Curator - GPU-accelerated data curation, 16× faster deduplication (375 lines + 2 refs)
🎓 Post-Training (8 skills)
- TRL Fine-Tuning - Transformer Reinforcement Learning (447 lines + 4 refs)
- GRPO-RL-Training (TRL) - Group Relative Policy Optimization with TRL (569 lines, gold standard)
- OpenRLHF - Full RLHF pipeline with Ray + vLLM (241 lines + 4 refs)
- SimPO - Simple Preference Optimization, no reference model needed (211 lines + 3 refs)
- verl - ByteDance's HybridFlow RL framework, FSDP/Megatron + vLLM/SGLang backends (389 lines + 2 refs)
- slime - THUDM's Megatron+SGLang framework powering GLM-4.x models (464 lines + 2 refs)
- miles - Enterprise fork of slime with FP8, INT4, speculative RL for MoE training (315 lines + 2 refs)
- torchforge - Meta's PyTorch-native RL with Monarch+TorchTitan+vLLM (380 lines + 2 refs)
🛡️ Safety & Alignment (4 skills)
- Constitutional AI - AI-driven self-improvement via principles (282 lines)
- LlamaGuard - Safety classifier for LLM inputs/outputs (329 lines)
- NeMo Guardrails - Programmable guardrails with Colang (289 lines)
- Prompt Guard - Meta's 86M prompt injection & jailbreak detector, 99%+ TPR, <2ms GPU (313 lines)
⚡ Distributed Training (6 skills)
- Megatron-Core - NVIDIA's framework for training 2B-462B param models with 47% MFU on H100 (359 lines + 4 refs)
- DeepSpeed - Microsoft's ZeRO optimization (137 lines + 9 refs)
- PyTorch FSDP2 - Fully Sharded Data Parallel v2 with
fully_shardand DTensor (231 lines + 12 refs) - Accelerate - HuggingFace's 4-line distributed training API (324 lines + 3 refs)
- PyTorch Lightning - High-level training framework with Trainer class (339 lines + 3 refs)
- Ray Train - Multi-node orchestration and hyperparameter tuning (399 lines + 1 ref)
🚀 Optimization (6 skills)
- Flash Attention - 2-4x faster attention with memory efficiency (359 lines + 2 refs)
- bitsandbytes - 8-bit/4-bit quantization for 50-75% memory reduction (403 lines + 3 refs)
- GPTQ - 4-bit post-training quantization, 4× memory reduction, <2% accuracy loss (443 lines + 3 refs)
- AWQ - Activation-aware weight quantization, 4-bit with minimal accuracy loss (310 lines + 2 refs)
- HQQ - Half-Quadratic Quantization, no calibration data needed, multi-backend (370 lines + 2 refs)
- GGUF - llama.cpp quantization format, K-quant methods, CPU/Metal inference (380 lines + 2 refs)
📊 Evaluation (3 skills)
- lm-evaluation-harness - EleutherAI's standard for benchmarking LLMs across 60+ tasks (482 lines + 4 refs)
- BigCode Evaluation Harness - Code model benchmarking with HumanEval, MBPP, MultiPL-E, pass@k metrics (406 lines + 3 refs)
- NeMo Evaluator - NVIDIA's enterprise platform for 100+ benchmarks across 18+ harnesses with multi-backend execution (454 lines + 4 refs)
☁️ Infrastructure (3 skills)
- Modal - Serverless GPU cloud with Python-native API, T4-H200 on-demand (342 lines + 2 refs)
- SkyPilot - Multi-cloud orchestration across 20+ providers with spot recovery (390 lines + 2 refs)
- Lambda Labs - Reserved/on-demand GPU cloud with H100/A100, persistent filesystems (390 lines + 2 refs)
🔥 Inference & Serving (4 skills)
- vLLM - High-throughput LLM serving with PagedAttention (356 lines + 4 refs, production-ready)
- TensorRT-LLM - NVIDIA's fastest inference, 24k tok/s, FP8/INT4 quantization (180 lines + 3 refs)
- llama.cpp - CPU/Apple Silicon inference, GGUF quantization (251 lines + 3 refs)
- SGLang - Structured generation with RadixAttention, 5-10× faster for agents (435 lines + 3 refs)
🤖 Agents (4 skills)
- LangChain - Most popular agent framework, 500+ integrations, ReAct pattern (658 lines + 3 refs, production-ready)
- LlamaIndex - Data framework for LLM apps, 300+ connectors, RAG-focused (535 lines + 3 refs)
- CrewAI - Multi-agent orchestration, role-based collaboration, autonomous workflows (498 lines + 3 refs)
- AutoGPT - Autonomous AI agent platform, visual workflow builder, continuous execution (400 lines + 2 refs)
🔍 RAG (5 skills)
- Chroma - Open-source embedding database, local/cloud, 24k stars (385 lines + 1 ref)
- FAISS - Facebook's similarity search, billion-scale, GPU acceleration (295 lines)
- Sentence Transformers - 5000+ embedding models, multilingual, 15k stars (370 lines)
- Pinecone - Managed vector database, auto-scaling, <100ms latency (410 lines)
- Qdrant - High-performance vector search, Rust-powered, hybrid search with filtering (493 lines + 2 refs)
🎨 Multimodal (7 skills)
- CLIP - OpenAI's vision-language model, zero-shot classification, 25k stars (320 lines)
- Whisper - Robust speech recognition, 99 languages, 73k stars (395 lines)
- LLaVA - Vision-language assistant, image chat, GPT-4V level (360 lines)
- Stable Diffusion - Text-to-image generation via HuggingFace Diffusers, SDXL, ControlNet (380 lines + 2 refs)
- Segment Anything - Meta's SAM for zero-shot image segmentation with points/boxes (500 lines + 2 refs)
- BLIP-2 - Vision-language pretraining with Q-Former, image captioning, VQA (500 lines + 2 refs)
- AudioCraft - Meta's MusicGen/AudioGen for text-to-music and text-to-sound (470 lines + 2 refs)
🎯 Prompt Engineering (4 skills)
- DSPy - Declarative prompt programming with optimizers, Stanford NLP, 22k stars (438 lines + 3 refs)
- Instructor - Structured LLM outputs with Pydantic validation, 15k stars (726 lines + 3 refs)
- Guidance - Constrained generation with regex/grammars, Microsoft Research, 18k stars (485 lines + 3 refs)
- Outlines - Structured text with FSM, zero-overhead, 8k stars (601 lines + 3 refs)
📊 MLOps (3 skills)
- Weights & Biases - Experiment tracking, sweeps, artifacts, model registry (427 lines + 3 refs)
- MLflow - Model registry, tracking, deployment, autologging (514 lines + 3 refs)
- TensorBoard - Visualization, profiling, embeddings, scalars/images (538 lines + 3 refs)
👁️ Observability (2 skills)
- LangSmith - LLM observability, tracing, evaluation, monitoring for AI apps (422 lines + 2 refs)
- Phoenix - Open-source AI observability with OpenTelemetry tracing and LLM evaluation (380 lines + 2 refs)
🔬 Emerging Techniques (6 skills)
- MoE Training - Mixture of Experts training with DeepSpeed, Mixtral 8x7B, 5× cost reduction (515 lines + 3 refs)
- Model Merging - Combine models with TIES, DARE, SLERP using mergekit (528 lines + 3 refs)
- Long Context - Extend context windows with RoPE, YaRN, ALiBi, 32k-128k tokens (624 lines + 3 refs)
- Speculative Decoding - 1.5-3.6× faster inference with Medusa, Lookahead (379 lines)
- Knowledge Distillation - Compress models 70B→7B with MiniLLM, temperature scaling (424 lines)
- Model Pruning - 50% sparsity with Wanda, SparseGPT, <1% accuracy loss (417 lines)
📝 ML Paper Writing (2 skills)
- ML Paper Writing - Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM with LaTeX templates, citation verification, and writing best practices (532 lines + 5 refs)
- Academic Plotting - Generate publication-quality figures for ML papers: architecture diagrams via Gemini AI and data-driven charts via matplotlib/seaborn with venue-specific styling (479 lines + 3 refs)
💡 Ideation (2 skills)
- Research Brainstorming - Structured ideation frameworks for discovering high-impact research directions with 10 complementary lenses (384 lines)
- Creative Thinking - Cognitive science frameworks (bisociation, structure-mapping, constraint manipulation) for genuinely novel research ideas (366 lines)
🧬 Agent-Native Research Artifact (3 skills)
- ARA Compiler - Compiles any research input (PDF papers, repos, experiment logs, raw notes) into a complete Agent-Native Research Artifact with claims, exploration graph, evidence, and code stubs (245 lines + 3 refs)
- ARA Research Manager - Post-task research recorder that runs at session end to extract decisions, experiments, dead ends, and pivots from conversation history into the
ara/directory with user-vs-AI provenance tags (324 lines + 3 refs) - ARA Rigor Reviewer - ARA Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope, coherence, exploration integrity, methodology) with severity-ranked findings (322 lines + 1 ref)
Demos
All 87 skills in this repo are automatically synced to Orchestra Research, where you can add them to your projects with one click and use them with AI research agents.
See skills in action → demos/
We maintain a curated collection of demo repositories showing how to use skills for real AI research tasks:
| Demo | Skills Used | What It Does |
|---|---|---|
| Norm Heterogeneity → LoRA Brittleness | Autoresearch, ML Paper Writing, Ideation | Agent autonomously discovered norm heterogeneity predicts fine-tuning difficulty (r=-0.99), pivoting from a null result on ETF overlaps |
| RL Algorithm Brain Scan | Autoresearch, GRPO, TRL, SAELens, TransformerLens, ML Paper Writing | Agent found DPO is a rank-1 perturbation (95.6% recovery from one SVD direction) while online RL is distributed and structure-preserving |
| NeMo Eval: GPQA Benchmark | NeMo Evaluator | Compare Llama 8B/70B/405B on graduate-level science questions |
| LoRA Without Regret Reproduction | GRPO, TRL | Reproduce SFT + GRPO RL experiments via prompting |
| Layer-Wise Quantization Experiment | llama.cpp, GGUF | Investigate optimal layer precision allocation—early layers at Q8 achieve 1.9× compression with 1.3% perplexity loss |
| Cross-Lingual Alignment Analysis | FAISS | Quantify how well multilingual embeddings align semantic concepts across 8 languages using FAISS similarity search |
| Scientific Plotting Demo | Academic Plotting | Generate publication-quality figures for the Andes QoE-aware LLM serving paper — Gemini AI architecture diagrams + matplotlib data charts (CDF, multi-panel grids, bar charts) |
Featured Demos: Two papers produced entirely by AI agents using the autoresearch skill. The Norm Heterogeneity paper demonstrates autonomous research pivoting — the agent refuted its own hypothesis and discovered a stronger finding. The RL Brain Scan paper demonstrates multi-skill orchestration — the agent trained RL models, analyzed internals with interpretability tools, and synthesized the insight that "DPO is rank-1 alignment." Both papers written end-to-end by the agent.
Skill Structure
Each skill follows a battle-tested format for maximum usefulness:
skill-name/
├── SKILL.md # Quick reference (50-150 lines)
│ ├── Metadata (name, description, version)
│ ├── When to use this skill
│ ├── Quick patterns & examples
│ └── Links to references
│
├── references/ # Deep documentation (300KB+)
│ ├── README.md # From GitHub/official docs
│ ├── api.md # API reference
│ ├── tutorials.md # Step-by-step guides
│ ├── issues.md # Real GitHub issues & solutions
│ ├── releases.md # Version history & breaking changes
│ └── file_structure.md # Codebase navigation
│
├── scripts/ # Helper scripts (optional)
└── assets/ # Templates & examples (optional)
Quality Standards
- 300KB+ documentation from official sources
- Real GitHub issues & solutions (when available)
- Code examples with language detection
- Version history & breaking changes
- Links to official docs
Roadmap
We're building towards 80 comprehensive skills across the full AI research lifecycle. See our detailed roadmap for the complete development plan.
View Detailed Statistics
| Metric | Current | Target |
|---|---|---|
| Skills | 87 (high-quality, standardized YAML) | 80 ✅ |
| Avg Lines/Skill | 420 lines (focused + progressive disclosure) | 200-600 lines |
| Documentation | ~130,000 lines total (SKILL.md + references) | 100,000+ lines |
| Gold Standard Skills | 65 with comprehensive references | 50+ |
| Contributors | 1 | 100+ |
| Coverage | Architecture, Tokenization, Fine-Tuning, Mechanistic Interpretability, Data Processing, Post-Training, Safety, Distributed, Optimization, Evaluation, Infrastructure, Inference, Agents, RAG, Multimodal, Prompt Engineering, MLOps, Observability, ML Paper Writing, Ideation, Autoresearch | Full Lifecycle ✅ |
Recent Progress: npm package @orchestra-research/ai-research-skills for one-command installation across all coding agents
Philosophy: Quality > Quantity. Following Anthropic official best practices - each skill provides 200-500 lines of focused, actionable guidance with progressive disclosure.
Repository Structure
claude-ai-research-skills/
├── README.md ← You are here
├── CONTRIBUTING.md ← Contribution guide
├── demos/ ← Curated demo gallery (links to demo repos)
├── docs/
├── 0-autoresearch-skill/ (1 skill ✓ - Autonomous research orchestration)
├── 01-model-architecture/ (5 skills ✓ - LitGPT, Mamba, RWKV, NanoGPT, TorchTitan)
├── 02-tokenization/ (2 skills ✓ - HuggingFace Tokenizers, SentencePiece)
├── 03-fine-tuning/ (4 skills ✓ - Axolotl, LLaMA-Factory, Unsloth, PEFT)
├── 04-mechanistic-interpretability/ (4 skills ✓ - TransformerLens, SAELens, pyvene, nnsight)
├── 05-data-processing/ (2 skills ✓ - Ray Data, NeMo Curator)
├── 06-post-training/ (8 skills ✓ - TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, torchforge)
├── 07-safety-alignment/ (4 skills ✓ - Constitutional AI, LlamaGuard, NeMo Guardrails, Prompt Guard)
├── 08-distributed-training/ (6 skills ✓ - Megatron-Core, DeepSpeed, FSDP, Accelerate, Lightning, Ray Train)
├── 09-infrastructure/ (3 skills ✓ - Modal, SkyPilot, Lambda Labs)
├── 10-optimization/ (6 skills ✓ - Flash Attention, bitsandbytes, GPTQ, AWQ, HQQ, GGUF)
├── 11-evaluation/ (3 skills ✓ - lm-evaluation-harness, BigCode, NeMo Evaluator)
├── 12-inference-serving/ (4 skills ✓ - vLLM, TensorRT-LLM, llama.cpp, SGLang)
├── 13-mlops/ (3 skills ✓ - Weights & Biases, MLflow, TensorBoard)
├── 14-agents/ (4 skills ✓ - LangChain, LlamaIndex, CrewAI, AutoGPT)
├── 15-rag/ (5 skills ✓ - Chroma, FAISS, Sentence Transformers, Pinecone, Qdrant)
├── 16-prompt-engineering/ (4 skills ✓ - DSPy, Instructor, Guidance, Outlines)
├── 17-observability/ (2 skills ✓ - LangSmith, Phoenix)
├── 18-multimodal/ (7 skills ✓ - CLIP, Whisper, LLaVA, Stable Diffusion, SAM, BLIP-2, AudioCraft)
├── 19-emerging-techniques/ (6 skills ✓ - MoE, Model Merging, Long Context, Speculative Decoding, Distillation, Pruning)
├── 20-ml-paper-writing/ (2 skills ✓ - ML Paper Writing with LaTeX templates, Academic Plotting)
├── 21-research-ideation/ (2 skills ✓ - Research Brainstorming, Creative Thinking)
├── 22-agent-native-research-artifact/ (3 skills ✓ - ARA Compiler, Research Manager, Rigor Reviewer)
└── packages/ai-research-skills/ (npm package for one-command installation)
Use Cases
For Researchers
"I need to fine-tune Llama 3 with custom data" → 03-fine-tuning/axolotl/ - YAML configs, 100+ model support
For ML Engineers
"How do I optimize inference latency?" → 12-inference-serving/vllm/ - PagedAttention, batching
For Students
"I want to learn how transformers work" → 01-model-architecture/litgpt/ - Clean implementations
For Teams
"We need to scale training to 100 GPUs" → 08-distributed-training/deepspeed/ - ZeRO stages, 3D parallelism
License
MIT License - See LICENSE for details.
Note: Individual skills may reference libraries with different licenses. Please check each project's license before use.
Citation
If you use AI Research Skills in your work or find it helpful for a publication, we'd appreciate a citation:
BibTeX
@software{ai_research_skills,
title = {AI Research Skills Library},
author = {{Orchestra Research}},
year = {2025},
url = {https://github.com/orchestra-research/AI-research-SKILLs},
note = {Open-source skills library enabling AI agents to autonomously conduct AI research}
}
APA
Orchestra Research. (2025). AI Research Skills Library [Computer software]. https://github.com/orchestra-research/AI-research-SKILLs
Chicago
Orchestra Research. "AI Research Skills Library." GitHub, 2025. https://github.com/orchestra-research/AI-research-SKILLs.
IEEE
Orchestra Research, "AI Research Skills Library," 2025. [Online]. Available: https://github.com/orchestra-research/AI-research-SKILLs
Tip: You can also click "Cite this repository" in the GitHub sidebar for auto-formatted citations.
Acknowledgments
Built with:
- Claude Code - AI pair programming
- Skill Seeker - Automated doc scraping
- Open Source AI Community - For amazing tools and docs
Special thanks to:
- EleutherAI, HuggingFace, NVIDIA, Lightning AI, Meta AI, Anthropic
- All researchers who maintain excellent documentation
Contributors
Thanks to all the people who have contributed to the AI Research Skills Library:
We welcome contributions from the AI research community! See CONTRIBUTING.md for detailed guidelines on:
- Adding new skills
- Improving existing skills
- Quality standards and best practices
- Submission process
Recent Updates
April 2026 - v1.6.0 🧬 Agent-Native Research Artifact (ARA) — 23rd Category, 98 Skills
- 🧬 NEW CATEGORY:
22-agent-native-research-artifact/(the 23rd category) — three skills that turn research outputs into a falsifiable, agent-traversable artifact:- 🛠️ ARA Compiler — compiles any input (PDF papers, GitHub repos, experiment logs, raw notes) into a structured ARA with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph (research DAG), and grounded evidence
- 📋 ARA Research Manager — post-task epilogue that scans conversation history at session end and writes decisions, experiments, dead ends, claims, heuristics, and pivots into the
ara/directory withuser/ai-suggested/ai-executed/user-revisedprovenance tags - 🔍 ARA Rigor Reviewer — Seal Level 2 semantic epistemic review scoring six dimensions of research rigor (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and emitting a severity-ranked report with a Strong Accept-to-Reject recommendation
- 🔗 Sourced from the Agent-Native-Research-Artifact-Init reference repo, restructured to AI-research-SKILLs standards (kebab-case names, third-person descriptions, Title-Case tags, one-level-deep references)
- 🧩 Plugin entry
agent-native-research-artifactadded to.claude-plugin/marketplace.json; CLI category registered as22-agent-native-research-artifactwith three individual skill entries in the npm installer - 🔄 Auto-syncs to Orchestra marketplace via
sync-skills.ymlon push; npm package republished as@orchestra-research/ai-research-skills@1.6.0viapublish-npm.ymlon version bump - 📊 98 total skills across 23 categories — full lifecycle from idea → paper → falsifiable, auditable artifact
March 2026 - v1.4.0 🔬 Autoresearch & 86 Skills — Full Research Lifecycle
- 🔬 NEW SKILL: Autoresearch — autonomous research orchestration using a two-loop architecture (inner optimization loop + outer synthesis loop)
- 🧠 Manages the full research lifecycle: literature survey → ideation → experiments → synthesis → paper writing
- 🔄 Routes to all 86 domain skills automatically — agents don't need to know which skill to use
- ⏰ Mandatory
/loop(Claude Code) and cron job (OpenClaw) for continuous autonomous operation - 📊 Generates research presentations (HTML/PDF) with optimization trajectory plots for human review
- 📝 Findings.md as persistent project memory across sessions with "Lessons and Constraints" tracking
- 🗂️ Structured workspace: research-state.yaml, findings.md, research-log.md, literature/, experiments/, src/, data/, to_human/
- 📄 Two demo papers produced by autoresearch: Norm Heterogeneity → LoRA Brittleness and RL Algorithm Brain Scan
- 🚀 WELCOME.md for cold-start agent bootstrap — one URL to go from zero to autonomous research
- 📦 npm v1.4.x with Windows symlink fallback, all 22 categories installable
- 🤖 Supported agents: Claude Code, Hermes Agent, OpenCode, OpenClaw, Cursor, Codex, Gemini CLI, Qwen Code
- 📊 87 total skills across 22 categories — complete research lifecycle coverage
February 2026 - v0.15.0 🛡️ Prompt Guard & 83 Skills
- 🛡️ NEW SKILL: Prompt Guard - Meta's 86M prompt injection & jailbreak detector
- ⚡ 99%+ TPR, <1% FPR, <2ms GPU latency, multilingual (8 languages)
- 🔒 3 workflows: user input filtering, third-party data filtering, batch RAG processing
- 📊 83 total skills across 20 categories
January 2026 - v0.14.0 📦 npm Package & 82 Skills
- 📦 NEW:
npx @orchestra-research/ai-research-skills- One-command installation for all coding agents - 🤖 Supported agents: Claude Code, OpenCode, Cursor, Codex, Gemini CLI, Qwen Code
- ✨ Interactive installer with category/individual skill selection
- 🔄 Update installed skills, selective uninstall
- 📊 82 total skills (5 new post-training skills: verl, slime, miles, torchforge + TorchTitan)
- 🏗️ Megatron-Core moved to Distributed Training category
January 2026 - v0.13.0 📝 ML Paper Writing & Demos Gallery
- 📝 NEW CATEGORY: ML Paper Writing (20th category, 77th skill)
- 🎯 Write publication-ready papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM
- 📚 Writing philosophy from top researchers (Neel Nanda, Farquhar, Gopen & Swan, Lipton, Perez)
- 🔬 Citation verification workflow - never hallucinate references
- 📄 LaTeX templates for 6 major conferences
- 🎪 NEW: Curated demos gallery (
demos/) showcasing skills in action - 🔗 Demo repos: NeMo Evaluator benchmark, LoRA Without Regret reproduction
- 📖 936-line comprehensive SKILL.md with 4 workflows
January 2026 - v0.12.0 📊 NeMo Evaluator SDK
- 📊 NEW SKILL: NeMo Evaluator SDK for enterprise LLM benchmarking
- 🔧 NVIDIA's evaluation platform with 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM)
- ⚡ Multi-backend execution: local Docker, Slurm HPC, Lepton cloud
- 📦 Container-first architecture for reproducible evaluation
- 📝 454 lines SKILL.md + 4 comprehensive reference files (~48KB documentation)
December 2025 - v0.11.0 🔬 Mechanistic Interpretability
- 🔬 NEW CATEGORY: Mechanistic Interpretability (4 skills)
- 🔍 TransformerLens skill: Neel Nanda's library for mech interp with HookPoints, activation caching, circuit analysis
- 🧠 SAELens skill: Sparse Autoencoder training and analysis for feature discovery, monosemanticity research
- ⚡ pyvene skill: Stanford's causal intervention library with declarative configs, DAS, activation patching
- 🌐 nnsight skill: Remote interpretability via NDIF, run experiments on 70B+ models without local GPUs
- 📝 ~6,500 new lines of documentation across 16 files
- 76 total skills (filling the missing 04 category slot)
November 25, 2025 - v0.10.0 🎉 70 Skills Complete!
- 🎉 ROADMAP COMPLETE: Reached 70-skill milestone!
- 🚀 Added 4 skills: Lambda Labs, Segment Anything (SAM), BLIP-2, AudioCraft
- ☁️ Lambda Labs skill: Reserved/on-demand GPU cloud with H100/A100, persistent filesystems, 1-Click Clusters
- 🖼️ SAM skill: Meta's Segment Anything for zero-shot image segmentation with points/boxes/masks
- 👁️ BLIP-2 skill: Vision-language pretraining with Q-Former, image captioning, VQA
- 🎵 AudioCraft skill: Meta's MusicGen/AudioGen for text-to-music and text-to-sound generation
- 📝 ~10,000 new lines of documentation across 12 files
- 70 total skills (100% roadmap complete!)
November 25, 2025 - v0.9.0
- 🚀 Added 2 infrastructure skills: Modal, SkyPilot
- ☁️ Modal skill: Serverless GPU cloud with Python-native API, T4-H200 on-demand, auto-scaling
- 🌐 SkyPilot skill: Multi-cloud orchestration across 20+ providers with spot recovery
- ✨ New Infrastructure category (2 skills - serverless GPU and multi-cloud orchestration)
- 📝 ~2,500 new lines of documentation across 6 files
- 66 total skills (94% towards 70-skill target)
November 25, 2025 - v0.8.0
- 🚀 Added 5 high-priority skills: HQQ, GGUF, Phoenix, AutoGPT, Stable Diffusion
- ⚡ HQQ skill: Half-Quadratic Quantization without calibration data, multi-backend support
- 📦 GGUF skill: llama.cpp quantization format, K-quant methods, CPU/Metal inference
- 👁️ Phoenix skill: Open-source AI observability with OpenTelemetry tracing and LLM evaluation
- 🤖 AutoGPT skill: Autonomous AI agent platform with visual workflow builder
- 🎨 Stable Diffusion skill: Text-to-image generation via Diffusers, SDXL, ControlNet, LoRA
- 📝 ~9,000 new lines of documentation across 15 files
- 64 total skills (91% towards 70-skill target)
November 25, 2025 - v0.7.0
- 🚀 Added 5 high-priority skills: PEFT, CrewAI, Qdrant, AWQ, LangSmith
- ✨ New Observability category with LangSmith for LLM tracing and evaluation
- 🎯 PEFT skill: Parameter-efficient fine-tuning with LoRA, QLoRA, DoRA, 25+ methods
- 🤖 CrewAI skill: Multi-agent orchestration with role-based collaboration
- 🔍 Qdrant skill: High-performance Rust vector search with hybrid filtering
- ⚡ AWQ skill: Activation-aware 4-bit quantization with minimal accuracy loss
- 📝 ~8,000 new lines of documentation across 15 files
- 59 total skills (84% towards 70-skill target)
November 15, 2025 - v0.6.0
- 📊 Added 3 comprehensive MLOps skills: Weights & Biases, MLflow, TensorBoard
- ✨ New MLOps category (3 skills - experiment tracking, model registry, visualization)
- 📝 ~10,000 new lines of documentation across 13 files
- 🔧 Comprehensive coverage: experiment tracking, hyperparameter sweeps, model registry, profiling, embeddings visualization
- 54 total skills (77% towards 70-skill target)
November 12, 2025 - v0.5.0
- 🎯 Added 4 comprehensive prompt engineering skills: DSPy, Instructor, Guidance, Outlines
- ✨ New Prompt Engineering category (4 skills - DSPy, Instructor, Guidance, Outlines)
- 📝 ~10,000 new lines of documentation across 16 files
- 🔧 Comprehensive coverage: declarative programming, structured outputs, constrained generation, FSM-based generation
- 47 total skills (67% towards 70-skill target)
November 9, 2025 - v0.4.0
- 🤖 Added 11 comprehensive skills: LangChain, LlamaIndex, Chroma, FAISS, Sentence Transformers, Pinecone, CLIP, Whisper, LLaVA
- ✨ New Agents category (2 skills - LangChain, LlamaIndex)
- 🔍 New RAG category (4 skills - Chroma, FAISS, Sentence Transformers, Pinecone)
- 🎨 New Multimodal category (3 skills - CLIP, Whisper, LLaVA)
- 📝 ~15,000 new lines of documentation
- 43 total skills (61% towards 70-skill target)
November 8, 2025 - v0.3.0
- 🚀 Added 8 comprehensive skills: TensorRT-LLM, llama.cpp, SGLang, GPTQ, HuggingFace Tokenizers, SentencePiece, Ray Data, NeMo Curator
- ⚡ Completed Inference & Serving category (4/4 skills)
- 🔤 New Tokenization category (2 skills)
- 📊 New Data Processing category (2 skills)
- 📝 9,617 new lines of documentation across 30 files
- 32 total skills (45% towards 70-skill target)
November 6, 2025 - v0.2.0
- Added 10 skills from GitHub (Megatron-Core, Lightning, Ray Train, etc.)
- Improved skill structure with comprehensive references
- Created strategic roadmap to 70 skills
- Added contribution guidelines
November 3, 2025 - v0.1.0
- 🎉 Initial release with 5 fine-tuning skills
Community
Join our community to stay updated, ask questions, and connect with other AI researchers:
- SkillEvolve Meta-Skill - Connect your agent to the collective intelligence of the community. Captures techniques discovered during sessions and shares them back as curated skills.
- Slack Community - Chat with the team and other users
- Twitter/X - Follow for updates and announcements
- LinkedIn - Connect professionally