Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
AlterLab-IEU

AlterLab-Academic-Skills

Quality
9.0

This repository offers a vast collection of over 186 specialized Claude AI skills, meticulously crafted for faculty, researchers, and academicians. Organized across 13 diverse research domains, these skills transform Claude into a domain-specific expert assistant. They are built on real scientific methods, professional output templates, and deep domain knowledge, enabling tasks from multi-agent research pipelines to database integrations and publication-quality output. Users can easily integrate them into Claude Projects or Claude Code CLI for instant expertise in areas like genomics, clinica

USP

Unlock 186+ Claude AI skills, transforming Claude into a domain-specific research expert. Leverage real scientific frameworks, professional output templates, and deep knowledge across 13 academic domains for enhanced scholarly workflows.

Use cases

  • 01Generating novel protein sequences with specific functional properties using ESM3.
  • 02Inferring gene regulatory networks from transcriptomics data to identify TF-target relationships.
  • 03Accessing and integrating data from multiple bioinformatics databases like UniProt, KEGG, and ChEMB…
  • 04Performing single-cell RNA-seq analysis and managing large annotated datasets with AnnData.
  • 05Automating sequence manipulation, file parsing (FASTA/GenBank), and programmatic NCBI/PubMed access.

Detected files (8)

  • skills/bioinformatics/alterlab-biopython/SKILL.mdskill
    Show content (13895 bytes)
    ---
    name: alterlab-biopython
    description: Comprehensive molecular biology toolkit. Use for sequence manipulation, file parsing (FASTA/GenBank/PDB), phylogenetics, and programmatic NCBI/PubMed access (Bio.Entrez). Best for batch processing, custom bioinformatics pipelines, BLAST automation. For quick lookups use gget; for multi-service integration use bioservices. Part of the AlterLab Academic Skills suite.
    license: MIT
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # Biopython: Computational Molecular Biology in Python
    
    ## Overview
    
    Biopython is a comprehensive set of freely available Python tools for biological computation. It provides functionality for sequence manipulation, file I/O, database access, structural bioinformatics, phylogenetics, and many other bioinformatics tasks. The current version is **Biopython 1.85** (released January 2025), which supports Python 3 and requires NumPy.
    
    ## When to Use This Skill
    
    Use this skill when:
    
    - Working with biological sequences (DNA, RNA, or protein)
    - Reading, writing, or converting biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, etc.)
    - Accessing NCBI databases (GenBank, PubMed, Protein, Gene, etc.) via Entrez
    - Running BLAST searches or parsing BLAST results
    - Performing sequence alignments (pairwise or multiple sequence alignments)
    - Analyzing protein structures from PDB files
    - Creating, manipulating, or visualizing phylogenetic trees
    - Finding sequence motifs or analyzing motif patterns
    - Calculating sequence statistics (GC content, molecular weight, melting temperature, etc.)
    - Performing structural bioinformatics tasks
    - Working with population genetics data
    - Any other computational molecular biology task
    
    ## Core Capabilities
    
    Biopython is organized into modular sub-packages, each addressing specific bioinformatics domains:
    
    1. **Sequence Handling** - Bio.Seq and Bio.SeqIO for sequence manipulation and file I/O
    2. **Alignment Analysis** - Bio.Align and Bio.AlignIO for pairwise and multiple sequence alignments
    3. **Database Access** - Bio.Entrez for programmatic access to NCBI databases
    4. **BLAST Operations** - Bio.Blast for running and parsing BLAST searches
    5. **Structural Bioinformatics** - Bio.PDB for working with 3D protein structures
    6. **Phylogenetics** - Bio.Phylo for phylogenetic tree manipulation and visualization
    7. **Advanced Features** - Motifs, population genetics, sequence utilities, and more
    
    ## Installation and Setup
    
    Install Biopython using pip (requires Python 3 and NumPy):
    
    ```python
    uv pip install biopython
    ```
    
    For NCBI database access, always set your email address (required by NCBI):
    
    ```python
    from Bio import Entrez
    Entrez.email = "your.email@example.com"
    
    # Optional: API key for higher rate limits (10 req/s instead of 3 req/s)
    Entrez.api_key = "your_api_key_here"
    ```
    
    ## Using This Skill
    
    This skill provides comprehensive documentation organized by functionality area. When working on a task, consult the relevant reference documentation:
    
    ### 1. Sequence Handling (Bio.Seq & Bio.SeqIO)
    
    **Reference:** `references/sequence_io.md`
    
    Use for:
    - Creating and manipulating biological sequences
    - Reading and writing sequence files (FASTA, GenBank, FASTQ, etc.)
    - Converting between file formats
    - Extracting sequences from large files
    - Sequence translation, transcription, and reverse complement
    - Working with SeqRecord objects
    
    **Quick example:**
    ```python
    from Bio import SeqIO
    
    # Read sequences from FASTA file
    for record in SeqIO.parse("sequences.fasta", "fasta"):
        print(f"{record.id}: {len(record.seq)} bp")
    
    # Convert GenBank to FASTA
    SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
    ```
    
    ### 2. Alignment Analysis (Bio.Align & Bio.AlignIO)
    
    **Reference:** `references/alignment.md`
    
    Use for:
    - Pairwise sequence alignment (global and local)
    - Reading and writing multiple sequence alignments
    - Using substitution matrices (BLOSUM, PAM)
    - Calculating alignment statistics
    - Customizing alignment parameters
    
    **Quick example:**
    ```python
    from Bio import Align
    
    # Pairwise alignment
    aligner = Align.PairwiseAligner()
    aligner.mode = 'global'
    alignments = aligner.align("ACCGGT", "ACGGT")
    print(alignments[0])
    ```
    
    ### 3. Database Access (Bio.Entrez)
    
    **Reference:** `references/databases.md`
    
    Use for:
    - Searching NCBI databases (PubMed, GenBank, Protein, Gene, etc.)
    - Downloading sequences and records
    - Fetching publication information
    - Finding related records across databases
    - Batch downloading with proper rate limiting
    
    **Quick example:**
    ```python
    from Bio import Entrez
    Entrez.email = "your.email@example.com"
    
    # Search PubMed
    handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10)
    results = Entrez.read(handle)
    handle.close()
    print(f"Found {results['Count']} results")
    ```
    
    ### 4. BLAST Operations (Bio.Blast)
    
    **Reference:** `references/blast.md`
    
    Use for:
    - Running BLAST searches via NCBI web services
    - Running local BLAST searches
    - Parsing BLAST XML output
    - Filtering results by E-value or identity
    - Extracting hit sequences
    
    **Quick example:**
    ```python
    from Bio.Blast import NCBIWWW, NCBIXML
    
    # Run BLAST search
    result_handle = NCBIWWW.qblast("blastn", "nt", "ATCGATCGATCG")
    blast_record = NCBIXML.read(result_handle)
    
    # Display top hits
    for alignment in blast_record.alignments[:5]:
        print(f"{alignment.title}: E-value={alignment.hsps[0].expect}")
    ```
    
    ### 5. Structural Bioinformatics (Bio.PDB)
    
    **Reference:** `references/structure.md`
    
    Use for:
    - Parsing PDB and mmCIF structure files
    - Navigating protein structure hierarchy (SMCRA: Structure/Model/Chain/Residue/Atom)
    - Calculating distances, angles, and dihedrals
    - Secondary structure assignment (DSSP)
    - Structure superimposition and RMSD calculation
    - Extracting sequences from structures
    
    **Quick example:**
    ```python
    from Bio.PDB import PDBParser
    
    # Parse structure
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("1crn", "1crn.pdb")
    
    # Calculate distance between alpha carbons
    chain = structure[0]["A"]
    distance = chain[10]["CA"] - chain[20]["CA"]
    print(f"Distance: {distance:.2f} Å")
    ```
    
    ### 6. Phylogenetics (Bio.Phylo)
    
    **Reference:** `references/phylogenetics.md`
    
    Use for:
    - Reading and writing phylogenetic trees (Newick, NEXUS, phyloXML)
    - Building trees from distance matrices or alignments
    - Tree manipulation (pruning, rerooting, ladderizing)
    - Calculating phylogenetic distances
    - Creating consensus trees
    - Visualizing trees
    
    **Quick example:**
    ```python
    from Bio import Phylo
    
    # Read and visualize tree
    tree = Phylo.read("tree.nwk", "newick")
    Phylo.draw_ascii(tree)
    
    # Calculate distance
    distance = tree.distance("Species_A", "Species_B")
    print(f"Distance: {distance:.3f}")
    ```
    
    ### 7. Advanced Features
    
    **Reference:** `references/advanced.md`
    
    Use for:
    - **Sequence motifs** (Bio.motifs) - Finding and analyzing motif patterns
    - **Population genetics** (Bio.PopGen) - GenePop files, Fst calculations, Hardy-Weinberg tests
    - **Sequence utilities** (Bio.SeqUtils) - GC content, melting temperature, molecular weight, protein analysis
    - **Restriction analysis** (Bio.Restriction) - Finding restriction enzyme sites
    - **Clustering** (Bio.Cluster) - K-means and hierarchical clustering
    - **Genome diagrams** (GenomeDiagram) - Visualizing genomic features
    
    **Quick example:**
    ```python
    from Bio.SeqUtils import gc_fraction, molecular_weight
    from Bio.Seq import Seq
    
    seq = Seq("ATCGATCGATCG")
    print(f"GC content: {gc_fraction(seq):.2%}")
    print(f"Molecular weight: {molecular_weight(seq, seq_type='DNA'):.2f} g/mol")
    ```
    
    ## General Workflow Guidelines
    
    ### Reading Documentation
    
    When a user asks about a specific Biopython task:
    
    1. **Identify the relevant module** based on the task description
    2. **Read the appropriate reference file** using the Read tool
    3. **Extract relevant code patterns** and adapt them to the user's specific needs
    4. **Combine multiple modules** when the task requires it
    
    Example search patterns for reference files:
    ```bash
    # Find information about specific functions
    grep -n "SeqIO.parse" references/sequence_io.md
    
    # Find examples of specific tasks
    grep -n "BLAST" references/blast.md
    
    # Find information about specific concepts
    grep -n "alignment" references/alignment.md
    ```
    
    ### Writing Biopython Code
    
    Follow these principles when writing Biopython code:
    
    1. **Import modules explicitly**
       ```python
       from Bio import SeqIO, Entrez
       from Bio.Seq import Seq
       ```
    
    2. **Set Entrez email** when using NCBI databases
       ```python
       Entrez.email = "your.email@example.com"
       ```
    
    3. **Use appropriate file formats** - Check which format best suits the task
       ```python
       # Common formats: "fasta", "genbank", "fastq", "clustal", "phylip"
       ```
    
    4. **Handle files properly** - Close handles after use or use context managers
       ```python
       with open("file.fasta") as handle:
           records = SeqIO.parse(handle, "fasta")
       ```
    
    5. **Use iterators for large files** - Avoid loading everything into memory
       ```python
       for record in SeqIO.parse("large_file.fasta", "fasta"):
           # Process one record at a time
       ```
    
    6. **Handle errors gracefully** - Network operations and file parsing can fail
       ```python
       try:
           handle = Entrez.efetch(db="nucleotide", id=accession)
       except HTTPError as e:
           print(f"Error: {e}")
       ```
    
    ## Common Patterns
    
    ### Pattern 1: Fetch Sequence from GenBank
    
    ```python
    from Bio import Entrez, SeqIO
    
    Entrez.email = "your.email@example.com"
    
    # Fetch sequence
    handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    
    print(f"Description: {record.description}")
    print(f"Sequence length: {len(record.seq)}")
    ```
    
    ### Pattern 2: Sequence Analysis Pipeline
    
    ```python
    from Bio import SeqIO
    from Bio.SeqUtils import gc_fraction
    
    for record in SeqIO.parse("sequences.fasta", "fasta"):
        # Calculate statistics
        gc = gc_fraction(record.seq)
        length = len(record.seq)
    
        # Find ORFs, translate, etc.
        protein = record.seq.translate()
    
        print(f"{record.id}: {length} bp, GC={gc:.2%}")
    ```
    
    ### Pattern 3: BLAST and Fetch Top Hits
    
    ```python
    from Bio.Blast import NCBIWWW, NCBIXML
    from Bio import Entrez, SeqIO
    
    Entrez.email = "your.email@example.com"
    
    # Run BLAST
    result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
    blast_record = NCBIXML.read(result_handle)
    
    # Get top hit accessions
    accessions = [aln.accession for aln in blast_record.alignments[:5]]
    
    # Fetch sequences
    for acc in accessions:
        handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text")
        record = SeqIO.read(handle, "fasta")
        handle.close()
        print(f">{record.description}")
    ```
    
    ### Pattern 4: Build Phylogenetic Tree from Sequences
    
    ```python
    from Bio import AlignIO, Phylo
    from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
    
    # Read alignment
    alignment = AlignIO.read("alignment.fasta", "fasta")
    
    # Calculate distances
    calculator = DistanceCalculator("identity")
    dm = calculator.get_distance(alignment)
    
    # Build tree
    constructor = DistanceTreeConstructor()
    tree = constructor.nj(dm)
    
    # Visualize
    Phylo.draw_ascii(tree)
    ```
    
    ## Best Practices
    
    1. **Always read relevant reference documentation** before writing code
    2. **Use grep to search reference files** for specific functions or examples
    3. **Validate file formats** before parsing
    4. **Handle missing data gracefully** - Not all records have all fields
    5. **Cache downloaded data** - Don't repeatedly download the same sequences
    6. **Respect NCBI rate limits** - Use API keys and proper delays
    7. **Test with small datasets** before processing large files
    8. **Keep Biopython updated** to get latest features and bug fixes
    9. **Use appropriate genetic code tables** for translation
    10. **Document analysis parameters** for reproducibility
    
    ## Troubleshooting Common Issues
    
    ### Issue: "No handlers could be found for logger 'Bio.Entrez'"
    **Solution:** This is just a warning. Set Entrez.email to suppress it.
    
    ### Issue: "HTTP Error 400" from NCBI
    **Solution:** Check that IDs/accessions are valid and properly formatted.
    
    ### Issue: "ValueError: EOF" when parsing files
    **Solution:** Verify file format matches the specified format string.
    
    ### Issue: Alignment fails with "sequences are not the same length"
    **Solution:** Ensure sequences are aligned before using AlignIO or MultipleSeqAlignment.
    
    ### Issue: BLAST searches are slow
    **Solution:** Use local BLAST for large-scale searches, or cache results.
    
    ### Issue: PDB parser warnings
    **Solution:** Use `PDBParser(QUIET=True)` to suppress warnings, or investigate structure quality.
    
    ## Additional Resources
    
    - **Official Documentation**: https://biopython.org/docs/latest/
    - **Tutorial**: https://biopython.org/docs/latest/Tutorial/
    - **Cookbook**: https://biopython.org/docs/latest/Tutorial/ (advanced examples)
    - **GitHub**: https://github.com/biopython/biopython
    - **Mailing List**: biopython@biopython.org
    
    ## Quick Reference
    
    To locate information in reference files, use these search patterns:
    
    ```bash
    # Search for specific functions
    grep -n "function_name" references/*.md
    
    # Find examples of specific tasks
    grep -n "example" references/sequence_io.md
    
    # Find all occurrences of a module
    grep -n "Bio.Seq" references/*.md
    ```
    
    ## Summary
    
    Biopython provides comprehensive tools for computational molecular biology. When using this skill:
    
    1. **Identify the task domain** (sequences, alignments, databases, BLAST, structures, phylogenetics, or advanced)
    2. **Consult the appropriate reference file** in the `references/` directory
    3. **Adapt code examples** to the specific use case
    4. **Combine multiple modules** when needed for complex workflows
    5. **Follow best practices** for file handling, error checking, and data management
    
    The modular reference documentation ensures detailed, searchable information for every major Biopython capability.
    
    
  • skills/bioinformatics/alterlab-bioservices/SKILL.mdskill
    Show content (10017 bytes)
    ---
    name: alterlab-bioservices
    description: Unified Python interface to 40+ bioinformatics services. Use when querying multiple databases (UniProt, KEGG, ChEMBL, Reactome) in a single workflow with consistent API. Best for cross-database analysis, ID mapping across services. For quick single-database lookups use gget; for sequence/file manipulation use biopython. Part of the AlterLab Academic Skills suite.
    license: GPLv3 license
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # BioServices
    
    ## Overview
    
    BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
    
    ## When to Use This Skill
    
    This skill should be used when:
    - Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
    - Analyzing metabolic pathways and gene functions via KEGG or Reactome
    - Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
    - Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs)
    - Running sequence similarity searches (BLAST, MUSCLE alignment)
    - Querying gene ontology terms (QuickGO, GO annotations)
    - Accessing protein-protein interaction data (PSICQUIC, IntactComplex)
    - Mining genomic data (BioMart, ArrayExpress, ENA)
    - Integrating data from multiple bioinformatics resources in a single workflow
    
    ## Core Capabilities
    
    ### 1. Protein Analysis
    
    Retrieve protein information, sequences, and functional annotations:
    
    ```python
    from bioservices import UniProt
    
    u = UniProt(verbose=False)
    
    # Search for protein by name
    results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
    
    # Retrieve FASTA sequence
    sequence = u.retrieve("P43403", "fasta")
    
    # Map identifiers between databases
    kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
    ```
    
    **Key methods:**
    - `search()`: Query UniProt with flexible search terms
    - `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
    - `mapping()`: Convert identifiers between databases
    
    Reference: `references/services_reference.md` for complete UniProt API details.
    
    ### 2. Pathway Discovery and Analysis
    
    Access KEGG pathway information for genes and organisms:
    
    ```python
    from bioservices import KEGG
    
    k = KEGG()
    k.organism = "hsa"  # Set to human
    
    # Search for organisms
    k.lookfor_organism("droso")  # Find Drosophila species
    
    # Find pathways by name
    k.lookfor_pathway("B cell")  # Returns matching pathway IDs
    
    # Get pathways containing specific genes
    pathways = k.get_pathway_by_gene("7535", "hsa")  # ZAP70 gene
    
    # Retrieve and parse pathway data
    data = k.get("hsa04660")
    parsed = k.parse(data)
    
    # Extract pathway interactions
    interactions = k.parse_kgml_pathway("hsa04660")
    relations = interactions['relations']  # Protein-protein interactions
    
    # Convert to Simple Interaction Format
    sif_data = k.pathway2sif("hsa04660")
    ```
    
    **Key methods:**
    - `lookfor_organism()`, `lookfor_pathway()`: Search by name
    - `get_pathway_by_gene()`: Find pathways containing genes
    - `parse_kgml_pathway()`: Extract structured pathway data
    - `pathway2sif()`: Get protein interaction networks
    
    Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
    
    ### 3. Compound Database Searches
    
    Search and cross-reference compounds across multiple databases:
    
    ```python
    from bioservices import KEGG, UniChem
    
    k = KEGG()
    
    # Search compounds by name
    results = k.find("compound", "Geldanamycin")  # Returns cpd:C11222
    
    # Get compound information with database links
    compound_info = k.get("cpd:C11222")  # Includes ChEBI links
    
    # Cross-reference KEGG → ChEMBL using UniChem
    u = UniChem()
    chembl_id = u.get_compound_id_from_kegg("C11222")  # Returns CHEMBL278315
    ```
    
    **Common workflow:**
    1. Search compound by name in KEGG
    2. Extract KEGG compound ID
    3. Use UniChem for KEGG → ChEMBL mapping
    4. ChEBI IDs are often provided in KEGG entries
    
    Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
    
    ### 4. Sequence Analysis
    
    Run BLAST searches and sequence alignments:
    
    ```python
    from bioservices import NCBIblast
    
    s = NCBIblast(verbose=False)
    
    # Run BLASTP against UniProtKB
    jobid = s.run(
        program="blastp",
        sequence=protein_sequence,
        stype="protein",
        database="uniprotkb",
        email="your.email@example.com"  # Required by NCBI
    )
    
    # Check job status and retrieve results
    s.getStatus(jobid)
    results = s.getResult(jobid, "out")
    ```
    
    **Note:** BLAST jobs are asynchronous. Check status before retrieving results.
    
    ### 5. Identifier Mapping
    
    Convert identifiers between different biological databases:
    
    ```python
    from bioservices import UniProt, KEGG
    
    # UniProt mapping (many database pairs supported)
    u = UniProt()
    results = u.mapping(
        fr="UniProtKB_AC-ID",  # Source database
        to="KEGG",              # Target database
        query="P43403"          # Identifier(s) to convert
    )
    
    # KEGG gene ID → UniProt
    kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
    
    # For compounds, use UniChem
    from bioservices import UniChem
    u = UniChem()
    chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
    ```
    
    **Supported mappings (UniProt):**
    - UniProtKB ↔ KEGG
    - UniProtKB ↔ Ensembl
    - UniProtKB ↔ PDB
    - UniProtKB ↔ RefSeq
    - And many more (see `references/identifier_mapping.md`)
    
    ### 6. Gene Ontology Queries
    
    Access GO terms and annotations:
    
    ```python
    from bioservices import QuickGO
    
    g = QuickGO(verbose=False)
    
    # Retrieve GO term information
    term_info = g.Term("GO:0003824", frmt="obo")
    
    # Search annotations
    annotations = g.Annotation(protein="P43403", format="tsv")
    ```
    
    ### 7. Protein-Protein Interactions
    
    Query interaction databases via PSICQUIC:
    
    ```python
    from bioservices import PSICQUIC
    
    s = PSICQUIC(verbose=False)
    
    # Query specific database (e.g., MINT)
    interactions = s.query("mint", "ZAP70 AND species:9606")
    
    # List available interaction databases
    databases = s.activeDBs
    ```
    
    **Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
    
    ## Multi-Service Integration Workflows
    
    BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
    
    ### Complete Protein Analysis Pipeline
    
    Execute a full protein characterization workflow:
    
    ```bash
    python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
    ```
    
    This script demonstrates:
    1. UniProt search for protein entry
    2. FASTA sequence retrieval
    3. BLAST similarity search
    4. KEGG pathway discovery
    5. PSICQUIC interaction mapping
    
    ### Pathway Network Analysis
    
    Analyze all pathways for an organism:
    
    ```bash
    python scripts/pathway_analysis.py hsa output_directory/
    ```
    
    Extracts and analyzes:
    - All pathway IDs for organism
    - Protein-protein interactions per pathway
    - Interaction type distributions
    - Exports to CSV/SIF formats
    
    ### Cross-Database Compound Search
    
    Map compound identifiers across databases:
    
    ```bash
    python scripts/compound_cross_reference.py Geldanamycin
    ```
    
    Retrieves:
    - KEGG compound ID
    - ChEBI identifier
    - ChEMBL identifier
    - Basic compound properties
    
    ### Batch Identifier Conversion
    
    Convert multiple identifiers at once:
    
    ```bash
    python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
    ```
    
    ## Best Practices
    
    ### Output Format Handling
    
    Different services return data in various formats:
    - **XML**: Parse using BeautifulSoup (most SOAP services)
    - **Tab-separated (TSV)**: Pandas DataFrames for tabular data
    - **Dictionary/JSON**: Direct Python manipulation
    - **FASTA**: BioPython integration for sequence analysis
    
    ### Rate Limiting and Verbosity
    
    Control API request behavior:
    
    ```python
    from bioservices import KEGG
    
    k = KEGG(verbose=False)  # Suppress HTTP request details
    k.TIMEOUT = 30  # Adjust timeout for slow connections
    ```
    
    ### Error Handling
    
    Wrap service calls in try-except blocks:
    
    ```python
    try:
        results = u.search("ambiguous_query")
        if results:
            # Process results
            pass
    except Exception as e:
        print(f"Search failed: {e}")
    ```
    
    ### Organism Codes
    
    Use standard organism abbreviations:
    - `hsa`: Homo sapiens (human)
    - `mmu`: Mus musculus (mouse)
    - `dme`: Drosophila melanogaster
    - `sce`: Saccharomyces cerevisiae (yeast)
    
    List all organisms: `k.list("organism")` or `k.organismIds`
    
    ### Integration with Other Tools
    
    BioServices works well with:
    - **BioPython**: Sequence analysis on retrieved FASTA data
    - **Pandas**: Tabular data manipulation
    - **PyMOL**: 3D structure visualization (retrieve PDB IDs)
    - **NetworkX**: Network analysis of pathway interactions
    - **Galaxy**: Custom tool wrappers for workflow platforms
    
    ## Resources
    
    ### scripts/
    
    Executable Python scripts demonstrating complete workflows:
    
    - `protein_analysis_workflow.py`: End-to-end protein characterization
    - `pathway_analysis.py`: KEGG pathway discovery and network extraction
    - `compound_cross_reference.py`: Multi-database compound searching
    - `batch_id_converter.py`: Bulk identifier mapping utility
    
    Scripts can be executed directly or adapted for specific use cases.
    
    ### references/
    
    Detailed documentation loaded as needed:
    
    - `services_reference.md`: Comprehensive list of all 40+ services with methods
    - `workflow_patterns.md`: Detailed multi-step analysis workflows
    - `identifier_mapping.md`: Complete guide to cross-database ID conversion
    
    Load references when working with specific services or complex integration tasks.
    
    ## Installation
    
    ```bash
    uv pip install bioservices
    ```
    
    Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
    
    ## Additional Information
    
    For detailed API documentation and advanced features, refer to:
    - Official documentation: https://bioservices.readthedocs.io/
    - Source code: https://github.com/cokelaer/bioservices
    - Service-specific references in `references/services_reference.md`
    
    
  • skills/bioinformatics/alterlab-cellxgene/SKILL.mdskill
    Show content (15498 bytes)
    ---
    name: alterlab-cellxgene
    description: Query the CELLxGENE Census (61M+ cells) programmatically. Use when you need expression data across tissues, diseases, or cell types from the largest curated single-cell atlas. Best for population-scale queries, reference atlas comparisons. For analyzing your own data use scanpy or scvi-tools. Part of the AlterLab Academic Skills suite.
    license: MIT
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # CZ CELLxGENE Census
    
    ## Overview
    
    The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
    
    The Census includes:
    - **61+ million cells** from human and mouse
    - **Standardized metadata** (cell types, tissues, diseases, donors)
    - **Raw gene expression** matrices
    - **Pre-calculated embeddings** and statistics
    - **Integration with PyTorch, scanpy, and other analysis tools**
    
    ## When to Use This Skill
    
    This skill should be used when:
    - Querying single-cell expression data by cell type, tissue, or disease
    - Exploring available single-cell datasets and metadata
    - Training machine learning models on single-cell data
    - Performing large-scale cross-dataset analyses
    - Integrating Census data with scanpy or other analysis frameworks
    - Computing statistics across millions of cells
    - Accessing pre-calculated embeddings or model predictions
    
    ## Installation and Setup
    
    Install the Census API:
    ```bash
    uv pip install cellxgene-census
    ```
    
    For machine learning workflows, install additional dependencies:
    ```bash
    uv pip install cellxgene-census[experimental]
    ```
    
    ## Core Workflow Patterns
    
    ### 1. Opening the Census
    
    Always use the context manager to ensure proper resource cleanup:
    
    ```python
    import cellxgene_census
    
    # Open latest stable version
    with cellxgene_census.open_soma() as census:
        # Work with census data
    
    # Open specific version for reproducibility
    with cellxgene_census.open_soma(census_version="2023-07-25") as census:
        # Work with census data
    ```
    
    **Key points:**
    - Use context manager (`with` statement) for automatic cleanup
    - Specify `census_version` for reproducible analyses
    - Default opens latest "stable" release
    
    ### 2. Exploring Census Information
    
    Before querying expression data, explore available datasets and metadata.
    
    **Access summary information:**
    ```python
    # Get summary statistics
    summary = census["census_info"]["summary"].read().concat().to_pandas()
    print(f"Total cells: {summary['total_cell_count'][0]}")
    
    # Get all datasets
    datasets = census["census_info"]["datasets"].read().concat().to_pandas()
    
    # Filter datasets by criteria
    covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
    ```
    
    **Query cell metadata to understand available data:**
    ```python
    # Get unique cell types in a tissue
    cell_metadata = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["cell_type"]
    )
    unique_cell_types = cell_metadata["cell_type"].unique()
    print(f"Found {len(unique_cell_types)} cell types in brain")
    
    # Count cells by tissue
    tissue_counts = cell_metadata.groupby("tissue_general").size()
    ```
    
    **Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
    
    ### 3. Querying Expression Data (Small to Medium Scale)
    
    For queries returning < 100k cells that fit in memory, use `get_anndata()`:
    
    ```python
    # Basic query with cell type and tissue filters
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",  # or "Mus musculus"
        obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
        obs_column_names=["assay", "disease", "sex", "donor_id"],
    )
    
    # Query specific genes with multiple filters
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
        obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
        obs_column_names=["cell_type", "tissue_general", "donor_id"],
    )
    ```
    
    **Filter syntax:**
    - Use `obs_value_filter` for cell filtering
    - Use `var_value_filter` for gene filtering
    - Combine conditions with `and`, `or`
    - Use `in` for multiple values: `tissue in ['lung', 'liver']`
    - Select only needed columns with `obs_column_names`
    
    **Getting metadata separately:**
    ```python
    # Query cell metadata
    cell_metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="disease == 'COVID-19' and is_primary_data == True",
        column_names=["cell_type", "tissue_general", "donor_id"]
    )
    
    # Query gene metadata
    gene_metadata = cellxgene_census.get_var(
        census, "homo_sapiens",
        value_filter="feature_name in ['CD4', 'CD8A']",
        column_names=["feature_id", "feature_name", "feature_length"]
    )
    ```
    
    ### 4. Large-Scale Queries (Out-of-Core Processing)
    
    For queries exceeding available RAM, use `axis_query()` with iterative processing:
    
    ```python
    import tiledbsoma as soma
    
    # Create axis query
    query = census["census_data"]["homo_sapiens"].axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(
            value_filter="tissue_general == 'brain' and is_primary_data == True"
        ),
        var_query=soma.AxisQuery(
            value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
        )
    )
    
    # Iterate through expression matrix in chunks
    iterator = query.X("raw").tables()
    for batch in iterator:
        # batch is a pyarrow.Table with columns:
        # - soma_data: expression value
        # - soma_dim_0: cell (obs) coordinate
        # - soma_dim_1: gene (var) coordinate
        process_batch(batch)
    ```
    
    **Computing incremental statistics:**
    ```python
    # Example: Calculate mean expression
    n_observations = 0
    sum_values = 0.0
    
    iterator = query.X("raw").tables()
    for batch in iterator:
        values = batch["soma_data"].to_numpy()
        n_observations += len(values)
        sum_values += values.sum()
    
    mean_expression = sum_values / n_observations
    ```
    
    ### 5. Machine Learning with PyTorch
    
    For training models, use the experimental PyTorch integration:
    
    ```python
    from cellxgene_census.experimental.ml import experiment_dataloader
    
    with cellxgene_census.open_soma() as census:
        # Create dataloader
        dataloader = experiment_dataloader(
            census["census_data"]["homo_sapiens"],
            measurement_name="RNA",
            X_name="raw",
            obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
            obs_column_names=["cell_type"],
            batch_size=128,
            shuffle=True,
        )
    
        # Training loop
        for epoch in range(num_epochs):
            for batch in dataloader:
                X = batch["X"]  # Gene expression tensor
                labels = batch["obs"]["cell_type"]  # Cell type labels
    
                # Forward pass
                outputs = model(X)
                loss = criterion(outputs, labels)
    
                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
    ```
    
    **Train/test splitting:**
    ```python
    from cellxgene_census.experimental.ml import ExperimentDataset
    
    # Create dataset from experiment
    dataset = ExperimentDataset(
        experiment_axis_query,
        layer_name="raw",
        obs_column_names=["cell_type"],
        batch_size=128,
    )
    
    # Split into train and test
    train_dataset, test_dataset = dataset.random_split(
        split=[0.8, 0.2],
        seed=42
    )
    ```
    
    ### 6. Integration with Scanpy
    
    Seamlessly integrate Census data with scanpy workflows:
    
    ```python
    import scanpy as sc
    
    # Load data from Census
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
    )
    
    # Standard scanpy workflow
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    
    # Dimensionality reduction
    sc.pp.pca(adata, n_comps=50)
    sc.pp.neighbors(adata)
    sc.tl.umap(adata)
    
    # Visualization
    sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
    ```
    
    ### 7. Multi-Dataset Integration
    
    Query and integrate multiple datasets:
    
    ```python
    # Strategy 1: Query multiple tissues separately
    tissues = ["lung", "liver", "kidney"]
    adatas = []
    
    for tissue in tissues:
        adata = cellxgene_census.get_anndata(
            census=census,
            organism="Homo sapiens",
            obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
        )
        adata.obs["tissue"] = tissue
        adatas.append(adata)
    
    # Concatenate
    combined = adatas[0].concatenate(adatas[1:])
    
    # Strategy 2: Query multiple datasets directly
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
    )
    ```
    
    ## Key Concepts and Best Practices
    
    ### Always Filter for Primary Data
    Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times:
    ```python
    obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
    ```
    
    ### Specify Census Version for Reproducibility
    Always specify the Census version in production analyses:
    ```python
    census = cellxgene_census.open_soma(census_version="2023-07-25")
    ```
    
    ### Estimate Query Size Before Loading
    For large queries, first check the number of cells to avoid memory issues:
    ```python
    # Get cell count
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["soma_joinid"]
    )
    n_cells = len(metadata)
    print(f"Query will return {n_cells:,} cells")
    
    # If too large (>100k), use out-of-core processing
    ```
    
    ### Use tissue_general for Broader Groupings
    The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses:
    ```python
    # Broader grouping
    obs_value_filter="tissue_general == 'immune system'"
    
    # Specific tissue
    obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
    ```
    
    ### Select Only Needed Columns
    Minimize data transfer by specifying only required metadata columns:
    ```python
    obs_column_names=["cell_type", "tissue_general", "disease"]  # Not all columns
    ```
    
    ### Check Dataset Presence for Gene-Specific Queries
    When analyzing specific genes, verify which datasets measured them:
    ```python
    presence = cellxgene_census.get_presence_matrix(
        census,
        "homo_sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A']"
    )
    ```
    
    ### Two-Step Workflow: Explore Then Query
    First explore metadata to understand available data, then query expression:
    ```python
    # Step 1: Explore what's available
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="disease == 'COVID-19' and is_primary_data == True",
        column_names=["cell_type", "tissue_general"]
    )
    print(metadata.value_counts())
    
    # Step 2: Query based on findings
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
    )
    ```
    
    ## Available Metadata Fields
    
    ### Cell Metadata (obs)
    Key fields for filtering:
    - `cell_type`, `cell_type_ontology_term_id`
    - `tissue`, `tissue_general`, `tissue_ontology_term_id`
    - `disease`, `disease_ontology_term_id`
    - `assay`, `assay_ontology_term_id`
    - `donor_id`, `sex`, `self_reported_ethnicity`
    - `development_stage`, `development_stage_ontology_term_id`
    - `dataset_id`
    - `is_primary_data` (Boolean: True = unique cell)
    
    ### Gene Metadata (var)
    - `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798")
    - `feature_name` (Gene symbol, e.g., "FOXP2")
    - `feature_length` (Gene length in base pairs)
    
    ## Reference Documentation
    
    This skill includes detailed reference documentation:
    
    ### references/census_schema.md
    Comprehensive documentation of:
    - Census data structure and organization
    - All available metadata fields
    - Value filter syntax and operators
    - SOMA object types
    - Data inclusion criteria
    
    **When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax.
    
    ### references/common_patterns.md
    Examples and patterns for:
    - Exploratory queries (metadata only)
    - Small-to-medium queries (AnnData)
    - Large queries (out-of-core processing)
    - PyTorch integration
    - Scanpy integration workflows
    - Multi-dataset integration
    - Best practices and common pitfalls
    
    **When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
    
    ## Common Use Cases
    
    ### Use Case 1: Explore Cell Types in a Tissue
    ```python
    with cellxgene_census.open_soma() as census:
        cells = cellxgene_census.get_obs(
            census, "homo_sapiens",
            value_filter="tissue_general == 'lung' and is_primary_data == True",
            column_names=["cell_type"]
        )
        print(cells["cell_type"].value_counts())
    ```
    
    ### Use Case 2: Query Marker Gene Expression
    ```python
    with cellxgene_census.open_soma() as census:
        adata = cellxgene_census.get_anndata(
            census=census,
            organism="Homo sapiens",
            var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
            obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
        )
    ```
    
    ### Use Case 3: Train Cell Type Classifier
    ```python
    from cellxgene_census.experimental.ml import experiment_dataloader
    
    with cellxgene_census.open_soma() as census:
        dataloader = experiment_dataloader(
            census["census_data"]["homo_sapiens"],
            measurement_name="RNA",
            X_name="raw",
            obs_value_filter="is_primary_data == True",
            obs_column_names=["cell_type"],
            batch_size=128,
            shuffle=True,
        )
    
        # Train model
        for epoch in range(epochs):
            for batch in dataloader:
                # Training logic
                pass
    ```
    
    ### Use Case 4: Cross-Tissue Analysis
    ```python
    with cellxgene_census.open_soma() as census:
        adata = cellxgene_census.get_anndata(
            census=census,
            organism="Homo sapiens",
            obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
        )
    
        # Analyze macrophage differences across tissues
        sc.tl.rank_genes_groups(adata, groupby="tissue_general")
    ```
    
    ## Troubleshooting
    
    ### Query Returns Too Many Cells
    - Add more specific filters to reduce scope
    - Use `tissue` instead of `tissue_general` for finer granularity
    - Filter by specific `dataset_id` if known
    - Switch to out-of-core processing for large queries
    
    ### Memory Errors
    - Reduce query scope with more restrictive filters
    - Select fewer genes with `var_value_filter`
    - Use out-of-core processing with `axis_query()`
    - Process data in batches
    
    ### Duplicate Cells in Results
    - Always include `is_primary_data == True` in filters
    - Check if intentionally querying across multiple datasets
    
    ### Gene Not Found
    - Verify gene name spelling (case-sensitive)
    - Try Ensembl ID with `feature_id` instead of `feature_name`
    - Check dataset presence matrix to see if gene was measured
    - Some genes may have been filtered during Census construction
    
    ### Version Inconsistencies
    - Always specify `census_version` explicitly
    - Use same version across all analyses
    - Check release notes for version-specific changes
    
    
  • skills/bioinformatics/alterlab-esm/SKILL.mdskill
    Show content (10632 bytes)
    ---
    name: alterlab-esm
    description: Comprehensive toolkit for protein language models including ESM3 (generative multimodal protein design across sequence, structure, and function) and ESM C (efficient protein embeddings and representations). Use this skill when working with protein sequences, structures, or function prediction; designing novel proteins; generating protein embeddings; performing inverse folding; or conducting protein engineering tasks. Supports both local model usage and cloud-based Forge API for scalable inference. Part of the AlterLab Academic Skills suite.
    license: MIT license
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # ESM: Evolutionary Scale Modeling
    
    ## Overview
    
    ESM provides state-of-the-art protein language models for understanding, generating, and designing proteins. This skill enables working with two model families: ESM3 for generative protein design across sequence, structure, and function, and ESM C for efficient protein representation learning and embeddings.
    
    ## Core Capabilities
    
    ### 1. Protein Sequence Generation with ESM3
    
    Generate novel protein sequences with desired properties using multimodal generative modeling.
    
    **When to use:**
    - Designing proteins with specific functional properties
    - Completing partial protein sequences
    - Generating variants of existing proteins
    - Creating proteins with desired structural characteristics
    
    **Basic usage:**
    
    ```python
    from esm.models.esm3 import ESM3
    from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
    
    # Load model locally
    model: ESM3InferenceClient = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
    
    # Create protein prompt
    protein = ESMProtein(sequence="MPRT___KEND")  # '_' represents masked positions
    
    # Generate completion
    protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
    print(protein.sequence)
    ```
    
    **For remote/cloud usage via Forge API:**
    
    ```python
    from esm.sdk.forge import ESM3ForgeInferenceClient
    from esm.sdk.api import ESMProtein, GenerationConfig
    
    # Connect to Forge
    model = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token="<token>")
    
    # Generate
    protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
    ```
    
    See `references/esm3-api.md` for detailed ESM3 model specifications, advanced generation configurations, and multimodal prompting examples.
    
    ### 2. Structure Prediction and Inverse Folding
    
    Use ESM3's structure track for structure prediction from sequence or inverse folding (sequence design from structure).
    
    **Structure prediction:**
    
    ```python
    from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
    
    # Predict structure from sequence
    protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
    protein_with_structure = model.generate(
        protein,
        GenerationConfig(track="structure", num_steps=protein.sequence.count("_"))
    )
    
    # Access predicted structure
    coordinates = protein_with_structure.coordinates  # 3D coordinates
    pdb_string = protein_with_structure.to_pdb()
    ```
    
    **Inverse folding (sequence from structure):**
    
    ```python
    # Design sequence for a target structure
    protein_with_structure = ESMProtein.from_pdb("target_structure.pdb")
    protein_with_structure.sequence = None  # Remove sequence
    
    # Generate sequence that folds to this structure
    designed_protein = model.generate(
        protein_with_structure,
        GenerationConfig(track="sequence", num_steps=50, temperature=0.7)
    )
    ```
    
    ### 3. Protein Embeddings with ESM C
    
    Generate high-quality embeddings for downstream tasks like function prediction, classification, or similarity analysis.
    
    **When to use:**
    - Extracting protein representations for machine learning
    - Computing sequence similarities
    - Feature extraction for protein classification
    - Transfer learning for protein-related tasks
    
    **Basic usage:**
    
    ```python
    from esm.models.esmc import ESMC
    from esm.sdk.api import ESMProtein
    
    # Load ESM C model
    model = ESMC.from_pretrained("esmc-300m").to("cuda")
    
    # Get embeddings
    protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
    protein_tensor = model.encode(protein)
    
    # Generate embeddings
    embeddings = model.forward(protein_tensor)
    ```
    
    **Batch processing:**
    
    ```python
    # Encode multiple proteins
    proteins = [
        ESMProtein(sequence="MPRTKEIND..."),
        ESMProtein(sequence="AGLIVHSPQ..."),
        ESMProtein(sequence="KTEFLNDGR...")
    ]
    
    embeddings_list = [model.logits(model.forward(model.encode(p))) for p in proteins]
    ```
    
    See `references/esm-c-api.md` for ESM C model details, efficiency comparisons, and advanced embedding strategies.
    
    ### 4. Function Conditioning and Annotation
    
    Use ESM3's function track to generate proteins with specific functional annotations or predict function from sequence.
    
    **Function-conditioned generation:**
    
    ```python
    from esm.sdk.api import ESMProtein, FunctionAnnotation, GenerationConfig
    
    # Create protein with desired function
    protein = ESMProtein(
        sequence="_" * 200,  # Generate 200 residue protein
        function_annotations=[
            FunctionAnnotation(label="fluorescent_protein", start=50, end=150)
        ]
    )
    
    # Generate sequence with specified function
    functional_protein = model.generate(
        protein,
        GenerationConfig(track="sequence", num_steps=200)
    )
    ```
    
    ### 5. Chain-of-Thought Generation
    
    Iteratively refine protein designs using ESM3's chain-of-thought generation approach.
    
    ```python
    from esm.sdk.api import GenerationConfig
    
    # Multi-step refinement
    protein = ESMProtein(sequence="MPRT" + "_" * 100 + "KEND")
    
    # Step 1: Generate initial structure
    config = GenerationConfig(track="structure", num_steps=50)
    protein = model.generate(protein, config)
    
    # Step 2: Refine sequence based on structure
    config = GenerationConfig(track="sequence", num_steps=50, temperature=0.5)
    protein = model.generate(protein, config)
    
    # Step 3: Predict function
    config = GenerationConfig(track="function", num_steps=20)
    protein = model.generate(protein, config)
    ```
    
    ### 6. Batch Processing with Forge API
    
    Process multiple proteins efficiently using Forge's async executor.
    
    ```python
    from esm.sdk.forge import ESM3ForgeInferenceClient
    import asyncio
    
    client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token="<token>")
    
    # Async batch processing
    async def batch_generate(proteins_list):
        tasks = [
            client.async_generate(protein, GenerationConfig(track="sequence"))
            for protein in proteins_list
        ]
        return await asyncio.gather(*tasks)
    
    # Execute
    proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(10)]
    results = asyncio.run(batch_generate(proteins))
    ```
    
    See `references/forge-api.md` for detailed Forge API documentation, authentication, rate limits, and batch processing patterns.
    
    ## Model Selection Guide
    
    **ESM3 Models (Generative):**
    - `esm3-sm-open-v1` (1.4B) - Open weights, local usage, good for experimentation
    - `esm3-medium-2024-08` (7B) - Best balance of quality and speed (Forge only)
    - `esm3-large-2024-03` (98B) - Highest quality, slower (Forge only)
    
    **ESM C Models (Embeddings):**
    - `esmc-300m` (30 layers) - Lightweight, fast inference
    - `esmc-600m` (36 layers) - Balanced performance
    - `esmc-6b` (80 layers) - Maximum representation quality
    
    **Selection criteria:**
    - **Local development/testing:** Use `esm3-sm-open-v1` or `esmc-300m`
    - **Production quality:** Use `esm3-medium-2024-08` via Forge
    - **Maximum accuracy:** Use `esm3-large-2024-03` or `esmc-6b`
    - **High throughput:** Use Forge API with batch executor
    - **Cost optimization:** Use smaller models, implement caching strategies
    
    ## Installation
    
    **Basic installation:**
    
    ```bash
    uv pip install esm
    ```
    
    **With Flash Attention (recommended for faster inference):**
    
    ```bash
    uv pip install esm
    uv pip install flash-attn --no-build-isolation
    ```
    
    **For Forge API access:**
    
    ```bash
    uv pip install esm  # SDK includes Forge client
    ```
    
    No additional dependencies needed. Obtain Forge API token at https://forge.evolutionaryscale.ai
    
    ## Common Workflows
    
    For detailed examples and complete workflows, see `references/workflows.md` which includes:
    - Novel GFP design with chain-of-thought
    - Protein variant generation and screening
    - Structure-based sequence optimization
    - Function prediction pipelines
    - Embedding-based clustering and analysis
    
    ## References
    
    This skill includes comprehensive reference documentation:
    
    - `references/esm3-api.md` - ESM3 model architecture, API reference, generation parameters, and multimodal prompting
    - `references/esm-c-api.md` - ESM C model details, embedding strategies, and performance optimization
    - `references/forge-api.md` - Forge platform documentation, authentication, batch processing, and deployment
    - `references/workflows.md` - Complete examples and common workflow patterns
    
    These references contain detailed API specifications, parameter descriptions, and advanced usage patterns. Load them as needed for specific tasks.
    
    ## Best Practices
    
    **For generation tasks:**
    - Start with smaller models for prototyping (`esm3-sm-open-v1`)
    - Use temperature parameter to control diversity (0.0 = deterministic, 1.0 = diverse)
    - Implement iterative refinement with chain-of-thought for complex designs
    - Validate generated sequences with structure prediction or wet-lab experiments
    
    **For embedding tasks:**
    - Batch process sequences when possible for efficiency
    - Cache embeddings for repeated analyses
    - Normalize embeddings when computing similarities
    - Use appropriate model size based on downstream task requirements
    
    **For production deployment:**
    - Use Forge API for scalability and latest models
    - Implement error handling and retry logic for API calls
    - Monitor token usage and implement rate limiting
    - Consider AWS SageMaker deployment for dedicated infrastructure
    
    ## Resources and Documentation
    
    - **GitHub Repository:** https://github.com/evolutionaryscale/esm
    - **Forge Platform:** https://forge.evolutionaryscale.ai
    - **Scientific Paper:** Hayes et al., Science (2025) - https://www.science.org/doi/10.1126/science.ads0018
    - **Blog Posts:**
      - ESM3 Release: https://www.evolutionaryscale.ai/blog/esm3-release
      - ESM C Launch: https://www.evolutionaryscale.ai/blog/esm-cambrian
    - **Community:** Slack community at https://bit.ly/3FKwcWd
    - **Model Weights:** HuggingFace EvolutionaryScale organization
    
    ## Responsible Use
    
    ESM is designed for beneficial applications in protein engineering, drug discovery, and scientific research. Follow the Responsible Biodesign Framework (https://responsiblebiodesign.ai/) when designing novel proteins. Consider biosafety and ethical implications of protein designs before experimental validation.
    
    
  • skills/bioinformatics/alterlab-anndata/SKILL.mdskill
    Show content (10267 bytes)
    ---
    name: alterlab-anndata
    description: Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census. Part of the AlterLab Academic Skills suite.
    license: MIT
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # AnnData
    
    ## Overview
    
    AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
    
    ## When to Use This Skill
    
    Use this skill when:
    - Creating, reading, or writing AnnData objects
    - Working with h5ad, zarr, or other genomics data formats
    - Performing single-cell RNA-seq analysis
    - Managing large datasets with sparse matrices or backed mode
    - Concatenating multiple datasets or experimental batches
    - Subsetting, filtering, or transforming annotated data
    - Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
    
    ## Installation
    
    ```bash
    uv pip install anndata
    
    # With optional dependencies
    uv pip install anndata[dev,test,doc]
    ```
    
    ## Quick Start
    
    ### Creating an AnnData object
    ```python
    import anndata as ad
    import numpy as np
    import pandas as pd
    
    # Minimal creation
    X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
    adata = ad.AnnData(X)
    
    # With metadata
    obs = pd.DataFrame({
        'cell_type': ['T cell', 'B cell'] * 50,
        'sample': ['A', 'B'] * 50
    }, index=[f'cell_{i}' for i in range(100)])
    
    var = pd.DataFrame({
        'gene_name': [f'Gene_{i}' for i in range(2000)]
    }, index=[f'ENSG{i:05d}' for i in range(2000)])
    
    adata = ad.AnnData(X=X, obs=obs, var=var)
    ```
    
    ### Reading data
    ```python
    # Read h5ad file
    adata = ad.read_h5ad('data.h5ad')
    
    # Read with backed mode (for large files)
    adata = ad.read_h5ad('large_data.h5ad', backed='r')
    
    # Read other formats
    adata = ad.read_csv('data.csv')
    adata = ad.read_loom('data.loom')
    adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
    ```
    
    ### Writing data
    ```python
    # Write h5ad file
    adata.write_h5ad('output.h5ad')
    
    # Write with compression
    adata.write_h5ad('output.h5ad', compression='gzip')
    
    # Write other formats
    adata.write_zarr('output.zarr')
    adata.write_csvs('output_dir/')
    ```
    
    ### Basic operations
    ```python
    # Subset by conditions
    t_cells = adata[adata.obs['cell_type'] == 'T cell']
    
    # Subset by indices
    subset = adata[0:50, 0:100]
    
    # Add metadata
    adata.obs['quality_score'] = np.random.rand(adata.n_obs)
    adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
    
    # Access dimensions
    print(f"{adata.n_obs} observations × {adata.n_vars} variables")
    ```
    
    ## Core Capabilities
    
    ### 1. Data Structure
    
    Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
    
    **See**: `references/data_structure.md` for comprehensive information on:
    - Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
    - Creating AnnData objects from various sources
    - Accessing and manipulating data components
    - Memory-efficient practices
    
    ### 2. Input/Output Operations
    
    Read and write data in various formats with support for compression, backed mode, and cloud storage.
    
    **See**: `references/io_operations.md` for details on:
    - Native formats (h5ad, zarr)
    - Alternative formats (CSV, MTX, Loom, 10X, Excel)
    - Backed mode for large datasets
    - Remote data access
    - Format conversion
    - Performance optimization
    
    Common commands:
    ```python
    # Read/write h5ad
    adata = ad.read_h5ad('data.h5ad', backed='r')
    adata.write_h5ad('output.h5ad', compression='gzip')
    
    # Read 10X data
    adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
    
    # Read MTX format
    adata = ad.read_mtx('matrix.mtx').T
    ```
    
    ### 3. Concatenation
    
    Combine multiple AnnData objects along observations or variables with flexible join strategies.
    
    **See**: `references/concatenation.md` for comprehensive coverage of:
    - Basic concatenation (axis=0 for observations, axis=1 for variables)
    - Join types (inner, outer)
    - Merge strategies (same, unique, first, only)
    - Tracking data sources with labels
    - Lazy concatenation (AnnCollection)
    - On-disk concatenation for large datasets
    
    Common commands:
    ```python
    # Concatenate observations (combine samples)
    adata = ad.concat(
        [adata1, adata2, adata3],
        axis=0,
        join='inner',
        label='batch',
        keys=['batch1', 'batch2', 'batch3']
    )
    
    # Concatenate variables (combine modalities)
    adata = ad.concat([adata_rna, adata_protein], axis=1)
    
    # Lazy concatenation
    from anndata.experimental import AnnCollection
    collection = AnnCollection(
        ['data1.h5ad', 'data2.h5ad'],
        join_obs='outer',
        label='dataset'
    )
    ```
    
    ### 4. Data Manipulation
    
    Transform, subset, filter, and reorganize data efficiently.
    
    **See**: `references/manipulation.md` for detailed guidance on:
    - Subsetting (by indices, names, boolean masks, metadata conditions)
    - Transposition
    - Copying (full copies vs views)
    - Renaming (observations, variables, categories)
    - Type conversions (strings to categoricals, sparse/dense)
    - Adding/removing data components
    - Reordering
    - Quality control filtering
    
    Common commands:
    ```python
    # Subset by metadata
    filtered = adata[adata.obs['quality_score'] > 0.8]
    hv_genes = adata[:, adata.var['highly_variable']]
    
    # Transpose
    adata_T = adata.T
    
    # Copy vs view
    view = adata[0:100, :]  # View (lightweight reference)
    copy = adata[0:100, :].copy()  # Independent copy
    
    # Convert strings to categoricals
    adata.strings_to_categoricals()
    ```
    
    ### 5. Best Practices
    
    Follow recommended patterns for memory efficiency, performance, and reproducibility.
    
    **See**: `references/best_practices.md` for guidelines on:
    - Memory management (sparse matrices, categoricals, backed mode)
    - Views vs copies
    - Data storage optimization
    - Performance optimization
    - Working with raw data
    - Metadata management
    - Reproducibility
    - Error handling
    - Integration with other tools
    - Common pitfalls and solutions
    
    Key recommendations:
    ```python
    # Use sparse matrices for sparse data
    from scipy.sparse import csr_matrix
    adata.X = csr_matrix(adata.X)
    
    # Convert strings to categoricals
    adata.strings_to_categoricals()
    
    # Use backed mode for large files
    adata = ad.read_h5ad('large.h5ad', backed='r')
    
    # Store raw before filtering
    adata.raw = adata.copy()
    adata = adata[:, adata.var['highly_variable']]
    ```
    
    ## Integration with Scverse Ecosystem
    
    AnnData serves as the foundational data structure for the scverse ecosystem:
    
    ### Scanpy (Single-cell analysis)
    ```python
    import scanpy as sc
    
    # Preprocessing
    sc.pp.filter_cells(adata, min_genes=200)
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    
    # Dimensionality reduction
    sc.pp.pca(adata, n_comps=50)
    sc.pp.neighbors(adata, n_neighbors=15)
    sc.tl.umap(adata)
    sc.tl.leiden(adata)
    
    # Visualization
    sc.pl.umap(adata, color=['cell_type', 'leiden'])
    ```
    
    ### Muon (Multimodal data)
    ```python
    import muon as mu
    
    # Combine RNA and protein data
    mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
    ```
    
    ### PyTorch integration
    ```python
    from anndata.experimental import AnnLoader
    
    # Create DataLoader for deep learning
    dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
    
    for batch in dataloader:
        X = batch.X
        # Train model
    ```
    
    ## Common Workflows
    
    ### Single-cell RNA-seq analysis
    ```python
    import anndata as ad
    import scanpy as sc
    
    # 1. Load data
    adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
    
    # 2. Quality control
    adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
    adata.obs['n_counts'] = adata.X.sum(axis=1)
    adata = adata[adata.obs['n_genes'] > 200]
    adata = adata[adata.obs['n_counts'] < 50000]
    
    # 3. Store raw
    adata.raw = adata.copy()
    
    # 4. Normalize and filter
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    adata = adata[:, adata.var['highly_variable']]
    
    # 5. Save processed data
    adata.write_h5ad('processed.h5ad')
    ```
    
    ### Batch integration
    ```python
    # Load multiple batches
    adata1 = ad.read_h5ad('batch1.h5ad')
    adata2 = ad.read_h5ad('batch2.h5ad')
    adata3 = ad.read_h5ad('batch3.h5ad')
    
    # Concatenate with batch labels
    adata = ad.concat(
        [adata1, adata2, adata3],
        label='batch',
        keys=['batch1', 'batch2', 'batch3'],
        join='inner'
    )
    
    # Apply batch correction
    import scanpy as sc
    sc.pp.combat(adata, key='batch')
    
    # Continue analysis
    sc.pp.pca(adata)
    sc.pp.neighbors(adata)
    sc.tl.umap(adata)
    ```
    
    ### Working with large datasets
    ```python
    # Open in backed mode
    adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
    
    # Filter based on metadata (no data loading)
    high_quality = adata[adata.obs['quality_score'] > 0.8]
    
    # Load filtered subset
    adata_subset = high_quality.to_memory()
    
    # Process subset
    process(adata_subset)
    
    # Or process in chunks
    chunk_size = 1000
    for i in range(0, adata.n_obs, chunk_size):
        chunk = adata[i:i+chunk_size, :].to_memory()
        process(chunk)
    ```
    
    ## Troubleshooting
    
    ### Out of memory errors
    Use backed mode or convert to sparse matrices:
    ```python
    # Backed mode
    adata = ad.read_h5ad('file.h5ad', backed='r')
    
    # Sparse matrices
    from scipy.sparse import csr_matrix
    adata.X = csr_matrix(adata.X)
    ```
    
    ### Slow file reading
    Use compression and appropriate formats:
    ```python
    # Optimize for storage
    adata.strings_to_categoricals()
    adata.write_h5ad('file.h5ad', compression='gzip')
    
    # Use Zarr for cloud storage
    adata.write_zarr('file.zarr', chunks=(1000, 1000))
    ```
    
    ### Index alignment issues
    Always align external data on index:
    ```python
    # Wrong
    adata.obs['new_col'] = external_data['values']
    
    # Correct
    adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
    ```
    
    ## Additional Resources
    
    - **Official documentation**: https://anndata.readthedocs.io/
    - **Scanpy tutorials**: https://scanpy.readthedocs.io/
    - **Scverse ecosystem**: https://scverse.org/
    - **GitHub repository**: https://github.com/scverse/anndata
    
    
  • skills/bioinformatics/alterlab-arboreto/SKILL.mdskill
    Show content (6982 bytes)
    ---
    name: alterlab-arboreto
    description: Infer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets. Part of the AlterLab Academic Skills suite.
    license: MIT
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # Arboreto
    
    ## Overview
    
    Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
    
    **Core capability**: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
    
    ## Quick Start
    
    Install arboreto:
    ```bash
    uv pip install arboreto
    ```
    
    Basic GRN inference:
    ```python
    import pandas as pd
    from arboreto.algo import grnboost2
    
    if __name__ == '__main__':
        # Load expression data (genes as columns)
        expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
    
        # Infer regulatory network
        network = grnboost2(expression_data=expression_matrix)
    
        # Save results (TF, target, importance)
        network.to_csv('network.tsv', sep='\t', index=False, header=False)
    ```
    
    **Critical**: Always use `if __name__ == '__main__':` guard because Dask spawns new processes.
    
    ## Core Capabilities
    
    ### 1. Basic GRN Inference
    
    For standard GRN inference workflows including:
    - Input data preparation (Pandas DataFrame or NumPy array)
    - Running inference with GRNBoost2 or GENIE3
    - Filtering by transcription factors
    - Output format and interpretation
    
    **See**: `references/basic_inference.md`
    
    **Use the ready-to-run script**: `scripts/basic_grn_inference.py` for standard inference tasks:
    ```bash
    python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
    ```
    
    ### 2. Algorithm Selection
    
    Arboreto provides two algorithms:
    
    **GRNBoost2 (Recommended)**:
    - Fast gradient boosting-based inference
    - Optimized for large datasets (10k+ observations)
    - Default choice for most analyses
    
    **GENIE3**:
    - Random Forest-based inference
    - Original multiple regression approach
    - Use for comparison or validation
    
    Quick comparison:
    ```python
    from arboreto.algo import grnboost2, genie3
    
    # Fast, recommended
    network_grnboost = grnboost2(expression_data=matrix)
    
    # Classic algorithm
    network_genie3 = genie3(expression_data=matrix)
    ```
    
    **For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md`
    
    ### 3. Distributed Computing
    
    Scale inference from local multi-core to cluster environments:
    
    **Local (default)** - Uses all available cores automatically:
    ```python
    network = grnboost2(expression_data=matrix)
    ```
    
    **Custom local client** - Control resources:
    ```python
    from distributed import LocalCluster, Client
    
    local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
    client = Client(local_cluster)
    
    network = grnboost2(expression_data=matrix, client_or_address=client)
    
    client.close()
    local_cluster.close()
    ```
    
    **Cluster computing** - Connect to remote Dask scheduler:
    ```python
    from distributed import Client
    
    client = Client('tcp://scheduler:8786')
    network = grnboost2(expression_data=matrix, client_or_address=client)
    ```
    
    **For cluster setup, performance optimization, and large-scale workflows**: `references/distributed_computing.md`
    
    ## Installation
    
    ```bash
    uv pip install arboreto
    ```
    
    **Dependencies**: scipy, scikit-learn, numpy, pandas, dask, distributed
    
    ## Common Use Cases
    
    ### Single-Cell RNA-seq Analysis
    ```python
    import pandas as pd
    from arboreto.algo import grnboost2
    
    if __name__ == '__main__':
        # Load single-cell expression matrix (cells x genes)
        sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
    
        # Infer cell-type-specific regulatory network
        network = grnboost2(expression_data=sc_data, seed=42)
    
        # Filter high-confidence links
        high_confidence = network[network['importance'] > 0.5]
        high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
    ```
    
    ### Bulk RNA-seq with TF Filtering
    ```python
    from arboreto.utils import load_tf_names
    from arboreto.algo import grnboost2
    
    if __name__ == '__main__':
        # Load data
        expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
        tf_names = load_tf_names('human_tfs.txt')
    
        # Infer with TF restriction
        network = grnboost2(
            expression_data=expression_data,
            tf_names=tf_names,
            seed=123
        )
    
        network.to_csv('tf_target_network.tsv', sep='\t', index=False)
    ```
    
    ### Comparative Analysis (Multiple Conditions)
    ```python
    from arboreto.algo import grnboost2
    
    if __name__ == '__main__':
        # Infer networks for different conditions
        conditions = ['control', 'treatment_24h', 'treatment_48h']
    
        for condition in conditions:
            data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
            network = grnboost2(expression_data=data, seed=42)
            network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
    ```
    
    ## Output Interpretation
    
    Arboreto returns a DataFrame with regulatory links:
    
    | Column | Description |
    |--------|-------------|
    | `TF` | Transcription factor (regulator) |
    | `target` | Target gene |
    | `importance` | Regulatory importance score (higher = stronger) |
    
    **Filtering strategy**:
    - Top N links per target gene
    - Importance threshold (e.g., > 0.5)
    - Statistical significance testing (permutation tests)
    
    ## Integration with pySCENIC
    
    Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
    
    ```python
    # Step 1: Use arboreto for GRN inference
    from arboreto.algo import grnboost2
    network = grnboost2(expression_data=sc_data, tf_names=tf_list)
    
    # Step 2: Use pySCENIC for regulon identification and activity scoring
    # (See pySCENIC documentation for downstream analysis)
    ```
    
    ## Reproducibility
    
    Always set a seed for reproducible results:
    ```python
    network = grnboost2(expression_data=matrix, seed=777)
    ```
    
    Run multiple seeds for robustness analysis:
    ```python
    from distributed import LocalCluster, Client
    
    if __name__ == '__main__':
        client = Client(LocalCluster())
    
        seeds = [42, 123, 777]
        networks = []
    
        for seed in seeds:
            net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
            networks.append(net)
    
        # Combine networks and filter consensus links
        consensus = analyze_consensus(networks)
    ```
    
    ## Troubleshooting
    
    **Memory errors**: Reduce dataset size by filtering low-variance genes or use distributed computing
    
    **Slow performance**: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
    
    **Dask errors**: Ensure `if __name__ == '__main__':` guard is present in scripts
    
    **Empty results**: Check data format (genes as columns), verify TF names match gene names
    
    
  • skills/bioinformatics/alterlab-cobrapy/SKILL.mdskill
    Show content (12520 bytes)
    ---
    name: alterlab-cobrapy
    description: Constraint-based metabolic modeling (COBRA). FBA, FVA, gene knockouts, flux sampling, SBML models, for systems biology and metabolic engineering analysis. Part of the AlterLab Academic Skills suite.
    license: GPL-2.0 license
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # COBRApy - Constraint-Based Reconstruction and Analysis
    
    ## Overview
    
    COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors.
    
    ## Core Capabilities
    
    COBRApy provides comprehensive tools organized into several key areas:
    
    ### 1. Model Management
    
    Load existing models from repositories or files:
    ```python
    from cobra.io import load_model
    
    # Load bundled test models
    model = load_model("textbook")  # E. coli core model
    model = load_model("ecoli")     # Full E. coli model
    model = load_model("salmonella")
    
    # Load from files
    from cobra.io import read_sbml_model, load_json_model, load_yaml_model
    model = read_sbml_model("path/to/model.xml")
    model = load_json_model("path/to/model.json")
    model = load_yaml_model("path/to/model.yml")
    ```
    
    Save models in various formats:
    ```python
    from cobra.io import write_sbml_model, save_json_model, save_yaml_model
    write_sbml_model(model, "output.xml")  # Preferred format
    save_json_model(model, "output.json")  # For Escher compatibility
    save_yaml_model(model, "output.yml")   # Human-readable
    ```
    
    ### 2. Model Structure and Components
    
    Access and inspect model components:
    ```python
    # Access components
    model.reactions      # DictList of all reactions
    model.metabolites    # DictList of all metabolites
    model.genes          # DictList of all genes
    
    # Get specific items by ID or index
    reaction = model.reactions.get_by_id("PFK")
    metabolite = model.metabolites[0]
    
    # Inspect properties
    print(reaction.reaction)        # Stoichiometric equation
    print(reaction.bounds)          # Flux constraints
    print(reaction.gene_reaction_rule)  # GPR logic
    print(metabolite.formula)       # Chemical formula
    print(metabolite.compartment)   # Cellular location
    ```
    
    ### 3. Flux Balance Analysis (FBA)
    
    Perform standard FBA simulation:
    ```python
    # Basic optimization
    solution = model.optimize()
    print(f"Objective value: {solution.objective_value}")
    print(f"Status: {solution.status}")
    
    # Access fluxes
    print(solution.fluxes["PFK"])
    print(solution.fluxes.head())
    
    # Fast optimization (objective value only)
    objective_value = model.slim_optimize()
    
    # Change objective
    model.objective = "ATPM"
    solution = model.optimize()
    ```
    
    Parsimonious FBA (minimize total flux):
    ```python
    from cobra.flux_analysis import pfba
    solution = pfba(model)
    ```
    
    Geometric FBA (find central solution):
    ```python
    from cobra.flux_analysis import geometric_fba
    solution = geometric_fba(model)
    ```
    
    ### 4. Flux Variability Analysis (FVA)
    
    Determine flux ranges for all reactions:
    ```python
    from cobra.flux_analysis import flux_variability_analysis
    
    # Standard FVA
    fva_result = flux_variability_analysis(model)
    
    # FVA at 90% optimality
    fva_result = flux_variability_analysis(model, fraction_of_optimum=0.9)
    
    # Loopless FVA (eliminates thermodynamically infeasible loops)
    fva_result = flux_variability_analysis(model, loopless=True)
    
    # FVA for specific reactions
    fva_result = flux_variability_analysis(
        model,
        reaction_list=["PFK", "FBA", "PGI"]
    )
    ```
    
    ### 5. Gene and Reaction Deletion Studies
    
    Perform knockout analyses:
    ```python
    from cobra.flux_analysis import (
        single_gene_deletion,
        single_reaction_deletion,
        double_gene_deletion,
        double_reaction_deletion
    )
    
    # Single deletions
    gene_results = single_gene_deletion(model)
    reaction_results = single_reaction_deletion(model)
    
    # Double deletions (uses multiprocessing)
    double_gene_results = double_gene_deletion(
        model,
        processes=4  # Number of CPU cores
    )
    
    # Manual knockout using context manager
    with model:
        model.genes.get_by_id("b0008").knock_out()
        solution = model.optimize()
        print(f"Growth after knockout: {solution.objective_value}")
    # Model automatically reverts after context exit
    ```
    
    ### 6. Growth Media and Minimal Media
    
    Manage growth medium:
    ```python
    # View current medium
    print(model.medium)
    
    # Modify medium (must reassign entire dict)
    medium = model.medium
    medium["EX_glc__D_e"] = 10.0  # Set glucose uptake
    medium["EX_o2_e"] = 0.0       # Anaerobic conditions
    model.medium = medium
    
    # Calculate minimal media
    from cobra.medium import minimal_medium
    
    # Minimize total import flux
    min_medium = minimal_medium(model, minimize_components=False)
    
    # Minimize number of components (uses MILP, slower)
    min_medium = minimal_medium(
        model,
        minimize_components=True,
        open_exchanges=True
    )
    ```
    
    ### 7. Flux Sampling
    
    Sample the feasible flux space:
    ```python
    from cobra.sampling import sample
    
    # Sample using OptGP (default, supports parallel processing)
    samples = sample(model, n=1000, method="optgp", processes=4)
    
    # Sample using ACHR
    samples = sample(model, n=1000, method="achr")
    
    # Validate samples
    from cobra.sampling import OptGPSampler
    sampler = OptGPSampler(model, processes=4)
    sampler.sample(1000)
    validation = sampler.validate(sampler.samples)
    print(validation.value_counts())  # Should be all 'v' for valid
    ```
    
    ### 8. Production Envelopes
    
    Calculate phenotype phase planes:
    ```python
    from cobra.flux_analysis import production_envelope
    
    # Standard production envelope
    envelope = production_envelope(
        model,
        reactions=["EX_glc__D_e", "EX_o2_e"],
        objective="EX_ac_e"  # Acetate production
    )
    
    # With carbon yield
    envelope = production_envelope(
        model,
        reactions=["EX_glc__D_e", "EX_o2_e"],
        carbon_sources="EX_glc__D_e"
    )
    
    # Visualize (use matplotlib or pandas plotting)
    import matplotlib.pyplot as plt
    envelope.plot(x="EX_glc__D_e", y="EX_o2_e", kind="scatter")
    plt.show()
    ```
    
    ### 9. Gapfilling
    
    Add reactions to make models feasible:
    ```python
    from cobra.flux_analysis import gapfill
    
    # Prepare universal model with candidate reactions
    universal = load_model("universal")
    
    # Perform gapfilling
    with model:
        # Remove reactions to create gaps for demonstration
        model.remove_reactions([model.reactions.PGI])
    
        # Find reactions needed
        solution = gapfill(model, universal)
        print(f"Reactions to add: {solution}")
    ```
    
    ### 10. Model Building
    
    Build models from scratch:
    ```python
    from cobra import Model, Reaction, Metabolite
    
    # Create model
    model = Model("my_model")
    
    # Create metabolites
    atp_c = Metabolite("atp_c", formula="C10H12N5O13P3",
                       name="ATP", compartment="c")
    adp_c = Metabolite("adp_c", formula="C10H12N5O10P2",
                       name="ADP", compartment="c")
    pi_c = Metabolite("pi_c", formula="HO4P",
                      name="Phosphate", compartment="c")
    
    # Create reaction
    reaction = Reaction("ATPASE")
    reaction.name = "ATP hydrolysis"
    reaction.subsystem = "Energy"
    reaction.lower_bound = 0.0
    reaction.upper_bound = 1000.0
    
    # Add metabolites with stoichiometry
    reaction.add_metabolites({
        atp_c: -1.0,
        adp_c: 1.0,
        pi_c: 1.0
    })
    
    # Add gene-reaction rule
    reaction.gene_reaction_rule = "(gene1 and gene2) or gene3"
    
    # Add to model
    model.add_reactions([reaction])
    
    # Add boundary reactions
    model.add_boundary(atp_c, type="exchange")
    model.add_boundary(adp_c, type="demand")
    
    # Set objective
    model.objective = "ATPASE"
    ```
    
    ## Common Workflows
    
    ### Workflow 1: Load Model and Predict Growth
    
    ```python
    from cobra.io import load_model
    
    # Load model
    model = load_model("ecoli")
    
    # Run FBA
    solution = model.optimize()
    print(f"Growth rate: {solution.objective_value:.3f} /h")
    
    # Show active pathways
    print(solution.fluxes[solution.fluxes.abs() > 1e-6])
    ```
    
    ### Workflow 2: Gene Knockout Screen
    
    ```python
    from cobra.io import load_model
    from cobra.flux_analysis import single_gene_deletion
    
    # Load model
    model = load_model("ecoli")
    
    # Perform single gene deletions
    results = single_gene_deletion(model)
    
    # Find essential genes (growth < threshold)
    essential_genes = results[results["growth"] < 0.01]
    print(f"Found {len(essential_genes)} essential genes")
    
    # Find genes with minimal impact
    neutral_genes = results[results["growth"] > 0.9 * solution.objective_value]
    ```
    
    ### Workflow 3: Media Optimization
    
    ```python
    from cobra.io import load_model
    from cobra.medium import minimal_medium
    
    # Load model
    model = load_model("ecoli")
    
    # Calculate minimal medium for 50% of max growth
    target_growth = model.slim_optimize() * 0.5
    min_medium = minimal_medium(
        model,
        target_growth,
        minimize_components=True
    )
    
    print(f"Minimal medium components: {len(min_medium)}")
    print(min_medium)
    ```
    
    ### Workflow 4: Flux Uncertainty Analysis
    
    ```python
    from cobra.io import load_model
    from cobra.flux_analysis import flux_variability_analysis
    from cobra.sampling import sample
    
    # Load model
    model = load_model("ecoli")
    
    # First check flux ranges at optimality
    fva = flux_variability_analysis(model, fraction_of_optimum=1.0)
    
    # For reactions with large ranges, sample to understand distribution
    samples = sample(model, n=1000)
    
    # Analyze specific reaction
    reaction_id = "PFK"
    import matplotlib.pyplot as plt
    samples[reaction_id].hist(bins=50)
    plt.xlabel(f"Flux through {reaction_id}")
    plt.ylabel("Frequency")
    plt.show()
    ```
    
    ### Workflow 5: Context Manager for Temporary Changes
    
    Use context managers to make temporary modifications:
    ```python
    # Model remains unchanged outside context
    with model:
        # Temporarily change objective
        model.objective = "ATPM"
    
        # Temporarily modify bounds
        model.reactions.EX_glc__D_e.lower_bound = -5.0
    
        # Temporarily knock out genes
        model.genes.b0008.knock_out()
    
        # Optimize with changes
        solution = model.optimize()
        print(f"Modified growth: {solution.objective_value}")
    
    # All changes automatically reverted
    solution = model.optimize()
    print(f"Original growth: {solution.objective_value}")
    ```
    
    ## Key Concepts
    
    ### DictList Objects
    Models use `DictList` objects for reactions, metabolites, and genes - behaving like both lists and dictionaries:
    ```python
    # Access by index
    first_reaction = model.reactions[0]
    
    # Access by ID
    pfk = model.reactions.get_by_id("PFK")
    
    # Query methods
    atp_reactions = model.reactions.query("atp")
    ```
    
    ### Flux Constraints
    Reaction bounds define feasible flux ranges:
    - **Irreversible**: `lower_bound = 0, upper_bound > 0`
    - **Reversible**: `lower_bound < 0, upper_bound > 0`
    - Set both bounds simultaneously with `.bounds` to avoid inconsistencies
    
    ### Gene-Reaction Rules (GPR)
    Boolean logic linking genes to reactions:
    ```python
    # AND logic (both required)
    reaction.gene_reaction_rule = "gene1 and gene2"
    
    # OR logic (either sufficient)
    reaction.gene_reaction_rule = "gene1 or gene2"
    
    # Complex logic
    reaction.gene_reaction_rule = "(gene1 and gene2) or (gene3 and gene4)"
    ```
    
    ### Exchange Reactions
    Special reactions representing metabolite import/export:
    - Named with prefix `EX_` by convention
    - Positive flux = secretion, negative flux = uptake
    - Managed through `model.medium` dictionary
    
    ## Best Practices
    
    1. **Use context managers** for temporary modifications to avoid state management issues
    2. **Validate models** before analysis using `model.slim_optimize()` to ensure feasibility
    3. **Check solution status** after optimization - `optimal` indicates successful solve
    4. **Use loopless FVA** when thermodynamic feasibility matters
    5. **Set fraction_of_optimum** appropriately in FVA to explore suboptimal space
    6. **Parallelize** computationally expensive operations (sampling, double deletions)
    7. **Prefer SBML format** for model exchange and long-term storage
    8. **Use slim_optimize()** when only objective value needed for performance
    9. **Validate flux samples** to ensure numerical stability
    
    ## Troubleshooting
    
    **Infeasible solutions**: Check medium constraints, reaction bounds, and model consistency
    **Slow optimization**: Try different solvers (GLPK, CPLEX, Gurobi) via `model.solver`
    **Unbounded solutions**: Verify exchange reactions have appropriate upper bounds
    **Import errors**: Ensure correct file format and valid SBML identifiers
    
    ## References
    
    For detailed workflows and API patterns, refer to:
    - `references/workflows.md` - Comprehensive step-by-step workflow examples
    - `references/api_quick_reference.md` - Common function signatures and patterns
    
    Official documentation: https://cobrapy.readthedocs.io/en/latest/
    
    
  • skills/bioinformatics/alterlab-deeptools/SKILL.mdskill
    Show content (18047 bytes)
    ---
    name: alterlab-deeptools
    description: NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks), for ChIP-seq, RNA-seq, ATAC-seq visualization. Part of the AlterLab Academic Skills suite.
    license: MIT
    metadata:
        skill-author: AlterLab
        version: "1.0.0"
    ---
    
    # deepTools: NGS Data Analysis Toolkit
    
    ## Overview
    
    deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. Use deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
    
    **Core capabilities:**
    - Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
    - Quality control assessment (fingerprint, correlation, coverage)
    - Sample comparison and correlation analysis
    - Heatmap and profile plot generation around genomic features
    - Enrichment analysis and peak region visualization
    
    ## When to Use This Skill
    
    This skill should be used when:
    
    - **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
    - **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
    - **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot"
    - **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis"
    - **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow"
    - **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context
    
    ## Quick Start
    
    For users new to deepTools, start with file validation and common workflows:
    
    ### 1. Validate Input Files
    
    Before running any analysis, validate BAM, bigWig, and BED files using the validation script:
    
    ```bash
    python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed
    ```
    
    This checks file existence, BAM indices, and format correctness.
    
    ### 2. Generate Workflow Template
    
    For standard analyses, use the workflow generator to create customized scripts:
    
    ```bash
    # List available workflows
    python scripts/workflow_generator.py --list
    
    # Generate ChIP-seq QC workflow
    python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \
        --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
        --genome-size 2913022398
    
    # Make executable and run
    chmod +x qc_workflow.sh
    ./qc_workflow.sh
    ```
    
    ### 3. Most Common Operations
    
    See `assets/quick_reference.md` for frequently used commands and parameters.
    
    ## Installation
    
    ```bash
    uv pip install deeptools
    ```
    
    ## Core Workflows
    
    deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization**
    
    ### ChIP-seq Quality Control Workflow
    
    When users request ChIP-seq QC or quality assessment:
    
    1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc`
    2. **Key QC steps**:
       - Sample correlation (multiBamSummary + plotCorrelation)
       - PCA analysis (plotPCA)
       - Coverage assessment (plotCoverage)
       - Fragment size validation (bamPEFragmentSize)
       - ChIP enrichment strength (plotFingerprint)
    
    **Interpreting results:**
    - **Correlation**: Replicates should cluster together with high correlation (>0.9)
    - **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment
    - **Coverage**: Assess if sequencing depth is adequate for analysis
    
    Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow"
    
    ### ChIP-seq Complete Analysis Workflow
    
    For full ChIP-seq analysis from BAM to visualizations:
    
    1. **Generate coverage tracks** with normalization (bamCoverage)
    2. **Create comparison tracks** (bamCompare for log2 ratio)
    3. **Compute signal matrices** around features (computeMatrix)
    4. **Generate visualizations** (plotHeatmap, plotProfile)
    5. **Enrichment analysis** at peaks (plotEnrichment)
    
    Use `scripts/workflow_generator.py chipseq_analysis` to generate template.
    
    Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow"
    
    ### RNA-seq Coverage Workflow
    
    For strand-specific RNA-seq coverage tracks:
    
    Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands.
    
    **Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions).
    
    Use normalization: CPM for fixed bins, RPKM for gene-level analysis.
    
    Template available: `scripts/workflow_generator.py rnaseq_coverage`
    
    Details in `references/workflows.md` → "RNA-seq Coverage Workflow"
    
    ### ATAC-seq Analysis Workflow
    
    ATAC-seq requires Tn5 offset correction:
    
    1. **Shift reads** using alignmentSieve with `--ATACshift`
    2. **Generate coverage** with bamCoverage
    3. **Analyze fragment sizes** (expect nucleosome ladder pattern)
    4. **Visualize at peaks** if available
    
    Template: `scripts/workflow_generator.py atacseq`
    
    Full workflow in `references/workflows.md` → "ATAC-seq Workflow"
    
    ## Tool Categories and Common Tasks
    
    ### BAM/bigWig Processing
    
    **Convert BAM to normalized coverage:**
    ```bash
    bamCoverage --bam input.bam --outFileName output.bw \
        --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
        --binSize 10 --numberOfProcessors 8
    ```
    
    **Compare two samples (log2 ratio):**
    ```bash
    bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
        --operation log2 --scaleFactorsMethod readCount
    ```
    
    **Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve
    
    Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools"
    
    ### Quality Control
    
    **Check ChIP enrichment:**
    ```bash
    plotFingerprint -b input.bam chip.bam -o fingerprint.png \
        --extendReads 200 --ignoreDuplicates
    ```
    
    **Sample correlation:**
    ```bash
    multiBamSummary bins --bamfiles *.bam -o counts.npz
    plotCorrelation -in counts.npz --corMethod pearson \
        --whatToShow heatmap -o correlation.png
    ```
    
    **Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize
    
    Complete reference: `references/tools_reference.md` → "Quality Control Tools"
    
    ### Visualization
    
    **Create heatmap around TSS:**
    ```bash
    # Compute matrix
    computeMatrix reference-point -S signal.bw -R genes.bed \
        -b 3000 -a 3000 --referencePoint TSS -o matrix.gz
    
    # Generate heatmap
    plotHeatmap -m matrix.gz -o heatmap.png \
        --colorMap RdBu --kmeans 3
    ```
    
    **Create profile plot:**
    ```bash
    plotProfile -m matrix.gz -o profile.png \
        --plotType lines --colors blue red
    ```
    
    **Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment
    
    Complete reference: `references/tools_reference.md` → "Visualization Tools"
    
    ## Normalization Methods
    
    Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance.
    
    **Quick selection guide:**
    
    - **ChIP-seq coverage**: Use RPGC or CPM
    - **ChIP-seq comparison**: Use bamCompare with log2 and readCount
    - **RNA-seq bins**: Use CPM
    - **RNA-seq genes**: Use RPKM (accounts for gene length)
    - **ATAC-seq**: Use RPGC or CPM
    
    **Normalization methods:**
    - **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
    - **CPM**: Counts per million mapped reads
    - **RPKM**: Reads per kb per million (accounts for region length)
    - **BPM**: Bins per million
    - **None**: Raw counts (not recommended for comparisons)
    
    Full explanation: `references/normalization_methods.md`
    
    ## Effective Genome Sizes
    
    RPGC normalization requires effective genome size. Common values:
    
    | Organism | Assembly | Size | Usage |
    |----------|----------|------|-------|
    | Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
    | Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
    | Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
    | *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
    | *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
    
    Complete table with read-length-specific values: `references/effective_genome_sizes.md`
    
    ## Common Parameters Across Tools
    
    Many deepTools commands share these options:
    
    **Performance:**
    - `--numberOfProcessors, -p`: Enable parallel processing (always use available cores)
    - `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`)
    
    **Read Filtering:**
    - `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses)
    - `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`)
    - `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds
    - `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering
    
    **Read Processing:**
    - `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO)
    - `--centerReads`: Center at fragment midpoint for sharper signals
    
    ## Best Practices
    
    ### File Validation
    **Always validate files first** using `scripts/validate_files.py` to check:
    - File existence and readability
    - BAM indices present (.bai files)
    - BED format correctness
    - File sizes reasonable
    
    ### Analysis Strategy
    
    1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding
    2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing
    3. **Document commands**: Save full command lines for reproducibility
    4. **Use consistent normalization**: Apply same method across samples in comparisons
    5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds
    
    ### ChIP-seq Specific
    
    - **Always extend reads** for ChIP-seq: `--extendReads 200`
    - **Remove duplicates**: Use `--ignoreDuplicates` in most cases
    - **Check enrichment first**: Run plotFingerprint before detailed analysis
    - **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction
    
    ### RNA-seq Specific
    
    - **Never extend reads** for RNA-seq (would span splice junctions)
    - **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries
    - **Normalization**: CPM for bins, RPKM for genes
    
    ### ATAC-seq Specific
    
    - **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift`
    - **Fragment filtering**: Set appropriate min/max fragment lengths
    - **Check nucleosome pattern**: Fragment size plot should show ladder pattern
    
    ### Performance Optimization
    
    1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores)
    2. **Increase bin size** for faster processing and smaller files
    3. **Process chromosomes separately** for memory-limited systems
    4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files
    5. **Use bigWig over bedGraph**: Compressed and faster to process
    
    ## Troubleshooting
    
    ### Common Issues
    
    **BAM index missing:**
    ```bash
    samtools index input.bam
    ```
    
    **Out of memory:**
    Process chromosomes individually using `--region`:
    ```bash
    bamCoverage --bam input.bam -o chr1.bw --region chr1
    ```
    
    **Slow processing:**
    Increase `--numberOfProcessors` and/or increase `--binSize`
    
    **bigWig files too large:**
    Increase bin size: `--binSize 50` or larger
    
    ### Validation Errors
    
    Run validation script to identify issues:
    ```bash
    python scripts/validate_files.py --bam *.bam --bed regions.bed
    ```
    
    Common errors and solutions explained in script output.
    
    ## Reference Documentation
    
    This skill includes comprehensive reference documentation:
    
    ### references/tools_reference.md
    Complete documentation of all deepTools commands organized by category:
    - BAM and bigWig processing tools (9 tools)
    - Quality control tools (6 tools)
    - Visualization tools (3 tools)
    - Miscellaneous tools (2 tools)
    
    Each tool includes:
    - Purpose and overview
    - Key parameters with explanations
    - Usage examples
    - Important notes and best practices
    
    **Use this reference when:** Users ask about specific tools, parameters, or detailed usage.
    
    ### references/workflows.md
    Complete workflow examples for common analyses:
    - ChIP-seq quality control workflow
    - ChIP-seq complete analysis workflow
    - RNA-seq coverage workflow
    - ATAC-seq analysis workflow
    - Multi-sample comparison workflow
    - Peak region analysis workflow
    - Troubleshooting and performance tips
    
    **Use this reference when:** Users need complete analysis pipelines or workflow examples.
    
    ### references/normalization_methods.md
    Comprehensive guide to normalization methods:
    - Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.)
    - When to use each method
    - Formulas and interpretation
    - Selection guide by experiment type
    - Common pitfalls and solutions
    - Quick reference table
    
    **Use this reference when:** Users ask about normalization, comparing samples, or which method to use.
    
    ### references/effective_genome_sizes.md
    Effective genome size values and usage:
    - Common organism values (human, mouse, fly, worm, zebrafish)
    - Read-length-specific values
    - Calculation methods
    - When and how to use in commands
    - Custom genome calculation instructions
    
    **Use this reference when:** Users need genome size for RPGC normalization or GC bias correction.
    
    ## Helper Scripts
    
    ### scripts/validate_files.py
    
    Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format.
    
    **Usage:**
    ```bash
    python scripts/validate_files.py --bam sample1.bam sample2.bam \
        --bed peaks.bed --bigwig signal.bw
    ```
    
    **When to use:** Before starting any analysis, or when troubleshooting errors.
    
    ### scripts/workflow_generator.py
    
    Generates customizable bash script templates for common deepTools workflows.
    
    **Available workflows:**
    - `chipseq_qc`: ChIP-seq quality control
    - `chipseq_analysis`: Complete ChIP-seq analysis
    - `rnaseq_coverage`: Strand-specific RNA-seq coverage
    - `atacseq`: ATAC-seq with Tn5 correction
    
    **Usage:**
    ```bash
    # List workflows
    python scripts/workflow_generator.py --list
    
    # Generate workflow
    python scripts/workflow_generator.py chipseq_qc -o qc.sh \
        --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
        --genome-size 2913022398 --threads 8
    
    # Run generated workflow
    chmod +x qc.sh
    ./qc.sh
    ```
    
    **When to use:** Users request standard workflows or need template scripts to customize.
    
    ## Assets
    
    ### assets/quick_reference.md
    
    Quick reference card with most common commands, effective genome sizes, and typical workflow pattern.
    
    **When to use:** Users need quick command examples without detailed documentation.
    
    ## Handling User Requests
    
    ### For New Users
    
    1. Start with installation verification
    2. Validate input files using `scripts/validate_files.py`
    3. Recommend appropriate workflow based on experiment type
    4. Generate workflow template using `scripts/workflow_generator.py`
    5. Guide through customization and execution
    
    ### For Experienced Users
    
    1. Provide specific tool commands for requested operations
    2. Reference appropriate sections in `references/tools_reference.md`
    3. Suggest optimizations and best practices
    4. Offer troubleshooting for issues
    
    ### For Specific Tasks
    
    **"Convert BAM to bigWig":**
    - Use bamCoverage with appropriate normalization
    - Recommend RPGC or CPM based on use case
    - Provide effective genome size for organism
    - Suggest relevant parameters (extendReads, ignoreDuplicates, binSize)
    
    **"Check ChIP quality":**
    - Run full QC workflow or use plotFingerprint specifically
    - Explain interpretation of results
    - Suggest follow-up actions based on results
    
    **"Create heatmap":**
    - Guide through two-step process: computeMatrix → plotHeatmap
    - Help choose appropriate matrix mode (reference-point vs scale-regions)
    - Suggest visualization parameters and clustering options
    
    **"Compare samples":**
    - Recommend bamCompare for two-sample comparison
    - Suggest multiBamSummary + plotCorrelation for multiple samples
    - Guide normalization method selection
    
    ### Referencing Documentation
    
    When users need detailed information:
    - **Tool details**: Direct to specific sections in `references/tools_reference.md`
    - **Workflows**: Use `references/workflows.md` for complete analysis pipelines
    - **Normalization**: Consult `references/normalization_methods.md` for method selection
    - **Genome sizes**: Reference `references/effective_genome_sizes.md`
    
    Search references using grep patterns:
    ```bash
    # Find tool documentation
    grep -A 20 "^### toolname" references/tools_reference.md
    
    # Find workflow
    grep -A 50 "^## Workflow Name" references/workflows.md
    
    # Find normalization method
    grep -A 15 "^### Method Name" references/normalization_methods.md
    ```
    
    ## Example Interactions
    
    **User: "I need to analyze my ChIP-seq data"**
    
    Response approach:
    1. Ask about files available (BAM files, peaks, genes)
    2. Validate files using validation script
    3. Generate chipseq_analysis workflow template
    4. Customize for their specific files and organism
    5. Explain each step as script runs
    
    **User: "Which normalization should I use?"**
    
    Response approach:
    1. Ask about experiment type (ChIP-seq, RNA-seq, etc.)
    2. Ask about comparison goal (within-sample or between-sample)
    3. Consult `references/normalization_methods.md` selection guide
    4. Recommend appropriate method with justification
    5. Provide command example with parameters
    
    **User: "Create a heatmap around TSS"**
    
    Response approach:
    1. Verify bigWig and gene BED files available
    2. Use computeMatrix with reference-point mode at TSS
    3. Generate plotHeatmap with appropriate visualization parameters
    4. Suggest clustering if dataset is large
    5. Offer profile plot as complement
    
    ## Key Reminders
    
    - **File validation first**: Always validate input files before analysis
    - **Normalization matters**: Choose appropriate method for comparison type
    - **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq
    - **Use all cores**: Set `--numberOfProcessors` to available cores
    - **Test on regions**: Use `--region` for parameter testing
    - **Check QC first**: Run quality control before detailed analysis
    - **Document everything**: Save commands for reproducibility
    - **Reference documentation**: Use comprehensive references for detailed guidance
    
    

README

AlterLab Academic Skills

Skills Domains Claude AI MIT License


Stars Forks Issues PRs Welcome Awesome


📢 Featured in awesome-claude-skills (5.7k ⭐)



🧬 186+ purpose-built Claude AI skills for faculty, researchers & academicians

Organized across 13 research domains — from bioinformatics to digital humanities

Research Pipeline · Scientific Databases · Bioinformatics · Data Science · Visualization · Clinical Research · and more

Explore Skills » · Quick Start · Domain Overview · Contributing · Report Bug



Built by AlterLab Creative Technologies Laboratory

Not tied to any specific university — these skills work for any researcher, anywhere.

🎯

Plug & Play

Drop a .md skill file into
Claude Projects or Claude Code
and get instant expertise

🧠

Domain Expert

Each skill transforms Claude
into a specialized research
assistant with deep knowledge

🔬

Real Frameworks

Built on actual scientific
methods, tools, and professional
output templates

🌐

Universal

Works for any researcher
at any institution —
no vendor lock-in


📋 Table of Contents

Click to expand / collapse

🎯 What Is This?

A comprehensive suite of 186+ purpose-built Claude AI skills for faculty members, academicians, and researchers — organized into 13 domain categories spanning the full academic research lifecycle.

Each skill transforms Claude into a domain-specific expert assistant tailored to academic research, scientific computing, and scholarly publishing workflows.

[!TIP] How it works: Each skill is a structured .md prompt file. Drop it into a Claude Project or Claude Code, and Claude instantly becomes your research expert — with real scientific frameworks, professional output templates, and deep domain knowledge.


✨ Key Features

FeatureDescription
🔬Research-ReadySkills built on real scientific methods, databases, and professional frameworks used by working researchers
🤖Multi-Agent PipelinesCore skills chain together: Research → Write → Review → Publish in a seamless workflow
📊39 Database Integration SkillsInstant access to PubMed, ChEMBL, UniProt, ClinicalTrials.gov, COSMIC, and more
🧬Deep Domain CoverageFrom single-cell RNA-seq analysis to quantum computing, from clinical trials to digital humanities
📝Publication-Quality OutputLaTeX papers, conference posters, grant proposals, scientific visualizations — all formatted to professional standards
🔄Mix & MatchCombine multiple skills in one Claude Project for a multi-expert research team

🗂️ Domain Overview

DomainSkillsFocus Areas
🔄Core Pipeline6Multi-agent research → write → review → publish pipeline + teaching + thesis
🗄️Databases39Connectors to scientific databases — PubMed, ChEMBL, UniProt, ClinicalTrials.gov, COSMIC, and more
🧬Bioinformatics25Genomics, proteomics, molecular biology — Scanpy, BioPython, ESM, single-cell analysis
⚗️Cheminformatics12Chemistry and drug discovery — RDKit, molecular dynamics, docking, ADMET
🏥Clinical Research10Clinical decision support, treatment planning, medical imaging, regulatory
📊Data Science22ML/statistics — scikit-learn, PyTorch Lightning, SHAP, transformers
📈Visualization8Scientific plotting — Matplotlib, Seaborn, Plotly, schematics, infographics
✍️Writing Tools13Scientific writing, citations, grants, posters, academic career
🔧Lab Integrations9Laboratory platforms — Benchling, DNAnexus, Opentrons, Protocols.io
🌍Domain-Specific17Quantum computing, geospatial, materials science, social science methods, digital humanities
📄Document Tools6File format handling — DOCX, PDF, PPTX, XLSX, Markdown
🔍Research Tools12Search, discovery, Zotero, qualitative methods, ethics, surveys, open science
💰Finance & Economics7FRED, Alpha Vantage, SEC EDGAR, market research

🚀 Quick Start

Option 1 — Claude Projects (Recommended)

1. Go to claude.ai → Projects → Create Project
2. Upload SKILL.md files from your domain folder into the project's Knowledge section
3. Start chatting — Claude now has your skills loaded

Option 2 — Claude Code CLI

git clone https://github.com/AlterLab-IEU/AlterLab-Academic-Skills.git
cd AlterLab-Academic-Skills
claude "help me research the latest findings on CRISPR gene editing"

Option 3 — Pick Individual Skills

Browse the skills/ folder and download only the ones you need. Every skill is a standalone .md file.



⚡ Core Pipeline — 6 Skills

The heart of the system — a multi-agent research-to-publication pipeline with 39 specialized agents, plus teaching and thesis supervision tools.

#SkillAgentsWhat It Does
1🔬 Deep Research13Multi-mode research with systematic review, Socratic dialogue, fact-checking
2📝 Paper Writer12Academic paper authoring with LaTeX, bilingual support, 9 writing modes
3🔍 Paper Reviewer7Multi-perspective peer review with Devil's Advocate, 0–100 quality rubrics
4🔄 Research Pipeline710-stage orchestrator with integrity verification and material passports
5🎓 Teaching DesignCourse design, syllabi, rubrics, Bloom's taxonomy, backward design
6📋 Thesis SupervisorDissertation guidance, defense prep, committee management


📚 All 186+ Skills

🗄️ Databases — Scientific Database Connectors (39 Skills)

Click to expand full database skills list
#SkillWhat It Does
1AlphaFold DBProtein structure predictions from AlphaFold
2arXivPreprint search and discovery
3BindingDBBinding affinity data for drug-target interactions
4bioRxivBiology preprint search and monitoring
5BRENDAEnzyme functional data
6cBioPortalCancer genomics data exploration
7ChEMBLBioactive molecules with drug-like properties
8ClinicalTrials.govClinical trial registry search
9ClinPGxClinical pharmacogenomics data
10ClinVarGenomic variation and human health
11COSMICCatalogue of somatic mutations in cancer
12Data CommonsGoogle's open knowledge graph
13DepMapCancer dependency mapping
14DrugBankDrug and drug target information
15ENAEuropean Nucleotide Archive
16EnsemblGenome annotation and variation
17FDAFDA drug and device data
18Gene DBGene-level data aggregation
19GEOGene Expression Omnibus datasets
20gnomADGenome aggregation and variant frequency
21GTExTissue-specific gene expression
22GWAS CatalogGenome-wide association studies
23HMDBHuman Metabolome Database
24Imaging Data CommonsCancer imaging data
25InterProProtein families and domains
26JASPARTranscription factor binding profiles
27KEGGBiological pathways and networks
28Metabolomics WorkbenchMetabolomics data repository
29Monarch InitiativeDisease-gene associations
30OpenAlexOpen scholarly metadata
31Open TargetsDrug target identification
32PDBProtein 3D structure database
33PubChemChemical information database
34PubMedBiomedical literature search
35ReactomeBiological pathway database
36STRINGProtein-protein interaction networks
37UniProtProtein sequence and function
38USPTOPatent search and analysis
39ZINCCommercially-available compounds for docking

🧬 Bioinformatics — Genomics, Proteomics & Molecular Biology (25 Skills)

Click to expand full bioinformatics skills list
#SkillWhat It Does
1AnnDataAnnotated data matrices for single-cell
2ArboretoGene regulatory network inference
3BioPythonGeneral-purpose bioinformatics toolkit
4BioServicesProgrammatic access to biological web services
5CellxGeneInteractive single-cell data exploration
6COBRApyConstraint-based metabolic modeling
7deepToolsNGS data analysis and visualization
8ESMProtein language models
9ETE ToolkitPhylogenetic tree analysis and visualization
10FlowIOFlow cytometry data handling
11ggetQuery genomic databases from Python
12GlycoengineeringGlycan analysis and engineering
13HistoLabComputational histopathology
14LaminDBData lineage and biological data management
15NeuropixelsNeural probe data processing
16PathMLMachine learning for pathology
17PhylogeneticsEvolutionary tree construction
18PyDESeq2Differential gene expression analysis
19pyOpenMSMass spectrometry data analysis
20pysamSAM/BAM file manipulation
21ScanpySingle-cell analysis in Python
22scikit-bioBioinformatics algorithms and data structures
23scVeloRNA velocity analysis
24scvi-toolsDeep generative models for single-cell
25TileDB-VCFPopulation-scale genomic variant storage

⚗️ Cheminformatics — Chemistry & Drug Discovery (12 Skills)

Click to expand full cheminformatics skills list
#SkillWhat It Does
1DatamolMolecular data manipulation
2DeepChemDeep learning for chemistry
3DiffDockDiffusion-based molecular docking
4matchmsMass spectra matching and similarity
5MedChemMedicinal chemistry analysis
6Molecular DynamicsMD simulation setup and analysis
7MolFeatMolecular featurization
8PrimeKGPrecision medicine knowledge graph
9PyTDCTherapeutics Data Commons access
10RDKitCore cheminformatics toolkit
11RowanComputational chemistry workflows
12TorchDrugGraph neural networks for drug discovery

🏥 Clinical Research — Clinical Decision Support & Medical Tools (10 Skills)

Click to expand full clinical research skills list
#SkillWhat It Does
1Clinical DecisionEvidence-based clinical decision support
2Clinical ReportsStructured clinical report generation
3Consciousness CouncilMulti-perspective medical ethics deliberation
4DHDNA ProfilerDigital health DNA profiling
5ISO 13485Medical device quality management
6NeuroKit2Neurophysiological signal processing
7PyDicomDICOM medical image handling
8PyHealthHealthcare ML pipelines
9Treatment PlansTreatment planning and protocol design
10What-If OracleCounterfactual clinical reasoning

📊 Data Science — ML, Statistics & Data Analysis (22 Skills)

Click to expand full data science skills list
#SkillWhat It Does
1DaskParallel computing and out-of-core data
2EDAExploratory data analysis
3NetworkXNetwork/graph analysis
4PolarsHigh-performance DataFrames
5PufferLibReinforcement learning environments
6PyMCBayesian statistical modeling
7pymooMulti-objective optimization
8PyTorch LightningStructured deep learning training
9scikit-learnClassical machine learning
10scikit-survivalSurvival analysis
11SHAPModel interpretability and feature importance
12SimPyDiscrete-event simulation
13Stable-Baselines3Reinforcement learning algorithms
14Statistical AnalysisClassical statistical tests and methods
15statsmodelsStatistical models and econometrics
16SymPySymbolic mathematics
17TimesFMFoundation model for time series
18PyTorch GeometricGraph neural networks
19TransformersHugging Face transformer models
20UMAPDimensionality reduction
21VaexOut-of-core DataFrames for big data
22ZarrChunked, compressed N-dimensional arrays

📈 Visualization — Scientific Plotting & Graphics (8 Skills)

Click to expand full visualization skills list
#SkillWhat It Does
1Generate ImageAI image generation for research figures
2InfographicsResearch infographic design
3MatplotlibPublication-quality 2D plots
4MermaidDiagrams and flowcharts as code
5PlotlyInteractive scientific visualizations
6Scientific SchematicsTechnical diagrams and schematics
7Scientific VizAdvanced scientific visualization
8SeabornStatistical data visualization

✍️ Writing Tools — Scientific Writing, Citations & Publishing (13 Skills)

Click to expand full writing tools skills list
#SkillWhat It Does
1Academic CareerAcademic CV, research statements, tenure dossier
2Citation ManagementReference formatting and management
3Hypothesis GeneratorResearch hypothesis development
4LaTeX PostersConference poster design in LaTeX
5Literature ReviewSystematic literature review assistance
6Paper-to-WebConvert papers to web-friendly formats
7Peer ReviewPeer review writing assistance
8PPTX PostersConference posters in PowerPoint
9Research GrantsGrant proposal writing
10Scholar EvalAcademic output evaluation
11Scientific SlidesResearch presentation creation
12Scientific WritingAcademic writing style and structure
13Venue TemplatesJournal/conference formatting templates

🔧 Lab Integrations — Laboratory Platform Connectors (9 Skills)

Click to expand full lab integration skills list
#SkillWhat It Does
1BenchlingMolecular biology data platform
2DNAnexusGenomic data analysis platform
3Ginkgo CloudSynthetic biology platform
4LabArchiveElectronic lab notebook
5LatchBioBioinformatics workflow platform
6OMEROBiological image management
7OpentronsLab automation and robotics
8Protocols.ioProtocol sharing and management
9PyLabRobotLab robotics programming

🌍 Domain-Specific — Quantum, Geospatial, Materials, Social Science & More (17 Skills)

Click to expand full domain-specific skills list
#SkillWhat It Does
1AdaptyvAdaptive experimental design
2AeonTime series classification
3AstroPyAstronomy and astrophysics
4CirqQuantum circuit design (Google)
5FluidSimFluid dynamics simulation
6GeniMLGenomic interval ML
7GeoMasterGeospatial analysis mastery
8GeoPandasGeospatial data analysis
9GTARSGenomic tool for annotation
10HypoGenicHypothesis generation from data
11ModalCloud compute for research
12PennyLaneQuantum machine learning
13PymatgenMaterials science analysis
14QiskitQuantum computing (IBM)
15QuTiPQuantum dynamics simulation
16Social Science MethodsDiscourse analysis, QCA, Delphi, process tracing
17Digital HumanitiesText mining, corpus linguistics, stylometry, OCR

📄 Document Tools — File Format Handling (6 Skills)

Click to expand full document tools skills list
#SkillWhat It Does
1DOCXWord document generation and manipulation
2MarkItDownConvert documents to Markdown
3Open NotebookOpen-format research notebooks
4PDFPDF generation and processing
5PPTXPowerPoint presentation creation
6XLSXExcel spreadsheet handling

🔍 Research Tools — Search, Discovery, Methods & Reference Management (12 Skills)

Click to expand full research tools skills list
#SkillWhat It Does
1BGPT SearchAI-powered research search
2Mixed MethodsMixed-methods research design and integration
3Open SciencePreregistration, FAIR data, open access publishing
4Parallel WebMulti-source parallel web search
5PerplexityPerplexity-powered research queries
6PyZoteroZotero reference manager integration
7Qualitative MethodsThematic analysis, grounded theory, IPA, coding
8Research EthicsIRB applications, informed consent, GDPR
9Research LookupQuick research paper discovery
10Scientific BrainstormStructured research ideation
11Scientific ThinkingCritical scientific reasoning frameworks
12Survey DesignQuestionnaire construction and validation

💰 Finance & Economics — Financial Data & Analysis (7 Skills)

Click to expand full finance & economics skills list
#SkillWhat It Does
1Alpha VantageStock and financial market data
2DenarioFinancial data processing
3EDGAR ToolsSEC filing search and analysis
4FREDFederal Reserve economic data
5Hedge Fund MonitorHedge fund tracking and analysis
6Market ResearchMarket analysis and intelligence
7US Fiscal DataUS government fiscal data


🏗️ Project Structure

AlterLab-Academic-Skills/
├── 📁 skills/
│   ├── 🔄 core/                # 6 pipeline + teaching + thesis skills
│   ├── 🗄️ databases/           # 39 database connectors
│   ├── 🧬 bioinformatics/      # 25 bio/genomics tools
│   ├── ⚗️ cheminformatics/     # 12 chemistry/drug discovery
│   ├── 🏥 clinical-research/   # 10 clinical/medical tools
│   ├── 📊 data-science/        # 22 ML/statistics tools
│   ├── 📈 visualization/       # 8 plotting/charting tools
│   ├── ✍️ writing-tools/       # 13 scientific writing & career tools
│   ├── 🔧 lab-integrations/    # 9 lab platform connectors
│   ├── 🌍 domain-specific/     # 17 specialized field tools
│   ├── 📄 document-tools/      # 6 file format tools
│   ├── 🔍 research-tools/      # 12 search, methods & ethics tools
│   └── 💰 finance-economics/   # 7 financial/economic tools
├── 📁 .claude/
│   └── CLAUDE.md               # Project-level Claude config
├── 📄 README.md                # This file
├── 📄 CLAUDE.md                # Project instructions
├── 📄 CONTRIBUTING.md          # Contribution guidelines
└── 📄 LICENSE                  # MIT License

⚙️ How Skills Work

Each .md skill file follows a consistent structure:

| name          | description                         |
|---------------|-------------------------------------|
| skill-name    | When to activate this skill...      |

# Skill Title

You are **RoleName**, a [role description]...

## Your Identity & Memory
## Your Core Mission
## Frameworks & Methods
## Output Templates
## Quality Standards

[!NOTE] Pro tip: Combine multiple skills in one Claude Project for a multi-expert team. For example, load Deep Research + Paper Writer + Paper Reviewer for a complete research-to-publication workflow.


💡 Usage Examples

Skills activate automatically based on user intent:

You say...Skill activated
"Help me research the latest findings on CRISPR gene editing"alterlab-deep-research
"Write an academic paper on machine learning in education"alterlab-paper-writer
"Review my manuscript for methodology issues"alterlab-paper-reviewer
"Search PubMed for recent studies on Alzheimer's biomarkers"alterlab-pubmed
"Analyze my RNA-seq data"alterlab-scanpy + alterlab-pydeseq2
"Create a scientific poster for my conference"alterlab-latex-posters
"Design a survey for my social science study"alterlab-survey-design
"Help me with my IRB ethics application"alterlab-research-ethics
"Build a Bayesian model for my clinical trial data"alterlab-pymc
"Guide my PhD student's thesis writing"alterlab-thesis-supervisor


🔗 Sister Projects

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Quick ways to contribute:

  • 🛠️ Improve an existing skill with better frameworks or templates
  • ✨ Create a new skill following the structure above
  • 🐛 Report issues or suggest improvements
  • 📚 Add examples or use cases to documentation

📜 License

This project is licensed under the MIT License.

MIT License — Copyright (c) 2026 AlterLab Creative Technologies Laboratory

🙏 Credits

Built with ❤️ by AlterLab Creative Technologies Laboratory



186+ skills · 13 domains · 1 prompt away from expert-level research




If you find this project useful, please consider giving it a ⭐


⬆ Back to Top


Footer