USP
Unlock 186+ Claude AI skills, transforming Claude into a domain-specific research expert. Leverage real scientific frameworks, professional output templates, and deep knowledge across 13 academic domains for enhanced scholarly workflows.
Use cases
- 01Generating novel protein sequences with specific functional properties using ESM3.
- 02Inferring gene regulatory networks from transcriptomics data to identify TF-target relationships.
- 03Accessing and integrating data from multiple bioinformatics databases like UniProt, KEGG, and ChEMB…
- 04Performing single-cell RNA-seq analysis and managing large annotated datasets with AnnData.
- 05Automating sequence manipulation, file parsing (FASTA/GenBank), and programmatic NCBI/PubMed access.
Detected files (8)
skills/bioinformatics/alterlab-biopython/SKILL.mdskillShow content (13895 bytes)
--- name: alterlab-biopython description: Comprehensive molecular biology toolkit. Use for sequence manipulation, file parsing (FASTA/GenBank/PDB), phylogenetics, and programmatic NCBI/PubMed access (Bio.Entrez). Best for batch processing, custom bioinformatics pipelines, BLAST automation. For quick lookups use gget; for multi-service integration use bioservices. Part of the AlterLab Academic Skills suite. license: MIT metadata: skill-author: AlterLab version: "1.0.0" --- # Biopython: Computational Molecular Biology in Python ## Overview Biopython is a comprehensive set of freely available Python tools for biological computation. It provides functionality for sequence manipulation, file I/O, database access, structural bioinformatics, phylogenetics, and many other bioinformatics tasks. The current version is **Biopython 1.85** (released January 2025), which supports Python 3 and requires NumPy. ## When to Use This Skill Use this skill when: - Working with biological sequences (DNA, RNA, or protein) - Reading, writing, or converting biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, etc.) - Accessing NCBI databases (GenBank, PubMed, Protein, Gene, etc.) via Entrez - Running BLAST searches or parsing BLAST results - Performing sequence alignments (pairwise or multiple sequence alignments) - Analyzing protein structures from PDB files - Creating, manipulating, or visualizing phylogenetic trees - Finding sequence motifs or analyzing motif patterns - Calculating sequence statistics (GC content, molecular weight, melting temperature, etc.) - Performing structural bioinformatics tasks - Working with population genetics data - Any other computational molecular biology task ## Core Capabilities Biopython is organized into modular sub-packages, each addressing specific bioinformatics domains: 1. **Sequence Handling** - Bio.Seq and Bio.SeqIO for sequence manipulation and file I/O 2. **Alignment Analysis** - Bio.Align and Bio.AlignIO for pairwise and multiple sequence alignments 3. **Database Access** - Bio.Entrez for programmatic access to NCBI databases 4. **BLAST Operations** - Bio.Blast for running and parsing BLAST searches 5. **Structural Bioinformatics** - Bio.PDB for working with 3D protein structures 6. **Phylogenetics** - Bio.Phylo for phylogenetic tree manipulation and visualization 7. **Advanced Features** - Motifs, population genetics, sequence utilities, and more ## Installation and Setup Install Biopython using pip (requires Python 3 and NumPy): ```python uv pip install biopython ``` For NCBI database access, always set your email address (required by NCBI): ```python from Bio import Entrez Entrez.email = "your.email@example.com" # Optional: API key for higher rate limits (10 req/s instead of 3 req/s) Entrez.api_key = "your_api_key_here" ``` ## Using This Skill This skill provides comprehensive documentation organized by functionality area. When working on a task, consult the relevant reference documentation: ### 1. Sequence Handling (Bio.Seq & Bio.SeqIO) **Reference:** `references/sequence_io.md` Use for: - Creating and manipulating biological sequences - Reading and writing sequence files (FASTA, GenBank, FASTQ, etc.) - Converting between file formats - Extracting sequences from large files - Sequence translation, transcription, and reverse complement - Working with SeqRecord objects **Quick example:** ```python from Bio import SeqIO # Read sequences from FASTA file for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp") # Convert GenBank to FASTA SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta") ``` ### 2. Alignment Analysis (Bio.Align & Bio.AlignIO) **Reference:** `references/alignment.md` Use for: - Pairwise sequence alignment (global and local) - Reading and writing multiple sequence alignments - Using substitution matrices (BLOSUM, PAM) - Calculating alignment statistics - Customizing alignment parameters **Quick example:** ```python from Bio import Align # Pairwise alignment aligner = Align.PairwiseAligner() aligner.mode = 'global' alignments = aligner.align("ACCGGT", "ACGGT") print(alignments[0]) ``` ### 3. Database Access (Bio.Entrez) **Reference:** `references/databases.md` Use for: - Searching NCBI databases (PubMed, GenBank, Protein, Gene, etc.) - Downloading sequences and records - Fetching publication information - Finding related records across databases - Batch downloading with proper rate limiting **Quick example:** ```python from Bio import Entrez Entrez.email = "your.email@example.com" # Search PubMed handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10) results = Entrez.read(handle) handle.close() print(f"Found {results['Count']} results") ``` ### 4. BLAST Operations (Bio.Blast) **Reference:** `references/blast.md` Use for: - Running BLAST searches via NCBI web services - Running local BLAST searches - Parsing BLAST XML output - Filtering results by E-value or identity - Extracting hit sequences **Quick example:** ```python from Bio.Blast import NCBIWWW, NCBIXML # Run BLAST search result_handle = NCBIWWW.qblast("blastn", "nt", "ATCGATCGATCG") blast_record = NCBIXML.read(result_handle) # Display top hits for alignment in blast_record.alignments[:5]: print(f"{alignment.title}: E-value={alignment.hsps[0].expect}") ``` ### 5. Structural Bioinformatics (Bio.PDB) **Reference:** `references/structure.md` Use for: - Parsing PDB and mmCIF structure files - Navigating protein structure hierarchy (SMCRA: Structure/Model/Chain/Residue/Atom) - Calculating distances, angles, and dihedrals - Secondary structure assignment (DSSP) - Structure superimposition and RMSD calculation - Extracting sequences from structures **Quick example:** ```python from Bio.PDB import PDBParser # Parse structure parser = PDBParser(QUIET=True) structure = parser.get_structure("1crn", "1crn.pdb") # Calculate distance between alpha carbons chain = structure[0]["A"] distance = chain[10]["CA"] - chain[20]["CA"] print(f"Distance: {distance:.2f} Å") ``` ### 6. Phylogenetics (Bio.Phylo) **Reference:** `references/phylogenetics.md` Use for: - Reading and writing phylogenetic trees (Newick, NEXUS, phyloXML) - Building trees from distance matrices or alignments - Tree manipulation (pruning, rerooting, ladderizing) - Calculating phylogenetic distances - Creating consensus trees - Visualizing trees **Quick example:** ```python from Bio import Phylo # Read and visualize tree tree = Phylo.read("tree.nwk", "newick") Phylo.draw_ascii(tree) # Calculate distance distance = tree.distance("Species_A", "Species_B") print(f"Distance: {distance:.3f}") ``` ### 7. Advanced Features **Reference:** `references/advanced.md` Use for: - **Sequence motifs** (Bio.motifs) - Finding and analyzing motif patterns - **Population genetics** (Bio.PopGen) - GenePop files, Fst calculations, Hardy-Weinberg tests - **Sequence utilities** (Bio.SeqUtils) - GC content, melting temperature, molecular weight, protein analysis - **Restriction analysis** (Bio.Restriction) - Finding restriction enzyme sites - **Clustering** (Bio.Cluster) - K-means and hierarchical clustering - **Genome diagrams** (GenomeDiagram) - Visualizing genomic features **Quick example:** ```python from Bio.SeqUtils import gc_fraction, molecular_weight from Bio.Seq import Seq seq = Seq("ATCGATCGATCG") print(f"GC content: {gc_fraction(seq):.2%}") print(f"Molecular weight: {molecular_weight(seq, seq_type='DNA'):.2f} g/mol") ``` ## General Workflow Guidelines ### Reading Documentation When a user asks about a specific Biopython task: 1. **Identify the relevant module** based on the task description 2. **Read the appropriate reference file** using the Read tool 3. **Extract relevant code patterns** and adapt them to the user's specific needs 4. **Combine multiple modules** when the task requires it Example search patterns for reference files: ```bash # Find information about specific functions grep -n "SeqIO.parse" references/sequence_io.md # Find examples of specific tasks grep -n "BLAST" references/blast.md # Find information about specific concepts grep -n "alignment" references/alignment.md ``` ### Writing Biopython Code Follow these principles when writing Biopython code: 1. **Import modules explicitly** ```python from Bio import SeqIO, Entrez from Bio.Seq import Seq ``` 2. **Set Entrez email** when using NCBI databases ```python Entrez.email = "your.email@example.com" ``` 3. **Use appropriate file formats** - Check which format best suits the task ```python # Common formats: "fasta", "genbank", "fastq", "clustal", "phylip" ``` 4. **Handle files properly** - Close handles after use or use context managers ```python with open("file.fasta") as handle: records = SeqIO.parse(handle, "fasta") ``` 5. **Use iterators for large files** - Avoid loading everything into memory ```python for record in SeqIO.parse("large_file.fasta", "fasta"): # Process one record at a time ``` 6. **Handle errors gracefully** - Network operations and file parsing can fail ```python try: handle = Entrez.efetch(db="nucleotide", id=accession) except HTTPError as e: print(f"Error: {e}") ``` ## Common Patterns ### Pattern 1: Fetch Sequence from GenBank ```python from Bio import Entrez, SeqIO Entrez.email = "your.email@example.com" # Fetch sequence handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() print(f"Description: {record.description}") print(f"Sequence length: {len(record.seq)}") ``` ### Pattern 2: Sequence Analysis Pipeline ```python from Bio import SeqIO from Bio.SeqUtils import gc_fraction for record in SeqIO.parse("sequences.fasta", "fasta"): # Calculate statistics gc = gc_fraction(record.seq) length = len(record.seq) # Find ORFs, translate, etc. protein = record.seq.translate() print(f"{record.id}: {length} bp, GC={gc:.2%}") ``` ### Pattern 3: BLAST and Fetch Top Hits ```python from Bio.Blast import NCBIWWW, NCBIXML from Bio import Entrez, SeqIO Entrez.email = "your.email@example.com" # Run BLAST result_handle = NCBIWWW.qblast("blastn", "nt", sequence) blast_record = NCBIXML.read(result_handle) # Get top hit accessions accessions = [aln.accession for aln in blast_record.alignments[:5]] # Fetch sequences for acc in accessions: handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text") record = SeqIO.read(handle, "fasta") handle.close() print(f">{record.description}") ``` ### Pattern 4: Build Phylogenetic Tree from Sequences ```python from Bio import AlignIO, Phylo from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor # Read alignment alignment = AlignIO.read("alignment.fasta", "fasta") # Calculate distances calculator = DistanceCalculator("identity") dm = calculator.get_distance(alignment) # Build tree constructor = DistanceTreeConstructor() tree = constructor.nj(dm) # Visualize Phylo.draw_ascii(tree) ``` ## Best Practices 1. **Always read relevant reference documentation** before writing code 2. **Use grep to search reference files** for specific functions or examples 3. **Validate file formats** before parsing 4. **Handle missing data gracefully** - Not all records have all fields 5. **Cache downloaded data** - Don't repeatedly download the same sequences 6. **Respect NCBI rate limits** - Use API keys and proper delays 7. **Test with small datasets** before processing large files 8. **Keep Biopython updated** to get latest features and bug fixes 9. **Use appropriate genetic code tables** for translation 10. **Document analysis parameters** for reproducibility ## Troubleshooting Common Issues ### Issue: "No handlers could be found for logger 'Bio.Entrez'" **Solution:** This is just a warning. Set Entrez.email to suppress it. ### Issue: "HTTP Error 400" from NCBI **Solution:** Check that IDs/accessions are valid and properly formatted. ### Issue: "ValueError: EOF" when parsing files **Solution:** Verify file format matches the specified format string. ### Issue: Alignment fails with "sequences are not the same length" **Solution:** Ensure sequences are aligned before using AlignIO or MultipleSeqAlignment. ### Issue: BLAST searches are slow **Solution:** Use local BLAST for large-scale searches, or cache results. ### Issue: PDB parser warnings **Solution:** Use `PDBParser(QUIET=True)` to suppress warnings, or investigate structure quality. ## Additional Resources - **Official Documentation**: https://biopython.org/docs/latest/ - **Tutorial**: https://biopython.org/docs/latest/Tutorial/ - **Cookbook**: https://biopython.org/docs/latest/Tutorial/ (advanced examples) - **GitHub**: https://github.com/biopython/biopython - **Mailing List**: biopython@biopython.org ## Quick Reference To locate information in reference files, use these search patterns: ```bash # Search for specific functions grep -n "function_name" references/*.md # Find examples of specific tasks grep -n "example" references/sequence_io.md # Find all occurrences of a module grep -n "Bio.Seq" references/*.md ``` ## Summary Biopython provides comprehensive tools for computational molecular biology. When using this skill: 1. **Identify the task domain** (sequences, alignments, databases, BLAST, structures, phylogenetics, or advanced) 2. **Consult the appropriate reference file** in the `references/` directory 3. **Adapt code examples** to the specific use case 4. **Combine multiple modules** when needed for complex workflows 5. **Follow best practices** for file handling, error checking, and data management The modular reference documentation ensures detailed, searchable information for every major Biopython capability.skills/bioinformatics/alterlab-bioservices/SKILL.mdskillShow content (10017 bytes)
--- name: alterlab-bioservices description: Unified Python interface to 40+ bioinformatics services. Use when querying multiple databases (UniProt, KEGG, ChEMBL, Reactome) in a single workflow with consistent API. Best for cross-database analysis, ID mapping across services. For quick single-database lookups use gget; for sequence/file manipulation use biopython. Part of the AlterLab Academic Skills suite. license: GPLv3 license metadata: skill-author: AlterLab version: "1.0.0" --- # BioServices ## Overview BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently. ## When to Use This Skill This skill should be used when: - Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam - Analyzing metabolic pathways and gene functions via KEGG or Reactome - Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information - Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs) - Running sequence similarity searches (BLAST, MUSCLE alignment) - Querying gene ontology terms (QuickGO, GO annotations) - Accessing protein-protein interaction data (PSICQUIC, IntactComplex) - Mining genomic data (BioMart, ArrayExpress, ENA) - Integrating data from multiple bioinformatics resources in a single workflow ## Core Capabilities ### 1. Protein Analysis Retrieve protein information, sequences, and functional annotations: ```python from bioservices import UniProt u = UniProt(verbose=False) # Search for protein by name results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism") # Retrieve FASTA sequence sequence = u.retrieve("P43403", "fasta") # Map identifiers between databases kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403") ``` **Key methods:** - `search()`: Query UniProt with flexible search terms - `retrieve()`: Get protein entries in various formats (FASTA, XML, tab) - `mapping()`: Convert identifiers between databases Reference: `references/services_reference.md` for complete UniProt API details. ### 2. Pathway Discovery and Analysis Access KEGG pathway information for genes and organisms: ```python from bioservices import KEGG k = KEGG() k.organism = "hsa" # Set to human # Search for organisms k.lookfor_organism("droso") # Find Drosophila species # Find pathways by name k.lookfor_pathway("B cell") # Returns matching pathway IDs # Get pathways containing specific genes pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene # Retrieve and parse pathway data data = k.get("hsa04660") parsed = k.parse(data) # Extract pathway interactions interactions = k.parse_kgml_pathway("hsa04660") relations = interactions['relations'] # Protein-protein interactions # Convert to Simple Interaction Format sif_data = k.pathway2sif("hsa04660") ``` **Key methods:** - `lookfor_organism()`, `lookfor_pathway()`: Search by name - `get_pathway_by_gene()`: Find pathways containing genes - `parse_kgml_pathway()`: Extract structured pathway data - `pathway2sif()`: Get protein interaction networks Reference: `references/workflow_patterns.md` for complete pathway analysis workflows. ### 3. Compound Database Searches Search and cross-reference compounds across multiple databases: ```python from bioservices import KEGG, UniChem k = KEGG() # Search compounds by name results = k.find("compound", "Geldanamycin") # Returns cpd:C11222 # Get compound information with database links compound_info = k.get("cpd:C11222") # Includes ChEBI links # Cross-reference KEGG → ChEMBL using UniChem u = UniChem() chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315 ``` **Common workflow:** 1. Search compound by name in KEGG 2. Extract KEGG compound ID 3. Use UniChem for KEGG → ChEMBL mapping 4. ChEBI IDs are often provided in KEGG entries Reference: `references/identifier_mapping.md` for complete cross-database mapping guide. ### 4. Sequence Analysis Run BLAST searches and sequence alignments: ```python from bioservices import NCBIblast s = NCBIblast(verbose=False) # Run BLASTP against UniProtKB jobid = s.run( program="blastp", sequence=protein_sequence, stype="protein", database="uniprotkb", email="your.email@example.com" # Required by NCBI ) # Check job status and retrieve results s.getStatus(jobid) results = s.getResult(jobid, "out") ``` **Note:** BLAST jobs are asynchronous. Check status before retrieving results. ### 5. Identifier Mapping Convert identifiers between different biological databases: ```python from bioservices import UniProt, KEGG # UniProt mapping (many database pairs supported) u = UniProt() results = u.mapping( fr="UniProtKB_AC-ID", # Source database to="KEGG", # Target database query="P43403" # Identifier(s) to convert ) # KEGG gene ID → UniProt kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535") # For compounds, use UniChem from bioservices import UniChem u = UniChem() chembl_from_kegg = u.get_compound_id_from_kegg("C11222") ``` **Supported mappings (UniProt):** - UniProtKB ↔ KEGG - UniProtKB ↔ Ensembl - UniProtKB ↔ PDB - UniProtKB ↔ RefSeq - And many more (see `references/identifier_mapping.md`) ### 6. Gene Ontology Queries Access GO terms and annotations: ```python from bioservices import QuickGO g = QuickGO(verbose=False) # Retrieve GO term information term_info = g.Term("GO:0003824", frmt="obo") # Search annotations annotations = g.Annotation(protein="P43403", format="tsv") ``` ### 7. Protein-Protein Interactions Query interaction databases via PSICQUIC: ```python from bioservices import PSICQUIC s = PSICQUIC(verbose=False) # Query specific database (e.g., MINT) interactions = s.query("mint", "ZAP70 AND species:9606") # List available interaction databases databases = s.activeDBs ``` **Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others. ## Multi-Service Integration Workflows BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns: ### Complete Protein Analysis Pipeline Execute a full protein characterization workflow: ```bash python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com ``` This script demonstrates: 1. UniProt search for protein entry 2. FASTA sequence retrieval 3. BLAST similarity search 4. KEGG pathway discovery 5. PSICQUIC interaction mapping ### Pathway Network Analysis Analyze all pathways for an organism: ```bash python scripts/pathway_analysis.py hsa output_directory/ ``` Extracts and analyzes: - All pathway IDs for organism - Protein-protein interactions per pathway - Interaction type distributions - Exports to CSV/SIF formats ### Cross-Database Compound Search Map compound identifiers across databases: ```bash python scripts/compound_cross_reference.py Geldanamycin ``` Retrieves: - KEGG compound ID - ChEBI identifier - ChEMBL identifier - Basic compound properties ### Batch Identifier Conversion Convert multiple identifiers at once: ```bash python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG ``` ## Best Practices ### Output Format Handling Different services return data in various formats: - **XML**: Parse using BeautifulSoup (most SOAP services) - **Tab-separated (TSV)**: Pandas DataFrames for tabular data - **Dictionary/JSON**: Direct Python manipulation - **FASTA**: BioPython integration for sequence analysis ### Rate Limiting and Verbosity Control API request behavior: ```python from bioservices import KEGG k = KEGG(verbose=False) # Suppress HTTP request details k.TIMEOUT = 30 # Adjust timeout for slow connections ``` ### Error Handling Wrap service calls in try-except blocks: ```python try: results = u.search("ambiguous_query") if results: # Process results pass except Exception as e: print(f"Search failed: {e}") ``` ### Organism Codes Use standard organism abbreviations: - `hsa`: Homo sapiens (human) - `mmu`: Mus musculus (mouse) - `dme`: Drosophila melanogaster - `sce`: Saccharomyces cerevisiae (yeast) List all organisms: `k.list("organism")` or `k.organismIds` ### Integration with Other Tools BioServices works well with: - **BioPython**: Sequence analysis on retrieved FASTA data - **Pandas**: Tabular data manipulation - **PyMOL**: 3D structure visualization (retrieve PDB IDs) - **NetworkX**: Network analysis of pathway interactions - **Galaxy**: Custom tool wrappers for workflow platforms ## Resources ### scripts/ Executable Python scripts demonstrating complete workflows: - `protein_analysis_workflow.py`: End-to-end protein characterization - `pathway_analysis.py`: KEGG pathway discovery and network extraction - `compound_cross_reference.py`: Multi-database compound searching - `batch_id_converter.py`: Bulk identifier mapping utility Scripts can be executed directly or adapted for specific use cases. ### references/ Detailed documentation loaded as needed: - `services_reference.md`: Comprehensive list of all 40+ services with methods - `workflow_patterns.md`: Detailed multi-step analysis workflows - `identifier_mapping.md`: Complete guide to cross-database ID conversion Load references when working with specific services or complex integration tasks. ## Installation ```bash uv pip install bioservices ``` Dependencies are automatically managed. Package is tested on Python 3.9-3.12. ## Additional Information For detailed API documentation and advanced features, refer to: - Official documentation: https://bioservices.readthedocs.io/ - Source code: https://github.com/cokelaer/bioservices - Service-specific references in `references/services_reference.md`skills/bioinformatics/alterlab-cellxgene/SKILL.mdskillShow content (15498 bytes)
--- name: alterlab-cellxgene description: Query the CELLxGENE Census (61M+ cells) programmatically. Use when you need expression data across tissues, diseases, or cell types from the largest curated single-cell atlas. Best for population-scale queries, reference atlas comparisons. For analyzing your own data use scanpy or scvi-tools. Part of the AlterLab Academic Skills suite. license: MIT metadata: skill-author: AlterLab version: "1.0.0" --- # CZ CELLxGENE Census ## Overview The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets. The Census includes: - **61+ million cells** from human and mouse - **Standardized metadata** (cell types, tissues, diseases, donors) - **Raw gene expression** matrices - **Pre-calculated embeddings** and statistics - **Integration with PyTorch, scanpy, and other analysis tools** ## When to Use This Skill This skill should be used when: - Querying single-cell expression data by cell type, tissue, or disease - Exploring available single-cell datasets and metadata - Training machine learning models on single-cell data - Performing large-scale cross-dataset analyses - Integrating Census data with scanpy or other analysis frameworks - Computing statistics across millions of cells - Accessing pre-calculated embeddings or model predictions ## Installation and Setup Install the Census API: ```bash uv pip install cellxgene-census ``` For machine learning workflows, install additional dependencies: ```bash uv pip install cellxgene-census[experimental] ``` ## Core Workflow Patterns ### 1. Opening the Census Always use the context manager to ensure proper resource cleanup: ```python import cellxgene_census # Open latest stable version with cellxgene_census.open_soma() as census: # Work with census data # Open specific version for reproducibility with cellxgene_census.open_soma(census_version="2023-07-25") as census: # Work with census data ``` **Key points:** - Use context manager (`with` statement) for automatic cleanup - Specify `census_version` for reproducible analyses - Default opens latest "stable" release ### 2. Exploring Census Information Before querying expression data, explore available datasets and metadata. **Access summary information:** ```python # Get summary statistics summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"Total cells: {summary['total_cell_count'][0]}") # Get all datasets datasets = census["census_info"]["datasets"].read().concat().to_pandas() # Filter datasets by criteria covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)] ``` **Query cell metadata to understand available data:** ```python # Get unique cell types in a tissue cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type"] ) unique_cell_types = cell_metadata["cell_type"].unique() print(f"Found {len(unique_cell_types)} cell types in brain") # Count cells by tissue tissue_counts = cell_metadata.groupby("tissue_general").size() ``` **Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates. ### 3. Querying Expression Data (Small to Medium Scale) For queries returning < 100k cells that fit in memory, use `get_anndata()`: ```python # Basic query with cell type and tissue filters adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", # or "Mus musculus" obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True", obs_column_names=["assay", "disease", "sex", "donor_id"], ) # Query specific genes with multiple filters adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']", obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True", obs_column_names=["cell_type", "tissue_general", "donor_id"], ) ``` **Filter syntax:** - Use `obs_value_filter` for cell filtering - Use `var_value_filter` for gene filtering - Combine conditions with `and`, `or` - Use `in` for multiple values: `tissue in ['lung', 'liver']` - Select only needed columns with `obs_column_names` **Getting metadata separately:** ```python # Query cell metadata cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general", "donor_id"] ) # Query gene metadata gene_metadata = cellxgene_census.get_var( census, "homo_sapiens", value_filter="feature_name in ['CD4', 'CD8A']", column_names=["feature_id", "feature_name", "feature_length"] ) ``` ### 4. Large-Scale Queries (Out-of-Core Processing) For queries exceeding available RAM, use `axis_query()` with iterative processing: ```python import tiledbsoma as soma # Create axis query query = census["census_data"]["homo_sapiens"].axis_query( measurement_name="RNA", obs_query=soma.AxisQuery( value_filter="tissue_general == 'brain' and is_primary_data == True" ), var_query=soma.AxisQuery( value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']" ) ) # Iterate through expression matrix in chunks iterator = query.X("raw").tables() for batch in iterator: # batch is a pyarrow.Table with columns: # - soma_data: expression value # - soma_dim_0: cell (obs) coordinate # - soma_dim_1: gene (var) coordinate process_batch(batch) ``` **Computing incremental statistics:** ```python # Example: Calculate mean expression n_observations = 0 sum_values = 0.0 iterator = query.X("raw").tables() for batch in iterator: values = batch["soma_data"].to_numpy() n_observations += len(values) sum_values += values.sum() mean_expression = sum_values / n_observations ``` ### 5. Machine Learning with PyTorch For training models, use the experimental PyTorch integration: ```python from cellxgene_census.experimental.ml import experiment_dataloader with cellxgene_census.open_soma() as census: # Create dataloader dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="tissue_general == 'liver' and is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, ) # Training loop for epoch in range(num_epochs): for batch in dataloader: X = batch["X"] # Gene expression tensor labels = batch["obs"]["cell_type"] # Cell type labels # Forward pass outputs = model(X) loss = criterion(outputs, labels) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() ``` **Train/test splitting:** ```python from cellxgene_census.experimental.ml import ExperimentDataset # Create dataset from experiment dataset = ExperimentDataset( experiment_axis_query, layer_name="raw", obs_column_names=["cell_type"], batch_size=128, ) # Split into train and test train_dataset, test_dataset = dataset.random_split( split=[0.8, 0.2], seed=42 ) ``` ### 6. Integration with Scanpy Seamlessly integrate Census data with scanpy workflows: ```python import scanpy as sc # Load data from Census adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True", ) # Standard scanpy workflow sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) # Dimensionality reduction sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata) # Visualization sc.pl.umap(adata, color=["cell_type", "tissue", "disease"]) ``` ### 7. Multi-Dataset Integration Query and integrate multiple datasets: ```python # Strategy 1: Query multiple tissues separately tissues = ["lung", "liver", "kidney"] adatas = [] for tissue in tissues: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True", ) adata.obs["tissue"] = tissue adatas.append(adata) # Concatenate combined = adatas[0].concatenate(adatas[1:]) # Strategy 2: Query multiple datasets directly adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True", ) ``` ## Key Concepts and Best Practices ### Always Filter for Primary Data Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times: ```python obs_value_filter="cell_type == 'B cell' and is_primary_data == True" ``` ### Specify Census Version for Reproducibility Always specify the Census version in production analyses: ```python census = cellxgene_census.open_soma(census_version="2023-07-25") ``` ### Estimate Query Size Before Loading For large queries, first check the number of cells to avoid memory issues: ```python # Get cell count metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["soma_joinid"] ) n_cells = len(metadata) print(f"Query will return {n_cells:,} cells") # If too large (>100k), use out-of-core processing ``` ### Use tissue_general for Broader Groupings The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses: ```python # Broader grouping obs_value_filter="tissue_general == 'immune system'" # Specific tissue obs_value_filter="tissue == 'peripheral blood mononuclear cell'" ``` ### Select Only Needed Columns Minimize data transfer by specifying only required metadata columns: ```python obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns ``` ### Check Dataset Presence for Gene-Specific Queries When analyzing specific genes, verify which datasets measured them: ```python presence = cellxgene_census.get_presence_matrix( census, "homo_sapiens", var_value_filter="feature_name in ['CD4', 'CD8A']" ) ``` ### Two-Step Workflow: Explore Then Query First explore metadata to understand available data, then query expression: ```python # Step 1: Explore what's available metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general"] ) print(metadata.value_counts()) # Step 2: Query based on findings adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True", ) ``` ## Available Metadata Fields ### Cell Metadata (obs) Key fields for filtering: - `cell_type`, `cell_type_ontology_term_id` - `tissue`, `tissue_general`, `tissue_ontology_term_id` - `disease`, `disease_ontology_term_id` - `assay`, `assay_ontology_term_id` - `donor_id`, `sex`, `self_reported_ethnicity` - `development_stage`, `development_stage_ontology_term_id` - `dataset_id` - `is_primary_data` (Boolean: True = unique cell) ### Gene Metadata (var) - `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798") - `feature_name` (Gene symbol, e.g., "FOXP2") - `feature_length` (Gene length in base pairs) ## Reference Documentation This skill includes detailed reference documentation: ### references/census_schema.md Comprehensive documentation of: - Census data structure and organization - All available metadata fields - Value filter syntax and operators - SOMA object types - Data inclusion criteria **When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax. ### references/common_patterns.md Examples and patterns for: - Exploratory queries (metadata only) - Small-to-medium queries (AnnData) - Large queries (out-of-core processing) - PyTorch integration - Scanpy integration workflows - Multi-dataset integration - Best practices and common pitfalls **When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues. ## Common Use Cases ### Use Case 1: Explore Cell Types in a Tissue ```python with cellxgene_census.open_soma() as census: cells = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'lung' and is_primary_data == True", column_names=["cell_type"] ) print(cells["cell_type"].value_counts()) ``` ### Use Case 2: Query Marker Gene Expression ```python with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']", obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True", ) ``` ### Use Case 3: Train Cell Type Classifier ```python from cellxgene_census.experimental.ml import experiment_dataloader with cellxgene_census.open_soma() as census: dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, ) # Train model for epoch in range(epochs): for batch in dataloader: # Training logic pass ``` ### Use Case 4: Cross-Tissue Analysis ```python with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True", ) # Analyze macrophage differences across tissues sc.tl.rank_genes_groups(adata, groupby="tissue_general") ``` ## Troubleshooting ### Query Returns Too Many Cells - Add more specific filters to reduce scope - Use `tissue` instead of `tissue_general` for finer granularity - Filter by specific `dataset_id` if known - Switch to out-of-core processing for large queries ### Memory Errors - Reduce query scope with more restrictive filters - Select fewer genes with `var_value_filter` - Use out-of-core processing with `axis_query()` - Process data in batches ### Duplicate Cells in Results - Always include `is_primary_data == True` in filters - Check if intentionally querying across multiple datasets ### Gene Not Found - Verify gene name spelling (case-sensitive) - Try Ensembl ID with `feature_id` instead of `feature_name` - Check dataset presence matrix to see if gene was measured - Some genes may have been filtered during Census construction ### Version Inconsistencies - Always specify `census_version` explicitly - Use same version across all analyses - Check release notes for version-specific changesskills/bioinformatics/alterlab-esm/SKILL.mdskillShow content (10632 bytes)
--- name: alterlab-esm description: Comprehensive toolkit for protein language models including ESM3 (generative multimodal protein design across sequence, structure, and function) and ESM C (efficient protein embeddings and representations). Use this skill when working with protein sequences, structures, or function prediction; designing novel proteins; generating protein embeddings; performing inverse folding; or conducting protein engineering tasks. Supports both local model usage and cloud-based Forge API for scalable inference. Part of the AlterLab Academic Skills suite. license: MIT license metadata: skill-author: AlterLab version: "1.0.0" --- # ESM: Evolutionary Scale Modeling ## Overview ESM provides state-of-the-art protein language models for understanding, generating, and designing proteins. This skill enables working with two model families: ESM3 for generative protein design across sequence, structure, and function, and ESM C for efficient protein representation learning and embeddings. ## Core Capabilities ### 1. Protein Sequence Generation with ESM3 Generate novel protein sequences with desired properties using multimodal generative modeling. **When to use:** - Designing proteins with specific functional properties - Completing partial protein sequences - Generating variants of existing proteins - Creating proteins with desired structural characteristics **Basic usage:** ```python from esm.models.esm3 import ESM3 from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig # Load model locally model: ESM3InferenceClient = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda") # Create protein prompt protein = ESMProtein(sequence="MPRT___KEND") # '_' represents masked positions # Generate completion protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8)) print(protein.sequence) ``` **For remote/cloud usage via Forge API:** ```python from esm.sdk.forge import ESM3ForgeInferenceClient from esm.sdk.api import ESMProtein, GenerationConfig # Connect to Forge model = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token="<token>") # Generate protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8)) ``` See `references/esm3-api.md` for detailed ESM3 model specifications, advanced generation configurations, and multimodal prompting examples. ### 2. Structure Prediction and Inverse Folding Use ESM3's structure track for structure prediction from sequence or inverse folding (sequence design from structure). **Structure prediction:** ```python from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig # Predict structure from sequence protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...") protein_with_structure = model.generate( protein, GenerationConfig(track="structure", num_steps=protein.sequence.count("_")) ) # Access predicted structure coordinates = protein_with_structure.coordinates # 3D coordinates pdb_string = protein_with_structure.to_pdb() ``` **Inverse folding (sequence from structure):** ```python # Design sequence for a target structure protein_with_structure = ESMProtein.from_pdb("target_structure.pdb") protein_with_structure.sequence = None # Remove sequence # Generate sequence that folds to this structure designed_protein = model.generate( protein_with_structure, GenerationConfig(track="sequence", num_steps=50, temperature=0.7) ) ``` ### 3. Protein Embeddings with ESM C Generate high-quality embeddings for downstream tasks like function prediction, classification, or similarity analysis. **When to use:** - Extracting protein representations for machine learning - Computing sequence similarities - Feature extraction for protein classification - Transfer learning for protein-related tasks **Basic usage:** ```python from esm.models.esmc import ESMC from esm.sdk.api import ESMProtein # Load ESM C model model = ESMC.from_pretrained("esmc-300m").to("cuda") # Get embeddings protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...") protein_tensor = model.encode(protein) # Generate embeddings embeddings = model.forward(protein_tensor) ``` **Batch processing:** ```python # Encode multiple proteins proteins = [ ESMProtein(sequence="MPRTKEIND..."), ESMProtein(sequence="AGLIVHSPQ..."), ESMProtein(sequence="KTEFLNDGR...") ] embeddings_list = [model.logits(model.forward(model.encode(p))) for p in proteins] ``` See `references/esm-c-api.md` for ESM C model details, efficiency comparisons, and advanced embedding strategies. ### 4. Function Conditioning and Annotation Use ESM3's function track to generate proteins with specific functional annotations or predict function from sequence. **Function-conditioned generation:** ```python from esm.sdk.api import ESMProtein, FunctionAnnotation, GenerationConfig # Create protein with desired function protein = ESMProtein( sequence="_" * 200, # Generate 200 residue protein function_annotations=[ FunctionAnnotation(label="fluorescent_protein", start=50, end=150) ] ) # Generate sequence with specified function functional_protein = model.generate( protein, GenerationConfig(track="sequence", num_steps=200) ) ``` ### 5. Chain-of-Thought Generation Iteratively refine protein designs using ESM3's chain-of-thought generation approach. ```python from esm.sdk.api import GenerationConfig # Multi-step refinement protein = ESMProtein(sequence="MPRT" + "_" * 100 + "KEND") # Step 1: Generate initial structure config = GenerationConfig(track="structure", num_steps=50) protein = model.generate(protein, config) # Step 2: Refine sequence based on structure config = GenerationConfig(track="sequence", num_steps=50, temperature=0.5) protein = model.generate(protein, config) # Step 3: Predict function config = GenerationConfig(track="function", num_steps=20) protein = model.generate(protein, config) ``` ### 6. Batch Processing with Forge API Process multiple proteins efficiently using Forge's async executor. ```python from esm.sdk.forge import ESM3ForgeInferenceClient import asyncio client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token="<token>") # Async batch processing async def batch_generate(proteins_list): tasks = [ client.async_generate(protein, GenerationConfig(track="sequence")) for protein in proteins_list ] return await asyncio.gather(*tasks) # Execute proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(10)] results = asyncio.run(batch_generate(proteins)) ``` See `references/forge-api.md` for detailed Forge API documentation, authentication, rate limits, and batch processing patterns. ## Model Selection Guide **ESM3 Models (Generative):** - `esm3-sm-open-v1` (1.4B) - Open weights, local usage, good for experimentation - `esm3-medium-2024-08` (7B) - Best balance of quality and speed (Forge only) - `esm3-large-2024-03` (98B) - Highest quality, slower (Forge only) **ESM C Models (Embeddings):** - `esmc-300m` (30 layers) - Lightweight, fast inference - `esmc-600m` (36 layers) - Balanced performance - `esmc-6b` (80 layers) - Maximum representation quality **Selection criteria:** - **Local development/testing:** Use `esm3-sm-open-v1` or `esmc-300m` - **Production quality:** Use `esm3-medium-2024-08` via Forge - **Maximum accuracy:** Use `esm3-large-2024-03` or `esmc-6b` - **High throughput:** Use Forge API with batch executor - **Cost optimization:** Use smaller models, implement caching strategies ## Installation **Basic installation:** ```bash uv pip install esm ``` **With Flash Attention (recommended for faster inference):** ```bash uv pip install esm uv pip install flash-attn --no-build-isolation ``` **For Forge API access:** ```bash uv pip install esm # SDK includes Forge client ``` No additional dependencies needed. Obtain Forge API token at https://forge.evolutionaryscale.ai ## Common Workflows For detailed examples and complete workflows, see `references/workflows.md` which includes: - Novel GFP design with chain-of-thought - Protein variant generation and screening - Structure-based sequence optimization - Function prediction pipelines - Embedding-based clustering and analysis ## References This skill includes comprehensive reference documentation: - `references/esm3-api.md` - ESM3 model architecture, API reference, generation parameters, and multimodal prompting - `references/esm-c-api.md` - ESM C model details, embedding strategies, and performance optimization - `references/forge-api.md` - Forge platform documentation, authentication, batch processing, and deployment - `references/workflows.md` - Complete examples and common workflow patterns These references contain detailed API specifications, parameter descriptions, and advanced usage patterns. Load them as needed for specific tasks. ## Best Practices **For generation tasks:** - Start with smaller models for prototyping (`esm3-sm-open-v1`) - Use temperature parameter to control diversity (0.0 = deterministic, 1.0 = diverse) - Implement iterative refinement with chain-of-thought for complex designs - Validate generated sequences with structure prediction or wet-lab experiments **For embedding tasks:** - Batch process sequences when possible for efficiency - Cache embeddings for repeated analyses - Normalize embeddings when computing similarities - Use appropriate model size based on downstream task requirements **For production deployment:** - Use Forge API for scalability and latest models - Implement error handling and retry logic for API calls - Monitor token usage and implement rate limiting - Consider AWS SageMaker deployment for dedicated infrastructure ## Resources and Documentation - **GitHub Repository:** https://github.com/evolutionaryscale/esm - **Forge Platform:** https://forge.evolutionaryscale.ai - **Scientific Paper:** Hayes et al., Science (2025) - https://www.science.org/doi/10.1126/science.ads0018 - **Blog Posts:** - ESM3 Release: https://www.evolutionaryscale.ai/blog/esm3-release - ESM C Launch: https://www.evolutionaryscale.ai/blog/esm-cambrian - **Community:** Slack community at https://bit.ly/3FKwcWd - **Model Weights:** HuggingFace EvolutionaryScale organization ## Responsible Use ESM is designed for beneficial applications in protein engineering, drug discovery, and scientific research. Follow the Responsible Biodesign Framework (https://responsiblebiodesign.ai/) when designing novel proteins. Consider biosafety and ethical implications of protein designs before experimental validation.skills/bioinformatics/alterlab-anndata/SKILL.mdskillShow content (10267 bytes)
--- name: alterlab-anndata description: Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census. Part of the AlterLab Academic Skills suite. license: MIT metadata: skill-author: AlterLab version: "1.0.0" --- # AnnData ## Overview AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis. ## When to Use This Skill Use this skill when: - Creating, reading, or writing AnnData objects - Working with h5ad, zarr, or other genomics data formats - Performing single-cell RNA-seq analysis - Managing large datasets with sparse matrices or backed mode - Concatenating multiple datasets or experimental batches - Subsetting, filtering, or transforming annotated data - Integrating with scanpy, scvi-tools, or other scverse ecosystem tools ## Installation ```bash uv pip install anndata # With optional dependencies uv pip install anndata[dev,test,doc] ``` ## Quick Start ### Creating an AnnData object ```python import anndata as ad import numpy as np import pandas as pd # Minimal creation X = np.random.rand(100, 2000) # 100 cells × 2000 genes adata = ad.AnnData(X) # With metadata obs = pd.DataFrame({ 'cell_type': ['T cell', 'B cell'] * 50, 'sample': ['A', 'B'] * 50 }, index=[f'cell_{i}' for i in range(100)]) var = pd.DataFrame({ 'gene_name': [f'Gene_{i}' for i in range(2000)] }, index=[f'ENSG{i:05d}' for i in range(2000)]) adata = ad.AnnData(X=X, obs=obs, var=var) ``` ### Reading data ```python # Read h5ad file adata = ad.read_h5ad('data.h5ad') # Read with backed mode (for large files) adata = ad.read_h5ad('large_data.h5ad', backed='r') # Read other formats adata = ad.read_csv('data.csv') adata = ad.read_loom('data.loom') adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') ``` ### Writing data ```python # Write h5ad file adata.write_h5ad('output.h5ad') # Write with compression adata.write_h5ad('output.h5ad', compression='gzip') # Write other formats adata.write_zarr('output.zarr') adata.write_csvs('output_dir/') ``` ### Basic operations ```python # Subset by conditions t_cells = adata[adata.obs['cell_type'] == 'T cell'] # Subset by indices subset = adata[0:50, 0:100] # Add metadata adata.obs['quality_score'] = np.random.rand(adata.n_obs) adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8 # Access dimensions print(f"{adata.n_obs} observations × {adata.n_vars} variables") ``` ## Core Capabilities ### 1. Data Structure Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components. **See**: `references/data_structure.md` for comprehensive information on: - Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw) - Creating AnnData objects from various sources - Accessing and manipulating data components - Memory-efficient practices ### 2. Input/Output Operations Read and write data in various formats with support for compression, backed mode, and cloud storage. **See**: `references/io_operations.md` for details on: - Native formats (h5ad, zarr) - Alternative formats (CSV, MTX, Loom, 10X, Excel) - Backed mode for large datasets - Remote data access - Format conversion - Performance optimization Common commands: ```python # Read/write h5ad adata = ad.read_h5ad('data.h5ad', backed='r') adata.write_h5ad('output.h5ad', compression='gzip') # Read 10X data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # Read MTX format adata = ad.read_mtx('matrix.mtx').T ``` ### 3. Concatenation Combine multiple AnnData objects along observations or variables with flexible join strategies. **See**: `references/concatenation.md` for comprehensive coverage of: - Basic concatenation (axis=0 for observations, axis=1 for variables) - Join types (inner, outer) - Merge strategies (same, unique, first, only) - Tracking data sources with labels - Lazy concatenation (AnnCollection) - On-disk concatenation for large datasets Common commands: ```python # Concatenate observations (combine samples) adata = ad.concat( [adata1, adata2, adata3], axis=0, join='inner', label='batch', keys=['batch1', 'batch2', 'batch3'] ) # Concatenate variables (combine modalities) adata = ad.concat([adata_rna, adata_protein], axis=1) # Lazy concatenation from anndata.experimental import AnnCollection collection = AnnCollection( ['data1.h5ad', 'data2.h5ad'], join_obs='outer', label='dataset' ) ``` ### 4. Data Manipulation Transform, subset, filter, and reorganize data efficiently. **See**: `references/manipulation.md` for detailed guidance on: - Subsetting (by indices, names, boolean masks, metadata conditions) - Transposition - Copying (full copies vs views) - Renaming (observations, variables, categories) - Type conversions (strings to categoricals, sparse/dense) - Adding/removing data components - Reordering - Quality control filtering Common commands: ```python # Subset by metadata filtered = adata[adata.obs['quality_score'] > 0.8] hv_genes = adata[:, adata.var['highly_variable']] # Transpose adata_T = adata.T # Copy vs view view = adata[0:100, :] # View (lightweight reference) copy = adata[0:100, :].copy() # Independent copy # Convert strings to categoricals adata.strings_to_categoricals() ``` ### 5. Best Practices Follow recommended patterns for memory efficiency, performance, and reproducibility. **See**: `references/best_practices.md` for guidelines on: - Memory management (sparse matrices, categoricals, backed mode) - Views vs copies - Data storage optimization - Performance optimization - Working with raw data - Metadata management - Reproducibility - Error handling - Integration with other tools - Common pitfalls and solutions Key recommendations: ```python # Use sparse matrices for sparse data from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) # Convert strings to categoricals adata.strings_to_categoricals() # Use backed mode for large files adata = ad.read_h5ad('large.h5ad', backed='r') # Store raw before filtering adata.raw = adata.copy() adata = adata[:, adata.var['highly_variable']] ``` ## Integration with Scverse Ecosystem AnnData serves as the foundational data structure for the scverse ecosystem: ### Scanpy (Single-cell analysis) ```python import scanpy as sc # Preprocessing sc.pp.filter_cells(adata, min_genes=200) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) # Dimensionality reduction sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata) # Visualization sc.pl.umap(adata, color=['cell_type', 'leiden']) ``` ### Muon (Multimodal data) ```python import muon as mu # Combine RNA and protein data mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein}) ``` ### PyTorch integration ```python from anndata.experimental import AnnLoader # Create DataLoader for deep learning dataloader = AnnLoader(adata, batch_size=128, shuffle=True) for batch in dataloader: X = batch.X # Train model ``` ## Common Workflows ### Single-cell RNA-seq analysis ```python import anndata as ad import scanpy as sc # 1. Load data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # 2. Quality control adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) adata.obs['n_counts'] = adata.X.sum(axis=1) adata = adata[adata.obs['n_genes'] > 200] adata = adata[adata.obs['n_counts'] < 50000] # 3. Store raw adata.raw = adata.copy() # 4. Normalize and filter sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var['highly_variable']] # 5. Save processed data adata.write_h5ad('processed.h5ad') ``` ### Batch integration ```python # Load multiple batches adata1 = ad.read_h5ad('batch1.h5ad') adata2 = ad.read_h5ad('batch2.h5ad') adata3 = ad.read_h5ad('batch3.h5ad') # Concatenate with batch labels adata = ad.concat( [adata1, adata2, adata3], label='batch', keys=['batch1', 'batch2', 'batch3'], join='inner' ) # Apply batch correction import scanpy as sc sc.pp.combat(adata, key='batch') # Continue analysis sc.pp.pca(adata) sc.pp.neighbors(adata) sc.tl.umap(adata) ``` ### Working with large datasets ```python # Open in backed mode adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r') # Filter based on metadata (no data loading) high_quality = adata[adata.obs['quality_score'] > 0.8] # Load filtered subset adata_subset = high_quality.to_memory() # Process subset process(adata_subset) # Or process in chunks chunk_size = 1000 for i in range(0, adata.n_obs, chunk_size): chunk = adata[i:i+chunk_size, :].to_memory() process(chunk) ``` ## Troubleshooting ### Out of memory errors Use backed mode or convert to sparse matrices: ```python # Backed mode adata = ad.read_h5ad('file.h5ad', backed='r') # Sparse matrices from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) ``` ### Slow file reading Use compression and appropriate formats: ```python # Optimize for storage adata.strings_to_categoricals() adata.write_h5ad('file.h5ad', compression='gzip') # Use Zarr for cloud storage adata.write_zarr('file.zarr', chunks=(1000, 1000)) ``` ### Index alignment issues Always align external data on index: ```python # Wrong adata.obs['new_col'] = external_data['values'] # Correct adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values'] ``` ## Additional Resources - **Official documentation**: https://anndata.readthedocs.io/ - **Scanpy tutorials**: https://scanpy.readthedocs.io/ - **Scverse ecosystem**: https://scverse.org/ - **GitHub repository**: https://github.com/scverse/anndataskills/bioinformatics/alterlab-arboreto/SKILL.mdskillShow content (6982 bytes)
--- name: alterlab-arboreto description: Infer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets. Part of the AlterLab Academic Skills suite. license: MIT metadata: skill-author: AlterLab version: "1.0.0" --- # Arboreto ## Overview Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters. **Core capability**: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions). ## Quick Start Install arboreto: ```bash uv pip install arboreto ``` Basic GRN inference: ```python import pandas as pd from arboreto.algo import grnboost2 if __name__ == '__main__': # Load expression data (genes as columns) expression_matrix = pd.read_csv('expression_data.tsv', sep='\t') # Infer regulatory network network = grnboost2(expression_data=expression_matrix) # Save results (TF, target, importance) network.to_csv('network.tsv', sep='\t', index=False, header=False) ``` **Critical**: Always use `if __name__ == '__main__':` guard because Dask spawns new processes. ## Core Capabilities ### 1. Basic GRN Inference For standard GRN inference workflows including: - Input data preparation (Pandas DataFrame or NumPy array) - Running inference with GRNBoost2 or GENIE3 - Filtering by transcription factors - Output format and interpretation **See**: `references/basic_inference.md` **Use the ready-to-run script**: `scripts/basic_grn_inference.py` for standard inference tasks: ```bash python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777 ``` ### 2. Algorithm Selection Arboreto provides two algorithms: **GRNBoost2 (Recommended)**: - Fast gradient boosting-based inference - Optimized for large datasets (10k+ observations) - Default choice for most analyses **GENIE3**: - Random Forest-based inference - Original multiple regression approach - Use for comparison or validation Quick comparison: ```python from arboreto.algo import grnboost2, genie3 # Fast, recommended network_grnboost = grnboost2(expression_data=matrix) # Classic algorithm network_genie3 = genie3(expression_data=matrix) ``` **For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md` ### 3. Distributed Computing Scale inference from local multi-core to cluster environments: **Local (default)** - Uses all available cores automatically: ```python network = grnboost2(expression_data=matrix) ``` **Custom local client** - Control resources: ```python from distributed import LocalCluster, Client local_cluster = LocalCluster(n_workers=10, memory_limit='8GB') client = Client(local_cluster) network = grnboost2(expression_data=matrix, client_or_address=client) client.close() local_cluster.close() ``` **Cluster computing** - Connect to remote Dask scheduler: ```python from distributed import Client client = Client('tcp://scheduler:8786') network = grnboost2(expression_data=matrix, client_or_address=client) ``` **For cluster setup, performance optimization, and large-scale workflows**: `references/distributed_computing.md` ## Installation ```bash uv pip install arboreto ``` **Dependencies**: scipy, scikit-learn, numpy, pandas, dask, distributed ## Common Use Cases ### Single-Cell RNA-seq Analysis ```python import pandas as pd from arboreto.algo import grnboost2 if __name__ == '__main__': # Load single-cell expression matrix (cells x genes) sc_data = pd.read_csv('scrna_counts.tsv', sep='\t') # Infer cell-type-specific regulatory network network = grnboost2(expression_data=sc_data, seed=42) # Filter high-confidence links high_confidence = network[network['importance'] > 0.5] high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False) ``` ### Bulk RNA-seq with TF Filtering ```python from arboreto.utils import load_tf_names from arboreto.algo import grnboost2 if __name__ == '__main__': # Load data expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t') tf_names = load_tf_names('human_tfs.txt') # Infer with TF restriction network = grnboost2( expression_data=expression_data, tf_names=tf_names, seed=123 ) network.to_csv('tf_target_network.tsv', sep='\t', index=False) ``` ### Comparative Analysis (Multiple Conditions) ```python from arboreto.algo import grnboost2 if __name__ == '__main__': # Infer networks for different conditions conditions = ['control', 'treatment_24h', 'treatment_48h'] for condition in conditions: data = pd.read_csv(f'{condition}_expression.tsv', sep='\t') network = grnboost2(expression_data=data, seed=42) network.to_csv(f'{condition}_network.tsv', sep='\t', index=False) ``` ## Output Interpretation Arboreto returns a DataFrame with regulatory links: | Column | Description | |--------|-------------| | `TF` | Transcription factor (regulator) | | `target` | Target gene | | `importance` | Regulatory importance score (higher = stronger) | **Filtering strategy**: - Top N links per target gene - Importance threshold (e.g., > 0.5) - Statistical significance testing (permutation tests) ## Integration with pySCENIC Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis: ```python # Step 1: Use arboreto for GRN inference from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list) # Step 2: Use pySCENIC for regulon identification and activity scoring # (See pySCENIC documentation for downstream analysis) ``` ## Reproducibility Always set a seed for reproducible results: ```python network = grnboost2(expression_data=matrix, seed=777) ``` Run multiple seeds for robustness analysis: ```python from distributed import LocalCluster, Client if __name__ == '__main__': client = Client(LocalCluster()) seeds = [42, 123, 777] networks = [] for seed in seeds: net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed) networks.append(net) # Combine networks and filter consensus links consensus = analyze_consensus(networks) ``` ## Troubleshooting **Memory errors**: Reduce dataset size by filtering low-variance genes or use distributed computing **Slow performance**: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list **Dask errors**: Ensure `if __name__ == '__main__':` guard is present in scripts **Empty results**: Check data format (genes as columns), verify TF names match gene namesskills/bioinformatics/alterlab-cobrapy/SKILL.mdskillShow content (12520 bytes)
--- name: alterlab-cobrapy description: Constraint-based metabolic modeling (COBRA). FBA, FVA, gene knockouts, flux sampling, SBML models, for systems biology and metabolic engineering analysis. Part of the AlterLab Academic Skills suite. license: GPL-2.0 license metadata: skill-author: AlterLab version: "1.0.0" --- # COBRApy - Constraint-Based Reconstruction and Analysis ## Overview COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors. ## Core Capabilities COBRApy provides comprehensive tools organized into several key areas: ### 1. Model Management Load existing models from repositories or files: ```python from cobra.io import load_model # Load bundled test models model = load_model("textbook") # E. coli core model model = load_model("ecoli") # Full E. coli model model = load_model("salmonella") # Load from files from cobra.io import read_sbml_model, load_json_model, load_yaml_model model = read_sbml_model("path/to/model.xml") model = load_json_model("path/to/model.json") model = load_yaml_model("path/to/model.yml") ``` Save models in various formats: ```python from cobra.io import write_sbml_model, save_json_model, save_yaml_model write_sbml_model(model, "output.xml") # Preferred format save_json_model(model, "output.json") # For Escher compatibility save_yaml_model(model, "output.yml") # Human-readable ``` ### 2. Model Structure and Components Access and inspect model components: ```python # Access components model.reactions # DictList of all reactions model.metabolites # DictList of all metabolites model.genes # DictList of all genes # Get specific items by ID or index reaction = model.reactions.get_by_id("PFK") metabolite = model.metabolites[0] # Inspect properties print(reaction.reaction) # Stoichiometric equation print(reaction.bounds) # Flux constraints print(reaction.gene_reaction_rule) # GPR logic print(metabolite.formula) # Chemical formula print(metabolite.compartment) # Cellular location ``` ### 3. Flux Balance Analysis (FBA) Perform standard FBA simulation: ```python # Basic optimization solution = model.optimize() print(f"Objective value: {solution.objective_value}") print(f"Status: {solution.status}") # Access fluxes print(solution.fluxes["PFK"]) print(solution.fluxes.head()) # Fast optimization (objective value only) objective_value = model.slim_optimize() # Change objective model.objective = "ATPM" solution = model.optimize() ``` Parsimonious FBA (minimize total flux): ```python from cobra.flux_analysis import pfba solution = pfba(model) ``` Geometric FBA (find central solution): ```python from cobra.flux_analysis import geometric_fba solution = geometric_fba(model) ``` ### 4. Flux Variability Analysis (FVA) Determine flux ranges for all reactions: ```python from cobra.flux_analysis import flux_variability_analysis # Standard FVA fva_result = flux_variability_analysis(model) # FVA at 90% optimality fva_result = flux_variability_analysis(model, fraction_of_optimum=0.9) # Loopless FVA (eliminates thermodynamically infeasible loops) fva_result = flux_variability_analysis(model, loopless=True) # FVA for specific reactions fva_result = flux_variability_analysis( model, reaction_list=["PFK", "FBA", "PGI"] ) ``` ### 5. Gene and Reaction Deletion Studies Perform knockout analyses: ```python from cobra.flux_analysis import ( single_gene_deletion, single_reaction_deletion, double_gene_deletion, double_reaction_deletion ) # Single deletions gene_results = single_gene_deletion(model) reaction_results = single_reaction_deletion(model) # Double deletions (uses multiprocessing) double_gene_results = double_gene_deletion( model, processes=4 # Number of CPU cores ) # Manual knockout using context manager with model: model.genes.get_by_id("b0008").knock_out() solution = model.optimize() print(f"Growth after knockout: {solution.objective_value}") # Model automatically reverts after context exit ``` ### 6. Growth Media and Minimal Media Manage growth medium: ```python # View current medium print(model.medium) # Modify medium (must reassign entire dict) medium = model.medium medium["EX_glc__D_e"] = 10.0 # Set glucose uptake medium["EX_o2_e"] = 0.0 # Anaerobic conditions model.medium = medium # Calculate minimal media from cobra.medium import minimal_medium # Minimize total import flux min_medium = minimal_medium(model, minimize_components=False) # Minimize number of components (uses MILP, slower) min_medium = minimal_medium( model, minimize_components=True, open_exchanges=True ) ``` ### 7. Flux Sampling Sample the feasible flux space: ```python from cobra.sampling import sample # Sample using OptGP (default, supports parallel processing) samples = sample(model, n=1000, method="optgp", processes=4) # Sample using ACHR samples = sample(model, n=1000, method="achr") # Validate samples from cobra.sampling import OptGPSampler sampler = OptGPSampler(model, processes=4) sampler.sample(1000) validation = sampler.validate(sampler.samples) print(validation.value_counts()) # Should be all 'v' for valid ``` ### 8. Production Envelopes Calculate phenotype phase planes: ```python from cobra.flux_analysis import production_envelope # Standard production envelope envelope = production_envelope( model, reactions=["EX_glc__D_e", "EX_o2_e"], objective="EX_ac_e" # Acetate production ) # With carbon yield envelope = production_envelope( model, reactions=["EX_glc__D_e", "EX_o2_e"], carbon_sources="EX_glc__D_e" ) # Visualize (use matplotlib or pandas plotting) import matplotlib.pyplot as plt envelope.plot(x="EX_glc__D_e", y="EX_o2_e", kind="scatter") plt.show() ``` ### 9. Gapfilling Add reactions to make models feasible: ```python from cobra.flux_analysis import gapfill # Prepare universal model with candidate reactions universal = load_model("universal") # Perform gapfilling with model: # Remove reactions to create gaps for demonstration model.remove_reactions([model.reactions.PGI]) # Find reactions needed solution = gapfill(model, universal) print(f"Reactions to add: {solution}") ``` ### 10. Model Building Build models from scratch: ```python from cobra import Model, Reaction, Metabolite # Create model model = Model("my_model") # Create metabolites atp_c = Metabolite("atp_c", formula="C10H12N5O13P3", name="ATP", compartment="c") adp_c = Metabolite("adp_c", formula="C10H12N5O10P2", name="ADP", compartment="c") pi_c = Metabolite("pi_c", formula="HO4P", name="Phosphate", compartment="c") # Create reaction reaction = Reaction("ATPASE") reaction.name = "ATP hydrolysis" reaction.subsystem = "Energy" reaction.lower_bound = 0.0 reaction.upper_bound = 1000.0 # Add metabolites with stoichiometry reaction.add_metabolites({ atp_c: -1.0, adp_c: 1.0, pi_c: 1.0 }) # Add gene-reaction rule reaction.gene_reaction_rule = "(gene1 and gene2) or gene3" # Add to model model.add_reactions([reaction]) # Add boundary reactions model.add_boundary(atp_c, type="exchange") model.add_boundary(adp_c, type="demand") # Set objective model.objective = "ATPASE" ``` ## Common Workflows ### Workflow 1: Load Model and Predict Growth ```python from cobra.io import load_model # Load model model = load_model("ecoli") # Run FBA solution = model.optimize() print(f"Growth rate: {solution.objective_value:.3f} /h") # Show active pathways print(solution.fluxes[solution.fluxes.abs() > 1e-6]) ``` ### Workflow 2: Gene Knockout Screen ```python from cobra.io import load_model from cobra.flux_analysis import single_gene_deletion # Load model model = load_model("ecoli") # Perform single gene deletions results = single_gene_deletion(model) # Find essential genes (growth < threshold) essential_genes = results[results["growth"] < 0.01] print(f"Found {len(essential_genes)} essential genes") # Find genes with minimal impact neutral_genes = results[results["growth"] > 0.9 * solution.objective_value] ``` ### Workflow 3: Media Optimization ```python from cobra.io import load_model from cobra.medium import minimal_medium # Load model model = load_model("ecoli") # Calculate minimal medium for 50% of max growth target_growth = model.slim_optimize() * 0.5 min_medium = minimal_medium( model, target_growth, minimize_components=True ) print(f"Minimal medium components: {len(min_medium)}") print(min_medium) ``` ### Workflow 4: Flux Uncertainty Analysis ```python from cobra.io import load_model from cobra.flux_analysis import flux_variability_analysis from cobra.sampling import sample # Load model model = load_model("ecoli") # First check flux ranges at optimality fva = flux_variability_analysis(model, fraction_of_optimum=1.0) # For reactions with large ranges, sample to understand distribution samples = sample(model, n=1000) # Analyze specific reaction reaction_id = "PFK" import matplotlib.pyplot as plt samples[reaction_id].hist(bins=50) plt.xlabel(f"Flux through {reaction_id}") plt.ylabel("Frequency") plt.show() ``` ### Workflow 5: Context Manager for Temporary Changes Use context managers to make temporary modifications: ```python # Model remains unchanged outside context with model: # Temporarily change objective model.objective = "ATPM" # Temporarily modify bounds model.reactions.EX_glc__D_e.lower_bound = -5.0 # Temporarily knock out genes model.genes.b0008.knock_out() # Optimize with changes solution = model.optimize() print(f"Modified growth: {solution.objective_value}") # All changes automatically reverted solution = model.optimize() print(f"Original growth: {solution.objective_value}") ``` ## Key Concepts ### DictList Objects Models use `DictList` objects for reactions, metabolites, and genes - behaving like both lists and dictionaries: ```python # Access by index first_reaction = model.reactions[0] # Access by ID pfk = model.reactions.get_by_id("PFK") # Query methods atp_reactions = model.reactions.query("atp") ``` ### Flux Constraints Reaction bounds define feasible flux ranges: - **Irreversible**: `lower_bound = 0, upper_bound > 0` - **Reversible**: `lower_bound < 0, upper_bound > 0` - Set both bounds simultaneously with `.bounds` to avoid inconsistencies ### Gene-Reaction Rules (GPR) Boolean logic linking genes to reactions: ```python # AND logic (both required) reaction.gene_reaction_rule = "gene1 and gene2" # OR logic (either sufficient) reaction.gene_reaction_rule = "gene1 or gene2" # Complex logic reaction.gene_reaction_rule = "(gene1 and gene2) or (gene3 and gene4)" ``` ### Exchange Reactions Special reactions representing metabolite import/export: - Named with prefix `EX_` by convention - Positive flux = secretion, negative flux = uptake - Managed through `model.medium` dictionary ## Best Practices 1. **Use context managers** for temporary modifications to avoid state management issues 2. **Validate models** before analysis using `model.slim_optimize()` to ensure feasibility 3. **Check solution status** after optimization - `optimal` indicates successful solve 4. **Use loopless FVA** when thermodynamic feasibility matters 5. **Set fraction_of_optimum** appropriately in FVA to explore suboptimal space 6. **Parallelize** computationally expensive operations (sampling, double deletions) 7. **Prefer SBML format** for model exchange and long-term storage 8. **Use slim_optimize()** when only objective value needed for performance 9. **Validate flux samples** to ensure numerical stability ## Troubleshooting **Infeasible solutions**: Check medium constraints, reaction bounds, and model consistency **Slow optimization**: Try different solvers (GLPK, CPLEX, Gurobi) via `model.solver` **Unbounded solutions**: Verify exchange reactions have appropriate upper bounds **Import errors**: Ensure correct file format and valid SBML identifiers ## References For detailed workflows and API patterns, refer to: - `references/workflows.md` - Comprehensive step-by-step workflow examples - `references/api_quick_reference.md` - Common function signatures and patterns Official documentation: https://cobrapy.readthedocs.io/en/latest/skills/bioinformatics/alterlab-deeptools/SKILL.mdskillShow content (18047 bytes)
--- name: alterlab-deeptools description: NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks), for ChIP-seq, RNA-seq, ATAC-seq visualization. Part of the AlterLab Academic Skills suite. license: MIT metadata: skill-author: AlterLab version: "1.0.0" --- # deepTools: NGS Data Analysis Toolkit ## Overview deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. Use deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments. **Core capabilities:** - Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph) - Quality control assessment (fingerprint, correlation, coverage) - Sample comparison and correlation analysis - Heatmap and profile plot generation around genomic features - Enrichment analysis and peak region visualization ## When to Use This Skill This skill should be used when: - **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data" - **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis" - **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot" - **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis" - **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow" - **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context ## Quick Start For users new to deepTools, start with file validation and common workflows: ### 1. Validate Input Files Before running any analysis, validate BAM, bigWig, and BED files using the validation script: ```bash python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed ``` This checks file existence, BAM indices, and format correctness. ### 2. Generate Workflow Template For standard analyses, use the workflow generator to create customized scripts: ```bash # List available workflows python scripts/workflow_generator.py --list # Generate ChIP-seq QC workflow python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \ --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \ --genome-size 2913022398 # Make executable and run chmod +x qc_workflow.sh ./qc_workflow.sh ``` ### 3. Most Common Operations See `assets/quick_reference.md` for frequently used commands and parameters. ## Installation ```bash uv pip install deeptools ``` ## Core Workflows deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization** ### ChIP-seq Quality Control Workflow When users request ChIP-seq QC or quality assessment: 1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc` 2. **Key QC steps**: - Sample correlation (multiBamSummary + plotCorrelation) - PCA analysis (plotPCA) - Coverage assessment (plotCoverage) - Fragment size validation (bamPEFragmentSize) - ChIP enrichment strength (plotFingerprint) **Interpreting results:** - **Correlation**: Replicates should cluster together with high correlation (>0.9) - **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment - **Coverage**: Assess if sequencing depth is adequate for analysis Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow" ### ChIP-seq Complete Analysis Workflow For full ChIP-seq analysis from BAM to visualizations: 1. **Generate coverage tracks** with normalization (bamCoverage) 2. **Create comparison tracks** (bamCompare for log2 ratio) 3. **Compute signal matrices** around features (computeMatrix) 4. **Generate visualizations** (plotHeatmap, plotProfile) 5. **Enrichment analysis** at peaks (plotEnrichment) Use `scripts/workflow_generator.py chipseq_analysis` to generate template. Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow" ### RNA-seq Coverage Workflow For strand-specific RNA-seq coverage tracks: Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands. **Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions). Use normalization: CPM for fixed bins, RPKM for gene-level analysis. Template available: `scripts/workflow_generator.py rnaseq_coverage` Details in `references/workflows.md` → "RNA-seq Coverage Workflow" ### ATAC-seq Analysis Workflow ATAC-seq requires Tn5 offset correction: 1. **Shift reads** using alignmentSieve with `--ATACshift` 2. **Generate coverage** with bamCoverage 3. **Analyze fragment sizes** (expect nucleosome ladder pattern) 4. **Visualize at peaks** if available Template: `scripts/workflow_generator.py atacseq` Full workflow in `references/workflows.md` → "ATAC-seq Workflow" ## Tool Categories and Common Tasks ### BAM/bigWig Processing **Convert BAM to normalized coverage:** ```bash bamCoverage --bam input.bam --outFileName output.bw \ --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \ --binSize 10 --numberOfProcessors 8 ``` **Compare two samples (log2 ratio):** ```bash bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \ --operation log2 --scaleFactorsMethod readCount ``` **Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools" ### Quality Control **Check ChIP enrichment:** ```bash plotFingerprint -b input.bam chip.bam -o fingerprint.png \ --extendReads 200 --ignoreDuplicates ``` **Sample correlation:** ```bash multiBamSummary bins --bamfiles *.bam -o counts.npz plotCorrelation -in counts.npz --corMethod pearson \ --whatToShow heatmap -o correlation.png ``` **Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize Complete reference: `references/tools_reference.md` → "Quality Control Tools" ### Visualization **Create heatmap around TSS:** ```bash # Compute matrix computeMatrix reference-point -S signal.bw -R genes.bed \ -b 3000 -a 3000 --referencePoint TSS -o matrix.gz # Generate heatmap plotHeatmap -m matrix.gz -o heatmap.png \ --colorMap RdBu --kmeans 3 ``` **Create profile plot:** ```bash plotProfile -m matrix.gz -o profile.png \ --plotType lines --colors blue red ``` **Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment Complete reference: `references/tools_reference.md` → "Visualization Tools" ## Normalization Methods Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance. **Quick selection guide:** - **ChIP-seq coverage**: Use RPGC or CPM - **ChIP-seq comparison**: Use bamCompare with log2 and readCount - **RNA-seq bins**: Use CPM - **RNA-seq genes**: Use RPKM (accounts for gene length) - **ATAC-seq**: Use RPGC or CPM **Normalization methods:** - **RPGC**: 1× genome coverage (requires --effectiveGenomeSize) - **CPM**: Counts per million mapped reads - **RPKM**: Reads per kb per million (accounts for region length) - **BPM**: Bins per million - **None**: Raw counts (not recommended for comparisons) Full explanation: `references/normalization_methods.md` ## Effective Genome Sizes RPGC normalization requires effective genome size. Common values: | Organism | Assembly | Size | Usage | |----------|----------|------|-------| | Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` | | Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` | | Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` | | *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` | | *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` | Complete table with read-length-specific values: `references/effective_genome_sizes.md` ## Common Parameters Across Tools Many deepTools commands share these options: **Performance:** - `--numberOfProcessors, -p`: Enable parallel processing (always use available cores) - `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`) **Read Filtering:** - `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses) - `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`) - `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds - `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering **Read Processing:** - `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO) - `--centerReads`: Center at fragment midpoint for sharper signals ## Best Practices ### File Validation **Always validate files first** using `scripts/validate_files.py` to check: - File existence and readability - BAM indices present (.bai files) - BED format correctness - File sizes reasonable ### Analysis Strategy 1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding 2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing 3. **Document commands**: Save full command lines for reproducibility 4. **Use consistent normalization**: Apply same method across samples in comparisons 5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds ### ChIP-seq Specific - **Always extend reads** for ChIP-seq: `--extendReads 200` - **Remove duplicates**: Use `--ignoreDuplicates` in most cases - **Check enrichment first**: Run plotFingerprint before detailed analysis - **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction ### RNA-seq Specific - **Never extend reads** for RNA-seq (would span splice junctions) - **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries - **Normalization**: CPM for bins, RPKM for genes ### ATAC-seq Specific - **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift` - **Fragment filtering**: Set appropriate min/max fragment lengths - **Check nucleosome pattern**: Fragment size plot should show ladder pattern ### Performance Optimization 1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores) 2. **Increase bin size** for faster processing and smaller files 3. **Process chromosomes separately** for memory-limited systems 4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files 5. **Use bigWig over bedGraph**: Compressed and faster to process ## Troubleshooting ### Common Issues **BAM index missing:** ```bash samtools index input.bam ``` **Out of memory:** Process chromosomes individually using `--region`: ```bash bamCoverage --bam input.bam -o chr1.bw --region chr1 ``` **Slow processing:** Increase `--numberOfProcessors` and/or increase `--binSize` **bigWig files too large:** Increase bin size: `--binSize 50` or larger ### Validation Errors Run validation script to identify issues: ```bash python scripts/validate_files.py --bam *.bam --bed regions.bed ``` Common errors and solutions explained in script output. ## Reference Documentation This skill includes comprehensive reference documentation: ### references/tools_reference.md Complete documentation of all deepTools commands organized by category: - BAM and bigWig processing tools (9 tools) - Quality control tools (6 tools) - Visualization tools (3 tools) - Miscellaneous tools (2 tools) Each tool includes: - Purpose and overview - Key parameters with explanations - Usage examples - Important notes and best practices **Use this reference when:** Users ask about specific tools, parameters, or detailed usage. ### references/workflows.md Complete workflow examples for common analyses: - ChIP-seq quality control workflow - ChIP-seq complete analysis workflow - RNA-seq coverage workflow - ATAC-seq analysis workflow - Multi-sample comparison workflow - Peak region analysis workflow - Troubleshooting and performance tips **Use this reference when:** Users need complete analysis pipelines or workflow examples. ### references/normalization_methods.md Comprehensive guide to normalization methods: - Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.) - When to use each method - Formulas and interpretation - Selection guide by experiment type - Common pitfalls and solutions - Quick reference table **Use this reference when:** Users ask about normalization, comparing samples, or which method to use. ### references/effective_genome_sizes.md Effective genome size values and usage: - Common organism values (human, mouse, fly, worm, zebrafish) - Read-length-specific values - Calculation methods - When and how to use in commands - Custom genome calculation instructions **Use this reference when:** Users need genome size for RPGC normalization or GC bias correction. ## Helper Scripts ### scripts/validate_files.py Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format. **Usage:** ```bash python scripts/validate_files.py --bam sample1.bam sample2.bam \ --bed peaks.bed --bigwig signal.bw ``` **When to use:** Before starting any analysis, or when troubleshooting errors. ### scripts/workflow_generator.py Generates customizable bash script templates for common deepTools workflows. **Available workflows:** - `chipseq_qc`: ChIP-seq quality control - `chipseq_analysis`: Complete ChIP-seq analysis - `rnaseq_coverage`: Strand-specific RNA-seq coverage - `atacseq`: ATAC-seq with Tn5 correction **Usage:** ```bash # List workflows python scripts/workflow_generator.py --list # Generate workflow python scripts/workflow_generator.py chipseq_qc -o qc.sh \ --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \ --genome-size 2913022398 --threads 8 # Run generated workflow chmod +x qc.sh ./qc.sh ``` **When to use:** Users request standard workflows or need template scripts to customize. ## Assets ### assets/quick_reference.md Quick reference card with most common commands, effective genome sizes, and typical workflow pattern. **When to use:** Users need quick command examples without detailed documentation. ## Handling User Requests ### For New Users 1. Start with installation verification 2. Validate input files using `scripts/validate_files.py` 3. Recommend appropriate workflow based on experiment type 4. Generate workflow template using `scripts/workflow_generator.py` 5. Guide through customization and execution ### For Experienced Users 1. Provide specific tool commands for requested operations 2. Reference appropriate sections in `references/tools_reference.md` 3. Suggest optimizations and best practices 4. Offer troubleshooting for issues ### For Specific Tasks **"Convert BAM to bigWig":** - Use bamCoverage with appropriate normalization - Recommend RPGC or CPM based on use case - Provide effective genome size for organism - Suggest relevant parameters (extendReads, ignoreDuplicates, binSize) **"Check ChIP quality":** - Run full QC workflow or use plotFingerprint specifically - Explain interpretation of results - Suggest follow-up actions based on results **"Create heatmap":** - Guide through two-step process: computeMatrix → plotHeatmap - Help choose appropriate matrix mode (reference-point vs scale-regions) - Suggest visualization parameters and clustering options **"Compare samples":** - Recommend bamCompare for two-sample comparison - Suggest multiBamSummary + plotCorrelation for multiple samples - Guide normalization method selection ### Referencing Documentation When users need detailed information: - **Tool details**: Direct to specific sections in `references/tools_reference.md` - **Workflows**: Use `references/workflows.md` for complete analysis pipelines - **Normalization**: Consult `references/normalization_methods.md` for method selection - **Genome sizes**: Reference `references/effective_genome_sizes.md` Search references using grep patterns: ```bash # Find tool documentation grep -A 20 "^### toolname" references/tools_reference.md # Find workflow grep -A 50 "^## Workflow Name" references/workflows.md # Find normalization method grep -A 15 "^### Method Name" references/normalization_methods.md ``` ## Example Interactions **User: "I need to analyze my ChIP-seq data"** Response approach: 1. Ask about files available (BAM files, peaks, genes) 2. Validate files using validation script 3. Generate chipseq_analysis workflow template 4. Customize for their specific files and organism 5. Explain each step as script runs **User: "Which normalization should I use?"** Response approach: 1. Ask about experiment type (ChIP-seq, RNA-seq, etc.) 2. Ask about comparison goal (within-sample or between-sample) 3. Consult `references/normalization_methods.md` selection guide 4. Recommend appropriate method with justification 5. Provide command example with parameters **User: "Create a heatmap around TSS"** Response approach: 1. Verify bigWig and gene BED files available 2. Use computeMatrix with reference-point mode at TSS 3. Generate plotHeatmap with appropriate visualization parameters 4. Suggest clustering if dataset is large 5. Offer profile plot as complement ## Key Reminders - **File validation first**: Always validate input files before analysis - **Normalization matters**: Choose appropriate method for comparison type - **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq - **Use all cores**: Set `--numberOfProcessors` to available cores - **Test on regions**: Use `--region` for parameter testing - **Check QC first**: Run quality control before detailed analysis - **Document everything**: Save commands for reproducibility - **Reference documentation**: Use comprehensive references for detailed guidance
README
📢 Featured in awesome-claude-skills (5.7k ⭐)
🧬 186+ purpose-built Claude AI skills for faculty, researchers & academicians
Organized across 13 research domains — from bioinformatics to digital humanities
Research Pipeline · Scientific Databases · Bioinformatics · Data Science · Visualization · Clinical Research · and more
Explore Skills » · Quick Start · Domain Overview · Contributing · Report Bug
|
Built by AlterLab Creative Technologies Laboratory
Not tied to any specific university — these skills work for any researcher, anywhere. |
🎯Plug & PlayDrop a |
🧠Domain ExpertEach skill transforms Claude |
🔬Real FrameworksBuilt on actual scientific |
🌐UniversalWorks for any researcher |
📋 Table of Contents
Click to expand / collapse
🎯 What Is This?
A comprehensive suite of 186+ purpose-built Claude AI skills for faculty members, academicians, and researchers — organized into 13 domain categories spanning the full academic research lifecycle.
Each skill transforms Claude into a domain-specific expert assistant tailored to academic research, scientific computing, and scholarly publishing workflows.
[!TIP] How it works: Each skill is a structured
.mdprompt file. Drop it into a Claude Project or Claude Code, and Claude instantly becomes your research expert — with real scientific frameworks, professional output templates, and deep domain knowledge.
✨ Key Features
| Feature | Description | |
|---|---|---|
| 🔬 | Research-Ready | Skills built on real scientific methods, databases, and professional frameworks used by working researchers |
| 🤖 | Multi-Agent Pipelines | Core skills chain together: Research → Write → Review → Publish in a seamless workflow |
| 📊 | 39 Database Integration Skills | Instant access to PubMed, ChEMBL, UniProt, ClinicalTrials.gov, COSMIC, and more |
| 🧬 | Deep Domain Coverage | From single-cell RNA-seq analysis to quantum computing, from clinical trials to digital humanities |
| 📝 | Publication-Quality Output | LaTeX papers, conference posters, grant proposals, scientific visualizations — all formatted to professional standards |
| 🔄 | Mix & Match | Combine multiple skills in one Claude Project for a multi-expert research team |
🗂️ Domain Overview
| Domain | Skills | Focus Areas | |
|---|---|---|---|
| 🔄 | Core Pipeline | 6 | Multi-agent research → write → review → publish pipeline + teaching + thesis |
| 🗄️ | Databases | 39 | Connectors to scientific databases — PubMed, ChEMBL, UniProt, ClinicalTrials.gov, COSMIC, and more |
| 🧬 | Bioinformatics | 25 | Genomics, proteomics, molecular biology — Scanpy, BioPython, ESM, single-cell analysis |
| ⚗️ | Cheminformatics | 12 | Chemistry and drug discovery — RDKit, molecular dynamics, docking, ADMET |
| 🏥 | Clinical Research | 10 | Clinical decision support, treatment planning, medical imaging, regulatory |
| 📊 | Data Science | 22 | ML/statistics — scikit-learn, PyTorch Lightning, SHAP, transformers |
| 📈 | Visualization | 8 | Scientific plotting — Matplotlib, Seaborn, Plotly, schematics, infographics |
| ✍️ | Writing Tools | 13 | Scientific writing, citations, grants, posters, academic career |
| 🔧 | Lab Integrations | 9 | Laboratory platforms — Benchling, DNAnexus, Opentrons, Protocols.io |
| 🌍 | Domain-Specific | 17 | Quantum computing, geospatial, materials science, social science methods, digital humanities |
| 📄 | Document Tools | 6 | File format handling — DOCX, PDF, PPTX, XLSX, Markdown |
| 🔍 | Research Tools | 12 | Search, discovery, Zotero, qualitative methods, ethics, surveys, open science |
| 💰 | Finance & Economics | 7 | FRED, Alpha Vantage, SEC EDGAR, market research |
🚀 Quick Start
Option 1 — Claude Projects (Recommended)
1. Go to claude.ai → Projects → Create Project
2. Upload SKILL.md files from your domain folder into the project's Knowledge section
3. Start chatting — Claude now has your skills loaded
Option 2 — Claude Code CLI
git clone https://github.com/AlterLab-IEU/AlterLab-Academic-Skills.git
cd AlterLab-Academic-Skills
claude "help me research the latest findings on CRISPR gene editing"
Option 3 — Pick Individual Skills
Browse the
skills/folder and download only the ones you need. Every skill is a standalone.mdfile.
⚡ Core Pipeline — 6 Skills
The heart of the system — a multi-agent research-to-publication pipeline with 39 specialized agents, plus teaching and thesis supervision tools.
| # | Skill | Agents | What It Does |
|---|---|---|---|
| 1 | 🔬 Deep Research | 13 | Multi-mode research with systematic review, Socratic dialogue, fact-checking |
| 2 | 📝 Paper Writer | 12 | Academic paper authoring with LaTeX, bilingual support, 9 writing modes |
| 3 | 🔍 Paper Reviewer | 7 | Multi-perspective peer review with Devil's Advocate, 0–100 quality rubrics |
| 4 | 🔄 Research Pipeline | 7 | 10-stage orchestrator with integrity verification and material passports |
| 5 | 🎓 Teaching Design | — | Course design, syllabi, rubrics, Bloom's taxonomy, backward design |
| 6 | 📋 Thesis Supervisor | — | Dissertation guidance, defense prep, committee management |
📚 All 186+ Skills
🗄️ Databases — Scientific Database Connectors (39 Skills)
Click to expand full database skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | AlphaFold DB | Protein structure predictions from AlphaFold |
| 2 | arXiv | Preprint search and discovery |
| 3 | BindingDB | Binding affinity data for drug-target interactions |
| 4 | bioRxiv | Biology preprint search and monitoring |
| 5 | BRENDA | Enzyme functional data |
| 6 | cBioPortal | Cancer genomics data exploration |
| 7 | ChEMBL | Bioactive molecules with drug-like properties |
| 8 | ClinicalTrials.gov | Clinical trial registry search |
| 9 | ClinPGx | Clinical pharmacogenomics data |
| 10 | ClinVar | Genomic variation and human health |
| 11 | COSMIC | Catalogue of somatic mutations in cancer |
| 12 | Data Commons | Google's open knowledge graph |
| 13 | DepMap | Cancer dependency mapping |
| 14 | DrugBank | Drug and drug target information |
| 15 | ENA | European Nucleotide Archive |
| 16 | Ensembl | Genome annotation and variation |
| 17 | FDA | FDA drug and device data |
| 18 | Gene DB | Gene-level data aggregation |
| 19 | GEO | Gene Expression Omnibus datasets |
| 20 | gnomAD | Genome aggregation and variant frequency |
| 21 | GTEx | Tissue-specific gene expression |
| 22 | GWAS Catalog | Genome-wide association studies |
| 23 | HMDB | Human Metabolome Database |
| 24 | Imaging Data Commons | Cancer imaging data |
| 25 | InterPro | Protein families and domains |
| 26 | JASPAR | Transcription factor binding profiles |
| 27 | KEGG | Biological pathways and networks |
| 28 | Metabolomics Workbench | Metabolomics data repository |
| 29 | Monarch Initiative | Disease-gene associations |
| 30 | OpenAlex | Open scholarly metadata |
| 31 | Open Targets | Drug target identification |
| 32 | PDB | Protein 3D structure database |
| 33 | PubChem | Chemical information database |
| 34 | PubMed | Biomedical literature search |
| 35 | Reactome | Biological pathway database |
| 36 | STRING | Protein-protein interaction networks |
| 37 | UniProt | Protein sequence and function |
| 38 | USPTO | Patent search and analysis |
| 39 | ZINC | Commercially-available compounds for docking |
🧬 Bioinformatics — Genomics, Proteomics & Molecular Biology (25 Skills)
Click to expand full bioinformatics skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | AnnData | Annotated data matrices for single-cell |
| 2 | Arboreto | Gene regulatory network inference |
| 3 | BioPython | General-purpose bioinformatics toolkit |
| 4 | BioServices | Programmatic access to biological web services |
| 5 | CellxGene | Interactive single-cell data exploration |
| 6 | COBRApy | Constraint-based metabolic modeling |
| 7 | deepTools | NGS data analysis and visualization |
| 8 | ESM | Protein language models |
| 9 | ETE Toolkit | Phylogenetic tree analysis and visualization |
| 10 | FlowIO | Flow cytometry data handling |
| 11 | gget | Query genomic databases from Python |
| 12 | Glycoengineering | Glycan analysis and engineering |
| 13 | HistoLab | Computational histopathology |
| 14 | LaminDB | Data lineage and biological data management |
| 15 | Neuropixels | Neural probe data processing |
| 16 | PathML | Machine learning for pathology |
| 17 | Phylogenetics | Evolutionary tree construction |
| 18 | PyDESeq2 | Differential gene expression analysis |
| 19 | pyOpenMS | Mass spectrometry data analysis |
| 20 | pysam | SAM/BAM file manipulation |
| 21 | Scanpy | Single-cell analysis in Python |
| 22 | scikit-bio | Bioinformatics algorithms and data structures |
| 23 | scVelo | RNA velocity analysis |
| 24 | scvi-tools | Deep generative models for single-cell |
| 25 | TileDB-VCF | Population-scale genomic variant storage |
⚗️ Cheminformatics — Chemistry & Drug Discovery (12 Skills)
Click to expand full cheminformatics skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Datamol | Molecular data manipulation |
| 2 | DeepChem | Deep learning for chemistry |
| 3 | DiffDock | Diffusion-based molecular docking |
| 4 | matchms | Mass spectra matching and similarity |
| 5 | MedChem | Medicinal chemistry analysis |
| 6 | Molecular Dynamics | MD simulation setup and analysis |
| 7 | MolFeat | Molecular featurization |
| 8 | PrimeKG | Precision medicine knowledge graph |
| 9 | PyTDC | Therapeutics Data Commons access |
| 10 | RDKit | Core cheminformatics toolkit |
| 11 | Rowan | Computational chemistry workflows |
| 12 | TorchDrug | Graph neural networks for drug discovery |
🏥 Clinical Research — Clinical Decision Support & Medical Tools (10 Skills)
Click to expand full clinical research skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Clinical Decision | Evidence-based clinical decision support |
| 2 | Clinical Reports | Structured clinical report generation |
| 3 | Consciousness Council | Multi-perspective medical ethics deliberation |
| 4 | DHDNA Profiler | Digital health DNA profiling |
| 5 | ISO 13485 | Medical device quality management |
| 6 | NeuroKit2 | Neurophysiological signal processing |
| 7 | PyDicom | DICOM medical image handling |
| 8 | PyHealth | Healthcare ML pipelines |
| 9 | Treatment Plans | Treatment planning and protocol design |
| 10 | What-If Oracle | Counterfactual clinical reasoning |
📊 Data Science — ML, Statistics & Data Analysis (22 Skills)
Click to expand full data science skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Dask | Parallel computing and out-of-core data |
| 2 | EDA | Exploratory data analysis |
| 3 | NetworkX | Network/graph analysis |
| 4 | Polars | High-performance DataFrames |
| 5 | PufferLib | Reinforcement learning environments |
| 6 | PyMC | Bayesian statistical modeling |
| 7 | pymoo | Multi-objective optimization |
| 8 | PyTorch Lightning | Structured deep learning training |
| 9 | scikit-learn | Classical machine learning |
| 10 | scikit-survival | Survival analysis |
| 11 | SHAP | Model interpretability and feature importance |
| 12 | SimPy | Discrete-event simulation |
| 13 | Stable-Baselines3 | Reinforcement learning algorithms |
| 14 | Statistical Analysis | Classical statistical tests and methods |
| 15 | statsmodels | Statistical models and econometrics |
| 16 | SymPy | Symbolic mathematics |
| 17 | TimesFM | Foundation model for time series |
| 18 | PyTorch Geometric | Graph neural networks |
| 19 | Transformers | Hugging Face transformer models |
| 20 | UMAP | Dimensionality reduction |
| 21 | Vaex | Out-of-core DataFrames for big data |
| 22 | Zarr | Chunked, compressed N-dimensional arrays |
📈 Visualization — Scientific Plotting & Graphics (8 Skills)
Click to expand full visualization skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Generate Image | AI image generation for research figures |
| 2 | Infographics | Research infographic design |
| 3 | Matplotlib | Publication-quality 2D plots |
| 4 | Mermaid | Diagrams and flowcharts as code |
| 5 | Plotly | Interactive scientific visualizations |
| 6 | Scientific Schematics | Technical diagrams and schematics |
| 7 | Scientific Viz | Advanced scientific visualization |
| 8 | Seaborn | Statistical data visualization |
✍️ Writing Tools — Scientific Writing, Citations & Publishing (13 Skills)
Click to expand full writing tools skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Academic Career | Academic CV, research statements, tenure dossier |
| 2 | Citation Management | Reference formatting and management |
| 3 | Hypothesis Generator | Research hypothesis development |
| 4 | LaTeX Posters | Conference poster design in LaTeX |
| 5 | Literature Review | Systematic literature review assistance |
| 6 | Paper-to-Web | Convert papers to web-friendly formats |
| 7 | Peer Review | Peer review writing assistance |
| 8 | PPTX Posters | Conference posters in PowerPoint |
| 9 | Research Grants | Grant proposal writing |
| 10 | Scholar Eval | Academic output evaluation |
| 11 | Scientific Slides | Research presentation creation |
| 12 | Scientific Writing | Academic writing style and structure |
| 13 | Venue Templates | Journal/conference formatting templates |
🔧 Lab Integrations — Laboratory Platform Connectors (9 Skills)
Click to expand full lab integration skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Benchling | Molecular biology data platform |
| 2 | DNAnexus | Genomic data analysis platform |
| 3 | Ginkgo Cloud | Synthetic biology platform |
| 4 | LabArchive | Electronic lab notebook |
| 5 | LatchBio | Bioinformatics workflow platform |
| 6 | OMERO | Biological image management |
| 7 | Opentrons | Lab automation and robotics |
| 8 | Protocols.io | Protocol sharing and management |
| 9 | PyLabRobot | Lab robotics programming |
🌍 Domain-Specific — Quantum, Geospatial, Materials, Social Science & More (17 Skills)
Click to expand full domain-specific skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Adaptyv | Adaptive experimental design |
| 2 | Aeon | Time series classification |
| 3 | AstroPy | Astronomy and astrophysics |
| 4 | Cirq | Quantum circuit design (Google) |
| 5 | FluidSim | Fluid dynamics simulation |
| 6 | GeniML | Genomic interval ML |
| 7 | GeoMaster | Geospatial analysis mastery |
| 8 | GeoPandas | Geospatial data analysis |
| 9 | GTARS | Genomic tool for annotation |
| 10 | HypoGenic | Hypothesis generation from data |
| 11 | Modal | Cloud compute for research |
| 12 | PennyLane | Quantum machine learning |
| 13 | Pymatgen | Materials science analysis |
| 14 | Qiskit | Quantum computing (IBM) |
| 15 | QuTiP | Quantum dynamics simulation |
| 16 | Social Science Methods | Discourse analysis, QCA, Delphi, process tracing |
| 17 | Digital Humanities | Text mining, corpus linguistics, stylometry, OCR |
📄 Document Tools — File Format Handling (6 Skills)
Click to expand full document tools skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | DOCX | Word document generation and manipulation |
| 2 | MarkItDown | Convert documents to Markdown |
| 3 | Open Notebook | Open-format research notebooks |
| 4 | PDF generation and processing | |
| 5 | PPTX | PowerPoint presentation creation |
| 6 | XLSX | Excel spreadsheet handling |
🔍 Research Tools — Search, Discovery, Methods & Reference Management (12 Skills)
Click to expand full research tools skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | BGPT Search | AI-powered research search |
| 2 | Mixed Methods | Mixed-methods research design and integration |
| 3 | Open Science | Preregistration, FAIR data, open access publishing |
| 4 | Parallel Web | Multi-source parallel web search |
| 5 | Perplexity | Perplexity-powered research queries |
| 6 | PyZotero | Zotero reference manager integration |
| 7 | Qualitative Methods | Thematic analysis, grounded theory, IPA, coding |
| 8 | Research Ethics | IRB applications, informed consent, GDPR |
| 9 | Research Lookup | Quick research paper discovery |
| 10 | Scientific Brainstorm | Structured research ideation |
| 11 | Scientific Thinking | Critical scientific reasoning frameworks |
| 12 | Survey Design | Questionnaire construction and validation |
💰 Finance & Economics — Financial Data & Analysis (7 Skills)
Click to expand full finance & economics skills list
| # | Skill | What It Does |
|---|---|---|
| 1 | Alpha Vantage | Stock and financial market data |
| 2 | Denario | Financial data processing |
| 3 | EDGAR Tools | SEC filing search and analysis |
| 4 | FRED | Federal Reserve economic data |
| 5 | Hedge Fund Monitor | Hedge fund tracking and analysis |
| 6 | Market Research | Market analysis and intelligence |
| 7 | US Fiscal Data | US government fiscal data |
🏗️ Project Structure
AlterLab-Academic-Skills/
├── 📁 skills/
│ ├── 🔄 core/ # 6 pipeline + teaching + thesis skills
│ ├── 🗄️ databases/ # 39 database connectors
│ ├── 🧬 bioinformatics/ # 25 bio/genomics tools
│ ├── ⚗️ cheminformatics/ # 12 chemistry/drug discovery
│ ├── 🏥 clinical-research/ # 10 clinical/medical tools
│ ├── 📊 data-science/ # 22 ML/statistics tools
│ ├── 📈 visualization/ # 8 plotting/charting tools
│ ├── ✍️ writing-tools/ # 13 scientific writing & career tools
│ ├── 🔧 lab-integrations/ # 9 lab platform connectors
│ ├── 🌍 domain-specific/ # 17 specialized field tools
│ ├── 📄 document-tools/ # 6 file format tools
│ ├── 🔍 research-tools/ # 12 search, methods & ethics tools
│ └── 💰 finance-economics/ # 7 financial/economic tools
├── 📁 .claude/
│ └── CLAUDE.md # Project-level Claude config
├── 📄 README.md # This file
├── 📄 CLAUDE.md # Project instructions
├── 📄 CONTRIBUTING.md # Contribution guidelines
└── 📄 LICENSE # MIT License
⚙️ How Skills Work
Each .md skill file follows a consistent structure:
| name | description |
|---------------|-------------------------------------|
| skill-name | When to activate this skill... |
# Skill Title
You are **RoleName**, a [role description]...
## Your Identity & Memory
## Your Core Mission
## Frameworks & Methods
## Output Templates
## Quality Standards
[!NOTE] Pro tip: Combine multiple skills in one Claude Project for a multi-expert team. For example, load Deep Research + Paper Writer + Paper Reviewer for a complete research-to-publication workflow.
💡 Usage Examples
Skills activate automatically based on user intent:
| You say... | Skill activated |
|---|---|
| "Help me research the latest findings on CRISPR gene editing" | alterlab-deep-research |
| "Write an academic paper on machine learning in education" | alterlab-paper-writer |
| "Review my manuscript for methodology issues" | alterlab-paper-reviewer |
| "Search PubMed for recent studies on Alzheimer's biomarkers" | alterlab-pubmed |
| "Analyze my RNA-seq data" | alterlab-scanpy + alterlab-pydeseq2 |
| "Create a scientific poster for my conference" | alterlab-latex-posters |
| "Design a survey for my social science study" | alterlab-survey-design |
| "Help me with my IRB ethics application" | alterlab-research-ethics |
| "Build a Bayesian model for my clinical trial data" | alterlab-pymc |
| "Guide my PhD student's thesis writing" | alterlab-thesis-supervisor |
🔗 Sister Projects
- AlterLab-FC-Skills — 72 agentic skills for communication students
- AlterLab_GameForge — 34 game dev skills from concept to launch
🤝 Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Quick ways to contribute:
- 🛠️ Improve an existing skill with better frameworks or templates
- ✨ Create a new skill following the structure above
- 🐛 Report issues or suggest improvements
- 📚 Add examples or use cases to documentation
📜 License
This project is licensed under the MIT License.
MIT License — Copyright (c) 2026 AlterLab Creative Technologies Laboratory
🙏 Credits
Built with ❤️ by AlterLab Creative Technologies Laboratory
186+ skills · 13 domains · 1 prompt away from expert-level research
If you find this project useful, please consider giving it a ⭐