Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
skypilot-org

skypilot

Quality
9.0

SkyPilot provides a unified framework for AI teams to run, manage, and scale AI workloads on over 25 clouds, Kubernetes, and Slurm clusters. It simplifies infrastructure management for machine learning, enabling cost optimization, GPU utilization, and seamless job orchestration across diverse compute environments.

USP

Unlike single-cloud or single-cluster solutions, SkyPilot offers a single interface to manage AI compute across 25+ clouds, Kubernetes, and Slurm, maximizing GPU utilization and enabling intelligent failover for cost savings and availabili…

Use cases

  • 01Managing compute resources on any cloud, Slurm, or Kubernetes cluster
  • 02Launching CPU/GPU/TPU workloads (e.g., H100) on diverse infrastructure
  • 03Running training, fine-tuning, or batch inference jobs
  • 04Deploying inference servers with autoscaling and multi-cloud replicas
  • 05Finding the cheapest or most available GPU across multiple cloud providers

Detected files (2)

  • agent/skills/skypilot/SKILL.mdskill
    Show content (18500 bytes)
    ---
    name: skypilot
    description: "Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability."
    ---
    
    # SkyPilot Skill
    
    SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
    
    ## When to Use SkyPilot
    
    **Use SkyPilot when you need to:**
    - Manage compute resources on any cloud, Slurm, or Kubernetes cluster
    - Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
    - Run training, fine-tuning, or batch inference jobs
    - Serve models with autoscaling and multi-cloud replicas (SkyServe)
    - Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
    - Find the cheapest or most available GPU across clouds
    
    **Don't use SkyPilot for:**
    - Local-only workloads (use Docker/conda directly)
    
    ## Capabilities: When to Use What
    
    SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
    
    **1. SkyPilot Clusters** (`sky launch` / `sky exec`) — Interactive development and debugging
    - Use during initial development, debugging, and experimentation
    - Launch a cluster, SSH in or connect VSCode/Cursor (`code --remote ssh-remote+CLUSTER`), iterate quickly
    - Cluster stays up until you stop/down it or autostop triggers
    - Best for: prototyping, debugging, short experiments
    
    **2. Managed Jobs** (`sky jobs launch`) — Long-running training and batch jobs
    - Use when submitting long-running jobs that should run unattended
    - Manages the full lifecycle: provisioning, execution, recovery, and teardown
    - Automatically recovers from spot preemptions, quota limits, and transient failures
    - Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
    - Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
    
    **3. SkyServe** (`sky serve up`) — Production model serving
    - Use when serving models at scale with autoscaling
    - Start with `sky launch` + open port to test your serving setup, then use `sky serve up` to scale
    - Provides load balancing, autoscaling, and multi-cloud replicas
    - Best for: model serving endpoints, API services
    
    ## Before You Start (Agent Bootstrap)
    
    Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
    
    **Step 1: Check installation and API server connectivity**
    
    ```bash
    sky api info
    ```
    | Output contains | Meaning | Next action |
    |-----------------|---------|-------------|
    | Server version and status | Server is running and connected | **Bootstrap done.** Skip to user's task. |
    | `No SkyPilot API server is connected` | No server connected | Go to "Start or connect a server" below. |
    | `Could not connect to SkyPilot API server` | Remote server unreachable or auth expired | Tell the user and suggest `sky api login --relogin -e <endpoint>` to reconnect. |
    | `command not found: sky` | SkyPilot not installed | Go to "Install SkyPilot" below. |
    
    **Install SkyPilot** (only if `sky` command not found):
    ```bash
    pip install "skypilot[aws,gcp,kubernetes]"  # Pick clouds the user needs
    ```
    Ask the user which clouds they need if unclear, then re-run `sky api info`.
    
    **Start or connect a server** (only if "not running"):
    
    Ask the user:
    > Do you have an existing SkyPilot API server to connect to, or should I start one locally?
    
    - **Connect to existing server:** `sky api login -e <API_SERVER_URL>` — get the URL from the user.
    - **Start locally:** `sky api start`
    
    After either path, re-run `sky api info` to confirm the server is reachable.
    
    **Step 2: Check cloud credentials** (only for fresh setups — skip if the server was already running)
    ```bash
    sky check -o json
    ```
    This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see [Troubleshooting](references/troubleshooting.md#1-installation-and-credentials)).
    
    ## Essential Commands
    
    Use `-o json` with status/query commands to get structured JSON output instead of tables.
    
    **Clusters** — interactive development and debugging:
    
    | Command | Description |
    |---------|-------------|
    | `sky launch -c NAME task.yaml` | Launch a cluster or run a task |
    | `sky exec NAME task.yaml` | Run task on existing cluster (skips provisioning); syncs workdir each time |
    | `sky exec NAME task.yaml -d` | Same, but detach immediately (don't stream logs) |
    | `sky status -o json` | Show all clusters |
    | `sky logs NAME` | Stream job logs from a cluster |
    | `sky logs NAME --no-follow` | Print existing logs and exit immediately |
    | `sky logs NAME --tail 50` | Print last 50 lines of logs and exit |
    | `sky logs NAME --status` | Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled |
    | `sky queue NAME -o json` | List jobs on a cluster with status (structured JSON) |
    | `sky stop NAME` / `sky start NAME` | Stop/restart to save costs (preserves disk) |
    | `sky down NAME` | Tear down a cluster completely |
    | `sky gpus list -o json` | List available GPU types across clouds |
    
    **Managed Jobs** — long-running unattended workloads:
    
    | Command | Description |
    |---------|-------------|
    | `sky jobs launch task.yaml` | Launch a managed job (auto lifecycle + recovery) |
    | `sky jobs queue -o json` | Show all managed jobs and their status |
    | `sky jobs logs JOB_ID` | Stream logs from a managed job |
    | `sky jobs cancel JOB_ID` | Cancel a managed job |
    
    **SkyServe** — model serving with autoscaling:
    
    | Command | Description |
    |---------|-------------|
    | `sky serve up serve.yaml -n NAME` | Start a model serving service |
    | `sky serve status NAME` | Show service status and endpoint URL |
    | `sky serve update NAME new.yaml` | Update a running service (rolling) |
    | `sky serve down NAME` | Tear down a service |
    
    For complete CLI reference, see [CLI Reference](references/cli-reference.md).
    
    ## Quick Start
    
    ```bash
    # Launch a GPU cluster
    sky launch -c mycluster --gpus H100 -- nvidia-smi
    
    # Run a task from YAML
    sky launch -c mycluster task.yaml
    
    # SSH into cluster
    ssh mycluster
    
    # Connect VSCode or Cursor to the cluster for interactive development
    code --remote ssh-remote+mycluster /home/user/sky_workdir
    # or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir
    
    # Tear down
    sky down mycluster
    ```
    
    ## Task YAML Structure
    
    The task YAML is SkyPilot's primary interface. All fields are optional.
    
    ```yaml
    # task.yaml
    name: my-training-job
    
    # Local directory to sync to remote ~/sky_workdir
    workdir: .
    
    # Number of nodes (for distributed training)
    num_nodes: 1
    
    resources:
      # GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)
      accelerators: H200:8
      # Optional: pin to a specific cloud/region/infra
      # infra: aws  # or aws/us-east-1, k8s, ssh/my-pool
      # If infra is left out, SkyPilot automatically fails over across all
      # enabled clouds/regions to find the cheapest available option.
      # Use spot instances for cost savings
      use_spot: false
      # Disk size in GB
      disk_size: 256
      # Open ports for serving
      ports: 8080
    
    # Environment variables (accessible in file_mounts, setup, and run)
    envs:
      MODEL_NAME: my-model
      BATCH_SIZE: 32
    
    # Setup: runs once on cluster creation, cached on reuse
    setup: |
      pip install torch transformers
    
    # Run: the main command
    run: |
      python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
    ```
    
    For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see [YAML Specification](references/yaml-spec.md).
    
    ## GPU and Cloud Selection
    
    **IMPORTANT: Let SkyPilot choose the cloud and region.** Do NOT manually pick a cloud/region/instance by parsing `sky gpus list` output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify `infra:` when the user explicitly requests a specific cloud or region.
    
    **Default behavior (recommended):** Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
    
    ```yaml
    resources:
      accelerators: H200:8  # SkyPilot picks the cheapest cloud/region with H200:8
    ```
    
    If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run `sky gpus list` and pick for them — present options and let the user decide, or use `any_of` to let SkyPilot maximize availability:
    
    ```yaml
    # Let SkyPilot choose from multiple acceptable GPU types (cheapest wins)
    resources:
      any_of:
        - accelerators: H100:8
        - accelerators: A100-80GB:8
        - accelerators: A100:8
    ```
    
    Use `ordered` only when the user has a strict preference:
    
    ```yaml
    # Try H100 first on AWS, fall back to GCP, then A100
    resources:
      ordered:
        - infra: aws/us-east-1
          accelerators: H100:8
        - infra: gcp/us-central1
          accelerators: H100:8
        - infra: aws/us-west-2
          accelerators: A100-80GB:8
    ```
    
    Only set `infra:` when the user explicitly says something like "use AWS" or "run on GCP us-central1":
    
    ```yaml
    resources:
      infra: aws             # User asked for AWS specifically
      accelerators: H100:8
    ```
    
    ## Cluster Lifecycle
    
    ```bash
    # Launch and run a task
    sky launch -c mycluster task.yaml
    
    # Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)
    sky launch -c mycluster task.yaml -i 30        # stop after 30 min idle
    sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle
    
    # Override or pass environment variables via CLI
    sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
    
    # Re-run a different task on the same cluster (fast, skips provisioning)
    sky exec mycluster another_task.yaml
    
    # Run an inline command
    sky exec mycluster -- python train.py --epochs 10
    
    # Set autostop after launch (use if you forgot to set -i at launch time)
    sky autostop mycluster -i 30        # stop after 30 min idle, preserving disk (can restart with sky start)
    sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)
    
    # Stop to save costs, restart later
    sky stop mycluster
    sky start mycluster
    
    # Tear down completely
    sky down mycluster
    ```
    
    ## Workdir Sync Behavior
    
    `workdir:` is synced to `~/sky_workdir` on the remote via `rsync` before every `sky exec`. **rsync is additive — deleted local files are NOT removed from the remote.** This can cause experiments to run against stale build artifacts or old configs.
    
    To ensure a clean slate, SSH and wipe before `sky exec`:
    ```bash
    ssh mycluster "rm -rf ~/sky_workdir"
    sky exec mycluster task.yaml
    ```
    
    Or clean inside `run:` if only specific artifacts need removal:
    ```yaml
    run: |
      find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
      cd ~/sky_workdir && make
    ```
    
    ## Managed Jobs
    
    Use `sky jobs launch` for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
    
    ```yaml
    # managed-job.yaml
    name: training-job
    
    resources:
      accelerators: A100:8
    
    run: |
      python train.py --resume-from-checkpoint
    ```
    
    ```bash
    # Launch as managed job
    sky jobs launch managed-job.yaml
    
    # Check status
    sky jobs queue -o json
    
    # Stream logs
    sky jobs logs <job_id>
    
    # Cancel
    sky jobs cancel <job_id>
    ```
    
    **Checkpoint pattern**: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.
    
    ## SkyServe: Model Serving
    
    ```yaml
    # serve.yaml
    resources:
      accelerators: A100:1
      ports: 8080
    
    run: |
      python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --port 8080
    
    service:
      readiness_probe: /v1/models
      replica_policy:
        min_replicas: 1
        max_replicas: 3
        target_qps_per_replica: 5
    ```
    
    ```bash
    # Start service
    sky serve up serve.yaml -n my-llm
    
    # Check status / get endpoint
    sky serve status my-llm
    sky serve status my-llm --endpoint
    
    # Update (rolling)
    sky serve update my-llm new-serve.yaml
    
    # Tear down
    sky serve down my-llm
    ```
    
    ## Common Workflows
    
    ### Fine-Tuning Workflow
    1. Write task YAML with `setup` (install deps) and `run` (training command)
    2. Use `file_mounts` or `workdir` to sync code
    3. `sky launch -c train task.yaml` to launch
    4. `sky logs train` to monitor
    5. `sky exec train -- python eval.py` to evaluate on same cluster
    6. `sky down train` when done
    
    ### Hyperparameter Sweep
    1. Create parameterized YAML with `envs`
    2. Launch multiple managed jobs:
       ```bash
       for lr in 1e-4 1e-5 1e-6; do
         sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
       done
       ```
    3. Monitor with `sky jobs queue -o json`
    
    ### Model Serving Deployment
    1. Write serve YAML with `service:` section
    2. `sky serve up serve.yaml -n my-service`
    3. Get endpoint: `sky serve status my-service --endpoint`
    4. Update model: `sky serve update my-service updated.yaml`
    
    ### Parallel Experiment Submission
    
    Use `sky exec -d` to submit jobs to multiple VMs without blocking, then collect results:
    
    ```bash
    # Submit all experiments (detached, returns after job is queued)
    for i in 1 2 3 4; do
      sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
    done
    
    # Get the latest job ID from a cluster
    job_id=$(sky queue exp-vm-01 -o json \
      | python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
    
    # Wait for a specific job and fetch last 50 lines
    sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
    
    # Check all jobs across a cluster at once
    sky queue exp-vm-01 -o json
    ```
    
    ## Agent Feedback Loop
    
    When using SkyPilot programmatically, follow this loop:
    
    1. **Validate**: `sky launch --dryrun task.yaml` (check resource availability/cost)
    2. **Launch**: `sky launch -c mycluster task.yaml`
    3. **Monitor**: `sky status -o json` and `sky queue mycluster -o json`
    4. **Wait for completion**: `sky logs mycluster <JOB_ID>` (streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from `sky queue mycluster -o json`). For long-running jobs where you don't need intermediate output, use `sky logs mycluster <JOB_ID> --status` instead (blocks silently, exits 0 on success).
    5. **Inspect output**: `sky logs mycluster <JOB_ID> --no-follow` or `sky logs mycluster <JOB_ID> --tail 100`
    6. **Debug**: `ssh mycluster` (interactive)
    7. **Iterate**: `sky exec mycluster updated_task.yaml` (run on existing cluster)
    8. **Cleanup**: `sky down mycluster`
    
    > **Never poll with `sleep` + `sky queue`** — use `sky logs CLUSTER JOB_ID` to stream logs and block until done. Use `--status` if you only need the exit code, or `--tail N` to fetch recent output after completion.
    
    ## Common Agent Mistakes
    
    | Mistake | Why it's wrong | Do this instead |
    |---------|---------------|-----------------|
    | Manually picking cloud/region from `sky gpus list` output | SkyPilot optimizer does this automatically and better | Just set `accelerators:` and let SkyPilot choose |
    | Using `sky launch` for long-running unattended jobs | No recovery if preempted or interrupted | Use `sky jobs launch` for unattended work |
    | Forgetting `sky down` or autostop after work is done | Wastes money on idle clusters | Always clean up, or use `-i <minutes> --down` at launch |
    | Hardcoding `infra: aws` without user asking | Limits availability and increases cost | Only set `infra:` when user explicitly requests a cloud |
    | Not using `envs:` for configurable values | Hard to reuse or override from CLI | Use `envs:` in YAML + `--env KEY=VAL` for parameterization |
    | Running `sky launch` without `-c <name>` | Creates randomly-named cluster, hard to reference | Always name clusters with `-c` |
    | Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use `-o json` for structured output |
    | Using deprecated `cloud:`/`region:`/`zone:` fields | Deprecated in favor of `infra:` | Use `infra: aws/us-east-1` instead |
    | Polling job status with `sleep` + `sky queue` | Wastes tokens, introduces timing bugs, fragile | Use `sky logs CLUSTER JOB_ID --status` to block until done |
    | Assuming workdir sync removes remote files | rsync is additive; old remote files persist across `sky exec` calls | SSH and manually clean `~/sky_workdir`, or clean in `run:` script |
    | Not using `--tail` when only last output matters | Streaming full logs wastes tokens for long jobs | Use `sky logs CLUSTER JOB_ID --tail 50` for last N lines |
    
    ## Common Issues Quick Reference
    
    | Issue | Solution |
    |-------|----------|
    | GPU not available | Use `any_of` for fallback, or try different regions/clouds |
    | Setup takes too long | SkyPilot caches setup; use `sky exec` to skip it on reruns |
    | Task fails silently | Check `sky logs <cluster>` or `ssh <cluster>` to debug |
    | Cluster stuck in INIT | `sky down <cluster>` and relaunch |
    | Preemption/quota | Use `sky jobs launch` for automatic recovery and lifecycle management |
    | Port not accessible | Ensure `ports:` is set in resources and security groups allow traffic |
    | File sync slow | Use cloud bucket mounts instead of `workdir` for large datasets |
    | Credentials error | Run `sky check -o json` and inspect which clouds are disabled |
    
    ## References
    
    For detailed reference documentation:
    
    - [CLI Reference](references/cli-reference.md) — All commands and flags
    - [YAML Specification](references/yaml-spec.md) — Complete task YAML schema, file mounts, environment variables
    - [Python SDK](references/python-sdk.md) — Programmatic API and SDK usage
    - [Advanced Patterns](references/advanced-patterns.md) — Multi-cloud, distributed training, production patterns
    - [Troubleshooting](references/troubleshooting.md) — Error diagnosis and solutions
    - [Examples](references/examples.md) — Copy-paste task YAML examples
    
  • .claude-plugin/marketplace.jsonmarketplace
    Show content (756 bytes)
    {
      "name": "skypilot",
      "owner": {
        "name": "SkyPilot Team"
      },
      "metadata": {
        "description": "Official SkyPilot skills for AI coding agents",
        "version": "1.0.0"
      },
      "plugins": [
        {
          "name": "skypilot",
          "source": "./agent",
          "description": "SkyPilot agent skill for launching cloud VMs, Kubernetes pods, and Slurm jobs across 25+ clouds",
          "version": "1.0.0",
          "author": {
            "name": "SkyPilot Team"
          },
          "homepage": "https://docs.skypilot.co/",
          "repository": "https://github.com/skypilot-org/skypilot",
          "license": "Apache-2.0",
          "keywords": ["AI Infrastructure", "Multi-Cloud", "GPU", "Kubernetes", "Slurm", "Training", "Serving", "Cost Optimization", "SkyPilot"]
        }
      ]
    }
    

README

SkyPilot

Documentation GitHub Release Join Slack Downloads

Manage all your AI compute

SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.

SkyPilot gives AI teams a simple interface to run jobs on any infra. Infra teams get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.

SkyPilot Abstractions

:fire: News :fire:

  • [Apr 2026] Introducing GPU Compass: One dashboard to browse, compare pricing, and launch across every GPU cloud. Try it at gpus.skypilot.co.
  • [Apr 2026] Research-Driven Agents: Agents read arxiv papers before coding, landed 5 llama.cpp kernel fusions and +15% faster flash attention in ~3 hours for ~$29: blog, HackerNews
  • [Mar 2026] Scaling Karpathy's Autoresearch: Autoresearch runs 1 experiment at a time. We gave it 16 GPUs and let it run in parallel: blog, HackerNews
  • [Mar 2026] How H Company Unlocked Online RL and Unified their AI Platform: case study
  • [Mar 2026] SkyPilot v0.12 released: Slurm Support, Job Groups for RL, Agent Skill, Recipes, Pool Autoscaling for Batch Inference, 7x Data Mounting, and More: Release notes
  • [Mar 2026] SkyPilot Agent Skills: GPU access and job management for AI agents: docs
  • [Jan 2026] Shopify case study: Shopify runs all AI training workloads on SkyPilot: case study
  • [Dec 2025] SkyPilot v0.11 released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. Release notes
  • [Dec 2025] Train an agent to use Google Search as a tool with RL on your Kubernetes or clouds: blog, example

Overview

SkyPilot is easy to use for AI users:

  • Quickly spin up compute on your own infra
  • Environment and job as code — simple and portable
  • Easy job management: queue, run, and auto-recover many jobs

SkyPilot makes Kubernetes easy for AI & Infra teams:

  • Slurm-like ease of use, cloud-native robustness
  • Local dev experience on K8s: SSH into pods, sync code, or connect IDE
  • Turbocharge your clusters: gang scheduling, multi-cluster, and scaling

SkyPilot unifies multiple clusters, clouds, and hardware:

  • One interface to use reserved GPUs, Kubernetes clusters, Slurm clusters, or 20+ clouds
  • Flexible provisioning of GPUs, TPUs, CPUs, with smart failover
  • Team deployment and resource sharing

SkyPilot maximizes GPU fleet utilization:

  • Autostop: automatic cleanup of idle resources
  • Binpacking: workload binpacking on shared clusters
  • Intelligent scheduler: automatically schedule on the most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Install with uv (also supported: pip, nightly, from source)

# Choose your clouds:
uv pip install "skypilot[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,seeweb,shadeform,verda]"

To use SkyPilot directly with your agent (Claude Code, Codex, etc.), install the SkyPilot Skill. Tell your agent:

Fetch and follow https://github.com/skypilot-org/skypilot/blob/HEAD/agent/INSTALL.md to install the skypilot skill

SkyPilot

Current supported infra: Kubernetes, Slurm, AWS, GCP, Azure, OCI, CoreWeave, Nebius, Lambda Cloud, RunPod, Fluidstack, Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect, Shadeform, Verda Cloud, VastData, Crusoe.

SkyPilot

Getting started

Install SkyPilot in 1 minute. Then, launch your first cluster in 2 minutes in Quickstart.

SkyPilot is BYOC: Everything is launched within your cloud accounts, VPCs, and clusters.

Benefits of SkyPilot on Kubernetes

SkyPilot makes Kubernetes AI-native.

It turbocharges your existing Kubernetes clusters by accelerating AI/ML velocity:

  • AI-friendly interface to launch jobs and deployments
  • Much simplified interactive dev for K8s (SSH / sync code / connect IDE to pods)

...and optimizing GPU scheduling, utilization, and scaling:

  • Advanced scheduling: Gang scheduling, multi-node jobs, and queueing
  • Multi-cluster support: Bring all your clusters under one control plane
  • Multi-cloud support: One consistent interface to manage many providers

See SkyPilot vs Vanilla Kubernetes and this blog post for more details.

SkyPilot in 1 minute

A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.

Once written in this unified interface (YAML or Python API), the task can be launched on any available infra (Kubernetes, Slurm, cloud, etc.). This avoids vendor lock-in, and allows easily moving jobs to a different provider.

Paste the following into a file my_task.yaml:

resources:
  accelerators: A100:8  # 8x NVIDIA A100 GPU

num_nodes: 1  # Number of VMs to launch

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: ~/torch_examples

# Commands to be run before executing the job.
# Typical use: pip install -r requirements.txt, git clone, etc.
setup: |
  cd mnist
  pip install -r requirements.txt

# Commands to run as a job.
# Typical use: launch the main program.
run: |
  cd mnist
  python main.py --epochs 1

Prepare the workdir by cloning:

git clone https://github.com/pytorch/examples.git ~/torch_examples

Launch with sky launch (note: access to GPU instances is needed for this example):

sky launch my_task.yaml

SkyPilot then performs the heavy-lifting for you, including:

  1. Find the cheapest & available infra across your clusters or clouds
  2. Provision the GPUs (pods or VMs), with auto-failover if the infra returned capacity errors
  3. Sync your local workdir to the provisioned cluster
  4. Auto-install dependencies by running the task's setup commands
  5. Run the task's run commands, and stream logs

See Quickstart to get started with SkyPilot.

Runnable examples

See SkyPilot examples that cover: development, training, serving, LLM models, AI apps, and common frameworks.

Latest featured examples:

TaskExamples
TrainingVerl, Finetune Llama 4, TorchTitan, PyTorch, DeepSpeed, NeMo, Ray, Unsloth, Jax/TPU, OpenRLHF
ServingvLLM, SGLang, Ollama
ModelsDeepSeek-R1, Llama 4, Llama 3, CodeLlama, Qwen, Kimi-K2, Kimi-K2-Thinking, Mixtral
AI appsRAG, vector databases (ChromaDB, CLIP)
Common frameworksAirflow, Jupyter, marimo

Source files can be found in llm/ and examples/.

Learn more

To learn more, see SkyPilot Overview, SkyPilot docs, and SkyPilot blog.

SkyPilot adopters: Testimonials and Case Studies

Partners and integrations: Community Spotlights

Follow updates:

Questions and feedback

We are excited to hear your feedback:

For general discussions, join us on the SkyPilot Slack.

Contributing

We welcome all contributions to the project! See CONTRIBUTING for how to get involved.