USP
Unlike single-cloud or single-cluster solutions, SkyPilot offers a single interface to manage AI compute across 25+ clouds, Kubernetes, and Slurm, maximizing GPU utilization and enabling intelligent failover for cost savings and availabili…
Use cases
- 01Managing compute resources on any cloud, Slurm, or Kubernetes cluster
- 02Launching CPU/GPU/TPU workloads (e.g., H100) on diverse infrastructure
- 03Running training, fine-tuning, or batch inference jobs
- 04Deploying inference servers with autoscaling and multi-cloud replicas
- 05Finding the cheapest or most available GPU across multiple cloud providers
Detected files (2)
agent/skills/skypilot/SKILL.mdskillShow content (18500 bytes)
--- name: skypilot description: "Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability." --- # SkyPilot Skill SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters. ## When to Use SkyPilot **Use SkyPilot when you need to:** - Manage compute resources on any cloud, Slurm, or Kubernetes cluster - Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm - Run training, fine-tuning, or batch inference jobs - Serve models with autoscaling and multi-cloud replicas (SkyServe) - Run long-running jobs with automatic lifecycle management and recovery (managed jobs) - Find the cheapest or most available GPU across clouds **Don't use SkyPilot for:** - Local-only workloads (use Docker/conda directly) ## Capabilities: When to Use What SkyPilot has three core abstractions. Use the right one for each stage of your workflow: **1. SkyPilot Clusters** (`sky launch` / `sky exec`) — Interactive development and debugging - Use during initial development, debugging, and experimentation - Launch a cluster, SSH in or connect VSCode/Cursor (`code --remote ssh-remote+CLUSTER`), iterate quickly - Cluster stays up until you stop/down it or autostop triggers - Best for: prototyping, debugging, short experiments **2. Managed Jobs** (`sky jobs launch`) — Long-running training and batch jobs - Use when submitting long-running jobs that should run unattended - Manages the full lifecycle: provisioning, execution, recovery, and teardown - Automatically recovers from spot preemptions, quota limits, and transient failures - Works across clouds, Kubernetes, and Slurm (handles preemptions and quota) - Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference **3. SkyServe** (`sky serve up`) — Production model serving - Use when serving models at scale with autoscaling - Start with `sky launch` + open port to test your serving setup, then use `sky serve up` to scale - Provides load balancing, autoscaling, and multi-cloud replicas - Best for: model serving endpoints, API services ## Before You Start (Agent Bootstrap) Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task. **Step 1: Check installation and API server connectivity** ```bash sky api info ``` | Output contains | Meaning | Next action | |-----------------|---------|-------------| | Server version and status | Server is running and connected | **Bootstrap done.** Skip to user's task. | | `No SkyPilot API server is connected` | No server connected | Go to "Start or connect a server" below. | | `Could not connect to SkyPilot API server` | Remote server unreachable or auth expired | Tell the user and suggest `sky api login --relogin -e <endpoint>` to reconnect. | | `command not found: sky` | SkyPilot not installed | Go to "Install SkyPilot" below. | **Install SkyPilot** (only if `sky` command not found): ```bash pip install "skypilot[aws,gcp,kubernetes]" # Pick clouds the user needs ``` Ask the user which clouds they need if unclear, then re-run `sky api info`. **Start or connect a server** (only if "not running"): Ask the user: > Do you have an existing SkyPilot API server to connect to, or should I start one locally? - **Connect to existing server:** `sky api login -e <API_SERVER_URL>` — get the URL from the user. - **Start locally:** `sky api start` After either path, re-run `sky api info` to confirm the server is reachable. **Step 2: Check cloud credentials** (only for fresh setups — skip if the server was already running) ```bash sky check -o json ``` This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see [Troubleshooting](references/troubleshooting.md#1-installation-and-credentials)). ## Essential Commands Use `-o json` with status/query commands to get structured JSON output instead of tables. **Clusters** — interactive development and debugging: | Command | Description | |---------|-------------| | `sky launch -c NAME task.yaml` | Launch a cluster or run a task | | `sky exec NAME task.yaml` | Run task on existing cluster (skips provisioning); syncs workdir each time | | `sky exec NAME task.yaml -d` | Same, but detach immediately (don't stream logs) | | `sky status -o json` | Show all clusters | | `sky logs NAME` | Stream job logs from a cluster | | `sky logs NAME --no-follow` | Print existing logs and exit immediately | | `sky logs NAME --tail 50` | Print last 50 lines of logs and exit | | `sky logs NAME --status` | Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled | | `sky queue NAME -o json` | List jobs on a cluster with status (structured JSON) | | `sky stop NAME` / `sky start NAME` | Stop/restart to save costs (preserves disk) | | `sky down NAME` | Tear down a cluster completely | | `sky gpus list -o json` | List available GPU types across clouds | **Managed Jobs** — long-running unattended workloads: | Command | Description | |---------|-------------| | `sky jobs launch task.yaml` | Launch a managed job (auto lifecycle + recovery) | | `sky jobs queue -o json` | Show all managed jobs and their status | | `sky jobs logs JOB_ID` | Stream logs from a managed job | | `sky jobs cancel JOB_ID` | Cancel a managed job | **SkyServe** — model serving with autoscaling: | Command | Description | |---------|-------------| | `sky serve up serve.yaml -n NAME` | Start a model serving service | | `sky serve status NAME` | Show service status and endpoint URL | | `sky serve update NAME new.yaml` | Update a running service (rolling) | | `sky serve down NAME` | Tear down a service | For complete CLI reference, see [CLI Reference](references/cli-reference.md). ## Quick Start ```bash # Launch a GPU cluster sky launch -c mycluster --gpus H100 -- nvidia-smi # Run a task from YAML sky launch -c mycluster task.yaml # SSH into cluster ssh mycluster # Connect VSCode or Cursor to the cluster for interactive development code --remote ssh-remote+mycluster /home/user/sky_workdir # or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir # Tear down sky down mycluster ``` ## Task YAML Structure The task YAML is SkyPilot's primary interface. All fields are optional. ```yaml # task.yaml name: my-training-job # Local directory to sync to remote ~/sky_workdir workdir: . # Number of nodes (for distributed training) num_nodes: 1 resources: # GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region) accelerators: H200:8 # Optional: pin to a specific cloud/region/infra # infra: aws # or aws/us-east-1, k8s, ssh/my-pool # If infra is left out, SkyPilot automatically fails over across all # enabled clouds/regions to find the cheapest available option. # Use spot instances for cost savings use_spot: false # Disk size in GB disk_size: 256 # Open ports for serving ports: 8080 # Environment variables (accessible in file_mounts, setup, and run) envs: MODEL_NAME: my-model BATCH_SIZE: 32 # Setup: runs once on cluster creation, cached on reuse setup: | pip install torch transformers # Run: the main command run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE ``` For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see [YAML Specification](references/yaml-spec.md). ## GPU and Cloud Selection **IMPORTANT: Let SkyPilot choose the cloud and region.** Do NOT manually pick a cloud/region/instance by parsing `sky gpus list` output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify `infra:` when the user explicitly requests a specific cloud or region. **Default behavior (recommended):** Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically: ```yaml resources: accelerators: H200:8 # SkyPilot picks the cheapest cloud/region with H200:8 ``` If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run `sky gpus list` and pick for them — present options and let the user decide, or use `any_of` to let SkyPilot maximize availability: ```yaml # Let SkyPilot choose from multiple acceptable GPU types (cheapest wins) resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8 ``` Use `ordered` only when the user has a strict preference: ```yaml # Try H100 first on AWS, fall back to GCP, then A100 resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8 ``` Only set `infra:` when the user explicitly says something like "use AWS" or "run on GCP us-central1": ```yaml resources: infra: aws # User asked for AWS specifically accelerators: H100:8 ``` ## Cluster Lifecycle ```bash # Launch and run a task sky launch -c mycluster task.yaml # Launch with autostop at launch time (preferred: saves cost, no follow-up command needed) sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle # Override or pass environment variables via CLI sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64 # Re-run a different task on the same cluster (fast, skips provisioning) sky exec mycluster another_task.yaml # Run an inline command sky exec mycluster -- python train.py --epochs 10 # Set autostop after launch (use if you forgot to set -i at launch time) sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start) sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart) # Stop to save costs, restart later sky stop mycluster sky start mycluster # Tear down completely sky down mycluster ``` ## Workdir Sync Behavior `workdir:` is synced to `~/sky_workdir` on the remote via `rsync` before every `sky exec`. **rsync is additive — deleted local files are NOT removed from the remote.** This can cause experiments to run against stale build artifacts or old configs. To ensure a clean slate, SSH and wipe before `sky exec`: ```bash ssh mycluster "rm -rf ~/sky_workdir" sky exec mycluster task.yaml ``` Or clean inside `run:` if only specific artifacts need removal: ```yaml run: | find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true cd ~/sky_workdir && make ``` ## Managed Jobs Use `sky jobs launch` for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown: ```yaml # managed-job.yaml name: training-job resources: accelerators: A100:8 run: | python train.py --resume-from-checkpoint ``` ```bash # Launch as managed job sky jobs launch managed-job.yaml # Check status sky jobs queue -o json # Stream logs sky jobs logs <job_id> # Cancel sky jobs cancel <job_id> ``` **Checkpoint pattern**: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery. ## SkyServe: Model Serving ```yaml # serve.yaml resources: accelerators: A100:1 ports: 8080 run: | python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8080 service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5 ``` ```bash # Start service sky serve up serve.yaml -n my-llm # Check status / get endpoint sky serve status my-llm sky serve status my-llm --endpoint # Update (rolling) sky serve update my-llm new-serve.yaml # Tear down sky serve down my-llm ``` ## Common Workflows ### Fine-Tuning Workflow 1. Write task YAML with `setup` (install deps) and `run` (training command) 2. Use `file_mounts` or `workdir` to sync code 3. `sky launch -c train task.yaml` to launch 4. `sky logs train` to monitor 5. `sky exec train -- python eval.py` to evaluate on same cluster 6. `sky down train` when done ### Hyperparameter Sweep 1. Create parameterized YAML with `envs` 2. Launch multiple managed jobs: ```bash for lr in 1e-4 1e-5 1e-6; do sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr done ``` 3. Monitor with `sky jobs queue -o json` ### Model Serving Deployment 1. Write serve YAML with `service:` section 2. `sky serve up serve.yaml -n my-service` 3. Get endpoint: `sky serve status my-service --endpoint` 4. Update model: `sky serve update my-service updated.yaml` ### Parallel Experiment Submission Use `sky exec -d` to submit jobs to multiple VMs without blocking, then collect results: ```bash # Submit all experiments (detached, returns after job is queued) for i in 1 2 3 4; do sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d done # Get the latest job ID from a cluster job_id=$(sky queue exp-vm-01 -o json \ | python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')") # Wait for a specific job and fetch last 50 lines sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50 # Check all jobs across a cluster at once sky queue exp-vm-01 -o json ``` ## Agent Feedback Loop When using SkyPilot programmatically, follow this loop: 1. **Validate**: `sky launch --dryrun task.yaml` (check resource availability/cost) 2. **Launch**: `sky launch -c mycluster task.yaml` 3. **Monitor**: `sky status -o json` and `sky queue mycluster -o json` 4. **Wait for completion**: `sky logs mycluster <JOB_ID>` (streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from `sky queue mycluster -o json`). For long-running jobs where you don't need intermediate output, use `sky logs mycluster <JOB_ID> --status` instead (blocks silently, exits 0 on success). 5. **Inspect output**: `sky logs mycluster <JOB_ID> --no-follow` or `sky logs mycluster <JOB_ID> --tail 100` 6. **Debug**: `ssh mycluster` (interactive) 7. **Iterate**: `sky exec mycluster updated_task.yaml` (run on existing cluster) 8. **Cleanup**: `sky down mycluster` > **Never poll with `sleep` + `sky queue`** — use `sky logs CLUSTER JOB_ID` to stream logs and block until done. Use `--status` if you only need the exit code, or `--tail N` to fetch recent output after completion. ## Common Agent Mistakes | Mistake | Why it's wrong | Do this instead | |---------|---------------|-----------------| | Manually picking cloud/region from `sky gpus list` output | SkyPilot optimizer does this automatically and better | Just set `accelerators:` and let SkyPilot choose | | Using `sky launch` for long-running unattended jobs | No recovery if preempted or interrupted | Use `sky jobs launch` for unattended work | | Forgetting `sky down` or autostop after work is done | Wastes money on idle clusters | Always clean up, or use `-i <minutes> --down` at launch | | Hardcoding `infra: aws` without user asking | Limits availability and increases cost | Only set `infra:` when user explicitly requests a cloud | | Not using `envs:` for configurable values | Hard to reuse or override from CLI | Use `envs:` in YAML + `--env KEY=VAL` for parameterization | | Running `sky launch` without `-c <name>` | Creates randomly-named cluster, hard to reference | Always name clusters with `-c` | | Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use `-o json` for structured output | | Using deprecated `cloud:`/`region:`/`zone:` fields | Deprecated in favor of `infra:` | Use `infra: aws/us-east-1` instead | | Polling job status with `sleep` + `sky queue` | Wastes tokens, introduces timing bugs, fragile | Use `sky logs CLUSTER JOB_ID --status` to block until done | | Assuming workdir sync removes remote files | rsync is additive; old remote files persist across `sky exec` calls | SSH and manually clean `~/sky_workdir`, or clean in `run:` script | | Not using `--tail` when only last output matters | Streaming full logs wastes tokens for long jobs | Use `sky logs CLUSTER JOB_ID --tail 50` for last N lines | ## Common Issues Quick Reference | Issue | Solution | |-------|----------| | GPU not available | Use `any_of` for fallback, or try different regions/clouds | | Setup takes too long | SkyPilot caches setup; use `sky exec` to skip it on reruns | | Task fails silently | Check `sky logs <cluster>` or `ssh <cluster>` to debug | | Cluster stuck in INIT | `sky down <cluster>` and relaunch | | Preemption/quota | Use `sky jobs launch` for automatic recovery and lifecycle management | | Port not accessible | Ensure `ports:` is set in resources and security groups allow traffic | | File sync slow | Use cloud bucket mounts instead of `workdir` for large datasets | | Credentials error | Run `sky check -o json` and inspect which clouds are disabled | ## References For detailed reference documentation: - [CLI Reference](references/cli-reference.md) — All commands and flags - [YAML Specification](references/yaml-spec.md) — Complete task YAML schema, file mounts, environment variables - [Python SDK](references/python-sdk.md) — Programmatic API and SDK usage - [Advanced Patterns](references/advanced-patterns.md) — Multi-cloud, distributed training, production patterns - [Troubleshooting](references/troubleshooting.md) — Error diagnosis and solutions - [Examples](references/examples.md) — Copy-paste task YAML examples.claude-plugin/marketplace.jsonmarketplaceShow content (756 bytes)
{ "name": "skypilot", "owner": { "name": "SkyPilot Team" }, "metadata": { "description": "Official SkyPilot skills for AI coding agents", "version": "1.0.0" }, "plugins": [ { "name": "skypilot", "source": "./agent", "description": "SkyPilot agent skill for launching cloud VMs, Kubernetes pods, and Slurm jobs across 25+ clouds", "version": "1.0.0", "author": { "name": "SkyPilot Team" }, "homepage": "https://docs.skypilot.co/", "repository": "https://github.com/skypilot-org/skypilot", "license": "Apache-2.0", "keywords": ["AI Infrastructure", "Multi-Cloud", "GPU", "Kubernetes", "Slurm", "Training", "Serving", "Cost Optimization", "SkyPilot"] } ] }
README
Manage all your AI compute
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
SkyPilot gives AI teams a simple interface to run jobs on any infra. Infra teams get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
:fire: News :fire:
- [Apr 2026] Introducing GPU Compass: One dashboard to browse, compare pricing, and launch across every GPU cloud. Try it at gpus.skypilot.co.
- [Apr 2026] Research-Driven Agents: Agents read arxiv papers before coding, landed 5 llama.cpp kernel fusions and +15% faster flash attention in ~3 hours for ~$29: blog, HackerNews
- [Mar 2026] Scaling Karpathy's Autoresearch: Autoresearch runs 1 experiment at a time. We gave it 16 GPUs and let it run in parallel: blog, HackerNews
- [Mar 2026] How H Company Unlocked Online RL and Unified their AI Platform: case study
- [Mar 2026] SkyPilot v0.12 released: Slurm Support, Job Groups for RL, Agent Skill, Recipes, Pool Autoscaling for Batch Inference, 7x Data Mounting, and More: Release notes
- [Mar 2026] SkyPilot Agent Skills: GPU access and job management for AI agents: docs
- [Jan 2026] Shopify case study: Shopify runs all AI training workloads on SkyPilot: case study
- [Dec 2025] SkyPilot v0.11 released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. Release notes
- [Dec 2025] Train an agent to use Google Search as a tool with RL on your Kubernetes or clouds: blog, example
Overview
SkyPilot is easy to use for AI users:
- Quickly spin up compute on your own infra
- Environment and job as code — simple and portable
- Easy job management: queue, run, and auto-recover many jobs
SkyPilot makes Kubernetes easy for AI & Infra teams:
- Slurm-like ease of use, cloud-native robustness
- Local dev experience on K8s: SSH into pods, sync code, or connect IDE
- Turbocharge your clusters: gang scheduling, multi-cluster, and scaling
SkyPilot unifies multiple clusters, clouds, and hardware:
- One interface to use reserved GPUs, Kubernetes clusters, Slurm clusters, or 20+ clouds
- Flexible provisioning of GPUs, TPUs, CPUs, with smart failover
- Team deployment and resource sharing
SkyPilot maximizes GPU fleet utilization:
- Autostop: automatic cleanup of idle resources
- Binpacking: workload binpacking on shared clusters
- Intelligent scheduler: automatically schedule on the most available infra
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
Install with uv (also supported: pip, nightly, from source)
# Choose your clouds:
uv pip install "skypilot[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,seeweb,shadeform,verda]"
To use SkyPilot directly with your agent (Claude Code, Codex, etc.), install the SkyPilot Skill. Tell your agent:
Fetch and follow https://github.com/skypilot-org/skypilot/blob/HEAD/agent/INSTALL.md to install the skypilot skill
Current supported infra: Kubernetes, Slurm, AWS, GCP, Azure, OCI, CoreWeave, Nebius, Lambda Cloud, RunPod, Fluidstack, Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect, Shadeform, Verda Cloud, VastData, Crusoe.
Getting started
Install SkyPilot in 1 minute. Then, launch your first cluster in 2 minutes in Quickstart.
SkyPilot is BYOC: Everything is launched within your cloud accounts, VPCs, and clusters.
Benefits of SkyPilot on Kubernetes
SkyPilot makes Kubernetes AI-native.
It turbocharges your existing Kubernetes clusters by accelerating AI/ML velocity:
- AI-friendly interface to launch jobs and deployments
- Much simplified interactive dev for K8s (SSH / sync code / connect IDE to pods)
...and optimizing GPU scheduling, utilization, and scaling:
- Advanced scheduling: Gang scheduling, multi-node jobs, and queueing
- Multi-cluster support: Bring all your clusters under one control plane
- Multi-cloud support: One consistent interface to manage many providers
See SkyPilot vs Vanilla Kubernetes and this blog post for more details.
SkyPilot in 1 minute
A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.
Once written in this unified interface (YAML or Python API), the task can be launched on any available infra (Kubernetes, Slurm, cloud, etc.). This avoids vendor lock-in, and allows easily moving jobs to a different provider.
Paste the following into a file my_task.yaml:
resources:
accelerators: A100:8 # 8x NVIDIA A100 GPU
num_nodes: 1 # Number of VMs to launch
# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: ~/torch_examples
# Commands to be run before executing the job.
# Typical use: pip install -r requirements.txt, git clone, etc.
setup: |
cd mnist
pip install -r requirements.txt
# Commands to run as a job.
# Typical use: launch the main program.
run: |
cd mnist
python main.py --epochs 1
Prepare the workdir by cloning:
git clone https://github.com/pytorch/examples.git ~/torch_examples
Launch with sky launch (note: access to GPU instances is needed for this example):
sky launch my_task.yaml
SkyPilot then performs the heavy-lifting for you, including:
- Find the cheapest & available infra across your clusters or clouds
- Provision the GPUs (pods or VMs), with auto-failover if the infra returned capacity errors
- Sync your local
workdirto the provisioned cluster - Auto-install dependencies by running the task's
setupcommands - Run the task's
runcommands, and stream logs
See Quickstart to get started with SkyPilot.
Runnable examples
See SkyPilot examples that cover: development, training, serving, LLM models, AI apps, and common frameworks.
Latest featured examples:
| Task | Examples |
|---|---|
| Training | Verl, Finetune Llama 4, TorchTitan, PyTorch, DeepSpeed, NeMo, Ray, Unsloth, Jax/TPU, OpenRLHF |
| Serving | vLLM, SGLang, Ollama |
| Models | DeepSeek-R1, Llama 4, Llama 3, CodeLlama, Qwen, Kimi-K2, Kimi-K2-Thinking, Mixtral |
| AI apps | RAG, vector databases (ChromaDB, CLIP) |
| Common frameworks | Airflow, Jupyter, marimo |
Source files can be found in llm/ and examples/.
Learn more
To learn more, see SkyPilot Overview, SkyPilot docs, and SkyPilot blog.
SkyPilot adopters: Testimonials and Case Studies
Partners and integrations: Community Spotlights
Follow updates:
Questions and feedback
We are excited to hear your feedback:
- For issues and feature requests, please open a GitHub issue.
- For questions, please use GitHub Discussions.
For general discussions, join us on the SkyPilot Slack.
Contributing
We welcome all contributions to the project! See CONTRIBUTING for how to get involved.