SkyPilot: Run AI/ML Anywhere, Cheapest

Intercloud broker that automatically selects the cheapest cloud for your GPU workload, handles spot instance failover, and accounts for data gravity. Covers Q48–Q50.

GPU prices vary dramatically between cloud providers, regions, and time of day. A4100 instance on GCP might be 30% cheaper than AWS at this moment, but that could flip tomorrow. SkyPilot solves this by acting as an intercloud broker: you describe what you need, it finds the cheapest option — transparently and automatically.

💡 SkyPilot's core value proposition

Zero code change.Your ML training code is unchanged. You only change the infrastructure config (a YAML file). SkyPilot handles provisioning, data transfer, spot failover, and cleanup across any supported cloud. This is the “write once, run anywhere (cheapest)” promise for AI workloads.

SkyPilot Intercloud Broker Architecture

SkyPilot sits between your application and the cloud providers, handling selection, provisioning, and execution. Built on Ray for distributed task execution.

SkyPilot Intercloud Broker Architecture
Click any component to learn about it
↓ submit job
SkyPilot
selects cheapest → provisions → executes

Click any component to learn about its role.

SkyPilot acts as an intercloud broker: it abstracts away cloud-specific APIs and pricing, letting you describe what you need (e.g., “4× A100 GPUs, spot okay, budget $50”) and it finds the cheapest available option across all supported clouds.

Interactive Cost Optimizer

Configure your GPU type, count, and training duration. SkyPilot automatically selects the cheapest cloud option.

Interactive Cost Estimator
Configure your workload — SkyPilot picks the cheapest option
1h24h72h
AWS
$315
$39.32/hr × 8h
GCP
$261
$32.64/hr × 8h
SkyPilot picks this ✓
Azure
$290
$36.24/hr × 8h
SkyPilot selects: GCP4× A100 (spot)
Total cost: $261 over 8 hours. Savings vs most expensive: 17%
Prices are illustrative (2024 estimates). Actual prices vary by region and time.

The Data Gravity Problem

Moving large training datasets between clouds can cost more than the compute savings. SkyPilot accounts for egress fees when comparing cloud options.

The Data Gravity Problem
Training Data: 10 TB in S3 (us-east-1)
~$0.023/GB/month storage
↓ Where to train?
AWS us-east-1
Data transfer: $0
✓ Best if spot available
GCP / Azure
Egress: 10TB × $0.08 = $800!
✗ Data transfer kills savings

Data gravity in practice:

  • Large datasets “attract” computation — moving data is expensive
  • Inter-region egress costs can exceed compute savings from a cheaper cloud
  • SkyPilot accounts for data gravity by penalizing cross-cloud/region jobs when the penalty exceeds the savings
  • Solution: keep data and compute co-located, or use object stores with multi-cloud replication

Spot Instance Failover

Spot instances can be up to 90% cheaper than on-demand, but can be preempted with little warning. SkyPilot handles failover automatically — the user only needs to implement checkpointing.

Spot Instance Failover
1
Job starts on spot instance
SkyPilot provisions cheapest spot GPU. Job starts training. Periodic checkpoints saved to S3/GCS.
2
Cloud preempts the instance
Cloud provider reclaims the spot instance (2-minute warning on AWS). Training stops. Last checkpoint: epoch 42.
3
SkyPilot detects preemption
SkyPilot monitors instance health. On preemption, queries optimizer for next best option (same or different cloud).
4
Provision fallback instance
SkyPilot provisions the next cheapest option (e.g., GCP spot). May be same GPU type on a different cloud.
5
Resume from checkpoint
Training resumes from epoch 42 checkpoint. Total extra time: ~5 minutes for re-provision + checkpoint load.

Manual vs SkyPilot Cloud Management

Manual cloud management vs SkyPilot
TaskManual (you)With SkyPilot
Find cheapest GPUCheck 3 cloud pricing pages, do mathAutomatic (queries Service Catalog)
Provision instanceCloud-specific CLI / consolesky launch — cloud-agnostic
Handle spot preemptionWrite custom monitoring + retry logicAutomatic failover to next best option
Data gravityManual egress cost calculationFactored into cost comparison
Multi-cloud portabilityRewrite config per cloudSingle YAML, any cloud
Cost trackingCloud billing dashboard (1 day lag)sky cost-report — real-time

SkyPilot Exam Questions (Q48–Q50)