Distributed AI Systems: From Training to Deployment
How distributed systems power modern AI — from data-parallel training with AllReduce to intercloud cost optimization with SkyPilot. TensorFlow, Ray, and SkyPilot deep dives.
Training large neural networks (GPT-4: ~1T parameters, Gemini: ~1.5T) requires distributed systems at unprecedented scale. A single GPU with 80 GB VRAM can hold at most a few billion parameters — and training data is far too large to fit on any single machine. This section covers the systems built to solve these problems.
💡 Why distributing AI is hard
The Evolution of Distributed AI Training
Train small models on one GPU. Limited by VRAM (~24 GB). No distribution needed.
Same model on all GPUs; split the data. Gradient sync via AllReduce. Scales well to ~thousands of GPUs.
Split the model across GPUs (e.g., different layers). Needed when model > GPU memory. Sequential dependency limits throughput.
Combine model + data parallelism. Split model into stages across GPUs; pipeline mini-batches through stages. GPipe / Megatron-LM.
Distribute training across multiple cloud providers. SkyPilot selects cheapest available cloud, handles failover and spot instances.
AI Systems at a Glance
| System | Purpose | Built On | Key Innovation |
|---|---|---|---|
| TensorFlow | ML training + inference framework | C++ runtime + Python frontend | Dataflow graph execution, TPU support, SavedModel format |
| Ray | General distributed execution for AI/ML | Python + Plasma object store | Unified task + actor model; GCS for global state |
| SkyPilot | Intercloud AI workload broker | Ray (execution) + Python | Cost-optimal cloud selection, spot instance failover, zero code change |
Gradient Synchronization: Parameter Server vs AllReduce
The central challenge in data-parallel training: all workers must agree on the same model weights after each mini-batch. Two dominant approaches emerged.
Server
- Workers push gradients to PS, pull updated params
- PS is the bottleneck — easy to overload
- Supports asynchronous updates
- Used in: early TF DistributedStrategy
- Gradients passed around the ring (reduce + broadcast)
- No bottleneck — each GPU talks to 2 neighbors
- Bandwidth optimal: O(n) total communication
- Used in: NCCL, Horovod, TF MirroredStrategy