Distributed AI Systems: From Training to Deployment

How distributed systems power modern AI — from data-parallel training with AllReduce to intercloud cost optimization with SkyPilot. TensorFlow, Ray, and SkyPilot deep dives.

Training large neural networks (GPT-4: ~1T parameters, Gemini: ~1.5T) requires distributed systems at unprecedented scale. A single GPU with 80 GB VRAM can hold at most a few billion parameters — and training data is far too large to fit on any single machine. This section covers the systems built to solve these problems.

💡 Why distributing AI is hard

Unlike stateless data processing, deep learning training is stateful and iterative: gradients from each batch must be synchronized before the next forward pass, creating tight coupling between workers. The communication-to-computation ratio is the key performance bottleneck.

The Evolution of Distributed AI Training

Single GPU

Train small models on one GPU. Limited by VRAM (~24 GB). No distribution needed.

Data Parallelism

Same model on all GPUs; split the data. Gradient sync via AllReduce. Scales well to ~thousands of GPUs.

Model Parallelism

Split the model across GPUs (e.g., different layers). Needed when model > GPU memory. Sequential dependency limits throughput.

Pipeline Parallelism

Combine model + data parallelism. Split model into stages across GPUs; pipeline mini-batches through stages. GPipe / Megatron-LM.

Intercloud

Distribute training across multiple cloud providers. SkyPilot selects cheapest available cloud, handles failover and spot instances.

AI Systems at a Glance

TensorFlow vs Ray vs SkyPilot: purpose and innovations
System	Purpose	Built On	Key Innovation
TensorFlow	ML training + inference framework	C++ runtime + Python frontend	Dataflow graph execution, TPU support, SavedModel format
Ray	General distributed execution for AI/ML	Python + Plasma object store	Unified task + actor model; GCS for global state
SkyPilot	Intercloud AI workload broker	Ray (execution) + Python	Cost-optimal cloud selection, spot instance failover, zero code change

Gradient Synchronization: Parameter Server vs AllReduce

The central challenge in data-parallel training: all workers must agree on the same model weights after each mini-batch. Two dominant approaches emerged.

Parameter Server Architecture

Parameter
Server

↑↓

push gradients / pull params

Workers push gradients to PS, pull updated params
PS is the bottleneck — easy to overload
Supports asynchronous updates
Used in: early TF DistributedStrategy

Ring-AllReduce Architecture

Gradients passed around the ring (reduce + broadcast)
No bottleneck — each GPU talks to 2 neighbors
Bandwidth optimal: O(n) total communication
Used in: NCCL, Horovod, TF MirroredStrategy

Deep Dives

TensorFlow

Dataflow graphs, PS training, AllReduce, TPUs

Deep dive →

Ray

Tasks vs Actors, GCS, Plasma store, RL use case

Deep dive →

SkyPilot

Intercloud broker, cost optimizer, spot failover

Deep dive →