TensorFlow: Distributed Deep Learning
Google's open-source ML framework — from the dataflow graph model to multi-GPU and multi-worker distributed training strategies. Covers Q46–Q47.
TensorFlow was open-sourced by Google in 2015 and became the dominant deep learning framework for production workloads. Its core abstraction is the dataflow graph: a directed acyclic graph where nodes are operations and edges are tensors. This representation enables optimization (common subexpression elimination, constant folding) and distributed execution.
💡 Key design choice: separate definition from execution
TensorFlow Layered Architecture
TF is organized as a stack of layers, from high-level training libraries down to hardware kernels. Click each layer to explore its role.
Select a layer to learn about its role in the TensorFlow stack.
This layered design means TF can run on CPUs, GPUs, and TPUs without changing user code — the device placement and kernel selection happens automatically in the runtime layer.
Dataflow Graph: Nodes = Ops, Edges = Tensors
Every computation in TensorFlow is expressed as a dataflow graph. Operations (matmul, relu, add) are nodes; the tensors flowing between them are directed edges. The graph representation enables automatic differentiation (backprop) and distributed execution.
Nodes = operations (MatMul, Add, ReLU). Edges = tensors(multi-dimensional arrays flowing between ops). The graph is defined once in Python, then executed repeatedly by the C++ runtime.
Decorate with @tf.function to trace Python code into a static graph. Enables XLA compilation, constant folding, and parallel op scheduling. Critical for production performance.
TF traverses the dataflow graph in reverse to compute gradients via the chain rule. Each op registers its gradient function. tf.GradientTape records operations for gradient computation in eager mode.
Data-Parallel Training: Parameter Server
The classical approach to distributed training: dedicated parameter servershold model weights, workers compute gradients on their local batch, push gradients to PS, and pull updated weights.
params θ
params θ
mini-batch
mini-batch
mini-batch
AllReduce: Ring-Based Gradient Aggregation
Ring-AllReduce avoids the PS bottleneck by having all workers communicate directly in a ring topology. Used by NCCL (Nvidia), Horovod, and TF's MirroredStrategy.
How Ring-AllReduce works:
- Scatter-reduce: Each GPU sends a chunk of its gradient to the next GPU in the ring, receiving a partial sum. After N-1 rounds, each GPU holds the full sum for one chunk.
- AllGather: Each GPU broadcasts its complete chunk around the ring. After N-1 rounds, all GPUs have all chunks → complete summed gradient.
- Result: All GPUs hold the same aggregated gradient. No bottleneck — each GPU communicates with only 2 neighbors.
Distributed Training Code
1# TensorFlow distributed training
2import tensorflow as tf
3
4# MirroredStrategy: data parallel, all-reduce on GPUs
5strategy = tf.distribute.MirroredStrategy()
6
7with strategy.scope():
8 model = tf.keras.Sequential([
9 tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
10 tf.keras.layers.Dropout(0.2),
11 tf.keras.layers.Dense(128, activation='relu'),
12 tf.keras.layers.Dense(10, activation='softmax'),
13 ])
14 model.compile(
15 optimizer='adam',
16 loss='sparse_categorical_crossentropy',
17 metrics=['accuracy'],
18 )
19
20# Dataset must be created inside strategy.scope() or passed as a tf.data.Dataset
21dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) \
22 .shuffle(10000) \
23 .batch(64 * strategy.num_replicas_in_sync) # scale batch size!
24
25# Automatically distributes across all available GPUs
26model.fit(dataset, epochs=10, validation_data=(x_test, y_test))1# Multi-worker distributed training (MultiWorkerMirroredStrategy)
2import os, json
3
4# TF_CONFIG tells each worker its role and cluster topology
5os.environ['TF_CONFIG'] = json.dumps({
6 'cluster': {
7 'worker': ['worker0:12345', 'worker1:12346', 'worker2:12347']
8 },
9 'task': {'type': 'worker', 'index': 0} # different on each machine
10})
11
12strategy = tf.distribute.MultiWorkerMirroredStrategy()
13
14with strategy.scope():
15 model = build_model()
16 model.compile(...)
17
18model.fit(dataset, epochs=10)
19# Each worker runs the same code; TF_CONFIG differentiates their rolesDistribution Strategies
| Strategy | Scope | Gradient Sync | Use Case |
|---|---|---|---|
| MirroredStrategy | Single machine, multiple GPUs | NCCL AllReduce (GPU-to-GPU) | Most common GPU training setup |
| MultiWorkerMirroredStrategy | Multiple machines | gRPC + CollectiveOps | Scale beyond single machine |
| TPUStrategy | Single TPU pod | Mesh AllReduce via XLA | Google TPU training |
| ParameterServerStrategy | Multiple machines | Push/pull to PS | Large embedding layers (recommender systems) |
| CentralStorageStrategy | Single machine | Central CPU storage | When GPU memory is limited |