TensorFlow: Distributed Deep Learning

Google's open-source ML framework — from the dataflow graph model to multi-GPU and multi-worker distributed training strategies. Covers Q46–Q47.

TensorFlow was open-sourced by Google in 2015 and became the dominant deep learning framework for production workloads. Its core abstraction is the dataflow graph: a directed acyclic graph where nodes are operations and edges are tensors. This representation enables optimization (common subexpression elimination, constant folding) and distributed execution.

💡 Key design choice: separate definition from execution

In TF 1.x (static graphs), you define the entire computation graph in Python first, then execute it repeatedly with Session.run(). This allows the C++ runtime to optimize the graph globally. TF 2.x adds eager execution by default (Pythonic, debug-friendly) but tf.function compiles functions to static graphs when performance matters.

TensorFlow Layered Architecture

TF is organized as a stack of layers, from high-level training libraries down to hardware kernels. Click each layer to explore its role.

TensorFlow Layered Architecture
Click any layer to learn about it
↑ User-facing     System / Hardware ↓

Select a layer to learn about its role in the TensorFlow stack.

Warm colors (top) = user-facing APIs and libraries
Blue = core runtime (graph execution, scheduling)
Cool colors (bottom) = hardware abstraction layers

This layered design means TF can run on CPUs, GPUs, and TPUs without changing user code — the device placement and kernel selection happens automatically in the runtime layer.

Dataflow Graph: Nodes = Ops, Edges = Tensors

Every computation in TensorFlow is expressed as a dataflow graph. Operations (matmul, relu, add) are nodes; the tensors flowing between them are directed edges. The graph representation enables automatic differentiation (backprop) and distributed execution.

TensorFlow Dataflow Graph: y = ReLU(Wx + b)
x(Placeholder)W(Variable)MatMulb(Variable)AddReLULosstensor [batch, 784]tensor [batch, 256]tensor [batch, 256]scalar

Nodes = operations (MatMul, Add, ReLU). Edges = tensors(multi-dimensional arrays flowing between ops). The graph is defined once in Python, then executed repeatedly by the C++ runtime.

tf.function: Graph Mode

Decorate with @tf.function to trace Python code into a static graph. Enables XLA compilation, constant folding, and parallel op scheduling. Critical for production performance.

Automatic Differentiation

TF traverses the dataflow graph in reverse to compute gradients via the chain rule. Each op registers its gradient function. tf.GradientTape records operations for gradient computation in eager mode.

Data-Parallel Training: Parameter Server

The classical approach to distributed training: dedicated parameter servershold model weights, workers compute gradients on their local batch, push gradients to PS, and pull updated weights.

Parameter Server Architecture
PS 0
params θ
PS 1
params θ
↑ push ∇θ
↓ pull θ
↑ push ∇θ
↓ pull θ
↑ push ∇θ
↓ pull θ
Worker 0
mini-batch
Worker 1
mini-batch
Worker 2
mini-batch
Bottleneck: All workers push/pull to/from PS on every step. With many workers, PS becomes the bandwidth bottleneck. Solution: use AllReduce (no PS) or hierarchical PS.

AllReduce: Ring-Based Gradient Aggregation

Ring-AllReduce avoids the PS bottleneck by having all workers communicate directly in a ring topology. Used by NCCL (Nvidia), Horovod, and TF's MirroredStrategy.

Ring-AllReduce: Gradient Aggregation
GPU 0∇θ_0GPU 1∇θ_1GPU 2∇θ_2GPU 3∇θ_3

How Ring-AllReduce works:

  1. Scatter-reduce: Each GPU sends a chunk of its gradient to the next GPU in the ring, receiving a partial sum. After N-1 rounds, each GPU holds the full sum for one chunk.
  2. AllGather: Each GPU broadcasts its complete chunk around the ring. After N-1 rounds, all GPUs have all chunks → complete summed gradient.
  3. Result: All GPUs hold the same aggregated gradient. No bottleneck — each GPU communicates with only 2 neighbors.
Bandwidth efficiency: Each GPU sends/receives 2(N-1)/N × data ≈ 2× gradient data total, regardless of N workers. Optimal!

Distributed Training Code

Single Machine, Multiple GPUs (MirroredStrategy)
tf_mirrored.py
1# TensorFlow distributed training
2import tensorflow as tf
3
4# MirroredStrategy: data parallel, all-reduce on GPUs
5strategy = tf.distribute.MirroredStrategy()
6
7with strategy.scope():
8    model = tf.keras.Sequential([
9        tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
10        tf.keras.layers.Dropout(0.2),
11        tf.keras.layers.Dense(128, activation='relu'),
12        tf.keras.layers.Dense(10, activation='softmax'),
13    ])
14    model.compile(
15        optimizer='adam',
16        loss='sparse_categorical_crossentropy',
17        metrics=['accuracy'],
18    )
19
20# Dataset must be created inside strategy.scope() or passed as a tf.data.Dataset
21dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) \
22    .shuffle(10000) \
23    .batch(64 * strategy.num_replicas_in_sync)  # scale batch size!
24
25# Automatically distributes across all available GPUs
26model.fit(dataset, epochs=10, validation_data=(x_test, y_test))
Multi-Worker Training (MultiWorkerMirroredStrategy)
tf_multiworker.py
1# Multi-worker distributed training (MultiWorkerMirroredStrategy)
2import os, json
3
4# TF_CONFIG tells each worker its role and cluster topology
5os.environ['TF_CONFIG'] = json.dumps({
6    'cluster': {
7        'worker': ['worker0:12345', 'worker1:12346', 'worker2:12347']
8    },
9    'task': {'type': 'worker', 'index': 0}  # different on each machine
10})
11
12strategy = tf.distribute.MultiWorkerMirroredStrategy()
13
14with strategy.scope():
15    model = build_model()
16    model.compile(...)
17
18model.fit(dataset, epochs=10)
19# Each worker runs the same code; TF_CONFIG differentiates their roles

Distribution Strategies

TensorFlow distribution strategies and their use cases
StrategyScopeGradient SyncUse Case
MirroredStrategySingle machine, multiple GPUsNCCL AllReduce (GPU-to-GPU)Most common GPU training setup
MultiWorkerMirroredStrategyMultiple machinesgRPC + CollectiveOpsScale beyond single machine
TPUStrategySingle TPU podMesh AllReduce via XLAGoogle TPU training
ParameterServerStrategyMultiple machinesPush/pull to PSLarge embedding layers (recommender systems)
CentralStorageStrategySingle machineCentral CPU storageWhen GPU memory is limited

TensorFlow Exam Questions (Q46–Q47)