TensorFlow: Distributed Deep Learning

Google's open-source ML framework — from the dataflow graph model to multi-GPU and multi-worker distributed training strategies. Covers Q46–Q47.

TensorFlow was open-sourced by Google in 2015 and became the dominant deep learning framework for production workloads. Its core abstraction is the dataflow graph: a directed acyclic graph where nodes are operations and edges are tensors. This representation enables optimization (common subexpression elimination, constant folding) and distributed execution.

💡 Key design choice: separate definition from execution

In TF 1.x (static graphs), you define the entire computation graph in Python first, then execute it repeatedly with Session.run(). This allows the C++ runtime to optimize the graph globally. TF 2.x adds eager execution by default (Pythonic, debug-friendly) but tf.function compiles functions to static graphs when performance matters.

TensorFlow Layered Architecture

TF is organized as a stack of layers, from high-level training libraries down to hardware kernels. Click each layer to explore its role.

TensorFlow Layered Architecture

Click any layer to learn about it

↑ User-facing System / Hardware ↓

Select a layer to learn about its role in the TensorFlow stack.

Warm colors (top) = user-facing APIs and libraries

Blue = core runtime (graph execution, scheduling)

Cool colors (bottom) = hardware abstraction layers

This layered design means TF can run on CPUs, GPUs, and TPUs without changing user code — the device placement and kernel selection happens automatically in the runtime layer.

Dataflow Graph: Nodes = Ops, Edges = Tensors

Every computation in TensorFlow is expressed as a dataflow graph. Operations (matmul, relu, add) are nodes; the tensors flowing between them are directed edges. The graph representation enables automatic differentiation (backprop) and distributed execution.

TensorFlow Dataflow Graph: y = ReLU(Wx + b)

Nodes = operations (MatMul, Add, ReLU). Edges = tensors(multi-dimensional arrays flowing between ops). The graph is defined once in Python, then executed repeatedly by the C++ runtime.

tf.function: Graph Mode

Decorate with @tf.function to trace Python code into a static graph. Enables XLA compilation, constant folding, and parallel op scheduling. Critical for production performance.

Automatic Differentiation

TF traverses the dataflow graph in reverse to compute gradients via the chain rule. Each op registers its gradient function. tf.GradientTape records operations for gradient computation in eager mode.

Data-Parallel Training: Parameter Server

The classical approach to distributed training: dedicated parameter servershold model weights, workers compute gradients on their local batch, push gradients to PS, and pull updated weights.

Parameter Server Architecture

PS 0
params θ

PS 1
params θ

↑ push ∇θ

↓ pull θ

↑ push ∇θ

↓ pull θ

↑ push ∇θ

↓ pull θ

Worker 0
mini-batch

Worker 1
mini-batch

Worker 2
mini-batch

Bottleneck: All workers push/pull to/from PS on every step. With many workers, PS becomes the bandwidth bottleneck. Solution: use AllReduce (no PS) or hierarchical PS.

AllReduce: Ring-Based Gradient Aggregation

Ring-AllReduce avoids the PS bottleneck by having all workers communicate directly in a ring topology. Used by NCCL (Nvidia), Horovod, and TF's MirroredStrategy.

Ring-AllReduce: Gradient Aggregation

How Ring-AllReduce works:

Scatter-reduce: Each GPU sends a chunk of its gradient to the next GPU in the ring, receiving a partial sum. After N-1 rounds, each GPU holds the full sum for one chunk.
AllGather: Each GPU broadcasts its complete chunk around the ring. After N-1 rounds, all GPUs have all chunks → complete summed gradient.
Result: All GPUs hold the same aggregated gradient. No bottleneck — each GPU communicates with only 2 neighbors.

Bandwidth efficiency: Each GPU sends/receives 2(N-1)/N × data ≈ 2× gradient data total, regardless of N workers. Optimal!

Distributed Training Code

Single Machine, Multiple GPUs (MirroredStrategy)

tf_mirrored.py

1# TensorFlow distributed training
2import tensorflow as tf
3
4# MirroredStrategy: data parallel, all-reduce on GPUs
5strategy = tf.distribute.MirroredStrategy()
6
7with strategy.scope():
8    model = tf.keras.Sequential([
9        tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
10        tf.keras.layers.Dropout(0.2),
11        tf.keras.layers.Dense(128, activation='relu'),
12        tf.keras.layers.Dense(10, activation='softmax'),
13    ])
14    model.compile(
15        optimizer='adam',
16        loss='sparse_categorical_crossentropy',
17        metrics=['accuracy'],
18    )
19
20# Dataset must be created inside strategy.scope() or passed as a tf.data.Dataset
21dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) \
22    .shuffle(10000) \
23    .batch(64 * strategy.num_replicas_in_sync)  # scale batch size!
24
25# Automatically distributes across all available GPUs
26model.fit(dataset, epochs=10, validation_data=(x_test, y_test))

Multi-Worker Training (MultiWorkerMirroredStrategy)

tf_multiworker.py

1# Multi-worker distributed training (MultiWorkerMirroredStrategy)
2import os, json
3
4# TF_CONFIG tells each worker its role and cluster topology
5os.environ['TF_CONFIG'] = json.dumps({
6    'cluster': {
7        'worker': ['worker0:12345', 'worker1:12346', 'worker2:12347']
8    },
9    'task': {'type': 'worker', 'index': 0}  # different on each machine
10})
11
12strategy = tf.distribute.MultiWorkerMirroredStrategy()
13
14with strategy.scope():
15    model = build_model()
16    model.compile(...)
17
18model.fit(dataset, epochs=10)
19# Each worker runs the same code; TF_CONFIG differentiates their roles

Distribution Strategies

TensorFlow distribution strategies and their use cases
Strategy	Scope	Gradient Sync	Use Case
MirroredStrategy	Single machine, multiple GPUs	NCCL AllReduce (GPU-to-GPU)	Most common GPU training setup
MultiWorkerMirroredStrategy	Multiple machines	gRPC + CollectiveOps	Scale beyond single machine
TPUStrategy	Single TPU pod	Mesh AllReduce via XLA	Google TPU training
ParameterServerStrategy	Multiple machines	Push/pull to PS	Large embedding layers (recommender systems)
CentralStorageStrategy	Single machine	Central CPU storage	When GPU memory is limited

TensorFlow Exam Questions (Q46–Q47)

View all exam questions →