Distributed Data Processing: MapReduce to Real-Time Streams

From batch MapReduce on HDFS to sub-millisecond stream processing with Flink. Understand the paradigm shifts, trade-offs, and architectures behind large-scale data processing.

Processing petabytes of data requires distributing computation across thousands of machines. The field has evolved from batch-only MapReduce (minutes to hours) through micro-batch Spark Streaming (seconds) to true streaming with Apache Flink (milliseconds).

💡 The fundamental trade-off

Throughput vs Latency. Batch systems amortize coordination overhead over large chunks of data, achieving high throughput. Streaming systems process each event immediately at the cost of per-record overhead. Lambda Architecture tries to get both — Kappa Architecture argues you only need one.

Lambda Architecture

Proposed by Nathan Marz (creator of Storm), Lambda Architecture separates concerns into three layers: batch (accuracy), speed (low latency), and serving (query merging).

Lambda Architecture

New Data (Events / Streams)

Batch Layer

Stores the master dataset (immutable, append-only). Recomputes batch views periodically.

MapReduce / Spark (hours)

Speed Layer

Processes only the most recent data to compensate for batch latency.

Spark Streaming / Flink (ms)

Serving Layer

Merges batch views + real-time views to answer queries with low latency.

HBase / Cassandra / Druid

Query Response

Kappa Architecture simplifies this by removing the batch layer — process everything as a stream.

MapReduce: Word Count Walk-Through

MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. Every MapReduce job has exactly three phases: Map, Shuffle & Sort, and Reduce.

Each input line is split into (word, 1) pairs independently.

"the cat sat on the mat"

(the, 1)(cat, 1)(sat, 1)(on, 1)(the, 1)(mat, 1)

"the cat in the hat"

(the, 1)(cat, 1)(in, 1)(the, 1)(hat, 1)

Word count — the canonical MapReduce exampleClick phases to explore

Map

User-defined function. Reads (key, value) input pairs, emits intermediate (key, value) pairs. Runs in parallel across all input splits.

O(n) per split

Shuffle & Sort

Framework handles this automatically. Groups all values by key across all mapper outputs. The most network-intensive phase.

O(n log n) sort

Reduce

User-defined function. Receives (key, list[values]) and emits final output. Also runs in parallel — one reducer per unique key group.

O(n) per key

Processing Paradigm Comparison

Three generations of distributed processing engines, each addressing limitations of the previous.

MapReduce vs Spark vs Flink: feature comparison
Feature	MapReduce	Apache Spark	Apache Flink
Abstraction	Files (HDFS splits)	RDD / Dataset / DataFrame	DataStream / DataSet API
Processing model	Batch only	Batch + Micro-batch streaming	Native streaming + batch
Latency	Minutes – hours	Seconds – minutes	Milliseconds
State management	Stateless (HDFS between jobs)	RDD lineage graph	Rich keyed state (RocksDB)
Fault tolerance	HDFS replication + task retry	RDD lineage recomputation	Barrier-based checkpoints
In-memory?	No (disk I/O between phases)	Yes (RDDs in memory)	Yes (managed memory)
Languages	Java, Python (Streaming)	Scala, Python, Java, R	Java, Scala, Python
Windowing	N/A (batch only)	Window functions on DStreams	True windowing with watermarks

Deep Dives

Apache Spark

RDDs, DAG execution, streaming

Unified analytics engine with in-memory processing. RDD lineage for fault tolerance, Spark Streaming with D-Streams, and the full transformation/action API.

Explore Spark →

Apache Flink

True streaming, watermarks, barriers

Streaming-first engine with native event-time processing, watermarks for out-of-order events, and barrier-based exactly-once checkpointing.

Explore Flink →

Exam Question Preview

Preview questions from the processing section. Click to reveal hints.

View all processing exam questions →