Distributed Data Processing: MapReduce to Real-Time Streams

From batch MapReduce on HDFS to sub-millisecond stream processing with Flink. Understand the paradigm shifts, trade-offs, and architectures behind large-scale data processing.

Processing petabytes of data requires distributing computation across thousands of machines. The field has evolved from batch-only MapReduce (minutes to hours) through micro-batch Spark Streaming (seconds) to true streaming with Apache Flink (milliseconds).

๐Ÿ’ก The fundamental trade-off

Throughput vs Latency. Batch systems amortize coordination overhead over large chunks of data, achieving high throughput. Streaming systems process each event immediately at the cost of per-record overhead. Lambda Architecture tries to get both โ€” Kappa Architecture argues you only need one.

Lambda Architecture

Proposed by Nathan Marz (creator of Storm), Lambda Architecture separates concerns into three layers: batch (accuracy), speed (low latency), and serving (query merging).

Lambda Architecture
New Data (Events / Streams)
Batch Layer
Stores the master dataset (immutable, append-only). Recomputes batch views periodically.
MapReduce / Spark (hours)
Speed Layer
Processes only the most recent data to compensate for batch latency.
Spark Streaming / Flink (ms)
Serving Layer
Merges batch views + real-time views to answer queries with low latency.
HBase / Cassandra / Druid
Query Response

Kappa Architecture simplifies this by removing the batch layer โ€” process everything as a stream.

MapReduce: Word Count Walk-Through

MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. Every MapReduce job has exactly three phases: Map, Shuffle & Sort, and Reduce.

Each input line is split into (word, 1) pairs independently.

"the cat sat on the mat"
(the, 1)(cat, 1)(sat, 1)(on, 1)(the, 1)(mat, 1)
"the cat in the hat"
(the, 1)(cat, 1)(in, 1)(the, 1)(hat, 1)
Word count โ€” the canonical MapReduce exampleClick phases to explore
Map

User-defined function. Reads (key, value) input pairs, emits intermediate (key, value) pairs. Runs in parallel across all input splits.

O(n) per split
Shuffle & Sort

Framework handles this automatically. Groups all values by key across all mapper outputs. The most network-intensive phase.

O(n log n) sort
Reduce

User-defined function. Receives (key, list[values]) and emits final output. Also runs in parallel โ€” one reducer per unique key group.

O(n) per key

Processing Paradigm Comparison

Three generations of distributed processing engines, each addressing limitations of the previous.

MapReduce vs Spark vs Flink: feature comparison
FeatureMapReduceApache SparkApache Flink
AbstractionFiles (HDFS splits)RDD / Dataset / DataFrameDataStream / DataSet API
Processing modelBatch onlyBatch + Micro-batch streamingNative streaming + batch
LatencyMinutes โ€“ hoursSeconds โ€“ minutesMilliseconds
State managementStateless (HDFS between jobs)RDD lineage graphRich keyed state (RocksDB)
Fault toleranceHDFS replication + task retryRDD lineage recomputationBarrier-based checkpoints
In-memory?No (disk I/O between phases)Yes (RDDs in memory)Yes (managed memory)
LanguagesJava, Python (Streaming)Scala, Python, Java, RJava, Scala, Python
WindowingN/A (batch only)Window functions on DStreamsTrue windowing with watermarks

Deep Dives

Exam Question Preview

Preview questions from the processing section. Click to reveal hints.