Distributed Data Processing: MapReduce to Real-Time Streams
From batch MapReduce on HDFS to sub-millisecond stream processing with Flink. Understand the paradigm shifts, trade-offs, and architectures behind large-scale data processing.
Processing petabytes of data requires distributing computation across thousands of machines. The field has evolved from batch-only MapReduce (minutes to hours) through micro-batch Spark Streaming (seconds) to true streaming with Apache Flink (milliseconds).
๐ก The fundamental trade-off
Lambda Architecture
Proposed by Nathan Marz (creator of Storm), Lambda Architecture separates concerns into three layers: batch (accuracy), speed (low latency), and serving (query merging).
Kappa Architecture simplifies this by removing the batch layer โ process everything as a stream.
MapReduce: Word Count Walk-Through
MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. Every MapReduce job has exactly three phases: Map, Shuffle & Sort, and Reduce.
Each input line is split into (word, 1) pairs independently.
User-defined function. Reads (key, value) input pairs, emits intermediate (key, value) pairs. Runs in parallel across all input splits.
O(n) per splitFramework handles this automatically. Groups all values by key across all mapper outputs. The most network-intensive phase.
O(n log n) sortUser-defined function. Receives (key, list[values]) and emits final output. Also runs in parallel โ one reducer per unique key group.
O(n) per keyProcessing Paradigm Comparison
Three generations of distributed processing engines, each addressing limitations of the previous.
| Feature | MapReduce | Apache Spark | Apache Flink |
|---|---|---|---|
| Abstraction | Files (HDFS splits) | RDD / Dataset / DataFrame | DataStream / DataSet API |
| Processing model | Batch only | Batch + Micro-batch streaming | Native streaming + batch |
| Latency | Minutes โ hours | Seconds โ minutes | Milliseconds |
| State management | Stateless (HDFS between jobs) | RDD lineage graph | Rich keyed state (RocksDB) |
| Fault tolerance | HDFS replication + task retry | RDD lineage recomputation | Barrier-based checkpoints |
| In-memory? | No (disk I/O between phases) | Yes (RDDs in memory) | Yes (managed memory) |
| Languages | Java, Python (Streaming) | Scala, Python, Java, R | Java, Scala, Python |
| Windowing | N/A (batch only) | Window functions on DStreams | True windowing with watermarks |
Deep Dives
Unified analytics engine with in-memory processing. RDD lineage for fault tolerance, Spark Streaming with D-Streams, and the full transformation/action API.
Streaming-first engine with native event-time processing, watermarks for out-of-order events, and barrier-based exactly-once checkpointing.
Exam Question Preview
Preview questions from the processing section. Click to reveal hints.