Apache Spark: Unified Analytics Engine
In-memory distributed processing with RDD lineage, DAG-based scheduling, Spark Streaming D-Streams, and the full transformation/action API. Covers Q31–Q37.
Apache Spark was created at UC Berkeley in 2009 to overcome MapReduce's fundamental limitation: writing intermediate results to disk between every map and reduce phase. Spark keeps data in memory (in RDDs) and chains operations into a DAG that is optimized before execution, achieving 10–100× speedups over MapReduce for iterative workloads.
💡 RDD: Resilient Distributed Dataset
Resilient = fault tolerant via lineage (not replication).
Distributed = partitioned across many nodes.
Dataset = a collection of data.
Transformations are lazy — the DAG is only executed when an action is called.
Spark Architecture
A Spark application has a Driver process (your main() function) and multiple Executor processes on worker nodes. The Cluster Manager (YARN, Kubernetes, etc.) allocates resources.
Hosts the main() function. Converts RDD transformations into a DAG of stages. Negotiates resources with the Cluster Manager.
Allocates executor processes across the cluster
RDD Lineage DAG
Every RDD remembers which transformations produced it (the lineage graph). If a partition is lost, Spark recomputes only that partition by replaying its lineage — no need to store redundant copies of data.
Click any node to see details about the operation, data type, and partition count.
Fault Tolerance: Lineage vs Replication
- Every RDD tracks its parent RDDs and the transformation applied
- On failure: recompute the lost partition from its parents
- Pros: No storage overhead, works for any immutable data
- Cons: Long lineage chains are expensive to recompute
- Solution:
checkpoint()materializes RDD to HDFS, cuts lineage
- All intermediate data written to HDFS with 3× replication
- On failure: simply read a replica from HDFS
- Pros: Simple, reliable, fast recovery
- Cons: Huge I/O overhead — 3× data written between every phase
- This is why Spark is 10–100× faster for iterative jobs
Spark Streaming: D-Streams
Spark Streaming processes live data by dividing it into small batches (Discretized Streams = D-Streams). Each batch is processed as a regular Spark RDD job. This reuses all Spark optimizations but introduces latency equal to the batch interval.
Batch vs Streaming Code
1// Batch Spark: word count
2import org.apache.spark.{SparkConf, SparkContext}
3
4val conf = new SparkConf().setAppName("WordCount")
5val sc = new SparkContext(conf)
6
7val textFile = sc.textFile("hdfs://namenode/data/input/*.txt")
8
9val counts = textFile
10 .flatMap(line => line.split(" ")) // narrow: split each line
11 .map(word => (word, 1)) // narrow: emit (word, 1)
12 .reduceByKey(_ + _) // wide: shuffle by key
13
14counts.saveAsTextFile("hdfs://namenode/data/output/")
15sc.stop()1// Spark Streaming: word count over D-Streams
2import org.apache.spark.streaming._
3import org.apache.spark.streaming.StreamingContext._
4
5val ssc = new StreamingContext(sc, Seconds(2)) // 2-second batches
6
7val lines = ssc.socketTextStream("localhost", 9999)
8
9val wordCounts = lines
10 .flatMap(_.split(" "))
11 .map(word => (word, 1))
12 .reduceByKey(_ + _)
13
14// Print top 10 results every batch
15wordCounts.print()
16
17ssc.start()
18ssc.awaitTermination()Transformations vs Actions
| Type | Operation | Description | Returns |
|---|---|---|---|
| Transformation | map(f) | Apply f to each element | New RDD |
| Transformation | filter(f) | Keep elements where f returns true | New RDD |
| Transformation | flatMap(f) | Apply f, flatten result | New RDD |
| Transformation | reduceByKey(f) | Aggregate values per key (wide) | New RDD |
| Transformation | groupByKey() | Group all values per key (wide, slow) | New RDD |
| Transformation | join(other) | Join two RDDs by key (wide) | New RDD |
| Action | collect() | Return all elements to driver | Array |
| Action | count() | Count elements | Long |
| Action | reduce(f) | Aggregate all elements | Single value |
| Action | saveAsTextFile(path) | Write to HDFS | Unit (side effect) |
| Action | countByValue() | Count occurrences of each value | Map[T, Long] |
Narrow vs Wide Transformations
The key to Spark's DAG optimization. Narrow transformations can be pipelined together in a single stage. Wide transformations require a shuffle and create stage boundaries.
Each output partition depends on exactly one input partition. No data shuffle required. Can be pipelined together.
Each output partition depends on multiple input partitions. Requires a shuffle — data moves across the network. Stage boundary in DAG.
💡 Stage boundaries and the DAG Scheduler
Spark Exam Questions (Q31–Q37)
These questions cover the core Spark concepts tested in the exam.