Apache Spark: Unified Analytics Engine

In-memory distributed processing with RDD lineage, DAG-based scheduling, Spark Streaming D-Streams, and the full transformation/action API. Covers Q31–Q37.

Apache Spark was created at UC Berkeley in 2009 to overcome MapReduce's fundamental limitation: writing intermediate results to disk between every map and reduce phase. Spark keeps data in memory (in RDDs) and chains operations into a DAG that is optimized before execution, achieving 10–100× speedups over MapReduce for iterative workloads.

💡 RDD: Resilient Distributed Dataset

An RDD is an immutable, partitioned collection of records distributed across the cluster.
Resilient = fault tolerant via lineage (not replication).
Distributed = partitioned across many nodes.
Dataset = a collection of data.
Transformations are lazy — the DAG is only executed when an action is called.

Spark Architecture

A Spark application has a Driver process (your main() function) and multiple Executor processes on worker nodes. The Cluster Manager (YARN, Kubernetes, etc.) allocates resources.

Spark Architecture
Driver Program
SparkContext
DAG Scheduler
Task Scheduler

Hosts the main() function. Converts RDD transformations into a DAG of stages. Negotiates resources with the Cluster Manager.

Cluster Manager
YARN / Mesos / Kubernetes / Standalone

Allocates executor processes across the cluster

Executor 1
Task A
Task B
Cache / RDD
Executor 2
Task A
Task B
Cache / RDD
Executor 3
Task A
Task B
Cache / RDD
Key: Executors stay alive for the lifetime of the application, enabling in-memory caching of RDDs across operations. Workers cache RDD partitions in their JVM heap or optionally on disk.

RDD Lineage DAG

Every RDD remembers which transformations produced it (the lineage graph). If a partition is lost, Spark recomputes only that partition by replaying its lineage — no need to store redundant copies of data.

Spark RDD Lineage DAG
textFile()RDD[String] · 8plines RDDRDD[String] · 8pfilter(ERROR)RDD[String] · 8pcheckpoint()RDD[String] · 8pmap(parseURL)RDD[String] · 8pcountByValue()Map[String, Long] · 1p

Click any node to see details about the operation, data type, and partition count.

Purple node = Checkpoint (cuts lineage)
Amber node = Action (triggers execution)
Teal node = Selected
Red node = Simulated failure
Green glow = Recomputing from lineage
Dashed purple line = checkpoint boundary (lineage reset). Click nodes to explore. Use “Simulate Failure” to see lineage recomputation.

Fault Tolerance: Lineage vs Replication

Spark: Lineage Recomputation
  • Every RDD tracks its parent RDDs and the transformation applied
  • On failure: recompute the lost partition from its parents
  • Pros: No storage overhead, works for any immutable data
  • Cons: Long lineage chains are expensive to recompute
  • Solution: checkpoint() materializes RDD to HDFS, cuts lineage
Hadoop MapReduce: HDFS Replication
  • All intermediate data written to HDFS with 3× replication
  • On failure: simply read a replica from HDFS
  • Pros: Simple, reliable, fast recovery
  • Cons: Huge I/O overhead — 3× data written between every phase
  • This is why Spark is 10–100× faster for iterative jobs

Spark Streaming: D-Streams

Spark Streaming processes live data by dividing it into small batches (Discretized Streams = D-Streams). Each batch is processed as a regular Spark RDD job. This reuses all Spark optimizations but introduces latency equal to the batch interval.

D-Stream: Micro-batch Model
Live Stream
events flowing...
Micro-batch Intervals (Δt = 2s each)
t=[0,2)s
→ RDD_1
Processing
t=[2,4)s
→ RDD_2
t=[4,6)s
→ RDD_3
t=[6,8)s
→ RDD_4
t=[8,10)s
→ RDD_5
Key insight:Each interval's events are collected into a batch, then processed as a regular Spark RDD. This “micro-batch” approach reuses all Spark batch optimizations but adds latency equal to the batch interval (typically 0.5–5s).

Batch vs Streaming Code

Batch Spark
WordCount.scala
1// Batch Spark: word count
2import org.apache.spark.{SparkConf, SparkContext}
3
4val conf = new SparkConf().setAppName("WordCount")
5val sc = new SparkContext(conf)
6
7val textFile = sc.textFile("hdfs://namenode/data/input/*.txt")
8
9val counts = textFile
10  .flatMap(line => line.split(" "))   // narrow: split each line
11  .map(word => (word, 1))             // narrow: emit (word, 1)
12  .reduceByKey(_ + _)                 // wide: shuffle by key
13
14counts.saveAsTextFile("hdfs://namenode/data/output/")
15sc.stop()
Spark Streaming
StreamingWordCount.scala
1// Spark Streaming: word count over D-Streams
2import org.apache.spark.streaming._
3import org.apache.spark.streaming.StreamingContext._
4
5val ssc = new StreamingContext(sc, Seconds(2))  // 2-second batches
6
7val lines = ssc.socketTextStream("localhost", 9999)
8
9val wordCounts = lines
10  .flatMap(_.split(" "))
11  .map(word => (word, 1))
12  .reduceByKey(_ + _)
13
14// Print top 10 results every batch
15wordCounts.print()
16
17ssc.start()
18ssc.awaitTermination()
Notice the structural similarity: D-Streams expose the same transformation API as RDDs. The key difference is the batch interval and the fact that streaming jobs run continuously.

Transformations vs Actions

Spark transformations (lazy) vs actions (trigger computation)
TypeOperationDescriptionReturns
Transformationmap(f)Apply f to each elementNew RDD
Transformationfilter(f)Keep elements where f returns trueNew RDD
TransformationflatMap(f)Apply f, flatten resultNew RDD
TransformationreduceByKey(f)Aggregate values per key (wide)New RDD
TransformationgroupByKey()Group all values per key (wide, slow)New RDD
Transformationjoin(other)Join two RDDs by key (wide)New RDD
Actioncollect()Return all elements to driverArray
Actioncount()Count elementsLong
Actionreduce(f)Aggregate all elementsSingle value
ActionsaveAsTextFile(path)Write to HDFSUnit (side effect)
ActioncountByValue()Count occurrences of each valueMap[T, Long]

Narrow vs Wide Transformations

The key to Spark's DAG optimization. Narrow transformations can be pipelined together in a single stage. Wide transformations require a shuffle and create stage boundaries.

Narrow Transformations

Each output partition depends on exactly one input partition. No data shuffle required. Can be pipelined together.

Parent RDD
P0
P1
P2
Child RDD
P0'
P1'
P2'
mapfilterflatMapunionmapPartitions
Wide Transformations (Shuffle)

Each output partition depends on multiple input partitions. Requires a shuffle — data moves across the network. Stage boundary in DAG.

Parent RDD
P0
P1
P2
Child RDD
P0'
P1'
P2'
groupByKeyreduceByKeyjoinrepartitionsortByKey

💡 Stage boundaries and the DAG Scheduler

The DAGScheduler splits the RDD lineage graph into stages at every wide transformation. Within a stage, all narrow transformations are pipelined (no disk I/O). Between stages, data is shuffled and optionally spilled to disk for fault tolerance.

Spark Exam Questions (Q31–Q37)

These questions cover the core Spark concepts tested in the exam.