Apache Spark: Unified Analytics Engine

In-memory distributed processing with RDD lineage, DAG-based scheduling, Spark Streaming D-Streams, and the full transformation/action API. Covers Q31–Q37.

Apache Spark was created at UC Berkeley in 2009 to overcome MapReduce's fundamental limitation: writing intermediate results to disk between every map and reduce phase. Spark keeps data in memory (in RDDs) and chains operations into a DAG that is optimized before execution, achieving 10–100× speedups over MapReduce for iterative workloads.

💡 RDD: Resilient Distributed Dataset

An RDD is an immutable, partitioned collection of records distributed across the cluster.
Resilient = fault tolerant via lineage (not replication).
Distributed = partitioned across many nodes.
Dataset = a collection of data.
Transformations are lazy — the DAG is only executed when an action is called.

Spark Architecture

A Spark application has a Driver process (your main() function) and multiple Executor processes on worker nodes. The Cluster Manager (YARN, Kubernetes, etc.) allocates resources.

Spark Architecture

Driver Program

SparkContext

DAG Scheduler

Task Scheduler

Hosts the main() function. Converts RDD transformations into a DAG of stages. Negotiates resources with the Cluster Manager.

↕

Cluster Manager

YARN / Mesos / Kubernetes / Standalone

Allocates executor processes across the cluster

↓

Executor 1

Task A

Task B

Cache / RDD

Executor 2

Task A

Task B

Cache / RDD

Executor 3

Task A

Task B

Cache / RDD

Key: Executors stay alive for the lifetime of the application, enabling in-memory caching of RDDs across operations. Workers cache RDD partitions in their JVM heap or optionally on disk.

RDD Lineage DAG

Every RDD remembers which transformations produced it (the lineage graph). If a partition is lost, Spark recomputes only that partition by replaying its lineage — no need to store redundant copies of data.

Spark RDD Lineage DAG

Click any node to see details about the operation, data type, and partition count.

Purple node = Checkpoint (cuts lineage)

Amber node = Action (triggers execution)

Teal node = Selected

Red node = Simulated failure

Green glow = Recomputing from lineage

Dashed purple line = checkpoint boundary (lineage reset). Click nodes to explore. Use “Simulate Failure” to see lineage recomputation.

Fault Tolerance: Lineage vs Replication

Spark: Lineage Recomputation

Every RDD tracks its parent RDDs and the transformation applied
On failure: recompute the lost partition from its parents
Pros: No storage overhead, works for any immutable data
Cons: Long lineage chains are expensive to recompute
Solution: checkpoint() materializes RDD to HDFS, cuts lineage

Hadoop MapReduce: HDFS Replication

All intermediate data written to HDFS with 3× replication
On failure: simply read a replica from HDFS
Pros: Simple, reliable, fast recovery
Cons: Huge I/O overhead — 3× data written between every phase
This is why Spark is 10–100× faster for iterative jobs

Spark Streaming: D-Streams

Spark Streaming processes live data by dividing it into small batches (Discretized Streams = D-Streams). Each batch is processed as a regular Spark RDD job. This reuses all Spark optimizations but introduces latency equal to the batch interval.

D-Stream: Micro-batch Model

Live Stream

events flowing...

Micro-batch Intervals (Δt = 2s each)

t=[0,2)s

→ RDD_1

Processing

t=[2,4)s

→ RDD_2

t=[4,6)s

→ RDD_3

t=[6,8)s

→ RDD_4

t=[8,10)s

→ RDD_5

Key insight:Each interval's events are collected into a batch, then processed as a regular Spark RDD. This “micro-batch” approach reuses all Spark batch optimizations but adds latency equal to the batch interval (typically 0.5–5s).

Batch vs Streaming Code

Batch Spark

WordCount.scala

1// Batch Spark: word count
2import org.apache.spark.{SparkConf, SparkContext}
3
4val conf = new SparkConf().setAppName("WordCount")
5val sc = new SparkContext(conf)
6
7val textFile = sc.textFile("hdfs://namenode/data/input/*.txt")
8
9val counts = textFile
10  .flatMap(line => line.split(" "))   // narrow: split each line
11  .map(word => (word, 1))             // narrow: emit (word, 1)
12  .reduceByKey(_ + _)                 // wide: shuffle by key
13
14counts.saveAsTextFile("hdfs://namenode/data/output/")
15sc.stop()

Spark Streaming

StreamingWordCount.scala

1// Spark Streaming: word count over D-Streams
2import org.apache.spark.streaming._
3import org.apache.spark.streaming.StreamingContext._
4
5val ssc = new StreamingContext(sc, Seconds(2))  // 2-second batches
6
7val lines = ssc.socketTextStream("localhost", 9999)
8
9val wordCounts = lines
10  .flatMap(_.split(" "))
11  .map(word => (word, 1))
12  .reduceByKey(_ + _)
13
14// Print top 10 results every batch
15wordCounts.print()
16
17ssc.start()
18ssc.awaitTermination()

Notice the structural similarity: D-Streams expose the same transformation API as RDDs. The key difference is the batch interval and the fact that streaming jobs run continuously.

Transformations vs Actions

Spark transformations (lazy) vs actions (trigger computation)
Type	Operation	Description	Returns
Transformation	map(f)	Apply f to each element	New RDD
Transformation	filter(f)	Keep elements where f returns true	New RDD
Transformation	flatMap(f)	Apply f, flatten result	New RDD
Transformation	reduceByKey(f)	Aggregate values per key (wide)	New RDD
Transformation	groupByKey()	Group all values per key (wide, slow)	New RDD
Transformation	join(other)	Join two RDDs by key (wide)	New RDD
Action	collect()	Return all elements to driver	Array
Action	count()	Count elements	Long
Action	reduce(f)	Aggregate all elements	Single value
Action	saveAsTextFile(path)	Write to HDFS	Unit (side effect)
Action	countByValue()	Count occurrences of each value	Map[T, Long]

Narrow vs Wide Transformations

The key to Spark's DAG optimization. Narrow transformations can be pipelined together in a single stage. Wide transformations require a shuffle and create stage boundaries.

Narrow Transformations

Each output partition depends on exactly one input partition. No data shuffle required. Can be pipelined together.

Parent RDD

→

Child RDD

P0'

P1'

P2'

mapfilterflatMapunionmapPartitions

Wide Transformations (Shuffle)

Each output partition depends on multiple input partitions. Requires a shuffle — data moves across the network. Stage boundary in DAG.

Parent RDD

Child RDD

P0'

P1'

P2'

groupByKeyreduceByKeyjoinrepartitionsortByKey

💡 Stage boundaries and the DAG Scheduler

The DAGScheduler splits the RDD lineage graph into stages at every wide transformation. Within a stage, all narrow transformations are pipelined (no disk I/O). Between stages, data is shuffled and optionally spilled to disk for fault tolerance.

Spark Exam Questions (Q31–Q37)

These questions cover the core Spark concepts tested in the exam.

View all exam questions →