APACHE SPARK ARCHITECTURE 101
Spark sounds intimidating. Distributed computing, RDDs, DAGs, Executors - the vocabulary alone is a wall. Let's knock it down, brick by brick, with zero jargon and zero fluff.
Why Does Spark Even Exist?
Let's say you have 1 billion rows of data. Your laptop has 16GB of RAM. That billion rows won't fit. You can't even open it.
What if instead of one laptop, you had 100 of them - all working on 10 million rows each, simultaneously? That's the core idea behind Spark.
Apache Spark is a distributed computing engine. It takes your big data problem, splits it across many machines, runs them in parallel, and brings the result back to you. All while you write code that looks like it's working on a single machine.
Before Spark, There Was Hadoop
Hadoop MapReduce was the original big data framework - but it was painfully slow because it wrote everything to disk between each step. Spark keeps data in memory between steps, making it 10–100x faster for iterative workloads like ML training. Spark didn't kill Hadoop; it learned from its mistakes.
The Cast of Characters
Every Spark job involves a handful of players. Let's meet them:
The Driver - Your Manager
When you run your Spark script, the first thing that starts is the Driver. It's the brain of your job. It reads your code, figures out the plan, coordinates the workers, and collects the final result. There's only one Driver per job.
Think of the Driver as a project manager. It doesn't do the actual work - it delegates.
The Executors - Your Workers
Executors are the machines that actually do the computation. Each one gets a slice of your data and processes it. They run for the lifetime of your Spark job and report results back to the Driver.
Each Executor has a fixed amount of CPU cores and RAM. More Executors = more parallel power.
The Cluster Manager - The Hiring Agency
The Cluster Manager is responsible for allocating resources. The Driver asks: "I need 10 Executors with 4 cores each." The Cluster Manager finds those machines and hands them over. Common options: YARN (Hadoop), Kubernetes, or Spark's built-in Standalone mode.
Driver vs Executor Memory
Never run collect() on a billion-row DataFrame without filtering first.
collect() pulls ALL data from all Executors back to the Driver.
If the data is bigger than Driver memory - your job will crash gloriously.
Always filter → aggregate → then collect.
Your Data in Spark: The DataFrame
When you load data into Spark, it becomes a DataFrame - a distributed table spread across your Executors. You can't see it sitting on one machine because it isn't - it's partitioned across many.
from pyspark.sql import SparkSession
# Start Spark - this creates the Driver and connects to the cluster
spark = SparkSession.builder \
.appName("MyFirstSparkJob") \
.getOrCreate()
# Load 1 billion rows from S3 - doesn't actually read it yet
df = spark.read.parquet("s3://my-lake/orders/")
# This is a "transformation" - Spark plans it but does nothing yet
df_india = df.filter(df.country == "India")
# This is an "action" - NOW Spark executes everything
india_count = df_india.count()
print(f"India orders: {india_count}")
spark.stop()
The Lazy Secret - Transformations vs Actions
Here's the concept that confuses beginners the most, but it's actually clever: Spark is lazy.
When you write df.filter(...) or df.groupBy(...),
Spark doesn't execute anything. It just notes it down. Only when you
call an Action - like .count(), .show(),
or .write() - does Spark actually run the whole thing.
- Transformations (lazy) -
filter,select,groupBy,join,withColumn - Actions (trigger execution) -
count,show,collect,write,first
Why? Because before executing, Spark builds an optimized plan. The more transformations it has, the better it can optimize the whole pipeline in one go. It's the difference between planning your entire road trip vs. driving to each waypoint one-by-one.
The DAG - Spark's Blueprint
When an Action is triggered, Spark converts your code into a DAG (Directed Acyclic Graph) - a blueprint of every step required to produce the result.
Think of a DAG like a recipe broken into individual steps, where some steps can run in parallel and others depend on previous results.
Spark then breaks this DAG into Stages, and each Stage into Tasks. Tasks are what actually run on your Executors - one task per data partition, per Executor.
Partitions - The Unit of Parallelism
When Spark reads your data, it splits it into partitions - small chunks that can be processed independently. If you have 100 partitions and 10 Executors with 4 cores each (40 slots total), Spark runs 40 tasks at once, then another 40, then the remaining 20.
# Check how many partitions your DataFrame has
print(df.rdd.getNumPartitions()) # e.g., 200
# Increase partitions - useful when joining large tables
df_repartitioned = df.repartition(400)
# Decrease partitions - useful when writing small output files
# (fewer, bigger files = better for downstream reads)
df_small = df.coalesce(10)
# Default: Spark creates 200 shuffle partitions after groupBy/join
# For small datasets, this creates tiny tasks. Tune it:
spark.conf.set("spark.sql.shuffle.partitions", "50")
The Golden Rule of Partitions
Aim for partitions between 100MB–1GB each. Too many tiny partitions = scheduling overhead. Too few large partitions = some Executors sit idle while one struggles. The "small files problem" is real and it will slow you down.
Shuffles - The Expensive Necessary Evil
Some operations require data to move between Executors - this is called a shuffle.
When you do a groupBy or a join, all rows with the same key
need to end up on the same Executor. That means network transfer. Disk writes. Slowness.
Shuffles are expensive. The tricks to minimize them:
- Filter before joining (smaller data = less to shuffle)
- Use broadcast joins when one table is small (<200MB)
- Partition your data on the join key before reading
from pyspark.sql.functions import broadcast
# Without broadcast - Spark shuffles the big "orders" table
result = orders.join(countries, "country_code")
# With broadcast - small "countries" table is copied to every Executor
# No shuffling of the big table. Much faster.
result = orders.join(broadcast(countries), "country_code")
Putting It All Together
Here's what actually happens when you hit run on a Spark job:
- Step 1 - Driver receives your code. Builds a logical plan.
- Step 2 - Catalyst Optimizer (Spark's internal engine) optimizes it. Pushes filters down, reorders joins, etc.
- Step 3 - Logical plan becomes a Physical plan (actual DAG of stages and tasks).
- Step 4 - Driver asks Cluster Manager for Executors.
- Step 5 - Tasks are sent to Executors. Each processes its partition.
- Step 6 - Shuffle happens if needed. Results assembled.
- Step 7 - Final result returned to Driver (or written to storage).
The Main Takeaway
Spark's magic is in step 2 - the optimizer. You write clean, readable PySpark or SQL. Spark figures out how to run it efficiently across 100 machines. You get the result. That abstraction - thinking as if you have one huge machine - is what makes Spark worth learning.
You've Completed the Trilogy
You've now seen the full picture: how data moves → where it lives in a warehouse → how Spark processes it at scale. That's the foundation of modern data engineering. Everything else is just building on top of these three ideas.