Spark Cluster has various components and sub-systems which are running in the background when a job is being executed. All of these components below. Don’t worry if you have trouble imagining how these are created or implemented in code. That will become clear in the later entries.
Driver Program – The process running the main() function of the application and creating the SparkContext. It is also the program/job, written by the developers which is submitted to Spark for processing.
Spark Context – Spark Context is the entry point to use Spark Core services and features. It sets up internal services and establishes a connection to a Spark execution environment. Every Spark job creates a spark context object before it can do any processing.
Cluster Manager – Spark uses cluster manager to acquire resources across the cluster for executing a job. However, Spark is also agnostic of cluster managers and does not really care how it can get its hands on cluster resources. It supports the following cluster managers
- Spark standalone cluster manager
Worker Node – Worker Nodes are nodes which actually do data processing/heavy lifting on data.
Executor – Executors are independent processes which run inside the Worker Nodes in their own JVMs. Data processing is actually done by these executor processes.
Cache – Data stored in physical memory. Jobs can cache data so that it does not need to re-compute RDDs and hence increases the performance storing intermediary data.
Task – A task is a unit of work performed independently by the executor on one partition.
Partition – Spark manages its data by splitting data into manageable chunks across the nodes in a cluster. These chunks are called partitions. The splitting of data is done in a way so that it leads to reduction of network traffic and also optimise the operations to be performed on the data.