Spark RDDs – Introduction

Resilient Distributed Datasets(RDDs for short) are at the heart of a Spark Cluster. RDDs are an immutable distributed collection of data elements which can be operated upon in parallel across the various nodes in a cluster. RDDs hides/abstracts data partitioning across the cluster, making application development using Spark API.

One of the ways to understand the definition of RDDs can be done in the following way

  • Resilient – RDDs are fault tolerant. They are also lazy by nature and may not be materialized. An RDD stores information on how it can be derived from other data sources and/or RDDs. So in a sense maintains its lineage.
  • Distributed – RDDs are distributed across cluster nodes as data partitions and each partition has copies across the nodes in the cluster based on a replication factor which contributes to fault tolerance and resilience. An application can indicate how partitioning takes place so as to optimize processing of data.
  • Datasets – a collection of data elements on which the Spark API operate

RDDs can different record types so there can RDDs of any valid type – i.e RDD(String), RDD(Int, String), RDD(String, String) can be some very simple types but we can have RDDs of XML or JSON datasets as well.

Lazy Evaluation in Spark

RDDs are lazy by nature which means they are materialised only when they are needed. This allows Spark to use physical memory in an optimal manner. So if an RDD has been defined but not used for any data processing it does not occupy any space in the memory. So effectively, It saves time and unwanted processing power.

  • When you tell Spark to operate on a set of data.
  • Spark listens to what you ask it to do, writes down some shorthand for itself so it doesn’t forget, and then does absolutely nothing.
  • Spark will continue to do nothing until you ask it for the final answer.
  • Spark always looks to limit how much work it has to do.

That brings us to the question when does spark materialise an RDD. See Below

RDD Operations

There are two types of RDD operations

  • Transformations
    • Functions that take an RDD as the input and produce one or many RDDs as the output.
    • They do not change the input RDD
    • All transformations in Spark are lazy, in that they do not compute their results right away.
  • Actions
    • Operations which trigger one or more transformations which will result in one or more RDDs being created.

In the next entry let’s see some more details on RDD Operations.

Leave a Comment