Spark Partitioning

Introduction Transparently partitioning data across the worker nodes is one of the best features of Spark. It has made writing code simpler. Partitioning not done properly leads to data skew and eventually hitting performance issues. So having the right number of partitions and data at the right place is important to reduce performance issues and data shuffling.  This post will cover the various types of partitioning and see how you can utilize this for your spark jobs quickly.  IMPORTANT: Partitioning APIs are only available for pair RDDs. Pair RDDs are RDDs which are based on key, value pairs. Before diving into Partitioning let’s look at a couple of highlights around partitions. Each partition resides only one worker node. It cannot span across nodes. Every worker node in a spark cluster has at least one partition. Can have more.  Whenever you specify a partitioning strategy(Standard or Custom) make sure you persist/cache … Read more

Spark Shuffle

Introduction We cannot prevent shuffling. We can only reduce it. Ask a spark developer about performance issues and two things he/she will talk about is shuffling & partitioning. In this entry, we will focus just on shuffling. It would be a good idea to read one of the earlier posts on spark job execution model. When data is distributed across a spark cluster, it is not always where it should be. Therefore, from time to time spark may need to move data across the various nodes to complete specific computations.  Shuffling is a process of redistribution of data across the various partitions. Data is moved across nodes. Technically a shuffle consists of Network I/O, Disk I/O and Data serialisation. Network I/O is the most expensive operation and is usually the focus when everyone talks about shuffling. Disk I/O is probably the surprising thing for everyone but is interesting as well … Read more

Spark Job Execution Model

Introduction I have been wanting to write this entry for a long time. Spark Job Execution Model or how Spark works internally is an important topic of discussion. Having knowledge of internal execution engine can provide additional help when doing performance tuning. Let’s look at Spark’s execution model. Flow of Execution of any Spark program can be explained using the following diagram. Spark provides all this information quite nicely in its chatty logs. If you are free please take a look at spark logs. They offer quite good information. More on the logs later.  Now let’s see how the Jobs are actually executed. We know the runtime components of a spark cluster. As such are aware of some of the components already. In this entry we will look at how these components interact and what is what sequence of events and where they all happen. We also introduce two more … Read more

Spark – Shared Variables

Shared variables is a feature of spark which allows it to maintain variables across the entire spark cluster while executing a job. Not every variable can be made shared and there are certainly some rules around what you can do, cannot do and how to create and manage shared variables. There can only be two types of shared variables Broadcast Variables Accumulator Variables Broadcast Variables Broadcast variables send read-only data from the driver node to all the worker nodes. Makes me wonder why they are called variables 🙂 if they are read-only. There are numerous use-cases when the same copy of the data may-be required across all the worker nodes.  Quoting from spark documentation Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of … Read more

Spark Core – Caching

Introduction Cache in the computing world is a hardware or software component that stores data so that future requests can be served faster. A cache stores either some pre-computed data or copy of data stored somewhere else. Reading data from the cache, is usually faster than recomputing a result or reading from a slower data store and hence the software performance is increased. The process of seeding/populating the cache is called caching. Spark is very good for in-memory processing. We also know it is lazy when it comes to processing. Guess what – spark caching is also lazy in nature. So let’s take an example with the following assumptions. Spark has to perform two different actions on the same data There is a common intermediate rdd in the lineage. In case of no caching, spark computes all the RDDs in the DAG to generate the result. However, once the action … Read more

Spark Core – RDD Operations – Transformations

Transformations are lazy operations performed on RDDs which result in new RDDs. In the previous section we have already been using some of the transformations like flatMap. Let’s go ahead and look at some more transformations map flatmap filter join distinct reduceByKey groupByKey The first two transformations map and flatmap are probably the most used of all transformations. In this entry will go thru each of the above operations one by one and see the lazy nature of Spark. It’s a loooooooooooong entry! 🙂 . In all the examples the following build.sbt was used. map is the most used transformation. It allows the driver program to pass a function(anonymous and named) to executors for operating on the RDDs. When applied it returns another RDD with the same number of elements. The following is the signature The following is an example of using the map transformation on an RDD. The following … Read more

Spark Core – RDD Operations – Actions

Actions trigger the creation of the actual data. When spark encounters an action operation it creates list of all the steps required to achieve the result. These steps are nothing but various RDDs which the developers have created while applying transformations to RDDs to achieve the desired result. These steps are also called Directed Acyclic Graph or DAG in short. It is important to note that when Spark creates a DAG it will only consider RDDs which contribute to the action and not all the RDDs. So if there are RDDs which are just dangling and do not result in anything they will be left out. Thus laziness is used for being efficient. Operations are all performed on all the partitions in parallel. Many of these examples use Anonymous/Lambda/Function Literals. If you want to refresh your Scala functions knowledge please have a look at these blog entries.  In case you are … Read more

Spark Core – RDD Operations

Spark Operations are well documented on this link. Many of the Spark RDD Operations belong to the org.apache.spark.rdd.RDD class. Operations are all performed on all the partitions in parallel. I have listed some of the most used RDD operations below Actions Actions trigger the creation of the actual data. When spark encounters an action operation it creates list of all the steps required to achieve the result. Some of the actions are listed below. reduce count collect saveAsTextFile take The first two actions reduce and count are some of the most used actions. There are various others and I advice you to have a look at the Spark documentation. Transformations Transformations are methods which produce a new RDD(lazily though). All transformations are lazy. By applying a transformation and creating a new RDD we only provide lineage to Spark which only uses this information when an action methods are called. map … Read more

Spark RDDs – Laziness & Lineage

In the introduction of RDDs we saw how there are two types of operations. Actions and Transformations. All transformations are lazy by nature and only when there is an action that Spark does anything. Lazy Operations Before going further let’s see the lazy nature of transformations. Let’s modify our Spark Hello World program and comment out the step – 4. See Below I have attached the log when you execute this Spark program and you would see that it fires up Spark but spark does not launch any operation and log is quite short for Spark which has quite a chatty log. So no action operation = nothing happens – no executors are launched. See Below Now, uncomment the last line which has the println  and run the code again. It should give a log in which you would be able to see that Spark will try to read the file, … Read more

Spark – Hello World

Now that we have some idea of how the components work we can now write a small program using apache spark and do something with it. Before we start writing a program – let’s see what all tools we would be using to write this program IntelliJ Community Edition – IDE Scala  SBT – Scala Build Tool Apache Spark For the purpose of this we would be using Ubuntu Desktop. I already have a Ubuntu desktop using a Virtual Box but you can use macbook and process would still be the same. Launch IntelliJ IDE. Click on Create New Project Select SBT & click Next Provide the following information and then click finish Project Name – SparkHelloWorld sbt version – 0.13.17 Scala version – 2.11.8 This will create a sbt project.  Add the Spark libraries to the project.  Open build.sbt, it is available in the root of the project. Visible … Read more