Spark DataSets Performance Tuning – Resource Planning

Introduction In this post we will focus on datasets and see how they can be tuned and also see if they are more efficient than working on RDDs even though datasets are converted to RDDs for processing. More on that later. To set a stage for this post we will use the same resource plans as were used in the Spark RDDs Performance Tuning – Resource Planning and would also answer the same queries, using the same data and using a cluster with exactly the same components & resources. That way we will have a nice comparative as well. If you haven’t read it – then you should :). So let’s dive in and run our tests and see how everything stacks up. Step – 1 Modify spark-defaults.conf Make the following changes to the spark-defaults.conf. In Amazon EMR it is found on /etc/spark/conf/ directory Step – 2 Cluster Configuration For … Read more

Spark RDDs Performance Tuning – Partitioning

Now that we have gone thru resource planning, shuffling & caching to tune our spark application we have still not looked at couple of areas which can give us some additional performance gains and make our application a bit more efficient. One of the areas we can look at tuning is how Spark partitions its data into partitions and processes them. If we have too small partitions we would have a scenario were Spark will take more time to start than it will take time to process data and of course lead to additional shuffling. So the granular approach may not be the answer. In the same way, it may also be the case where having big partitions may lead to inefficient use of cluster resources. The answer lies somewhere in between these two extremes. If you have run the code in any of the previous posts you will have … Read more

Spark RDDs Performance Tuning – Shuffling & Caching

Let’s look at the next step in our performance tuning journey. This one involves changing which APIs are called and how data is stored in the distributed memory. Both involve code changes – not too many but they do help. Both these topics have been covered to an extent in some of my previous blog posts. For a quick recap – refer the links below Resource Planning Spark Caching Spark Shuffling We will use the same dataset and the same transformations (as in the previous post) but with minor changes which are highlighted and see how we can reduce shuffling and bring down the runtimes further. Reducing Shuffling There are two simple ways of reducing shuffling. Reduce the dataset on which the shuffle occurs. Change the API to use a more efficient API. Reduce dataset size When doing data analytics it is usually observed that not all the attributes which … Read more

Spark RDDs Performance Tuning – Resource Planning

Introduction Tuning spark jobs can dramatically increase the performance and help squeeze more from the resources at hand. This post is a high-level overview of how to tune spark jobs and talks about various tools which are at your disposal. Performance tuning may be black magic to some but to most engineers, it would be How many resources are provided How best are the provided resources used How you write your code Before you can tune a Spark job it is important to identify where are the potential performance bottlenecks. Resources – Memory, CPU Cores and Executors Partitioning & Parallelism Long-running straggling tasks Caching To help with the above areas Spark provides(and has access to) some tools which can be used for tuning Resource Manager UI(For example – YARN) Spark Web UI & History Server Tungsten & Catalyst Optimizers Explain Plan For the purpose of this post, a pre-cleaned dataset … Read more

Spark Partitioning

Introduction Transparently partitioning data across the worker nodes is one of the best features of Spark. It has made writing code simpler. Partitioning not done properly leads to data skew and eventually hitting performance issues. So having the right number of partitions and data at the right place is important to reduce performance issues and data shuffling.  This post will cover the various types of partitioning and see how you can utilize this for your spark jobs quickly.  IMPORTANT: Partitioning APIs are only available for pair RDDs. Pair RDDs are RDDs which are based on key, value pairs. Before diving into Partitioning let’s look at a couple of highlights around partitions. Each partition resides only one worker node. It cannot span across nodes. Every worker node in a spark cluster has at least one partition. Can have more.  Whenever you specify a partitioning strategy(Standard or Custom) make sure you persist/cache … Read more

Spark Shuffle

Introduction We cannot prevent shuffling. We can only reduce it. Ask a spark developer about performance issues and two things he/she will talk about is shuffling & partitioning. In this entry, we will focus just on shuffling. It would be a good idea to read one of the earlier posts on spark job execution model. When data is distributed across a spark cluster, it is not always where it should be. Therefore, from time to time spark may need to move data across the various nodes to complete specific computations.  Shuffling is a process of redistribution of data across the various partitions. Data is moved across nodes. Technically a shuffle consists of Network I/O, Disk I/O and Data serialisation. Network I/O is the most expensive operation and is usually the focus when everyone talks about shuffling. Disk I/O is probably the surprising thing for everyone but is interesting as well … Read more

Spark Job Execution Model

Introduction I have been wanting to write this entry for a long time. Spark Job Execution Model or how Spark works internally is an important topic of discussion. Having knowledge of internal execution engine can provide additional help when doing performance tuning. Let’s look at Spark’s execution model. Flow of Execution of any Spark program can be explained using the following diagram. Spark provides all this information quite nicely in its chatty logs. If you are free please take a look at spark logs. They offer quite good information. More on the logs later.  Now let’s see how the Jobs are actually executed. We know the runtime components of a spark cluster. As such are aware of some of the components already. In this entry we will look at how these components interact and what is what sequence of events and where they all happen. We also introduce two more … Read more

Spark Core – Caching

Introduction Cache in the computing world is a hardware or software component that stores data so that future requests can be served faster. A cache stores either some pre-computed data or copy of data stored somewhere else. Reading data from the cache, is usually faster than recomputing a result or reading from a slower data store and hence the software performance is increased. The process of seeding/populating the cache is called caching. Spark is very good for in-memory processing. We also know it is lazy when it comes to processing. Guess what – spark caching is also lazy in nature. So let’s take an example with the following assumptions. Spark has to perform two different actions on the same data There is a common intermediate rdd in the lineage. In case of no caching, spark computes all the RDDs in the DAG to generate the result. However, once the action … Read more