Spark Partitioning
Introduction Transparently partitioning data across the worker nodes is one of the best features of Spark. It has made writing code simpler. Partitioning not done properly leads to data skew and eventually hitting performance issues. So having the right number of partitions and data at the right place is important to reduce performance issues and data shuffling. This post will cover the various types of partitioning and see how you can utilize this for your spark jobs quickly. IMPORTANT: Partitioning APIs are only available for pair RDDs. Pair RDDs are RDDs which are based on key, value pairs. Before diving into Partitioning let’s look at a couple of highlights around partitions. Each partition resides only one worker node. It cannot span across nodes. Every worker node in a spark cluster has at least one partition. Can have more. Whenever you specify a partitioning strategy(Standard or Custom) make sure you persist/cache … Read more