Spark DataSets Performance Tuning – Resource Planning
Introduction In this post we will focus on datasets and see how they can be tuned and also see if they are more efficient than working on RDDs even though datasets are converted to RDDs for processing. More on that later. To set a stage for this post we will use the same resource plans as were used in the Spark RDDs Performance Tuning – Resource Planning and would also answer the same queries, using the same data and using a cluster with exactly the same components & resources. That way we will have a nice comparative as well. If you haven’t read it – then you should :). So let’s dive in and run our tests and see how everything stacks up. Step – 1 Modify spark-defaults.conf Make the following changes to the spark-defaults.conf. In Amazon EMR it is found on /etc/spark/conf/ directory Step – 2 Cluster Configuration For … Read more