Spark & Redis

One of the fantastic use-cases of Redis is its use along with Apache-Spark in-memory computation engine. You can in some sense use it as a backend to persist spark objects – data frames, datasets or RDDs in the Redis Cache alongside other cached objects. To enable this there is a very handy library available called Spark-Redis. This library has both Scala and Python-based API. Redis can be used to persist data and be used as a backend – it can be used to share common data between various jobs rather than loading the same data again and again. This makes Redis an invaluable tool for big data developers. In this blog post, we will use both scala and python based API to read data and write data frames and RDDs to/from Redis. Using Scala API In this section, we will read and write to a Redis cluster using Scala and … Read more

Spark DataSets Performance Tuning – Resource Planning

Introduction In this post we will focus on datasets and see how they can be tuned and also see if they are more efficient than working on RDDs even though datasets are converted to RDDs for processing. More on that later. To set a stage for this post we will use the same resource plans as were used in the Spark RDDs Performance Tuning – Resource Planning and would also answer the same queries, using the same data and using a cluster with exactly the same components & resources. That way we will have a nice comparative as well. If you haven’t read it – then you should :). So let’s dive in and run our tests and see how everything stacks up. Step – 1 Modify spark-defaults.conf Make the following changes to the spark-defaults.conf. In Amazon EMR it is found on /etc/spark/conf/ directory Step – 2 Cluster Configuration For … Read more

Spark RDDs Performance Tuning – Partitioning

Now that we have gone thru resource planning, shuffling & caching to tune our spark application we have still not looked at couple of areas which can give us some additional performance gains and make our application a bit more efficient. One of the areas we can look at tuning is how Spark partitions its data into partitions and processes them. If we have too small partitions we would have a scenario were Spark will take more time to start than it will take time to process data and of course lead to additional shuffling. So the granular approach may not be the answer. In the same way, it may also be the case where having big partitions may lead to inefficient use of cluster resources. The answer lies somewhere in between these two extremes. If you have run the code in any of the previous posts you will have … Read more

Spark RDDs Performance Tuning – Shuffling & Caching

Let’s look at the next step in our performance tuning journey. This one involves changing which APIs are called and how data is stored in the distributed memory. Both involve code changes – not too many but they do help. Both these topics have been covered to an extent in some of my previous blog posts. For a quick recap – refer the links below Resource Planning Spark Caching Spark Shuffling We will use the same dataset and the same transformations (as in the previous post) but with minor changes which are highlighted and see how we can reduce shuffling and bring down the runtimes further. Reducing Shuffling There are two simple ways of reducing shuffling. Reduce the dataset on which the shuffle occurs. Change the API to use a more efficient API. Reduce dataset size When doing data analytics it is usually observed that not all the attributes which … Read more

Spark RDDs Performance Tuning – Resource Planning

Introduction Tuning spark jobs can dramatically increase the performance and help squeeze more from the resources at hand. This post is a high-level overview of how to tune spark jobs and talks about various tools which are at your disposal. Performance tuning may be black magic to some but to most engineers, it would be How many resources are provided How best are the provided resources used How you write your code Before you can tune a Spark job it is important to identify where are the potential performance bottlenecks. Resources – Memory, CPU Cores and Executors Partitioning & Parallelism Long-running straggling tasks Caching To help with the above areas Spark provides(and has access to) some tools which can be used for tuning Resource Manager UI(For example – YARN) Spark Web UI & History Server Tungsten & Catalyst Optimizers Explain Plan For the purpose of this post, a pre-cleaned dataset … Read more

Spark & Ad-hoc Querying

Till now we have written programs which are essentially compiled jars & running on some Spark cluster. This works fine if the use case is batch processing data and that data is being consumed by some downstream applications. But think about if we(as users) want to use the raw computing power of Spark to do some ad-hoc SQL query. Aah, the good old world of an SQL command line or an SQL editor. Firing a bunch of queries. What about JDBC connections to some BI applications? Well, look no further – all that is possible!! In this blog post, we will look at how multiple users can interface with Spark to do Ad-hoc querying using Spark Thrift Server by creating a JDBC connection to it and fire some queries on data stored in Hive. Before we start on this let’s see what all is required to get going Access to a … Read more

Spark & XML

XML is another famous document type which is worth exploring when using Spark. It is used in various systems and is supposed to be “human readable” – though I doubt when you look at some really big XML documents. But having said that it is still possible to read, parse and understand an XML document in Spark. Though spark does not have native support for XML as it does for JSON – things are not all that bad. There is a library available to parse XML documents provided by databricks called Spark-XML and is actively maintained by them. Reading XML documents To make it easier to understand how to read XML documents, this blog post is divided into two parts Simple XML documents Nested XML documents Before we can read any XML documents we need to include the spark-xml library to our intelliJ development environment. Add the following line to build.sbt … Read more

Spark & JSON

JSON is a widely used data-interchange format. Spark provides native processing for JSON documents. No additional setup is required due to native support for JSON documents in Spark Reading JSON Documents To make this section easy, I have divided this post into three sub-sections Simple JSON documents Nested JSON documents Nested JSON documents with arrays inside them. As we go from reading from simple to more complex cases, we will see how the API increases in complexity. However, we will also see how Spark API keeps it easy to understand. Simple JSON Documents This is the simplest of all documents and may contain one or more set of attributes. For example, below is an example with data about just one person. This file is in the project as simple.json See the code below to parse and read this JSON file Let’s analyse the code Step 1 – Creates a spark … Read more

Spark & RDBMS

Spark interacts with various traditional open source and proprietary RDBMS via JDBC connectivity. If there is a JDBC driver available for the RDBMS then it can be used as a source or a sink. In this blog post, we look at how Spark integrates with the open-source Postgres database. The process would be the same or nearly similar for the other RDBMS as well. Setup JDBC Driver Libraries Spark needs to access Postgres database via JDBC driver. To enable this requirement a Postgres driver needs to be added to the development environment. To do this, update the build.sbt  to include the Postgres driver libraries in addition to the spark libraries. See Below The build.sbt for this blog entry should look like this. Database Setup To read from the database we have created the two tables in the public schema of a postgres database. This completes the setup for this blog entry. … Read more

Spark & Cassandra

Cassandra is a well known open source NoSQL database. It is highly scalable, highly available and performant NoSQL database with no single point of failure. These features make it one of the most widely adopted open source technologies. This post is about how to Spark can leverage Cassandra as a data source for reading and writing to it. Before we move any further – there are a couple of assumptions to this blog post Have a working copy of Cassandra. If not, then please take a look here on this link Knowledge of CQL commands. If not, then please take a look here on this link. Setup for Cassandra Setting up libraries to access Cassandra is relatively easy and can be done by including the datastax libraries in your project. If you are using IntelliJ, SBT – it is simple as adding a line in build.sbt below For this blog … Read more