Spark & Redis

One of the fantastic use-cases of Redis is its use along with Apache-Spark in-memory computation engine. You can in some sense use it as a backend to persist spark objects – data frames, datasets or RDDs in the Redis Cache alongside other cached objects. To enable this there is a very handy library available called Spark-Redis. This library has both Scala and Python-based API. Redis can be used to persist data and be used as a backend – it can be used to share common data between various jobs rather than loading the same data again and again. This makes Redis an invaluable tool for big data developers. In this blog post, we will use both scala and python based API to read data and write data frames and RDDs to/from Redis. Using Scala API In this section, we will read and write to a Redis cluster using Scala and … Read more

Spark & Ad-hoc Querying

Till now we have written programs which are essentially compiled jars & running on some Spark cluster. This works fine if the use case is batch processing data and that data is being consumed by some downstream applications. But think about if we(as users) want to use the raw computing power of Spark to do some ad-hoc SQL query. Aah, the good old world of an SQL command line or an SQL editor. Firing a bunch of queries. What about JDBC connections to some BI applications? Well, look no further – all that is possible!! In this blog post, we will look at how multiple users can interface with Spark to do Ad-hoc querying using Spark Thrift Server by creating a JDBC connection to it and fire some queries on data stored in Hive. Before we start on this let’s see what all is required to get going Access to a … Read more

Spark & XML

XML is another famous document type which is worth exploring when using Spark. It is used in various systems and is supposed to be “human readable” – though I doubt when you look at some really big XML documents. But having said that it is still possible to read, parse and understand an XML document in Spark. Though spark does not have native support for XML as it does for JSON – things are not all that bad. There is a library available to parse XML documents provided by databricks called Spark-XML and is actively maintained by them. Reading XML documents To make it easier to understand how to read XML documents, this blog post is divided into two parts Simple XML documents Nested XML documents Before we can read any XML documents we need to include the spark-xml library to our intelliJ development environment. Add the following line to build.sbt … Read more

Spark & JSON

JSON is a widely used data-interchange format. Spark provides native processing for JSON documents. No additional setup is required due to native support for JSON documents in Spark Reading JSON Documents To make this section easy, I have divided this post into three sub-sections Simple JSON documents Nested JSON documents Nested JSON documents with arrays inside them. As we go from reading from simple to more complex cases, we will see how the API increases in complexity. However, we will also see how Spark API keeps it easy to understand. Simple JSON Documents This is the simplest of all documents and may contain one or more set of attributes. For example, below is an example with data about just one person. This file is in the project as simple.json See the code below to parse and read this JSON file Let’s analyse the code Step 1 – Creates a spark … Read more

Spark & RDBMS

Spark interacts with various traditional open source and proprietary RDBMS via JDBC connectivity. If there is a JDBC driver available for the RDBMS then it can be used as a source or a sink. In this blog post, we look at how Spark integrates with the open-source Postgres database. The process would be the same or nearly similar for the other RDBMS as well. Setup JDBC Driver Libraries Spark needs to access Postgres database via JDBC driver. To enable this requirement a Postgres driver needs to be added to the development environment. To do this, update the build.sbt  to include the Postgres driver libraries in addition to the spark libraries. See Below The build.sbt for this blog entry should look like this. Database Setup To read from the database we have created the two tables in the public schema of a postgres database. This completes the setup for this blog entry. … Read more

Spark & Cassandra

Cassandra is a well known open source NoSQL database. It is highly scalable, highly available and performant NoSQL database with no single point of failure. These features make it one of the most widely adopted open source technologies. This post is about how to Spark can leverage Cassandra as a data source for reading and writing to it. Before we move any further – there are a couple of assumptions to this blog post Have a working copy of Cassandra. If not, then please take a look here on this link Knowledge of CQL commands. If not, then please take a look here on this link. Setup for Cassandra Setting up libraries to access Cassandra is relatively easy and can be done by including the datastax libraries in your project. If you are using IntelliJ, SBT – it is simple as adding a line in build.sbt below For this blog … Read more

Spark & Amazon S3

Introduction Till now we have only concentrated on reading data from local file systems. Which may be fine in some use case but does not apply to big data and/or cloud-based environments. Everyone knows about Amazon Web Services and the 100s of services it offers. One of its earliest and most used services is Simple Storage Service or simply S3. You can read more S3 on this link In this blog, entry we try to see how to develop Spark based application which reads and/or writes to AWS S3. It can then later be deployed on the AWS cloud. But before we do that we need to write a program that works. Before we begin, there are a couple of assumptions here – Understand the basics of AWS Identity & Access Management – like creating a user, access key and secret access key. If not check this link Understand how … Read more