Spark & XML

XML is another famous document type which is worth exploring when using Spark. It is used in various systems and is supposed to be “human readable” – though I doubt when you look at some really big XML documents. But having said that it is still possible to read, parse and understand an XML document in Spark. Though spark does not have native support for XML as it does for JSON – things are not all that bad. There is a library available to parse XML documents provided by databricks called Spark-XML and is actively maintained by them. Reading XML documents To make it easier to understand how to read XML documents, this blog post is divided into two parts Simple XML documents Nested XML documents Before we can read any XML documents we need to include the spark-xml library to our intelliJ development environment. Add the following line to build.sbt … Read more

Spark & JSON

JSON is a widely used data-interchange format. Spark provides native processing for JSON documents. No additional setup is required due to native support for JSON documents in Spark Reading JSON Documents To make this section easy, I have divided this post into three sub-sections Simple JSON documents Nested JSON documents Nested JSON documents with arrays inside them. As we go from reading from simple to more complex cases, we will see how the API increases in complexity. However, we will also see how Spark API keeps it easy to understand. Simple JSON Documents This is the simplest of all documents and may contain one or more set of attributes. For example, below is an example with data about just one person. This file is in the project as simple.json See the code below to parse and read this JSON file Let’s analyse the code Step 1 – Creates a spark … Read more

Spark & RDBMS

Spark interacts with various traditional open source and proprietary RDBMS via JDBC connectivity. If there is a JDBC driver available for the RDBMS then it can be used as a source or a sink. In this blog post, we look at how Spark integrates with the open-source Postgres database. The process would be the same or nearly similar for the other RDBMS as well. Setup JDBC Driver Libraries Spark needs to access Postgres database via JDBC driver. To enable this requirement a Postgres driver needs to be added to the development environment. To do this, update the build.sbt  to include the Postgres driver libraries in addition to the spark libraries. See Below The build.sbt for this blog entry should look like this. Database Setup To read from the database we have created the two tables in the public schema of a postgres database. This completes the setup for this blog entry. … Read more

Spark & Cassandra

Cassandra is a well known open source NoSQL database. It is highly scalable, highly available and performant NoSQL database with no single point of failure. These features make it one of the most widely adopted open source technologies. This post is about how to Spark can leverage Cassandra as a data source for reading and writing to it. Before we move any further – there are a couple of assumptions to this blog post Have a working copy of Cassandra. If not, then please take a look here on this link Knowledge of CQL commands. If not, then please take a look here on this link. Setup for Cassandra Setting up libraries to access Cassandra is relatively easy and can be done by including the datastax libraries in your project. If you are using IntelliJ, SBT – it is simple as adding a line in build.sbt below For this blog … Read more

Spark & Amazon S3

Introduction Till now we have only concentrated on reading data from local file systems. Which may be fine in some use case but does not apply to big data and/or cloud-based environments. Everyone knows about Amazon Web Services and the 100s of services it offers. One of its earliest and most used services is Simple Storage Service or simply S3. You can read more S3 on this link In this blog, entry we try to see how to develop Spark based application which reads and/or writes to AWS S3. It can then later be deployed on the AWS cloud. But before we do that we need to write a program that works. Before we begin, there are a couple of assumptions here – Understand the basics of AWS Identity & Access Management – like creating a user, access key and secret access key. If not check this link Understand how … Read more

DataSet Operations – Multi-Dimensional Aggregation

Spark Multi-Dimension aggregation methods like cube and rollup allow the creation of aggregates across a various combination of columns. So in a sense, they are an extension of groupBy with features of union operators rolled in. They can help you create pretty great datasets very quickly and easily and save you from writing an awful lot of complicated code to achieve the same result. Let’s unpick both these methods one by one and look at examples. Rollup Let’s assume that you have a dataset which has product type, year, quarter and quantity columns. You have been asked to create a single dataset in which you have the following combinations of aggregations – total quantity by product type, year and quarter by product type and year by product type Finally the total quantity for all product types, all years & quarters. One way would be to create three datasets using groupBy … Read more

DataSet Operations – Aggregation – Window Functions

Before we dive headfirst into Window functions it would be a good idea to review the groupby aggregate functions. An important thing to note, when groupBy functions are applied they reduce the number of rows returned. However, when Window functions are applied they do not reduce the rows in the result set. Window describes the set of rows on which the window function operates. It returns a value which is shown in all the rows of the window. One more thing, Window Functions also help us compare the current rows with other rows. Simple Aggregations Below is an example showing two operations to give a contrast in the results. avg function when a groupBy is applied avg function when a window is applied Let’s look at the Step – 4 & 5 which are the most important in this example. Step 1,2,3 – Creating a spark context, Reading a data file … Read more

DataSet Operations – Pivoting Data

Pivoting data is the process of re-arranging the data from rows into columns based on one or more columns. It is also sometimes called re-shaping data. Pivot in Spark is quite simple and can be accomplished via the pivot method. More importantly, before you pivot data make sure it has the right columns which are required. If there are columns you donot need in the pivot make sure they are not there in the dataset. To enable pivoting in spark requires the following. groupBy – Is passed columns which are used for aggregating and are not pivoted. pivot – Column(s) whose values are converted into columns agg – Aggregation method which is applied during pivot Pivoting a Single Column Let’s take a look at an example. See Below Let’s analyse the above example Step 1 – Creates a spark session Step 2 – Creates a custom schema Step 3 – … Read more

DataSet Operations – Aggregation – Group Functions

In this entry, we look at how we can aggregate data in Spark datasets to create new datasets. If you have seen the previous post on selecting and filtering data you would have realised that these API are quite similar to SQL operations. That similarity continues in this entry as well. Let’s start with the count API. There are a couple of ways to apply this API. See Below. Step 5 – Returns the count of rows in a dataset.The relevant log is below Step 6 – Use the groupBy & agg method to count the orders by order status.The relevant log is below Multiple aggregations can be performed via a single groupBy & agg method call. In the next example we see sum, avg, min and max being performed on the order price. The relevant portion of the log is show below Hope you have found this post interesting. … Read more

DataSet Operations – Date & Time

Spark DataSet- Data & Time APIs allow extensive data transformations. The API can be roughly divided into the following categories. Data Extraction from Dates – Extract years, quarters, months, weeks,  days, the day of the week, week of the year Epoch & Unix Time – Conversion to & from epoch Date Calculations – Date difference, add days, next day Current date and timestamps Different date formats All of these categories are discussed in this post. Data Extraction from Dates There are various API available to extract data from a given date. They are all listed below and can be applied directly on a date column as we will see. year – Extracts the calendar year from a given date quarter – Extracts the calendar quarter from a given date month – Extracts the calendar month from a given date weekOfYear – Extract the week number from a given date dayOfMonth – … Read more