Spark & Amazon S3

Introduction Till now we have only concentrated on reading data from local file systems. Which may be fine in some use case but does not apply to big data and/or cloud-based environments. Everyone knows about Amazon Web Services and the 100s of services it offers. One of its earliest and most used services is Simple Storage Service or simply S3. You can read more S3 on this link In this blog, entry we try to see how to develop Spark based application which reads and/or writes to AWS S3. It can then later be deployed on the AWS cloud. But before we do that we need to write a program that works. Before we begin, there are a couple of assumptions here – Understand the basics of AWS Identity & Access Management – like creating a user, access key and secret access key. If not check this link Understand how … Read more

DataSet Operations – Multi-Dimensional Aggregation

Spark Multi-Dimension aggregation methods like cube and rollup allow the creation of aggregates across a various combination of columns. So in a sense, they are an extension of groupBy with features of union operators rolled in. They can help you create pretty great datasets very quickly and easily and save you from writing an awful lot of complicated code to achieve the same result. Let’s unpick both these methods one by one and look at examples. Rollup Let’s assume that you have a dataset which has product type, year, quarter and quantity columns. You have been asked to create a single dataset in which you have the following combinations of aggregations – total quantity by product type, year and quarter by product type and year by product type Finally the total quantity for all product types, all years & quarters. One way would be to create three datasets using groupBy … Read more

DataSet Operations – Aggregation – Window Functions

Before we dive headfirst into Window functions it would be a good idea to review the groupby aggregate functions. An important thing to note, when groupBy functions are applied they reduce the number of rows returned. However, when Window functions are applied they do not reduce the rows in the result set. Window describes the set of rows on which the window function operates. It returns a value which is shown in all the rows of the window. One more thing, Window Functions also help us compare the current rows with other rows. Simple Aggregations Below is an example showing two operations to give a contrast in the results. avg function when a groupBy is applied avg function when a window is applied Let’s look at the Step – 4 & 5 which are the most important in this example. Step 1,2,3 – Creating a spark context, Reading a data file … Read more

DataSet Operations – Pivoting Data

Pivoting data is the process of re-arranging the data from rows into columns based on one or more columns. It is also sometimes called re-shaping data. Pivot in Spark is quite simple and can be accomplished via the pivot method. More importantly, before you pivot data make sure it has the right columns which are required. If there are columns you donot need in the pivot make sure they are not there in the dataset. To enable pivoting in spark requires the following. groupBy – Is passed columns which are used for aggregating and are not pivoted. pivot – Column(s) whose values are converted into columns agg – Aggregation method which is applied during pivot Pivoting a Single Column Let’s take a look at an example. See Below Let’s analyse the above example Step 1 – Creates a spark session Step 2 – Creates a custom schema Step 3 – … Read more

DataSet Operations – Aggregation – Group Functions

In this entry, we look at how we can aggregate data in Spark datasets to create new datasets. If you have seen the previous post on selecting and filtering data you would have realised that these API are quite similar to SQL operations. That similarity continues in this entry as well. Let’s start with the count API. There are a couple of ways to apply this API. See Below. Step 5 – Returns the count of rows in a dataset.The relevant log is below Step 6 – Use the groupBy & agg method to count the orders by order status.The relevant log is below Multiple aggregations can be performed via a single groupBy & agg method call. In the next example we see sum, avg, min and max being performed on the order price. The relevant portion of the log is show below Hope you have found this post interesting. … Read more

DataSet Operations – Date & Time

Spark DataSet- Data & Time APIs allow extensive data transformations. The API can be roughly divided into the following categories. Data Extraction from Dates – Extract years, quarters, months, weeks,  days, the day of the week, week of the year Epoch & Unix Time – Conversion to & from epoch Date Calculations – Date difference, add days, next day Current date and timestamps Different date formats All of these categories are discussed in this post. Data Extraction from Dates There are various API available to extract data from a given date. They are all listed below and can be applied directly on a date column as we will see. year – Extracts the calendar year from a given date quarter – Extracts the calendar quarter from a given date month – Extracts the calendar month from a given date weekOfYear – Extract the week number from a given date dayOfMonth – … Read more

DataSet Operations – Selecting Data

Introduction DataSet operations are used for transforming data and are quite similar to SQL.  Selecting Data Filtering Data – Applying Conditions Ordering data Aggregating data – count, sum, max, min, avg Joins – Right Outer, Left Outer, Full Outer Set operations – union, minus, intersect In all the following examples, I have used the TPC-H datasets which are an industry standard for various benchmarks. The data generators are open source and can be download from GitHub. The structure of the data is available on tpc-h website or you can directly click here and look at page 13. I have provided samples of the data files at the end of the posts. DataSet – Read options Before we get into the details of actual dataframe API let’s understand some of the configurations to read the data from a simple CSV file. There are various options available to read files. Some of them … Read more

Spark SQL & Datasets – Hello World

This post introduces you to a simple spark SQL & datasets example. It assumes that you are comfortable with Spark Core API. Before we start writing a program – let’s see what all tools we would be using to write this program IntelliJ Community Edition – IDE Scala  SBT – Scala Build Tool Apache Spark For the purpose of this, we would be using Ubuntu Desktop. I already have an Ubuntu desktop using a Virtual Box but you can use MacBook and process would still be the same. Launch IntelliJ IDE Click on Create New Project Select SBT & click Next Provide the following information and then click finish Project Name – SparkHelloWorldDataSet sbt version – 0.13.17 Scala version – 2.11.8 This will create a sbt project.  Add the Spark libraries to the project.  Open build.sbt, it is available in the root of the project. Visible in the screenshot. Add the following entry to build.sbt This will import all … Read more

Spark SQL & DataSets

Spark  SQL is built on top of Spark Core. It is used to handle structured & semi structured data. For example data – In a database organised as rows and columns. It stores the data in data structures called datasets.  Dataset in Spark is a distributed data structure which has named columns. Similar to pandas in Python or result sets in Java. Datasets have API which is very similar to pandas in Python or Dataframes in R Datasets have some distinct advantages over pandas or dataframes in R. Some of them are listed below. Spark SQL is built on top of Spark core API and is able to exploit the distributed capabilities of Spark. Spark datasets are lazily evaluated and immutable. Similar to RDDs. Supports a subset of SQL language which is evolving at a fast pace. Support a wide variety of integrations with RDBMS and NoSQL databases. for example … Read more

Spark Partitioning

Introduction Transparently partitioning data across the worker nodes is one of the best features of Spark. It has made writing code simpler. Partitioning not done properly leads to data skew and eventually hitting performance issues. So having the right number of partitions and data at the right place is important to reduce performance issues and data shuffling.  This post will cover the various types of partitioning and see how you can utilize this for your spark jobs quickly.  IMPORTANT: Partitioning APIs are only available for pair RDDs. Pair RDDs are RDDs which are based on key, value pairs. Before diving into Partitioning let’s look at a couple of highlights around partitions. Each partition resides only one worker node. It cannot span across nodes. Every worker node in a spark cluster has at least one partition. Can have more.  Whenever you specify a partitioning strategy(Standard or Custom) make sure you persist/cache … Read more