DataSet Operations – Selecting Data

Introduction DataSet operations are used for transforming data and are quite similar to SQL.  Selecting Data Filtering Data – Applying Conditions Ordering data Aggregating data – count, sum, max, min, avg Joins – Right Outer, Left Outer, Full Outer Set operations – union, minus, intersect In all the following examples, I have used the TPC-H datasets which are an industry standard for various benchmarks. The data generators are open source and can be download from GitHub. The structure of the data is available on tpc-h website or you can directly click here and look at page 13. I have provided samples of the data files at the end of the posts. DataSet – Read options Before we get into the details of actual dataframe API let’s understand some of the configurations to read the data from a simple CSV file. There are various options available to read files. Some of them … Read more

Spark SQL & Datasets – Hello World

This post introduces you to a simple spark SQL & datasets example. It assumes that you are comfortable with Spark Core API. Before we start writing a program – let’s see what all tools we would be using to write this program IntelliJ Community Edition – IDE Scala  SBT – Scala Build Tool Apache Spark For the purpose of this, we would be using Ubuntu Desktop. I already have an Ubuntu desktop using a Virtual Box but you can use MacBook and process would still be the same. Launch IntelliJ IDE Click on Create New Project Select SBT & click Next Provide the following information and then click finish Project Name – SparkHelloWorldDataSet sbt version – 0.13.17 Scala version – 2.11.8 This will create a sbt project.  Add the Spark libraries to the project.  Open build.sbt, it is available in the root of the project. Visible in the screenshot. Add the following entry to build.sbt This will import all … Read more

Spark SQL & DataSets

Spark  SQL is built on top of Spark Core. It is used to handle structured & semi structured data. For example data – In a database organised as rows and columns. It stores the data in data structures called datasets.  Dataset in Spark is a distributed data structure which has named columns. Similar to pandas in Python or result sets in Java. Datasets have API which is very similar to pandas in Python or Dataframes in R Datasets have some distinct advantages over pandas or dataframes in R. Some of them are listed below. Spark SQL is built on top of Spark core API and is able to exploit the distributed capabilities of Spark. Spark datasets are lazily evaluated and immutable. Similar to RDDs. Supports a subset of SQL language which is evolving at a fast pace. Support a wide variety of integrations with RDBMS and NoSQL databases. for example … Read more