Spark SQL & DataSets

Spark  SQL is built on top of Spark Core. It is used to handle structured & semi structured data. For example data – In a database organised as rows and columns. It stores the data in data structures called datasets. 

Dataset in Spark is a distributed data structure which has named columns. Similar to pandas in Python or result sets in Java. Datasets have API which is very similar to pandas in Python or Dataframes in R

Datasets have some distinct advantages over pandas or dataframes in R. Some of them are listed below.

  1. Spark SQL is built on top of Spark core API and is able to exploit the distributed capabilities of Spark.
  2. Spark datasets are lazily evaluated and immutable. Similar to RDDs.
  3. Supports a subset of SQL language which is evolving at a fast pace.
  4. Support a wide variety of integrations with RDBMS and NoSQL databases. for example – MySQL, Postgres, Hive, Cassandra
  5. Wide support for a variety of file formats – parquet, avro, orc, XML, JSON
  6. It is easy to convert between RDDs & datasets.
  7. Supported across various languages like Scala, Java & Python

Dataset Operations

Transformations can also be applied to datasets and have similarity to transformations in RDDs. However, they offer a much richer and easier API. All DataSet operations are transformations. Some of them are listed below

  • Aggregate operations – sum, min, max, count
  • Data/Time operations
  • Window operations
  • Data pivot functions

Actions are similar to the ones in RDDs and a few more. Whenever, Spark wants to write data to any data storage technology or to console – it is an action. So, you may want to write the transformed data into any of the below storage technologies it is an action

  • Hadoop – csv, Avro, parquet. orc
  • Hive
  • Cassandra
  • Any RDBMS
  • Object store – AWS S3, Google Storage

In the next blog entry we will see a good old “Hello World” example for Spark SQL and Datasets.

Leave a Comment