Spark SQL & Datasets – Hello World

This post introduces you to a simple spark SQL & datasets example. It assumes that you are comfortable with Spark Core API. Before we start writing a program – let’s see what all tools we would be using to write this program IntelliJ Community Edition – IDE Scala  SBT – Scala Build Tool Apache Spark For the purpose of this, we would be using Ubuntu Desktop. I already have an Ubuntu desktop using a Virtual Box but you can use MacBook and process would still be the same. Launch IntelliJ IDE Click on Create New Project Select SBT & click Next Provide the following information and then click finish Project Name – SparkHelloWorldDataSet sbt version – 0.13.17 Scala version – 2.11.8 This will create a sbt project.  Add the Spark libraries to the project.  Open build.sbt, it is available in the root of the project. Visible in the screenshot. Add the following entry to build.sbt This will import all … Read more

Spark SQL & DataSets

Spark  SQL is built on top of Spark Core. It is used to handle structured & semi structured data. For example data – In a database organised as rows and columns. It stores the data in data structures called datasets.  Dataset in Spark is a distributed data structure which has named columns. Similar to pandas in Python or result sets in Java. Datasets have API which is very similar to pandas in Python or Dataframes in R Datasets have some distinct advantages over pandas or dataframes in R. Some of them are listed below. Spark SQL is built on top of Spark core API and is able to exploit the distributed capabilities of Spark. Spark datasets are lazily evaluated and immutable. Similar to RDDs. Supports a subset of SQL language which is evolving at a fast pace. Support a wide variety of integrations with RDBMS and NoSQL databases. for example … Read more

Spark – Hello World

Now that we have some idea of how the components work we can now write a small program using apache spark and do something with it. Before we start writing a program – let’s see what all tools we would be using to write this program IntelliJ Community Edition – IDE Scala  SBT – Scala Build Tool Apache Spark For the purpose of this we would be using Ubuntu Desktop. I already have a Ubuntu desktop using a Virtual Box but you can use macbook and process would still be the same. Launch IntelliJ IDE. Click on Create New Project Select SBT & click Next Provide the following information and then click finish Project Name – SparkHelloWorld sbt version – 0.13.17 Scala version – 2.11.8 This will create a sbt project.  Add the Spark libraries to the project.  Open build.sbt, it is available in the root of the project. Visible … Read more

Spark RDDs – Introduction

Resilient Distributed Datasets(RDDs for short) are at the heart of a Spark Cluster. RDDs are an immutable distributed collection of data elements which can be operated upon in parallel across the various nodes in a cluster. RDDs hides/abstracts data partitioning across the cluster, making application development using Spark API. One of the ways to understand the definition of RDDs can be done in the following way Resilient – RDDs are fault tolerant. They are also lazy by nature and may not be materialized. An RDD stores information on how it can be derived from other data sources and/or RDDs. So in a sense maintains its lineage. Distributed – RDDs are distributed across cluster nodes as data partitions and each partition has copies across the nodes in the cluster based on a replication factor which contributes to fault tolerance and resilience. An application can indicate how partitioning takes place so as to optimize processing of … Read more

Spark Runtime Components

Spark Cluster has various components and sub-systems which are running in the background when a job is being executed. All of these components below. Don’t worry if you have trouble imagining how these are created or implemented in code. That will become clear in the later entries. Driver Program – The process running the main() function of the application and creating the SparkContext. It is also the program/job, written by the developers which is submitted to Spark for processing. Spark Context – Spark Context is the entry point to use Spark Core services and features. It sets up internal services and establishes a connection to a Spark execution environment. Every Spark job creates a spark context object before it can do any processing. Cluster Manager – Spark uses cluster manager to acquire resources across the cluster for executing a job. However, Spark is also agnostic of cluster managers and does not really care how … Read more

Spark – Introduction

Apache Spark is a general purpose in-memory computing in a clustered environment. It was developed at UC Berkeley AMP Labs by Matei Zaharia in 2009. Since then, it has grown tremendously. Its is now one of the major computational engines for processing large datasets. It has over the years grown to handle various use-cases batch processing, stream processing, graph processing and machine learning. Apache Spark was developed in a response to the limitations of the Map-Reduce framework. It is said to be up to 100 times faster than Map-Reduce and is also highlighted on its website. Spark engine has various components which help it generate optimized code. There are specific components like Catalyst and Tungsten which help it optimize execution plans. There are various other advantages. Spark has rich data re-presentations like RDDs, DataFrames, GraphsIt is tightly integrated with its other components. Layered Architecture. Optimisation at core benefits higher-level APIs like Data Frames, Spark SQL etc. … Read more