Spark – Introduction

Apache Spark is a general purpose in-memory computing in a clustered environment. It was developed at UC Berkeley AMP Labs by Matei Zaharia in 2009. Since then, it has grown tremendously. Its is now one of the major computational engines for processing large datasets. It has over the years grown to handle various use-cases batch processing, stream processing, graph processing and machine learning.

Apache Spark was developed in a response to the limitations of the Map-Reduce framework. It is said to be up to 100 times faster than Map-Reduce and is also highlighted on its website. Spark engine has various components which help it generate optimized code. There are specific components like Catalyst and Tungsten which help it optimize execution plans.

There are various other advantages.

  • Spark has rich data re-presentations like RDDs, DataFrames, GraphsIt is tightly integrated with its other components.
  • Layered Architecture. Optimisation at core benefits higher-level APIs like Data Frames, Spark SQL etc.
  • Additional infrastructure can be added or removed without any outage and is linearly scalable.
  • If required, all components can work with each other all at the same time
  • Applications see this unified framework rather than individual component
  • The application developers do not need to know how the data is distributed across the spark cluster and code does not need to worry about distribution across the cluster in most cases.
  • Spark has rich data representations like RDDs, DataFrames, Graphs
  • Its API is available across various languages – Scala, Java and Python

Spark can be divided into the 5 major components. See Below

Let’s look at these high-level/major components. Don’t worry if you do not understand the sub-components. See Below

Spark Core

Spark Core consists of various sub-components, sub-systems and data structures. See Below

  • Memory Management
  • Task Management & Scheduling
  • Fault Management & Recovery
  • Interaction with Storage Systems
  • Finally, Resilient Distributed DataSets or RDDs

RDDs are probably the most well-known component of Spark Core and is at the heart of the spark technology stack. It is a collection of key-value pairs which are distributed across the spark cluster.

Spark SQL

SQL is used for manipulating structured data quite easily in the RDBMS world. Spark SQL(a subset of ANSI SQL) allows you to use SQL API to query the data in the spark cluster. This interface along with Spark Dataframes has been one of the main reasons for an increased adoption of Spark as a computational engine replacing traditional data processing ETL tools. Spark SQL ut

While using Spark SQL you can easily work with RDDs and can convert between them easily. Spark also supports various file format like parquet, avro, text etc. It also supports Apache Hive and HQL which allows it use Hadoop, HDFS as a source of data or sink for storing data.

Spark Streaming

Spark Streaming is part of the Spark API which allows developers to process live streaming data. It can pick streaming data from messaging systems and apply Spark Core APIs to create RDDs making the data available across the Spark cluster with all the features of Spark Core.

Structured Streaming

Is a new addition to Spark Streaming which is a new component in streaming from Spark version 2.2.x onwards brings the features of the dataframes and structured data to the world of streaming.

ML Lib

It is Spark’s machine learning library it provides API for the following machine learning algorithms for clustering, classification, regression model evaluation etc. ML Lib extends Spark core like all the other APIs so is able to run all the algorithms in a distributed manner. 

GraphX

GraphX is a library for manipulating graph data and performing graph computing in a distributed manner. 

Getting Started

There are various distributions of Apache Spark that you can use for Spark. It can be installed locally on your PC as well and will work just as well. However keep in mind to test your application on a distributed cluster setup as well. There are a few distributions available

  • Cloudera
  • HortonWorks(It has now been bought by Cloudera)
  • Amazon Elastic MapReduce(EMR)
  • DataBricks community

I like databricks if you are just getting started up on apache spark as their community edition allows you to write code without having the need to install anything and that too in a browser. I guess that is where you should be headed to before moving to create your environment on a PC or a cluster. 

This was a very brief overview of Apache Spark. We will look at the various cluster components in the next entry. Hope this was useful for you!

Till then…..bye!

Leave a Comment