Spark – Hello World

Now that we have some idea of how the components work we can now write a small program using apache spark and do something with it.

Before we start writing a program – let’s see what all tools we would be using to write this program

  • IntelliJ Community Edition – IDE
  • Scala 
  • SBT – Scala Build Tool
  • Apache Spark

For the purpose of this we would be using Ubuntu Desktop. I already have a Ubuntu desktop using a Virtual Box but you can use macbook and process would still be the same.

Launch IntelliJ IDE.

Click on Create New Project

Select SBT & click Next

Provide the following information and then click finish

  • Project Name – SparkHelloWorld
  • sbt version – 0.13.17
  • Scala version – 2.11.8

This will create a sbt project. 

Add the Spark libraries to the project. 

  • Open build.sbt, it is available in the root of the project. Visible in the screenshot.
  • Add the following entry to build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

This will import all the libraries which are required for installing spark-core library 2.2.1 for this project. See Below

We are now ready to create our first Spark Job. Before we more any further. A quick recap from Spark Runtime Components. Any spark job needs a spark context object which tells the job the configuration of the cluster.

Let’s write some code. Create a file SparkHelloWorld.scala. Do the following steps

  • Goto src->main->scala folder in the project navigator.
  • Right Click
  • Select Scala Class

Enter the following

  • Enter Name – SparkHelloWorld
  • Kind – Select Object

You should finally see the following screen

Now we are ready to add some code. Make sure you read the comments in the code as you go thru it. Code has four steps

  • Create a Spark Configuration object
  • Create a Spark Context using the Spark Configuration object
  • Create an RDD from a data file using Spark Context
  • Count the number of lines in the RDD and print a message.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object SparkHelloWorld {
  def main(args: Array[String]): Unit = {
  //Step 1 - Create a Spark Configuration Object
    val sparkConf  = new SparkConf()
      .setMaster("local[*]")  //Master is running on a local node.
      .setAppName("SparkHelloWorld") //Name of our spark app

    //Step 2 - Create Spark Context using the spark configuration object
    val sparkContext = new SparkContext(sparkConf)

    //Step 3 - Read a text file using the Spark Context and create an rdd.
    val someRdd = sparkContext.textFile("some_data.txt")

    //Step 4 - Print the number of rows in the rdd
    println("Number of line is "+someRdd.count)

Also add the file some_data.txt show below to the root folder of the project

This is a a text
This is also a line of text
Some more text
Still Some more text
I think this is enough

The project folder should look like this

Finally – Its time to run your program!!! Goto Run->Select Run It should ask for editing configuration if you have not created a configuration. Select SparkHelloWorld

In the Edit Configuration Window. Select Main ClassSparkHelloWorld and press Ok

Goto Run and Select Run. It should run your project. If you maximize your Run area of the window it should look something like this

Observer Number of line is 5 is in the log. You have just written your first Spark Hello World program. In practice we can write this to variety of technologies like HDFS, NoSQL databases or classic RDBMS using JDBC and many more.

In the next entries we will explore more features of Spark Core API!

Hope this has been helpful……

Leave a Comment