Spark SQL & Datasets – Hello World

This post introduces you to a simple spark SQL & datasets example. It assumes that you are comfortable with Spark Core API.

Before we start writing a program – let’s see what all tools we would be using to write this program

  • IntelliJ Community Edition – IDE
  • Scala 
  • SBT – Scala Build Tool
  • Apache Spark

For the purpose of this, we would be using Ubuntu Desktop. I already have an Ubuntu desktop using a Virtual Box but you can use MacBook and process would still be the same.

Launch IntelliJ IDE

Click on Create New Project

Select SBT & click Next

Provide the following information and then click finish

  • Project Name – SparkHelloWorldDataSet
  • sbt version – 0.13.17
  • Scala version – 2.11.8

This will create a sbt project. 

Add the Spark libraries to the project. 

  • Open build.sbt, it is available in the root of the project. Visible in the screenshot.
  • Add the following entry to build.sbt
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"

This will import all the libraries which are required for installing spark-sql library 2.4.0 for this project. See Below

We are now ready to create our first Spark Job.

Any spark sql job needs a spark session object which tells the job the configuration of the cluster.

Let’s write some code. Create a file SparkHelloWorldDataset.scala. Do the following steps

  • Goto src->main->scala folder in the project navigator.
  • Right Click
  • Select Scala Class

Enter the following

  • Enter Name – SparkDatasetHelloWorld
  • Kind – Select Object

You should finally see the following screen

Now we are ready to add some code. Make sure you read the comments in the code as you go thru it. Code has four steps

  • Step 1 – Create a spark session
  • Step 2 – Read the file some_data.txt
  • Step 3 – Show a sample of data
  • Step 4 – Print the class of data set
  • Step 5 – Count the number of rows
import org.apache.spark.sql.SparkSession

object SparkDatasetHelloWorld {
  def main(args: Array[String]): Unit = {

    //Step - 1 - Create a Spark Session
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark Dataset Hello World")
      .getOrCreate

    //Step - 2 - Read the CSV file and create a dataset
    val ds = spark.read
      .option("header",true)
      .csv("some_data.csv")

    //Step - 3 - Show a sample of data from the dataset
    ds.show

    //Step - 4 - Print the class of the object ds
    println(s"class of the ds is ${ds.getClass}")

    //Step - 5 - Count the number of rows and print it
    println(s" Number of rows is ${ds.count}")
  }
}

Also add the file some_data.csv show below to the root folder of the project


GeSHi Error: GeSHi could not find the language csv (using path /home/vipinc/public_html/wp-content/plugins/codecolorer/lib/geshi/) (code 2)

The project folder should look like this

Finally – Its time to run your program!!! Goto Run->Select Run It should ask for editing configuration if you have not created a configuration. Select SparkDatasetHelloWorld

In the Edit Configuration Window. Select Main Class – SparkDatasetHelloWorld and press Ok

Goto Run and Select Run. It should run your project. If you look at your Run area of the window it should look something like this

and like this

Observer Number of rows is 5 is in the log. You have just written your first Spark Dataframe program. In practice we can write this to variety of technologies like HDFS, NoSQL databases or classic RDBMS using JDBC and many more.

In the next entries we will explore more features of Spark SQL API!

Hope this has been helpful……

Leave a Comment