This post introduces you to a simple spark SQL & datasets example. It assumes that you are comfortable with Spark Core API.
Before we start writing a program – let’s see what all tools we would be using to write this program
- IntelliJ Community Edition – IDE
- Scala
- SBT – Scala Build Tool
- Apache Spark
For the purpose of this, we would be using Ubuntu Desktop. I already have an Ubuntu desktop using a Virtual Box but you can use MacBook and process would still be the same.
Launch IntelliJ IDE
Click on Create New Project

Select SBT & click Next

Provide the following information and then click finish
- Project Name – SparkHelloWorldDataSet
- sbt version – 0.13.17
- Scala version – 2.11.8

This will create a sbt project.

Add the Spark libraries to the project.
- Open build.sbt, it is available in the root of the project. Visible in the screenshot.
- Add the following entry to build.sbt
This will import all the libraries which are required for installing spark-

We are now ready to create our first Spark Job.
Any spark
Let’s write some code. Create a file SparkHelloWorldDataset.scala. Do the following steps
- Goto src->main->scala folder in the project navigator.
- Right Click
- Select Scala Class

Enter the following
- Enter Name – SparkDatasetHelloWorld
- Kind – Select Object

You should finally see the following screen

Now we are ready to add some code. Make sure you read the comments in the code as you go thru it. Code has four steps
- Step 1 – Create a spark session
- Step 2 – Read the file some_data.txt
- Step 3 – Show a sample of data
- Step 4 – Print the class of data set
- Step 5 – Count the number of rows
object SparkDatasetHelloWorld {
def main(args: Array[String]): Unit = {
//Step - 1 - Create a Spark Session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark Dataset Hello World")
.getOrCreate
//Step - 2 - Read the CSV file and create a dataset
val ds = spark.read
.option("header",true)
.csv("some_data.csv")
//Step - 3 - Show a sample of data from the dataset
ds.show
//Step - 4 - Print the class of the object ds
println(s"class of the ds is ${ds.getClass}")
//Step - 5 - Count the number of rows and print it
println(s" Number of rows is ${ds.count}")
}
}
Also add the file some_data.csv show below to the root folder of the project
GeSHi Error: GeSHi could not find the language csv (using path /home/vipinc/public_html/wp-content/plugins/codecolorer/lib/geshi/) (code 2)
The project folder should look like this

Finally – Its time to run your program!!!
In the Edit Configuration Window. Select Main Class – SparkDatasetHelloWorld and press Ok

Goto Run and Select Run. It should run your project. If you look at your Run area of the window it should look something like this

and like this

Observer Number of rows is 5 is in the log. You have just written your first Spark Dataframe program. In
In the next entries we will explore more features of Spark SQL API!
Hope this has been helpful……