JSON is a widely used data-interchange format. Spark provides native processing for JSON documents. No additional setup is required due to native support for JSON documents in Spark
Reading JSON Documents
To make this section easy, I have divided this post into three sub-sections
- Simple JSON documents
- Nested JSON documents
- Nested JSON documents with arrays inside them.
As we go from reading from simple to more complex cases, we will see how the API increases in complexity. However, we will also see how Spark API keeps it easy to understand.
Simple JSON Documents
This is the simplest of all documents and may contain one or more set of attributes. For example, below is an example with data about just one person.
"firstName": "John",
"lastName": "Simth",
"age": 32,
"departmentName": "HR"
}
This file is in the project as simple.json
See the code below to parse and read this JSON file
object SparkJSON {
def main(args: Array[String]): Unit = {
//Step 1 - Create a spark session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark JSON")
.getOrCreate
//Step 2 - Read data from a local JSON file.
val baseDS = spark.read
.option("multiline",true)
.json("simple.json")
//Step 3 - Show the schema
baseDS.printSchema
//Step 4 - Show the data
baseDS.show
}
}
Let’s analyse the code
- Step 1 – Creates a spark session
- Step 2 – Read a JSON document
- Make sure you enable it to read multiline JSON documents
- json method takes the file path
- Step 3 – Shows the JSON schema
- Step 4 – Shows the data from
JSON document asdataset
Here are the relevant parts of the log when the program is run.


The same program can also run for a JSON document which has got data for more than one person. Like Below
{
"firstName": "John",
"lastName": "Simth",
"age": 32,
"departmentName": "HR"
},
{
"firstName": "Tim",
"lastName": "Hunter",
"age": 55,
"departmentName": "Sales"
},
{
"firstName": "Mark",
"lastName": "Kent",
"age": 23,
"departmentName": "Production"
}
]
When the same program is now run it gives the following output to the console.
Schema has got no changes and is same as the schema printed in the example above.

As you can see now we have data about three people in our dataset and we can now apply the Spark API to process the data.

Nested JSON documents
Let’s make things more complex by including some nested information into the mix of things. Say suppose our JSON document needs to group employees names together in a tree-like structure.
{
"name":{
"firstName": "John",
"lastName": "Simth"
},
"age": 32,
"departmentName": "HR"
},
{
"name":{
"firstName": "Tim",
"lastName": "Hunter"
},
"age": 55,
"departmentName": "Sales"
},
{
"name":{
"firstName": "Mark",
"lastName": "Kent"
},
"age": 23,
"departmentName": "Production"
}
]
If you run the same program given above it should give you the following results
The schema should look like the screenshot below. Observe that the attribute name is of the type struct

The dataset should look like this below.

But hang on a minute we wanted to see firstName and lastName as two different attributes! as opposed to being part of the name attribute. To enable that we need to add some code to see them as two separate columns. Its easy to do it – the revised code is below
object SparkJSON {
def main(args: Array[String]): Unit = {
//Step 1 - Create a spark session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark JSON")
.getOrCreate
//Step 2 - Read data from a local JSON file.
val baseDS = spark.read
.option("multiline",true)
.json("nested.json")
//Step 3 - Show the schema
baseDS.printSchema
//Step 4 - Show the data
baseDS.show
//Step 5 - Extract firstName and lastName as separate columns
import spark.implicits._
val resultDS = baseDS
.withColumn("firstName",$"name.firstName")
.withColumn("lastName",$"name.lastName")
//Step 6 - Show the data
resultDS.show
}
}
Let’s analyse the code above.
- Step 1 – Creates a spark session
- Step 2 – Read a JSON document and create
baseDS - Make sure you enable it to read multiline JSON documents
json method takes the file path
- Step 3 – Shows the JSON schema of
baseDS - Step 4 – Shows the data from the JSON document as a dataset
- Step 5 – Extract firstName and lastName from
baseDS and add two columns to the dataset.- Use
method to add two columns(firstName & lastName) towithColumn dataset.baseDS
- Use
- Step 6 – Show the resulting dataset.
- Name attribute has not been removed
The output from step 6 is shown below from the log

As you can see above from the screenshot – firstName and lastName are now added as two separate columns to the dataset. These have been extracted from the name column in the dataset. Now standard spark API can be applied to process these columns
Nested Documents with Arrays
In this section, we will deal with a slightly more complex example and introduce a new API called explode. Let’s assume we want to group all employees by their department names in the JSON document. It would look something like this
{
"departmentName" :"HR",
"employees": [
{
"name":{
"firstName": "John",
"lastName": "Simth"
},
"age": 32
},
{
"name":{
"firstName": "Tim",
"lastName": "Hunter"
},
"age": 55
},
{
"name":{
"firstName": "Mark",
"lastName": "Kent"
},
"age": 23
}
]
},
{
"departmentName" : "Sales",
"employees": [
{
"name":{
"firstName": "Amanda",
"lastName": "Miles"
},
"age": 29
},
{
"name":{
"firstName": "Lewis",
"lastName": "Hayes"
},
"age": 42
}
]
}
]
The above JSON document can be easily parsed using the code below. It uses an explode function to extract nested arrays out as separate rows of data.
import org.apache.spark.sql.functions._
object SparkJSON {
def main(args: Array[String]): Unit = {
//Step 1 - Create a spark session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark JSON")
.getOrCreate
//Step 2 - Read data from a local JSON file
val baseDS = spark.read
.option("multiline",true)
.json("nestedWithArrays.json")
//Step 3 - Show the schema
baseDS.printSchema
//Step 4 - Show the data
baseDS.show
//Step 5 - Extract firstName, lastName, age as separate columns
import spark.implicits._
val resultDS = baseDS
.withColumn("employee", explode($"employees"))
.withColumn("firstName",$"employee.name.firstName")
.withColumn("lastName",$"employee.name.lastName")
.withColumn("age",$"employee.age")
//Step 6 - Show atomic data
resultDS.select("departmentName","firstName","lastName","age").show
}
}
Let’s analyse the code
- Step 1 – Create a spark session
- Step 2 – Read data from a local JSON file
- Step 3 – Show the nested schema
- Observe that employees element is an array
- Inside
employees each element is a struct type name is also a struct type- It is possible to have a nested
arrays and/or struct types
- Step 4 – Show the data as understood by Spark
- Step 5 – Extract firstName, lastName, age as separate columns
- Step 6 – Show the atomic type columns only.
The relevant parts of the log are shown below



This brings us to the end of this blog entry. Hope this entry has been helpful to you.