XML is another famous document type which is worth exploring when using Spark. It is used in various systems and is supposed to be “human readable” – though I doubt when you look at some really big XML documents. But having said that it is still possible to read, parse and understand an XML document in Spark.
Though spark does not have native support for XML as it does for JSON – things are not all that bad. There is a library available to parse XML documents provided by
Reading XML documents
To make it easier to understand how to read XML documents, this blog post is divided into two parts
- Simple XML documents
- Nested XML documents
Before we can read any XML documents we need to include the spark-
For this blog entry your build.sbt should atleast has these two libraries.

Simple XML documents
Let’s divide this section into two parts.
- Simple XML document with just one record
- Simple XML document with multiple records of the same type
Simple XML Document with just one record
This probably the simplest form of xml document which may look something like this
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>32</age>
<departmentName>HR</departmentName>
</employee>
The above is a simple xml document with one row of data about an employee. Let’s see how we can write a simple program to parse this record.
import com.databricks.spark.xml._
object SparkXML {
def main(args: Array[String]): Unit = {
//Step 1 - Create a spark session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark XML")
.getOrCreate
//Step 2 - Read data file
val baseDS = spark.read
.option("rowTag","employee")
.xml("employee.xml")
//Step 3 - Print Schema
baseDS.printSchema
//Step 4 - Show the data
baseDS.show
}
}
Let’s analyse this code
- Step 1 – Creates a spark session
- Step 2 – Reads an XML file. It passes an option called rowTag telling Spark which XML tag can be used to identify a row of data.
- Step 3 – Prints the schema of the XML file using standard API. Spark uses schema inference if
schema is not provided. - Step 4 – Shows the data of the XML file using standard API
The relevant parts of the log are shown below


Simple XML Document with just one record
Let’s now make the XML document slightly more complex with a set of rows rather than just one row. Below is the sample of an XML file
<employee>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>32</age>
<departmentName>HR</departmentName>
</employee>
<employee>
<firstName>Tim</firstName>
<lastName>Hunter</lastName>
<age>55</age>
<departmentName>Sales</departmentName>
</employee>
<employee>
<firstName>Mark</firstName>
<lastName>Kent</lastName>
<age>23</age>
<departmentName>Production</departmentName>
</employee>
</employees>
The same program if run will give the following output in the logs


Simple XML documents with attributes
Let’s now make our document a bit more complicated by adding attributes to the XML document like below
<employee empId="1001">
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>32</age>
<departmentName>HR</departmentName>
</employee>
<employee empId="1002">
<firstName>Tim</firstName>
<lastName>Hunter</lastName>
<age>55</age>
<departmentName>Sales</departmentName>
</employee>
<employee empId="1003">
<firstName>Mark</firstName>
<lastName>Kent</lastName>
<age>23</age>
<departmentName>Production</departmentName>
</employee>
</employees>
If the same program is run again it will give the following output in the logs

Observe how a new column is added to the schema of

Nested XML documents
Nested XML documents can be understood and converted into atomic types via Spark API similar to one used for JSON documents.
Let’s look at an xml document with multiple rows and nested XML structures.
<employee empId="1001">
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>32</age>
<departmentName>HR</departmentName>
<contactDetails>
<phone phoneType="mobile">20010715</phone>
<phone phoneType="landline">20010715</phone>
</contactDetails>
</employee>
<employee empId="1002">
<firstName>Tim</firstName>
<lastName>Hunter</lastName>
<age>55</age>
<departmentName>Sales</departmentName>
<contactDetails>
<phone phoneType="mobile">40010715</phone>
<phone phoneType="landline">40010715</phone>
</contactDetails>
</employee>
<employee empId="1003">
<firstName>Mark</firstName>
<lastName>Kent</lastName>
<age>23</age>
<departmentName>Production</departmentName>
<contactDetails>
<phone phoneType="mobile">60010715</phone>
<phone phoneType="landline">60010715</phone>
</contactDetails>
</employee>
</employees>
Let’s look at a program below to see how this data can be parsed and be presented in dataset.
import org.apache.spark.sql.functions._
import com.databricks.spark.xml._
object SparkXML {
def main(args: Array[String]): Unit = {
//Step 1 - Create a spark session
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark XML")
.getOrCreate
//Step 2 - Read data file
val baseDS = spark.read
.option("rowTag","employee")
.xml("employee.xml")
//Step 3 - Print Schema
baseDS.printSchema
//Step 4 - extract atomic elements
import spark.implicits._
val resultDS = baseDS
.withColumn("phone",explode($"contactDetails.phone")) //Creates an array of struct type
.withColumn("phoneType",$"phone._phoneType") //There is an attribute of phoneType
.withColumn("phone",$"phone._VALUE") // _VALUE is the value contained in the element
//Step 5 - Show the data
resultDS.select($"_empID" as "empID", $"firstName", $"lastName", $"phoneType", $"phone").show
}
}
Let’s analyse the code above.
- Step 1 – Creates a spark session
- Step 2 – Reads the XML documents
- Step 3 – Prints the schema as inferred by Spark
- Step 4 – Extracts the atomic elements from the array of
struct type using explode and withColumn API which is similar to the API used for extracting JSON elements. - Step 5 – Show the data.
The relevant parts of the log are shown below

Observe that contactDetails is a type struct and phone is an array of the type struct, which contains the phoneType and _VALUE as two atomic types.

Using the logic mentioned it is possible to evaluate, extract data from XML documents into Spark DataSets. Once in Spark DataSets all the existing Spark APIs can be applied without any additional modifications.
Hope this entry has been helpful in increasing your understanding of Spark and XML documents. Till then byeeeeeee!!!