Spark & XML

XML is another famous document type which is worth exploring when using Spark. It is used in various systems and is supposed to be “human readable” – though I doubt when you look at some really big XML documents. But having said that it is still possible to read, parse and understand an XML document in Spark.

Though spark does not have native support for XML as it does for JSON – things are not all that bad. There is a library available to parse XML documents provided by databricks called Spark-XML and is actively maintained by them.

Reading XML documents

To make it easier to understand how to read XML documents, this blog post is divided into two parts

  • Simple XML documents
  • Nested XML documents

Before we can read any XML documents we need to include the spark-xml library to our intelliJ development environment. Add the following line to build.sbt

libraryDependencies += "com.databricks" %% "spark-xml" % "0.5.0"

For this blog entry your build.sbt should atleast has these two libraries.

build.sbt

Simple XML documents

Let’s divide this section into two parts.

  • Simple XML document with just one record
  • Simple XML document with multiple records of the same type

Simple XML Document with just one record

This probably the simplest form of xml document which may look something like this

<employee>
    <firstName>John</firstName>
    <lastName>Smith</lastName>
    <age>32</age>
    <departmentName>HR</departmentName>
</employee>

The above is a simple xml document with one row of data about an employee. Let’s see how we can write a simple program to parse this record.

import org.apache.spark.sql.SparkSession
import com.databricks.spark.xml._

object SparkXML {
  def main(args: Array[String]): Unit = {
    //Step 1 - Create a spark session
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark XML")
      .getOrCreate

    //Step 2 - Read data file
    val baseDS = spark.read
      .option("rowTag","employee")
      .xml("employee.xml")

    //Step 3 - Print Schema
    baseDS.printSchema

    //Step 4 - Show the data
    baseDS.show
  }
}

Let’s analyse this code

  • Step 1 – Creates a spark session
  • Step 2 – Reads an XML file. It passes an option called rowTag telling Spark which XML tag can be used to identify a row of data.
  • Step 3 – Prints the schema of the XML file using standard API. Spark uses schema inference if schema is not provided.
  • Step 4 – Shows the data of the XML file using standard API

The relevant parts of the log are shown below

Step 3 – Prints the schema
Step 4 – Show data from XML file

Simple XML Document with just one record

Let’s now make the XML document slightly more complex with a set of rows rather than just one row. Below is the sample of an XML file

<employees>
    <employee>
        <firstName>John</firstName>
        <lastName>Smith</lastName>
        <age>32</age>
        <departmentName>HR</departmentName>
    </employee>
    <employee>
        <firstName>Tim</firstName>
        <lastName>Hunter</lastName>
        <age>55</age>
        <departmentName>Sales</departmentName>
    </employee>
    <employee>
        <firstName>Mark</firstName>
        <lastName>Kent</lastName>
        <age>23</age>
        <departmentName>Production</departmentName>
    </employee>
</employees>

The same program if run will give the following output in the logs

Step 3 – Prints the Schema
Step 4 – Show data from XML file

Simple XML documents with attributes

Let’s now make our document a bit more complicated by adding attributes to the XML document like below

<employees>
    <employee empId="1001">
        <firstName>John</firstName>
        <lastName>Smith</lastName>
        <age>32</age>
        <departmentName>HR</departmentName>
    </employee>
    <employee empId="1002">
        <firstName>Tim</firstName>
        <lastName>Hunter</lastName>
        <age>55</age>
        <departmentName>Sales</departmentName>
    </employee>
    <employee empId="1003">
        <firstName>Mark</firstName>
        <lastName>Kent</lastName>
        <age>23</age>
        <departmentName>Production</departmentName>
    </employee>
</employees>

If the same program is run again it will give the following output in the logs

Step 3 – Prints the schema

Observe how a new column is added to the schema of dataset_empId. An underscore is added when spark encounters attributes along with elements in XML documents. This column can be accessed like any other column in the dataset.

Step 4 – Show data from XML file

Nested XML documents

Nested XML documents can be understood and converted into atomic types via Spark API similar to one used for JSON documents.

Let’s look at an xml document with multiple rows and nested XML structures.

<employees>
    <employee empId="1001">
        <firstName>John</firstName>
        <lastName>Smith</lastName>
        <age>32</age>
        <departmentName>HR</departmentName>
        <contactDetails>
            <phone phoneType="mobile">20010715</phone>
            <phone phoneType="landline">20010715</phone>
        </contactDetails>
    </employee>
    <employee empId="1002">
        <firstName>Tim</firstName>
        <lastName>Hunter</lastName>
        <age>55</age>
        <departmentName>Sales</departmentName>
        <contactDetails>
            <phone phoneType="mobile">40010715</phone>
            <phone phoneType="landline">40010715</phone>
        </contactDetails>
    </employee>
    <employee empId="1003">
        <firstName>Mark</firstName>
        <lastName>Kent</lastName>
        <age>23</age>
        <departmentName>Production</departmentName>
        <contactDetails>
            <phone phoneType="mobile">60010715</phone>
            <phone phoneType="landline">60010715</phone>
        </contactDetails>
    </employee>
</employees>

Let’s look at a program below to see how this data can be parsed and be presented in dataset.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import com.databricks.spark.xml._

object SparkXML {
  def main(args: Array[String]): Unit = {
    //Step 1 - Create a spark session
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark XML")
      .getOrCreate

    //Step 2 - Read data file
    val baseDS = spark.read
      .option("rowTag","employee")
      .xml("employee.xml")

    //Step 3 - Print Schema
    baseDS.printSchema

    //Step 4 - extract atomic elements
    import spark.implicits._
    val resultDS = baseDS
      .withColumn("phone",explode($"contactDetails.phone")) //Creates an array of struct type
      .withColumn("phoneType",$"phone._phoneType") //There is an attribute of phoneType
      .withColumn("phone",$"phone._VALUE") // _VALUE is the value contained in the element

    //Step 5 - Show the data
    resultDS.select($"_empID" as "empID", $"firstName", $"lastName", $"phoneType", $"phone").show
  }
}

Let’s analyse the code above.

  • Step 1 – Creates a spark session
  • Step 2 – Reads the XML documents
  • Step 3 – Prints the schema as inferred by Spark
  • Step 4 – Extracts the atomic elements from the array of
    struct type using explode and withColumn API which is similar to the API used for extracting JSON elements.
  • Step 5 – Show the data.

The relevant parts of the log are shown below

Step 3 – Print the inferred schema

Observe that contactDetails is a type struct and phone is an array of the type struct, which contains the phoneType and _VALUE as two atomic types.

Step 5 – Display atomic elements from the XML document

Using the logic mentioned it is possible to evaluate, extract data from XML documents into Spark DataSets. Once in Spark DataSets all the existing Spark APIs can be applied without any additional modifications.

Hope this entry has been helpful in increasing your understanding of Spark and XML documents. Till then byeeeeeee!!!

Leave a Comment