Spark & Amazon S3

Introduction

Till now we have only concentrated on reading data from local file systems. Which may be fine in some use case but does not apply to big data and/or cloud-based environments. Everyone knows about Amazon Web Services and the 100s of services it offers. One of its earliest and most used services is Simple Storage Service or simply S3. You can read more S3 on this link

In this blog, entry we try to see how to develop Spark based application which reads and/or writes to AWS S3. It can then later be deployed on the AWS cloud. But before we do that we need to write a program that works.

Before we begin, there are a couple of assumptions here –

  • Understand the basics of AWS Identity & Access Management – like creating a user, access key and secret access key. If not check this link
  • Understand how to work with AWS S3 via command line utilites. If not check this link.

Setup for AWS S3

To read/write data from/to S3 during spark application development requires two things.

  • AWS Credentials – access key id and secret access key id
  • Appropriate libraries

Let’s see how it is done. Create a Scala project in IntelliJ. Check this link if you need a quick recap.

Step – 1 – Add the library dependencies.

Open your build.sbt and add the following. Library dependencies can be obtained from this link

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

Hadoop-AWS library is the one which enables Spark’s S3 access in AWS. It should look something like this below

Step – 2 – Setup AWS credentials in IntelliJ

In this step we will provide the AWS credentials so that our spark application can access AWS S3 service.

Goto Run->Edit Configuration it should open the following window

Click on Environment Variables ->Press +

Add the following

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

You can get these credentials from AWS Admin console or AWS Administrators will be able to provide you with one. Press Ok on both the dialogue boxes and you are ready to go.

Read data via S3

We are now ready to write a simple program to read the file and show the data. For this example, I have already created an S3 Bucket and uploaded a file.

import org.apache.spark.sql.SparkSession

object SparkS3Example {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark S3")
      .getOrCreate

    //Step 2 - Read the products.csv
    val ds = spark.read
      .option("header",true)
      .option("delimiter",",")
      .csv("s3a://cloudwalker-sparks3/input/products.csv")

    //Step 3 - Show the data
    ds.show
  }
}

Let’s analyse the program

  • Step 1 – Create a spark session
  • Step 2 – Read the file from S3. Observe how the location of the file is given. From an application development perspective, it is as easy as any other file path. Once Spark has access to the data the remaining APIs remain the same.
  • Step 3 – Show the data

Relevant portion of the log is shown below

Write data to S3

Another thing we have not done till now is saving our data anywhere. All we have done is show our data in the console. It is time to see how to save our data on S3.

Writing to an S3 bucket is quite easy. Below the program which does that. Make sure you have applied the Setup for AWS S3.

import org.apache.spark.sql.SparkSession

object SparkS3Example {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark S3")
      .getOrCreate

    //Step 2 - Read the products.csv
    val ds = spark.read
      .option("header",true)
      .option("delimiter",",")
      .csv("s3a://cloudwalker-sparks3/input/products.csv")

    //Step 3 - Select some data and save it to S3.
    ds.select("Product","Year","Quantity")
      .write
      .option("header",true)
      .csv("s3a://cloudwalker-sparks3/output/mydata")
  }
}

Let’s analyse the program

  • Step 1 – Create a spark session
  • Step 2 – Read the file from S3. Observe how the location of the file is given.
  • Step 3 – Select some data columns and write to a folder. Keep in mind we are not mentioning the name of the file. There is a reason for this. In Spark, as we know, all the executors are doing the hard work and they are all independent processes running in separate nodes in a cluster. They are all assigned tasks by the driver. In my cluster, I have three tasks and hence three files are generated. Each task writes its own CSV file. Keep in mind that when you run your application it may have a different number of tasks and executors. You can check that in the log which is generated.

Once the application is run, check the logs for any errors. If no errors are there then head over to AWS S3 console. See Below

If you download and open any of the files you should be able to see your data. See Below

Another variation to the write API is below. It allows you to pass the format as an option. This way you can generate different file format like parquet, avro etc. See Below

import org.apache.spark.sql.SparkSession

object SparkS3Example {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark S3")
      .getOrCreate

    //Step 2 - Read the products.csv
    val ds = spark.read
      .option("header",true)
      .option("delimiter",",")
      .csv("s3a://cloudwalker-sparks3/input/products.csv")

    //Step 3 - Select some data and save it to S3 as parquet
    ds.select("Product","Year","Quantity")
      .write
      .option("header",true)
      .format("parquet")
      .save("s3a://cloudwalker-sparks3/output/mydata")
  }
}

The files in S3 bucket are shown below. Compare the size of data with csv files from earlier program.

Parquet files in AWS Console

You can now configure how you can save data in various data stores. It is not necessary to read and write to the same format. You can read from any data store or format and write your data into any data store or format.

Hope this has been useful for you.

Take care byeeeeeeeeeeee!

Leave a Comment