Spark & Ad-hoc Querying

Till now we have written programs which are essentially compiled jars & running on some Spark cluster. This works fine if the use case is batch processing data and that data is being consumed by some downstream applications.

But think about if we(as users) want to use the raw computing power of Spark to do some ad-hoc SQL query. Aah, the good old world of an SQL command line or an SQL editor. Firing a bunch of queries. What about JDBC connections to some BI applications? Well, look no further – all that is possible!!

In this blog post, we will look at how multiple users can interface with Spark to do Ad-hoc querying using Spark Thrift Server by creating a JDBC connection to it and fire some queries on data stored in Hive.

Before we start on this let’s see what all is required to get going

  • Access to a Hadoop cluster with hive running.
  • A simple SQL IDE like DBeaver which supports JDBC

I have created a small three node cluster on AWS EMR to get started. I have access to this cluster via SSH and Hue browser.

Setup

Hive Setup

Step – 1 – Logon to the Hue Browser

Hue Browser

Step – 2 – Execute the following script

CREATE DATABASE testdb;

DROP TABLE IF EXISTS testdb.products;

CREATE EXTERNAL TABLE testdb.products(
    retailercountry VARCHAR(100),
    ordermethodtype VARCHAR(100),
    retailertype VARCHAR(100),
    productline VARCHAR(100),
    producttype VARCHAR(100),
    product VARCHAR(100),
    YEAR  VARCHAR(100),
    quarter  VARCHAR(100),
    revenue FLOAT,
    quantity FLOAT,
    grossmargin FLOAT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ","
)  
STORED AS TEXTFILE;

Data Upload

A data file of the following sample is uploaded to

  • / user/ hive/ warehouse/ testdb.db/ products/ products_v2.csv
retailercountry,ordermethodtype,retailertype,productline,producttype,product,year,quarter,revenue,quantity,grossmargin
United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.34754797
United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.4742745
United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.35277197
United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.28293788
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145017
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.39814629
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.33560737
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896022
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.43420523
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.46149254
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Butane,2012,Q1 2012,6940.03,109,0.36186587

File in HDFS via Hue Browser

Let’s query the file via Hive Query using the following query

SELECT * FROM testdb.products LIMIT 10;
Data via Hive

Spark Thrift Server

Start Thrift Server

Now that we have the data and completed the setup it is time to start the thrift server. Starting a spark thrift server is simple. You need to execute the following steps

  • Goto $SPARK_HOME/sbin
  • Fire the following command
./start-thriftserver.sh

The thrift server will start with default settings. It should give you something like this

Start Thrift Server

We can also go to the YARN and see a thrift application is running ready to accept queries. Below is a screenshot

Resource Manager UI

If you goto Spark Web UI – It will show the following

Spark Web UI

So in case you want to configure a spark thrift server use the following command to get all the configuration parameters

./start-thriftserver.sh --help

This should give you a long list of configurable parameters. See Below

Query Using Beeline

Once the thrift server is up and running. It is time to do some querying. Let’s connect to our Hive and query the products table.

Step – 1 – Start beeline

Start beeline

Step – 2 – Connect to thrift server

Use the following command on beeline

!connect jdbc:hive2://localhost:10000

This should connect to the Spark context which will in turn connect to the Hive. When prompted for Username and Password just press ENTER.

Step – 3 – Fire Queries to thrift server

Let’s fire a simple query which we had fired earlier

SELECT * FROM testdb.products LIMIT 10;

It should return something like below

data in beeline via thrift server

Querying using SQL JDBC client

But not everyone is geared up towards using SSH and may not even have an account. Let’s look at connecting to thrift via SQL JDBC client like DBeaver on your PC/laptop.

Step – 1 – Create a Hive JDBC connection

  • Goto Database ->New Connection
  • Select Hadoop->Spark Hive
  • Press Next and fill in the following connection details
    • URL
    • Database

Press Next and Finish the connection details

If all the setting are correct it should now connect to the Thrift Server.

Step – 2 – Fire Queries to thrift server

This bring us to the end of this blog entry. Hope you found this blog entry helpful! Till next time…..byeeeeeeeee!

Leave a Comment