Till now we have written programs which are essentially compiled jars & running on some Spark cluster. This works fine if the use case is batch processing data and that data is being consumed by some downstream applications.
But think about if we(as users) want to use the raw computing power of Spark to do some ad-hoc SQL query. Aah, the good old world of an SQL command line or an SQL editor. Firing a bunch of queries. What about JDBC connections to some BI applications? Well, look no further – all that is possible!!
In this blog post, we will look at how multiple users can interface with Spark to do Ad-hoc querying using Spark Thrift Server by creating a JDBC connection to it and fire some queries on data stored in Hive.
Before we start on this let’s see what all is required to get going
- Access to a Hadoop cluster with hive running.
- A simple SQL IDE like DBeaver which supports JDBC
I have created a small three node cluster on AWS EMR to get started. I have access to this cluster via SSH and Hue browser.
Step – 1 – Logon to the Hue Browser
Step – 2 – Execute the following script
DROP TABLE IF EXISTS testdb.products;
CREATE EXTERNAL TABLE testdb.products(
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
STORED AS TEXTFILE;
A data file of the following sample is uploaded to
- / user/ hive/ warehouse/ testdb.db/ products/ products_v2.csv
United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489,0.34754797
United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame,2012,Q1 2012,35950.32,252,0.4742745
United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome,2012,Q1 2012,89940.48,147,0.35277197
United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303,0.28293788
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415,0.29145017
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352,0.39814629
United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Camp Cot,2012,Q1 2012,41837.46,426,0.33560737
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Lite,2012,Q1 2012,8268.41,577,0.52896022
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,Firefly Extreme,2012,Q1 2012,9393.3,189,0.43420523
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Single,2012,Q1 2012,19396.5,579,0.46149254
United States,Fax,Outdoors Shop,Camping Equipment,Lanterns,EverGlow Butane,2012,Q1 2012,6940.03,109,0.36186587
Let’s query the file via Hive Query using the following query
Spark Thrift Server
Start Thrift Server
Now that we have the data and completed the setup it is time to start the thrift server. Starting a spark thrift server is simple. You need to execute the following steps
- Goto $SPARK_HOME/
- Fire the following command
The thrift server will start with default settings. It should give you something like this
We can also go to the YARN and see a thrift application is running ready to accept queries. Below is a screenshot
If you goto Spark Web UI – It will show the following
This should give you a long list of configurable parameters. See Below
Query Using Beeline
Once the thrift server is up and running. It is time to do some querying. Let’s connect to our Hive and query the products table.
Step – 1 – Start beeline
Step – 2 – Connect to thrift server
Use the following command on beeline
This should connect to the Spark context which will in turn connect to the Hive. When prompted for Username and Password just press ENTER.
Step – 3 – Fire Queries to thrift server
Let’s fire a simple query which we had fired earlier
It should return something like below
Querying using SQL JDBC client
But not everyone is geared up towards using SSH and may not even have an account. Let’s look at connecting to thrift via SQL JDBC client like DBeaver on your PC/laptop.
Step – 1 – Create a Hive JDBC connection
- Goto Database ->New Connection
- Select Hadoop->Spark Hive
- Press Next and fill in the following connection details
Press Next and Finish the connection details
If all the setting are correct it should now connect to the Thrift Server.
Step – 2 – Fire Queries to thrift server