Analytics Using Spark and HBase

In this assignment I used Spark to connect to data stored in HBase tables and ran analytical queries. Since HBase is not avilable in UTD cluster, I used Cloudera’s docker container. This assignment has following steps.

Step I:

  1. Download the bike sharing dataset from: http://www.utdallas.edu/~axn112530/cs6350/data/bikeShare/201508_trip_data.csv
    Hint: On the UNIX shell, you can run the following
    curl –o 201508_trip_data.csv http://www.utdallas.edu/~axn112530/cs6350/data/bikeShare/201508_trip_data.csv
  2. Analyze the data and look at the fields. Check if it has a header. Create table and at least one column family in HBase so that this data can be imported. You can do this using the command line or using the Hue GUI.
  3. Import the data into the table that you created in step 2. You can do this using any of the Hadoop technologies, such as Pig or Spark. An example of this was shown in the class.
  4. Make sure that the data has been imported correctly by looking at it on the Hue GUI.

Step II:

  1. In this step, I used Spark to connect to the HBase table that I created in step I. Below are some hints:
    • Download the Spark HBase connector jar file from: https://github.com/nerdammer/spark-hbase-connector The above page also contains helpful hints and code snippets. You can directly download the jar file as: curl -o spark-hbase-connector.jar http://central.maven.org/maven2/it/nerdammer/bigdata/spark-hbase-connector_2.10/0.9.2/spark-hbase-connector_2.10-0.9.2.jar
    • When starting Spark shell use the following command: spark-shell --jars spark-hbase-connector.jar
    • On the first line of the Spark shell, import the library as: import it.nerdammer.spark.hbase._

Commands used to load and store and connect spark to Hbase

T = LOAD '/user/cloudera/201508_trip_data.csv' USING PigStorage(',') AS(
		trip_id:chararray,
		duration:chararray,
		start_date:chararray,
		start_station:chararray,
		start_terminal:chararray, 
		end_date: chararray,
		end_station: chararray,
		end_terminal: chararray,
		bikeno:chararray,
		subscriber_type:chararray,
		zipcode:chararray);

STORE T INTO 'hbase://trip_data' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'data:duration,
data:start_date,
data:start_station,
data:start_terminal, 
data:end_date,
data:end_station,
data:end_terminal,
data:bikeno,
data:subscriber_type,
data:zipcode'
);

curl -o spark-hbase-connector.jar http://central.maven.org/maven2/it/nerdammer/bigdata/spark-hbase-connector_2.10/0.9.2/spark-hbase-connector_2.10-0.9.2.jar 

spark-shell --jars spark-hbase-connector.jar

Following Queries were answered as follows: