Apache Hadoop is a framework for storing and processing large scale data using distributed computing. While base Hadoop includes four modules: HDFS, MapReduce, Yarn, and Hadoop Common, we’ll only concern ourselves with the first two:
- Hadoop Distributed File System (HDFS) is Hadoop’s system for distributed storage. It splits up files and distributes them across servers in a distributed computing cluster.
- MapReduce is Hadoop’s data processing engine. It provides a model for processing the data in HDFS that has been split and stored in the distributed file system.
While MapReduce is a powerful tool for processing large scale data, it is a completely new processing model for data scientists to learn. Fortunately, Apache Hive can be deployed with Hadoop and provides a query language (called Hive QL) that is very similar to SQL. Ultimately, the query that we submit through Hive is translated to MapReduce (or Tez or Spark) and run – but this complexity is completely abstracted for us. We’re left with a super easy to use system for create and querying databases and tables with mostly the same functionality and code that we use for SQL.
In this tutorial, we’ll spin up a Hadoop cluster with Hive using AWS’s EMR service.
Ok let’s get started.
Spin Up EMR Hadoop Cluster with Hive and Connect through SSH
To set up the cluster, follow the same steps you did for Get Started with Apache Spark in 20 Minutes, but for “Instance type”, choose m1.large, and for “Applications” choose the option with both Hive and Pig.
To connect to the master node using ssh, follow the same steps from Get Started with Apache Spark in 20 Minute.
Prep CSV Files in AWS S3
Follow the steps from Get Started with Apache Spark in 20 Minutes to copy the Salaries.csv and Master.csv onto S3, with two differences:
- Go into your bucket and create two folders: one called Salaries and one called Master.
- Now copy Salaries.csv into the “Salaries” folder and Master.csv into the “Master” folder.
Why do we need to do this step? The Hive command we’ll use to import the CSV files will import all files in the folder we pass to it. Since we want to produce two separate Hive tables with different structures with each table only including data from a single file, we need to separate our two target files into their own separate directories.
Launch Hive Command Line Interface
On the terminal window showing the ssh session with your master node, you should see the following prompt:
[hadoop@ip-address ~]$
Enter the command “hive”:
[hadoop@ip-address ~]$ hive
Now you should see the Hive command line interface prompt:
hive>
For the rest of the tutorial in Hive, we’ll be following very similar steps as we did in Get Started with PostgreSQL in 20 Minutes. You’ll notice that the syntax of the code for Hive QL is also very similar to SQL syntax.
Create Database and Initialize Tables
Let’s create a database called “baseball”, connect to the “baseball” database, and create two tables called “salary” and “master”:
Create “baseball” database:
hive> CREATE DATABASE baseball;
Connect to “baseball” database:
hive> USE baseball;
Create table called “salary” in “baseball” database:
hive> CREATE TABLE salary (yearID STRING, teamID STRING, lgID STRING, playerID STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’
LOCATION ‘s3n://your-bucket-name/Salaries’;
Create table called “master” in “baseball” database:
hive> CREATE TABLE master (playerID STRING, birthYearID STRING, birthMonth STRING, birthDay STRING, birthCountry STRING, birthState STRING, birthCity STRING, deathYear STRING, deathMonth STRING, deathDay STRING, deathCountry STRING, deathState STRING, deathCity STRING, nameFirst STRING, nameLast STRING, nameGiven STRING, weight STRING, height STRING, bats STRING, throws STRING, debut STRING, finalGame STRING, retroID STRING, bbrefID STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’
LOCATION ‘s3n://you-bucket-name/Master’;
After each step, Hive should say “Ok” in the terminal window.
Join Tables and Find Player with Highest Annual Salary
The syntax to join the Hive tables and pull the player with the highest annual salary is the same as the SQL syntax in the Get Started with PostgreSQL in 20 Minutes tutorial with one exception: the syntax for the subquery (SELECT MAX…) requires that we use “IN” instead of “=”. So, you can see that Hive QL is very similar to SQL, but it’s not exactly the same.
Here’s the command:
hive> SELECT master.nameFirst, master.nameLast, salary.yearID, salary.salary
FROM master, salary
WHERE master.playerID = salary.playerID AND salary.salary IN (SELECT MAX(salary) FROM salary);
After running the query, Hive should print out the following for you:
Congratulations, you’ve just completed your first data analysis using Hive on a distributed Hadoop cluster!
Please remember to log back into your AWS account and terminate your EMR cluster. From the AWS console, click on “Services” then “EMR” then click on your cluster name. Then click the “Terminate” button. Then click the red “Terminate” button in the pop-up window. If you don’t do this, the EMR cluster and the 3 underlying EC2 instances will keep running indefinitely and will accrue very large charges on your AWS account.
Continue to our next tutorial: Get Started with Apache Pig in 20 Minutes.