Select Page

Apache Pig is similar to Apache Hive in that it is an abstraction layer over Hadoop that provides a programming model and language (called Pig Latin) that is easy to use and ultimately translates into MapReduce (or Tez or Spark) to be run.

For more information about Hadoop, MapReduce, and Hive, check out the first section of our Get Started with Apache Hive in 20 Minutes tutorial.

Discussion of the major differences between Pig and Hive are beyond the scope of this tutorial. However, one of the differences you can see played out in these tutorials is that Hive QL is similar to SQL in that we use the Select…FROM…WHERE paradigm to pull data and if we want to limit our pull using analysis we try to do so with subqueries.

On the other hand, the processing steps we’ll use with Pig Latin are closer to those we used in the Get Started with R in 20 Minutes tutorial.  As you go through this tutorial note how we are creating interim data sets and then using those data sets in later steps of the analysis just like we did with R.

Ok, let’s get started with Pig.

Spin Up EMR Hadoop Cluster with Pig and Connect through SSH

To set up the cluster, follow the same steps you did for Get Started with Apache Spark in 20 Minutes, but for “Instance type”, choose m1.large, and for “Applications” choose the option with both Hive and Pig.

To connect to the master node using ssh, follow the same steps from Get Started with Apache Spark in 20 Minutes.

Copy CSV Files to S3

Copy the “Salaries.csv” and “Master.csv” files to AWS S3 following the steps in our Get Started with AWS in 20 Minutes tutorial.

Launch Pig Command Line Interface

On the terminal window showing the ssh session with your master node, you should see the following prompt:

[hadoop@ip-address ~]$

Enter the command “pig”:

[hadoop@ip-address ~]$ pig

Now you should see the Pig command line interface prompt:

grunt>

Import CSV Files into Pig Data Structures

We won’t go into detail about Pig’s data structures in this tutorial, but suffice to say that Pig is much more flexible than Hive or SQL because it doesn’t require each row in a table to have the same number of fields or that the observation for a given column be of the same data type for each row.

Pig has it’s own naming convention for each element in its data structure but for the sake of simplicity we’ll just adopt the convention of saying tables, rows, and columns.

Ok, let’s import the CSV files into Pig:

grunt> salary = LOAD ‘s3n://your-bucket-name/Salaries.csv’ USING PigStorage(‘,’) AS (yearID:chararray, teamID:chararray, lgID:chararray, playerID:chararray, salary:int);

 

grunt> master = LOAD ‘s3n://your-bucket-name/Master.csv’ USING PigStorage(‘,’) AS (playerID:chararray, birthYearID:chararray, birthMonth:chararray, birthDay:chararray, birthCountry:chararray, birthState:chararray, birthCity:chararray, deathYear:chararray, deathMonth:chararray, deathDay:chararray, deathCountry:chararray, deathState:chararray, deathCity:chararray, nameFirst:chararray, nameLast:chararray, nameGiven:chararray, weight:chararray, height:chararray, bats:chararray, throws:chararray, debut:chararray, finalGame:chararray, retroID:chararray, bbrefID:chararray);

So, what just happened?  We created two Pig tables called “salary” and “master” using Pig’s LOAD command.  In the LOAD command, we told Pig the path where to find the data, that the data is comma-delimited, and we used “AS” to tell Pig the structure of our data so it can assign data types to the columns.

Merge the Tables and Find the Player with the Highest Salary

First, let’s merge the data:

grunt> mergeddata = JOIN salary BY playerID, master BY playerID;

Now, let’s filter to find the player with the highest salary.  Note that we cheat a bit and used the fact that we know the highest ever recorded annual salary is $33,000,000 rather than using Pig to calculate this value:

grunt> highestsalary = FILTER mergeddata BY salary == 33000000;

Now let’s prune the table so the final output only shows the player’s name, the year, and the salary:

grunt> highestsalarynew = FOREACH highestsalary GENERATE nameFirst, nameLast, yearID, salary;

Now let’s print out “highestsalarynew” which should contain data about the player with the highest ever recorded salary:

grunt> DUMP highestsalarynew;

You should see output that looks like this:

pig-output

Congratulations, you’ve just completed your first data analysis using Pig on a distributed  Hadoop cluster!

Please remember to log back into your AWS account and terminate your EMR cluster.  From the AWS console, click on “Services” then “EMR” then click on your cluster name.  Then click the “Terminate” button.  Then click the red “Terminate” button in the pop-up window.  If you don’t do this, the EMR cluster and the 3 underlying EC2 instances will keep running indefinitely and will accrue very large charges on your AWS account.

Continue to our next tutorial: Get Started with MongoDB in 20 Minutes.

Share This
Learn Data Science

Learn Data Science

Join our mailing list and we'll send you awesome content and info about programs to help you learn data science.

Thanks for subscribing!