Using SQL to work with relational databases is a core skill for anyone doing serious data analysis. Today we’ll get started with SQL using PostgreSql. We’ll follow the same format and produce the same output as we did in the Get Started with R in 20 Minutes post. Specifically, we’ll do the following:
- Import two CSV (comma-separated value) files
- Merge the two CSV files
- Find a value in the merged data set
We’ll run this tutorial using an AWS EC2 (Amazon Web Services – Elastic Compute Cloud) instance. One of the challenges of these tutorials is that we don’t all have the same computing environment – some of us use Windows, some Mac OS, and some Linux. Doing the tutorial on an AWS EC2 instance allows us to remove most of this complexity since we can all set up and work from the exact same computing environment. There will still be one difference between Windows and MAC/Linux users because we’ll need access to a terminal which comes already available on MAC/Linux but will need to be downloaded for Windows.
Please use our Get Started with AWS tutorial to create and prep your EC2 instance.
All work in this tutorial from here on will happen in the terminal window that is connected to your EC2 instance (the one with the prompt that says ubuntu@ip-address:~$).
Note that in the commands I ask you to run, I’ve kept some info about the prompt for reference. I did this because different prompts tell you about what you are currently connected to. You shouldn’t retype the prompt when you issue the command. So, for example, in:
postgres@ip-address:~$ psql
The command you should type is just “psql”. I’ve only kept the “postgres@ip-address:~$” so that you can verify you are connected properly. In this case, the prompt indicates that you are connected as a user called “postgres” instead of the original “ubuntu” user that you originally connected as.
Ok, let’s get started.
Install PostgreSQL
Ok, we need to install PostgreSQL onto our EC2 instance. Go to your EC2 terminal window and run the command:
$ sudo apt-get install postgresql-9.5
You may get a warning telling you how much disk space will be used for the install and asking if you want to continue. Type “y” to continue.
Create Postgres database
Now let’s create a database. Go back to the terminal window. Now, you need login as the root user: “postgres”. Issue the following command:
$ sudo su – postgres
Let’s start the Postgres interactive terminal (called psql). Run:
postgres@ip-address:~$ psql
Now, create a database called baseball (please note that after every command you issue in psql, you need to put a semi-colon or it won’t run):
postgres=# CREATE DATABASE baseball;
Now, connect to the baseball database:
postgres=# \c baseball;
Now, create a table for the salaries data:
baseball=# CREATE TABLE salaries (yearID char(10), teamID char(10), lgID char(10), playerID char(30), salary integer);
Let’s verify that our new table exists:
baseball=# \dt
What’s happening here? Very simply, we are creating a table and telling Postgres what the columns will be and what type of data each column will hold. For instance, the column yearID is being set to hold character values (i.e. a text field instead of, for instance, an integer field).
Now, import the data from the “Salaries.csv” file into the new “salaries” table:
baseball# COPY salaries FROM ‘/home/ubuntu/Salaries.csv’ DELIMITER ‘,’ CSV HEADER;
Notice that this “copy” command has 4 pieces of information. First, we say that we want to copy to the salaries table, then we say the path of the file we want to copy, then we say what the delimiter is, then we say that the file has a header for the first row.
Ok, let’s repeat the process to create a table for the “master” data:
baseball# CREATE TABLE master (playerID char(30), birthyear char(10), birthmonth char(10), birthday char(10), birthcountry char(30), birthstate char(30), birthcity char(30), deathyear char(10), deathmonth char(10), deathday char(10), deathcountry char(30), deathstate char(30), deathcity char(30), namefirst char(30), namelast char(30), namegiven char(100), weight char(10), height char(10), bats char(10), throws char(10), debut char(30), finalgame char(30), retroID char(30), bbrefID char(30));
Now let’s import the data from “Master.csv” into the new “master” table:
baseball# COPY master FROM ‘/home/ubuntu/Master.csv’ DELIMITER ‘,’ CSV HEADER;
Finally, let’s run a query to see who the player with the highest annual salary of all time is:
baseball# SELECT master.namefirst, master.namelast, salaries.salary
FROM master, salaries
WHERE master.playerID = salaries.playerID and salaries.salary = (select max(salary) from salaries);
You should see something like this:
Ok, so what’s going on with this query? First, note the general syntax: SELECT…FROM…WHERE… This is the standard format for SQL queries. After “SELECT” we are telling the database what columns we want to select, after “FROM” we are telling the database what tables to pull the columns from, after “WHERE” we are telling the database any conditions to apply before giving us the data.
Notice in the SELECT statement that we use the syntax table.column. Since we are pulling data from multiple tables, we need to not only tell the database which tables to use in the FROM statement but also which table to pull each variable from in the SELECT statement – this is a best practice since sometimes multiple tables will have the same variable names in it.
The first part of the WHERE statement is telling the database to join the data by playerID while the second part is telling the database just to give us just the observations with the maximum salary.
Note in the second part of the WHERE statement, we’ve embedded a separate SQL query. Very simply, this second query: SELECT max(salary) FROM salaries will return the maximum observation for the salary variable and then the first query will use that value to limit the data it returns to us.
Congratulations, you’ve just used PostgreSQL to create a database and a couple tables. You’ve then used it to run a query that merges the two tables and returns output based on a mathematical function (max).
To exit out of psql, type:
baseball=# \q
Then logout of the “postgres” root account:
postgres@ip-address:~$ exit
Then close your ssh session to the EC2 instance:
ubuntu@ip-address:~$ exit
Please remember to log back into your AWS account, click on “Instances” then go to “Actions>Instance State>Terminate” to terminate your EC2 instance. If you don’t do this, the EC2 instance will keep running indefinitely and will eventually start to accrue charges to your account.
Continue to our next tutorial: Get Started with Python in 20 Minutes.