MongoDB is one of a class of databases called “NoSQL” databases. Basically, NoSQL refers to any database that is not a relational database. However, “NoSQL” is often thought of as “Not only SQL” because some NoSQL databases do actually support SQL-style querying.
MongodDB is a sub-class of NoSQL called a “document-oriented” database. In a “document-oriented” database, a document is basically just an individual record without a strict structure. For example, a Mongo database with documents corresponding to individual people might have two documents as follows:
{
name: “Bob Jones”,
phone: “555-555-5555”,
address: “123 Nowhere Street”,
city: “Boston”,
state: “MA”
}
{
name: “Sally Jones”,
phone: “555-555-5555”
address: “123 Nowhere Street”
city: “Boston”
state: “MA”
email: “sally_jones@fake-email.com”
priorname: { first: “Sally”, last: “Roberts” }
}
Note the flexibility of the data structure:
- The “Sally Jones” document has an additional field that the “Bob Jones” document doesn’t have. This isn’t possible in a relational data which requires that all records in a table have exactly the same fields.
- The “Sally Jones” document has another document in it showing her previous name. This is called an “embedded document”.
One more key concept about Mongo documents: Take for example “city: Boston” – this is an example of something called a key-value pair, where “city” is the “key” and “Boston” is the “value”. We’ll use this key-value structure later in the tutorial to query the database for specific documents.
Finally, MongoDB organizes “documents” into something it calls “collections” and organizes “collections” into databases.
Ok, now that we understand the basics of NoSQL document-oriented databases let’s get started with MongoDB.
Commission EC2 Instance and Copy CSV Files
Follow the steps in our Get Started with AWS in 20 Minutes tutorial to commission an EC2 server instance and ssh into it. However, instead of choosing the Ubuntu Server 16.04 LTS (HVM), SSD Volume Type machine image, choose the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type image (this will save us a step when we install MongoDB). Then ssh into the instance.
Now, use scp (Mac or Linux) or pscp (Windows) to copy the Salaries.csv and Master.csv files over to your EC2 instance following the same instructions in the Get Started with AWS in 20 Minutes tutorial.
Install MongoDB
At the Ubuntu command line prompt on your EC2 ssh session, issue the following commands to install MongoDB (note these steps are taken directly from the installation guide published by MongoDB at https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/):
ubuntu@an-ip-address:~$ sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 0C49F3730359A14518585931BC711F9BA15703C6
ubuntu@an-ip-address:~$ echo “deb [ arch=amd64 ] http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.4 multiverse” | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
ubuntu@an-ip-address:~$ sudo apt-get update
ubuntu@an-ip-address:~$ sudo apt-get install -y mongodb-org
Create Mongo Database and Collections, and Import CSV Files
We’ll issue the command for Mongo to import the CSV files while still at the Ubuntu command line prompt:
ubuntu@an-ip-address:~$ mongoimport –db baseball –collection salaries –type csv –headerline –file /home/ubuntu/Salaries.csv
ubuntu@an-ip-address:~$ mongoimport –db baseball –collection master –type csv –headerline –file /home/ubuntu/Master.csv
Notice that when we use this “mongoimport” command, we are telling Mongo to create and use a database called “baseball” (Mongo creates it since it doesn’t already exist) and to put the data into a “collection” called “master” or “salaries”. We are also telling Mongo that the raw file type is “csv”, that it includes a header for the first line, and that the path to the file is “/home/ubuntu” on our EC2 instance.
Launch the MongoDB Shell and Find The Player with the Highest Salary
MongoDB provides an interactive shell that we can use to work with databases. Let’s launch the Mongo shell:
ubuntu@an-ip-address:~$ mongo
Now you should see a prompt like this:
>
Now, let’s see what database Mongo is using by default:
> db
You should see the result:
test
Ok, let’s switch to the “baseball” database:
> use baseball
Now let’s make sure our collections are there:
> show collections
You should see the two collections listed:
master
salaries
Now to get comfortable querying data in Mongo, let’s do a simple query for the document that has the highest salary number, which we know to be $33,000,000 from prior tutorials:
> db.salaries.find( { “salary” : 33000000} ).pretty()
You should see output that looks like this:
{
“_id” : ObjectId(“57f1caec4db4d5fdd0bfa636”),
“yearID” : 2009,
“teamID” : “NYA”,
“lgID” : “AL”,
“playerID” : “rodrial01”,
“salary” : 33000000
}
{
“_id” : ObjectId(“57f1caec4db4d5fdd0bfa96d”),
“yearID” : 2010,
“teamID” : “NYA”,
“lgID” : “AL”,
“playerID” : “rodrial01”,
“salary” : 33000000
}
Ok, so we see data in a format similar to what we expected based on the discussion in the first section of the tutorial. Note a few things:
- We used a command called find() to query the database.
- Mongo looked in the current database because we used “db” at the start of the command – when we told Mongo earlier to “use baseball”, Mongo changed the value of “db” which is the variable in the Mongo shell that references the current database.
- Mongo looked in the collection named “salaries” because we told it to do so when we typed “db.salaries” in front of find().
- Finally, we passed as a parameter to find(), a key-value pair corresponding to the documents we wanted. In this case, the key-value pair was { “salary” : 33000000 } which was present in two documents, hence Mongo showed us two documents.
- We tacked the command pretty() on the end to make the output formatting better so it’s easier to read.
- You can see that in importing the CSV data, Mongo created a document for every record (i.e. row) and then within each document created a key-value pair corresponding to the fields and observed values for that record. It also created a field called “_id” which is a unique identifier (or “key”) for each document in the collection.
Ok, now let’s do what we’ve done in the other tutorials and produce the output with the two tables (called “collections” in this case) joined.
To do this, we’ll make use of a capability in MongoDB called an aggregation pipeline. This basically means that we’ll issue one aggregation command that will process our collections in a series of steps to produce the aggregated output we want. Issue this command:
> db.salaries.aggregate([
{
$match : { “salary” : 33000000 }
},
{
$lookup :
{
from: “master”,
localField: “playerID”,
foreignField: “playerID”,
as: “output_test”
}
}
]).pretty()
You should see this output:
{
“_id” : ObjectId(“57f1caec4db4d5fdd0bfa636”),
“yearID” : 2009,
“teamID” : “NYA”,
“lgID” : “AL”,
“playerID” : “rodrial01”,
“salary” : 33000000,
“output_test” : [
{
“_id” : ObjectId(“57f1caf84db4d5fdd0bff366”),
“playerID” : “rodrial01”,
“birthYear” : 1975,
“birthMonth” : 7,
“birthDay” : 27,
“birthCountry” : “USA”,
“birthState” : “NY”,
“birthCity” : “New York”,
“deathYear” : “”,
“deathMonth” : “”,
“deathDay” : “”,
“deathCountry” : “”,
“deathState” : “”,
“deathCity” : “”,
“nameFirst” : “Alex”,
“nameLast” : “Rodriguez”,
“nameGiven” : “Alexander Enmanuel”,
“weight” : 225,
“height” : 75,
“bats” : “R”,
“throws” : “R”,
“debut” : “1994-07-08”,
“finalGame” : “2015-10-04”,
“retroID” : “rodra001”,
“bbrefID” : “rodrial01”
}
]
}
{
“_id” : ObjectId(“57f1caec4db4d5fdd0bfa96d”),
“yearID” : 2010,
“teamID” : “NYA”,
“lgID” : “AL”,
“playerID” : “rodrial01”,
“salary” : 33000000,
“output_test” : [
{
“_id” : ObjectId(“57f1caf84db4d5fdd0bff366”),
“playerID” : “rodrial01”,
“birthYear” : 1975,
“birthMonth” : 7,
“birthDay” : 27,
“birthCountry” : “USA”,
“birthState” : “NY”,
“birthCity” : “New York”,
“deathYear” : “”,
“deathMonth” : “”,
“deathDay” : “”,
“deathCountry” : “”,
“deathState” : “”,
“deathCity” : “”,
“nameFirst” : “Alex”,
“nameLast” : “Rodriguez”,
“nameGiven” : “Alexander Enmanuel”,
“weight” : 225,
“height” : 75,
“bats” : “R”,
“throws” : “R”,
“debut” : “1994-07-08”,
“finalGame” : “2015-10-04”,
“retroID” : “rodra001”,
“bbrefID” : “rodrial01”
}
]
}
Ok, so what happened here? Now, instead of find(), we are using aggregate() but are still telling Mongo to use the database named “baseball” (which is the current value of “db”) and the collection named “salaries”.
Then in the first step of the pipeline, we are telling Mongo to filter “salaries” and only give us the documents where the value of “salary” is $33,000,000 (this is the $match step).
Then we are telling Mongo to take that subset of “salaries” and create a new field in each document called “output_test” and to set the value of “output_test” to be the document from “master” that corresponds to that player.
Remember that each document in “salaries” corresponds to a salary observation for a given player in a given year. So the first $match step pulled the documents corresponding to Alex Rodriguez’s 2009 and 2010 seasons where he earned $33,000,000. But if you recall from the earlier query we ran, those documents don’t have his name, they just have his playerID. His name is in the “master” collection. So the $lookup step does the following:
- Goes into the “master” collection (i.e. from: master)
- Looks up which document corresponds to Alex Rodriguez (i.e. localField: playerID, foreignField: playerID which is basically merging on “playerID”)
- Creates a new field called “output_test” in “salaries” to house those documents from “master” (i.e. as: output_test)
Congratulations, you’ve just used Mongo DB to import two CSV files and then run an aggregation pipeline to filter a collection and merge documents from another collection!
Please remember to log back into your AWS account, click on “Instances” then go to “Actions>Instance State>Terminate” to terminate your EC2 instance. If you don’t do this, the EC2 instance will keep running indefinitely and will eventually start to accrue charges to your account.