Want a super fast introduction to data science technologies? Our Data Science Technology Sprint will get you working with 8 essential data science technologies in 20 minutes each.
You’ll get started with two standard data science programming languages (Python and R), a cutting-edge analytics processing engine (Apache Spark), two abstraction layers for data processing over Hadoop (Hive and Pig), a relational database (PostgreSQL), a NoSQL database (MongoDB), and a system for commissioning remote servers and spinning up distributed computing clusters (Amazon Web Services a.k.a. AWS).
The best way to learn any data science tech is to start using it, but the challenge is often getting the tech installed, configured, and running with some real data.
Many available tutorials solve this by abstracting away all of the complexity and just providing a window through a web browser so that you can issue simple commands. The problem is that while you are actually learning some commands, you won’t feel like you really “own” the technology, so to speak, because after the tutorial you can’t use that browser console to explore and do any real analysis.
The tutorials in this Sprint aren’t meant to give you a full introduction to each system but rather to get you started using each system so that you can accelerate your learning.
In each Sprint, we’ll show you the exact steps to install, configure, import data, and do a simple piece of analysis with each technology. Afterwards, you can start to play around with each of the systems and start learning more by doing your own analysis.
For each technology, we’ll use exactly the same raw data and analysis plan so that you spend your time working with the tech rather than learning about a bunch of different data sets. This will get you learning really fast.
The tutorials in this Sprint build on one another, so we recommend you do them in the following order:
- Get Started with R in 20 Minutes
- Get Started with AWS in 20 Minutes
- Get Started with PostgreSQL in 20 Minutes
- Get Started with Python in 20 Minutes
- Get Started with Apache Spark in 20 Minutes
- Get Started with Apache Hive in 20 Minutes
- Get Started with Apache Pig in 20 Minutes
- Get Started with MongoDB in 20 Minutes
If you’re already familiar with any of the technologies, feel free to skip doing the tutorial, but we recommend that you do still read through it since they build on one other.
Enjoy!
Get started with R in 20 minutes
R is an open source software application for data science. It’s a powerful language and has a ton of packages that streamline everything from performing operations on data sets to running machine learning algorithms. Anyone in an “analyst” role in business that...
read moreGet started with AWS in 20 minutes
Amazon Web Services (AWS) is a robust set of cloud computing services that Amazon makes available for use on-demand. With AWS, we can do things as simple as commissioning a cheap, lightweight server running Ubuntu for simple processing to commissioning a large...
read moreGet started with PostgreSQL in 20 minutes
Using SQL to work with relational databases is a core skill for anyone doing serious data analysis. Today we’ll get started with SQL using PostgreSql. We’ll follow the same format and produce the same output as we did in the Get Started with R in 20 Minutes...
read moreGet Started with Python in 20 Minutes
Python is a general purpose programming language. While it’s often taught as an introductory programming language for beginners because it’s syntax is clear and easy to use, it’s also a very powerful language. Many web applications are written in Python. And Python...
read moreGet started with Apache Spark in 20 minutes
Apache Spark is a super fast distributed computing framework for big data analytics. Tests of Spark show that it’s anywhere from 10 to 100 times faster than Hadoop’s Map Reduce processing framework. Because of this, it’s become one of the hottest technologies in...
read moreGet Started with Apache Hive in 20 Minutes
Apache Hadoop is a framework for storing and processing large scale data using distributed computing. While base Hadoop includes four modules: HDFS, MapReduce, Yarn, and Hadoop Common, we’ll only concern ourselves with the first two: Hadoop Distributed File System...
read moreGet Started with Apache Pig in 20 Minutes
Apache Pig is similar to Apache Hive in that it is an abstraction layer over Hadoop that provides a programming model and language (called Pig Latin) that is easy to use and ultimately translates into MapReduce (or Tez or Spark) to be run. For more information about...
read moreGet Started with MongoDB in 20 Minutes
MongoDB is one of a class of databases called “NoSQL” databases. Basically, NoSQL refers to any database that is not a relational database. However, “NoSQL” is often thought of as “Not only SQL” because some NoSQL databases do actually...
read more