AWS's Elastic Map Reduce Offering

A company I used to work at had weekly company wide catered lunches where people are given the opportunity to present on interesting topics that are relevant to the contracts they're currently working on. I took the option to give a presentation on Amazon Web Service's Elastic Map Reduce offering as the migration I spent the majority of my employment working on was heavily reliant on taking the client's Hadoop based SaaS product and getting it to work on EMR.

This AWS service is a dynamically scale-able, big data processing distributed system provider based on several open source Apache libraries like Spark, Hadoop, Hive, HBase, and Zookeeper. It’s used to store, manipulate, and reduce large scale data sets to help provide new context, understanding, and insights into what the data means. For a long-time, big data users of these technologies have been locked into expensive, long-term hardware contracts with their data-center providers for the amount of servers they think they need at the time of the original signing. Below are some extracts from my original set of PowerPoint slides, click/touch the images to see the next.

Who would want to actually use this stuff? Why?

Amazon Web Services' EMR brings Industry Leading Technologies out of rigid on-premise data-centers and into the affordable, scalable Cloud. Now that the necessary technologies are managed by AWS, these same expensive, "on-prem", data center customers get the cost savings and scalability of ephemeral EC2 clusters as well as the reliability of S3/EBS storage. Amazon currently sells it as:

Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand.

https://aws.amazon.com/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime

How does the data processing really happen?

When jobs are submitted to an EMR cluster, the work is split up into stages and those stages are split into containers. The whole point being that the primary server(s) can evenly distribute all of the containers in a stage across the secondary nodes living in the core and task groups, thus increasing the parallelization of a job's execution and decreasing the overall time it takes to finish the job. Currently there are two main paradigms for structuring the jobs you can submit to an EMR cluster, their names are Map Reduce and Directed Acyclic Graph. Map Reduce jobs have a more rigid structure with a predefined set of stages that cant be changed where as Directed Acyclic Graphs allow engineers to use their imagination when writing jobs to accomplish their needs.

Hadoop: Map Reduce

All of the containers in a submitted Map Reduce job are split into four predefined phases: Splitting, Mapping, Shuffling, & Reducing. One famous, simple to understand, example of a Map Reduce job is a word count program.

Spark: Directed Acyclic Graph

All of the containers in a submitted Spark job are used to construct a free-form DAG (Directed Acyclic Graph). The DAG is optimized by EMR by rearranging and combining containers where possible in the job's execution.

Where does all of the data end up living?

Running jobs of either type will have to be executed using Hadoop File System(HDFS) as the underlying storage system because CPUs, even virtual ones, rely upon the POSIX standards to reliably read and write data to hard drives. HDFS uses block based storage like other file systems but with some slight differences. For one, its designed to have a much larger block size than a standard Operating System. Secondly, it creates copies of all of its blocks and replicates them to other servers in the cluster for fault tolerance and the increased chance that CPU power is available to work with the blocks' data locally. AWS has also written their own EMR File System(EMRFS) as an abstraction around using their S3 service to store your data long term so that you don't need to keep EMR clusters up 24/7 to persist your data. EMRFS does not offer POSIX standards but to work around that requirement AWS's Hadoop software pulls your data into the EMR cluster from S3 first in order for it to be processed and then it writes the output back out upon completion. This was really a game changer for customers looking to easily take advantage of short-lived, ephemeral EMR clusters.

Amazon Web Services' EMR is an amazing offering that really helps to bring big data companies out of their on-premise data centers and into the dynamic cloud. I've seen it revolutionize SaaS products and help to decrease monthly operating costs by over 10 times what a company was used to paying for hardware and maintenance. Even if you're not a part of a multi-million dollar company, this service makes it easier than ever to start learning about what it takes to process huge amounts of data.