EMR supports many different big data products including Spark, HBASE, Presto, Flink. Storage can use regular good old fashioned HDFS or EMRFS which is storage on S3. EMR runs in a VPC but within a single AZ.
EMR includes the following components:
- Master node - Manages data distribution to core and slave nodes; backs up log files to S3 every 5 minutes if configured upon creation; long running (so good candidates for RI)
- Core Nodes - store data on HDFS; managed by master node; long running (so good candidates for RI)
- Task Nodes - Performs data tasks; no HDFS; short running and good for Spot instances
- EMR Studio - managed Jupyter notebooks used as an integrated development environment; notebooks are backed up to S3 not on cluster with availability exclusively on the console
It’s also common to run EMR in a transient cluster (temporary usage). Capacity is related to instances configuration; uniform instances groups have single instances types, purchasing options, and autoscaling while instances fleets mix instance types and don’t have autoscaling (like Spot for EMR).
The big idea is distribute high velocity, high volume and lots of variety of data across tons of servers… using an Apache project, Hadoop. Tasks include log analysis, web indexing, machine learning, financial analysis, scientific simulation, and bioinformatics. Hadoop is a data storage and batch processing framework that is compromised of four major components:
- Hadoop Common - the utilities and libraries part
- Hadoop HDFS - three copies of the data is written here… not a file system (can’t mount it) = more of a data structure storage device
- Hadoop YARN - Yet Another Resource Negotiator; new in Hadoop 2; enables applications (Pig, Hive, HBase) to be plugged in
- Hadoop Map Reduce - the engine - an implementation of the MapReduce programming model for large scale data processing.
Parallel processing framework that works with Hadoop. It moves data out to processing location which essentially scales-out the architecture; fault tolerant; runs on commodity hardware; many different languages are supported; works with Hive in V2 and Pig (V1 & V2)
Data warehouse based on Hadoop (Petabyte scale; distributed on commodity hardware; highly latent queries) Hive uses HQL so you can leverage your SQL skills and works on structured and unstructured data and has the nifty feature of defining a schema on read. Hive can be configured to read data from DynamoDB.
Components of Hive include:
- Databases - named bunch of data
- Tables - files in subdirectories; can be managed internally in HDFS or external of Hive.
- Partitions - a directory used to reduce the size of a table which makes data reads/filters faster
- Buckets - a file in a directory table used to manage tables; more efficient that 1000s of small partitions
- Views & indexes - same as RDBmS
- Hive Metastore - stores information about partitions, schemas, tables & locations, and columns/column types
Pig is a language for analyzing large datasets and is more efficient than Java and similar to SQL. Pig Latin is easily controlled, data agnostic, flexible, and fast. Use Pig when external services are needed for ETL, and you need to debug it… and you don’t like Java. It has all the basic data types, complex data types and relations.
Other Hadoop Projects
- HBase - an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop.
- Cassandra - a database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
- Spark is super quick because it’s an in-memory solution and can be used with the Spark MLLib, the defacto machine learning library for this platform. MLLib enables pretty much all ML functions including modeling, ETL/feature transformations, model evaluation, and ML persistence.
- Get Data in realtime? Kinesis
- Import Data? Snowball or Import export
- Visualize Data? Quicksight
- On Demand Analytics = S3 => EMR (sort, aggregate, join) => Redshift (data storage) => Quicksight
- Realtime Analytics = Kinesis => DynamoDB
- Data Warehousing = S3 => EMR (ETL) => S3 => Redshift => Quicksight
- Clickstream = Kinesis => Hadoop/Kinesis Client Library -> Personalized recommendation
- Long batch processing jobs (log parsing, ETL)? Hive
- Real-time analytics? Redshift
- Spot instances long running/data-critical? master, core using on-demand/resource; task using spot
- Spot instances cost sensitive/app testing? spot