Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
EMR instance groups
An important consideration when you create an EMR cluster is how you configure Amazon EC2 instances and network options. EC2 instances in an EMR cluster are organized into node types. There are three: the master node, the core node, and task nodes. Each node performs a set of roles defined by the distributed applications that you install on the cluster.
The collection of EC2 instances that host each node type is called either an instance fleet or an instance group. The instance fleets or instance groups configuration is a choice you make when you create a cluster. It applies to all node types, and it can't be changed later.
Master instance group
The master node manages the cluster and typically runs master components of distributed applications.
It also tracks the status of jobs submitted to the cluster and monitors the health of the instance groups. Because there is only one master node, the instance group consists a single EC2 instance.
We recommend using On-demand life cycles for Master instance groups
Core instance group
Core nodes are managed by the master node. Core nodes run the Data Node daemon to coordinate data storage.
They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require.
Like the master node, at least one core node is required per cluster. However, unlike the master node, there can be multiple core nodes—and therefore multiple EC2 instances—in the instance group.
With instance groups, you can add and remove EC2 instances while the cluster is running or set up automatic scaling.
We recommend using On-demand life cycles for Core instance groups
Task instance group
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data. Task nodes don't run the Data Node daemon, nor do they store data in HDFS. As with core nodes, you can add task nodes to a cluster by adding EC2 instances to an existing instance group.
Clusters with the instance group configuration can have up to a total of 48 task instance groups.
We recommend using Spot life cycles for Task instance groups