When changing the capacity of an instance group in the EMR cluster, it might get stuck in a resizing status, this means that your requested number of instances doesn't fit the actual running number of instances. Therefore, more instances need to be launched/terminated as part of the cluster capacity change.
In order to prevent the instance groups from getting stuck on resizing state, we created an EMR auto-recovery process.
Auto Recovery Process
When a change in instance group capacity is applied, a process will monitor its status and if the time limit of 30 minutes is exceeded, Spotinst Elastigroup will automatically stop the resizing process on the specific instance group and will create a new instance group with the same configuration which we will fall back to.
The original instance group will be "banned" for 2 hours and all actions of launching new instances will be applied in the new instance group.
i.e. - if in the original instance group there were 3 missing instances that were requested to be launched, they will be launched as part of the new instance group.
Please note: in order to prevent Elastigroups from scaling rapidly while a resize process is running, all scaling operations are suspended as long as one of the instance groups in the cluster is in resizing mode.
Once the resizing process is finished, scaling operations will be resumed.