AWS EC2 Status Checks and Recovery Options
During our training, we discussed the importance of understanding and managing status checks for EC2 instances in AWS. Here's a summary of the key points:
EC2 Status Checks
AWS performs automatic status checks on EC2 instances to identify potential hardware and software issues. There are two main types of status checks:
System Status Checks
- Monitors AWS system problems that affect the underlying infrastructure.
- Issues can include hardware failures or loss of power on the host.
- To review these issues, use the Personal Health Dashboard.
- Remediation typically involves stopping and starting the instance, which migrates it to a new host.
Instance Status Checks
- Monitors the software and network configuration of the individual instance.
- Issues can include invalid network configurations or exhausted memory.
- Remediation can be as simple as rebooting the instance or changing its configuration.
Automated Recovery Options
To automate the recovery process, you can leverage CloudWatch metrics and alarms:
Option 1: CloudWatch Alarms
- Set up CloudWatch Alarms for metrics like
StatusCheckFailed_System and StatusCheckFailed_Instance.
- Configure the alarm to perform the "recover instance" action, which maintains the same IP addresses, metadata, and placement group.
- Optionally, send notifications via Amazon SNS.
Option 2: Auto Scaling Group