Disaster recovery basics RPO and RTO

Why AWS? AWS is great for DR because it is flexible, Opex model, automation is easy.

  • Recovery point objective (RPO) - amount of data, based on time, the business can loose; the shorter then more expensive
  • Recovery time objective (RTO) - The time it takes after a disruption to restore a business process to its service level including time to restore data

There are four main scenarios: Backup and restore, pilot light, warm standby, and multi-site; but how does this map to RPO and RTO?

RPO/RTO in hours = Backup and restore

Back up your data and applications into the recovery Region.

  • Using automated or continuous backups will enable point in time recovery, which can lower RPO to as low as 5 minutes in some cases. In the event of a disaster, you will deploy your infrastructure (using infrastructure as code to reduce RTO), deploy your code, and restore the backed-up data to recover from a disaster in the recovery Region.

Implementation

Just backup the data and restore it later; Backup data; services idle..

  • RDS - auto backup with max 35 days retention; no backup of read-replicas; On-premise RDS replication
  • Elasticache - redis can be backed up (very similar to RDS); not memecached
  • Redshift - snapshots with continuous S3 backup for nodes; snapshot cross-region copy feature; restore = new cluster complete with configuration
  • EBS - NOT automatic; encrypted volumes = encrypted snapshot
  • S3/Glacier - S3 is a great target for backup; Glacier has a long RTO metric
  • Storage Gateway - continuous backup
  • Snowball & Import/Export Snowball - this gets the initial chunk of data, OR big chunks of data every once and a while, to S3

RPO/RTO in tens of minutes = Pilot light

  • Provision a copy of your core workload infrastructure in the recovery Region.
  • Replicate your data into the recovery Region and create backups of it there.
  • Resources required to support data replication and backup, such as databases and object storage, are always on.
  • Application servers or serverless compute are not deployed, but can be created when needed with the necessary configuration and application code.

Pilot light Implementation

Live data; services idle

  • Route53 - health checks
  • ASG - Autoscaling min/max adjustment & stored launch Configuration
  • EC2 - AMI as a backup of system configurations
  • RDS - Multi-AZ - synchronous
  • Possible use of DRS

RPO/RTO in minutes = Warm standby

Maintain a scaled-down but fully functional version of your workload always running in the recovery Region. When fully scales this is known as Hot Standby. The more scaled-up the Warm Standby is, the lower RTO and control plane reliance will be.

  • Data is replicated and live in the recovery Region.
  • Business-critical systems are fully duplicated and are always on, but with a scaled down fleet.
  • When the time comes for recovery, the system is scaled up quickly to handle the production load.

Warm Standby Implementation

  • Live data; services small
  • Route53 - Route53 - Weighted; healthchecks
  • RDS - cross region read Replicas - async; No Oracle or MS; encryption, options sets, and parameter sets challenging
  • Possible use of DRS

RPO/RTO near zero = Multi-Region

(multi-site) active-active 

  • Synchronize data across Regions. Possible conflicts caused by writes to the same record in two different regional replicas must be avoided or handled, which can be complex. Data replication is useful for data synchronization and will protect you against some types of disaster, but it will not protect you against data corruption or destruction unless your solution also includes options for point-in-time recovery.
  • Your workload is deployed to, and actively serving traffic from, multiple AWS Regions.

Multi-Site Implementation

Live data; live services load balanced between sites

  • Route53 - latency based routing with health checks;
  • DynamoDB - Global Tables
  • Aurora - Global Database
  • Redshift - automated cross region snapshot copy

Elastic Disaster Recovery (DRS)

Very much like Application Migration Server (MGN) except for DR - used to be called CloudEndure Disaster Recovery. DRS minimizes downtime and data loss with fast, reliable recovery of on-prem and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. Uses what looks like the same “AWS Replication Agent” in Application Migration Server to do block-level replication for servers; failover happens in minutes.

DRS does not handle the failover from a networking perspective but enables failover and failback.

Triage

“Disaster resilient”? multi-region using S3 CRR

Resources

Reliability Pillar of the Well-Architected Framework