Relational Database Service enables numerous RDBMS systems to run as a service. These transactional storage engines are recommended for ACID transactions (Atomicity, Consistency, Isolated, Durability). RDS instances with Multi-AZ perform more predictably than single AZ - backups, restores and all housekeeping activities are performed on non-primary instance.Storage for RDS is auto-scaling EBS (gp2 or io1). RDS events get published via SNS.
These days there are two classes of RDS - Aurora and RDS… well, RDS engines.
MySQL/PostgreSQL compatible, relational DB that starts with 10Gb and scales in 10Gb increments to 128 Tb and up to 32 vCPUs and 244 Gb of RAM. Aurora must exist in a VPC, AES-256 for data in transit, use KMS keys or other keys to encrypt storage, snapshots, backup, and replicas. Includes HA capabilities with 2 copies in 3 AZ so you get 6 copies of your data and it is self-healing through disk and data block data scanning and errors are fixed. Data must be written to 4 of the 6 copies and 3 of 6 copies for a read. In case of failure, a read replica is promoted to primary in less than 30 seconds.
There is automatic, continuos incremental backups with a point-in-time restore within a second and are stored for 35 days by default. There are four types of Aurora logs which can be downloaded or sent to CW logs: Error, Slow query, general, and audit. Getting data from S3 to Aurora is straightforward - it can read directly from S3 without client tools! Moving to Aurora from a different RDS engine is simple: create an Aurora read-replica and then promote the read-replica OR restore a snapshot to new Aurora instance.
There are 4 types of Aurora endpoints:
- Read replica endpoints which is load balanced with up to 15 Aurora read replicas that sits in behind a single reader endpoint simplifying access.
- Cluster/Writer Endpoints connect to the primary DB instance in the cluster. Sort of obvious - it is possible to have more than one database on Aurora Cluster; access to all databases happen through the same Writer endpoint.
- Custom Endpoints represent a subset of DB instances with a different configuration; generally the reader endpoints are NOT used after the custom end-point has been defined.
- Instance Endpoints enable connect to a single instance for diagnosis or tuning.
Need Aurora Global? Cross region RR copy the entire database across - not select tables. Aurora Global DB has a single primary region and up to 5 read only regions with less than 1 second replication lag (and up to 16 RR per region). RTO of less than 1 minute for regional failure; ability to manage RPO in Aurora for PostgreSQL. Write forwarding transmits write made to secondary cluster to primary cluster; the writes are made to the primary cluster then replicated to secondary cluster. Write forwarding reduces the number of endpoints to manage….
Aurora Auto Scaling dynamically adjusts the number of Aurora Replicas provisioned for an Aurora DB cluster using single-master replication for MySQL and PostgreSQL. Auto Scaling requires a target metrics (available in CloudWatch), a min/max capacity, and a cool down period.
Love this product - I wrote the launch tutorial and was on the launch call for the initial release of this service. Includes the Data API endpoint which keeps you out of managing DB connections; must have permissions to Data API and Secrets Manager where credentials are checked.
Oracle on RDS
Version of Oracle available include Oracle EE, Oracle SE, and Oracle SE One. Support Multi-AZ & manual snapshots. Oracle does NOT support multi-region replication.
Importing uses two different tools… Oracle SQL Developer for small amounts and Oracle Data Pump for large amounts of data.
Recovery Manager (RMAN) can backup an Oracle RDS database with the intention of restoring on-prem - the backup can not be restored to RDS. RDS backup & restore for RDS based Oracle databases ONLY.
HA: Oracle RAC is not supported on RDS but can be implemented using EC2 placement groups and using Ntop N2n VPN software to tunnel between nodes. In addition, Oracle Data Guard can be implemented as well.
MS SQL on RDS
Supports point-in-time backups using snapshots, Multi-AZ deployments using the native MS Mirroring technology but read replicas are not supported. There is no support for multi-region or Cloud to on-prem replication in RDS based MS SQL but native tools can be used.
On-prem to Cloud migration is super ugly… create RDS empty tables, disable backups/key constraints, import flat files. In addition, FILESTREAM functions are not supported and there is no ability to restore data from file.
Backup & Restoration
Backups are stored on S3. By default, backups are enabled and have a max retention of 35 days. A backup retention period of zero days will disable automated backups.
Deleting an RDS instance deletes ALL the automated backups… but not the manual ones. =» “I acknowledge that upon instance deletion, automated backups, including system snapshots and point-in-time recovery, will no longer be available.”
In a multi-AZ configuration, backups are done from the standby. Backups without multi-AZ are done on the primary and I/O is suspended for about 1 minute.
Every 5 minutes, RDS backs up the transaction log. Combined with regular backups the “Latest Restore Time” is computed and “Restore to point in Time” option allows you to pinpoint a time to restore from.
Restores manifest themselves as a new RDS instance with a new endpoint. Restoration only works with the DB parameter and security groups are associated with the snapshot. You maybe can change storage engines - if they are related.
Snapshots are a full copy of RDS database that are independent of scheduled backups. In a multi-AZ configuration, snapshots are done from the standby. Snapshots are a manual process. Snapshots can be copied across region.
All RDS instance types support encryption at rest and every RDS instance also includes a SSL Certificate for data encryption in flight. Encryption at rest must be setup at instance creation time and uses KMS to generate Master keys then use envelope encryption to generate data keys for the RDS instance. Once enabled, EVERYTHING is encrypted including read replicas, logs, and snapshots. The keys can’t be changed after installation.
Amazon RDS creates an SSL certificate and installs the certificate on the DB instance when Amazon RDS provisions the instance.
To copy an encrypted snapshot from one region to another, you must specify the KMS key identifier of the destination region. This is because KMS encryption keys are specific to the region that they are created in.
Oracle and MS SQL Server support Transparent Data Encryption. Oracle does NOT integrate with KMS but will work with CloudHSM and the Oracle Wallet. SQL Server requires a key but that key is managed by RDS after enabling TDE.
IAM AuthN for MySQL and PostgreSQL; AuthZ still happens in the engine
A multi-AZ failover process is key in the event of an AZ failure (so disaster recovery/high availability) but is NOT a scaling solution. Multi-AZ systems are setup within the same region - that would be obvious from the name.
For MySQL, Oracle and PostgreSQL replication from one zone to another causes a lot of higher write and commit latency - this is a synchronous physical replication using AWS technology - provisioned IOPS is recommended. Native replication, called Mirroring, is used for MS SQLServer.
Patching is another benefit of standby instances - patch the standby, failover, then patch the primary. Backups can be created from the standby instance and that can eliminate I/O locking and latency spikes from backups on the primary.
Failover is automatically triggers when:
- the AZ fails
- the underlying hardware fails
- Loss of connectivity
- Or you can force a fail over using a Reboot with failover using the console or with
- Software patching
A slow server and corrupt data will NOT lead to a failover. Getting notification about a failover as you would expect - RDS events. Failover is implemented as a DNS change so the application servers have to re-establish connections with the failover instance.
RDS Read Replicas
Read Replicas are used to scale RDS by creating a READ ONLY copy of your database with a new DNS address. Automatic backups must be enabled for the RR process to work; can be configured to be multi-AZ or cross-region. Can be done from the console OR using the
CreateDBInstanceReadReplica API call. All RDS engines (except SQL Server) support read replicas. Replicas can be promoted to a primary but this breaks the replication link.
Possible use cases include:
- Scale beyond single instance I/O capabilities
- Serve read-only traffic during times the primary is unavailable
- Business reporting against almost live data without affecting the performance of the primary instance
A transactional DB engine must be used to support read replicas - so MySQL/MariaDB must have InnoDB engine installed. PostgreSQL requires 9.3.5 and in this configuration uses the native PostgreSQL asynchronous streaming replication engine is used.
MySQL, PostgreSQL, and MariaDB support having read replicas in a different region and/or multi-AZ. Using SSL for cross region replication is a great idea seeing that the data moves across the open Internet. Encryption is supported but most RR must match be encrypted if the DB cluster is encrypted.
Replica lag is a key metric - keeping the read replica on a similar, in fact, exact same configuration of instances can help keep this metric inline.
Multi-AZ RDS is synchronous while RDS Read Replica are asynchronous
Read Replica Setup
Read replicas are created from a snapshot. Auto backup must be enabled (this is done by turning the backup retention period to greater than 0) then RDS takes a backup copy of the database then enables replication. Without multi-AZ operations, IO is suspended while RDS snapshots the database; with multi-AZ, the snapshot is taken from the slave. This can be enabled from the console OR using the
CreateDBInstanceReadReplica call from the API.
A DB subnet defines the subnets for a multi-AZ RDS setup; it is best practice to set one up in subnets in multiple regions. The high level steps to create a DB subnet are as follows:
- Create 2 new subnets in a VPC
- In RDS Dashboard create subnet group & add the subnets you just created
- Now launch an RDS instance and select Multi-AZ Deployment
- In “configure advanced settings” select subnet group you created
Presto, you have a multi-AZ DB subnet!
Reserved RDS Instances
Each reservation is associated to a specific DB Engine, instance class, deployment type, license model and region (NOT AZ). And yes, you can have a reserved read replicas.
If any of these attributes are changed then pricing switches to on-demand pricing. Switch your instance BACK to the specs of the reservation OR create a new instance with the same specs, reservation pricing will apply.
Troubleshooting & Monitoring
Scaling the instance can happen immediately, if the checkbox is checked, or will happen in your maintenance window.
Resizing will result in downtime if in a single AZ configuration. In a multi-az, the standby gets resized, there is a fail-over then the master is resized.
Performance Insights to find issues by waits, SQL statements, hosts and users. Use CloudWatch to monitor CPU, memory, swap usage. Enhanced monitoring generates host level, process view, and per-second metrics
There are some non-critical metrics that can be monitored like
More critical metrics and their solutions include:
|FreeStorageSpace||Increase disk space|
|ReplicaLag||Increase IOPS or instance size|
In addition to CloudWatch you can use the RDS service console to monitor RDS events. Event Subscriptions can be created as well.
- Multi-AZ? For DR not scaling, sync replication
- Read replica? Scaling not DR; async replication
- Index/Query focused data? DynamoDB
- Lots of BLOBS? Pointer to S3
- Push button scalability? DynamoDB
- DB2, Informix, Sybase? EC2
- Complete control? EC2
- Copy unencrypted data into encrypted DB? yes!
- CloudTrail track queries? No.
- Distribute traffic across replicas? Route53 Weighted Record Set.
- Access RDS from Lambda? RDS Proxy for AWS Lambda
- Access Aurora from Lambda? RDS Proxy for Aurora (could connect only to read replicas)
- Cross-region failover? Healthchecks/CW Alarm fires Lambda to update Route53 & promote read-replica
- troubleshoot waits, sql statements, hosts and users? Performance Insights
- CPU, Memory, Swap metrics? CloudWatch
- More detailed host level metrics at second level? Enhanced monitoring
- Database performance? Check out the Slow Query log
- Secrets Manager for password? create secret in secret manager, have application read from Secrets manager, reference