Introduction
Aurora automatically fails over to a replica when the primary instance fails. The failover process includes detecting the failure, promoting a replica, updating DNS, and re-establishing connections. Slow failovers cause extended downtime during outages.
Symptoms
Failover taking too long:
```bash $ aws rds describe-events --source-identifier my-cluster \ --start-time 2024-01-15T10:00:00Z
"Event": "Failover to DB instance: my-replica-1 completed after 120 seconds" # Expected: < 30 seconds for typical Aurora # Actual: 120+ seconds ```
Application connection errors during failover:
FATAL: the database system is starting up
FATAL: connection refusedDNS resolution delay:
```bash $ nslookup my-cluster.cluster-xyz.region.rds.amazonaws.com
# Returns old primary IP for minutes after failover # DNS TTL causing delay in client reconnect ```
Common Causes
- 1.Wrong promotion tier - Non-optimal replica prioritized
- 2.Large buffer pool - Memory initialization takes time
- 3.Cross-AZ failover - Different availability zone adds latency
- 4.DNS caching - Clients caching old endpoint IPs
- 5.Connection pool exhaustion - Pool not handling failover gracefully
- 6.Instance type mismatch - Replica smaller than primary
- 7.Long-running transactions - Rollback during failover
Step-by-Step Fix
- 1.Check logs for specific error messages
- 2.Verify configuration settings
- 3.Test network connectivity
- 4.Review recent changes
- 5.Apply corrective action
- 6.Verify the fix
Step 1: Check Current Failover Configuration
```bash # Get cluster configuration aws rds describe-db-clusters --db-cluster-identifier my-cluster \ --query 'DBClusters[*].[ClusterMembers[*].[DBInstanceIdentifier,IsClusterWriter,PromotionTier]]'
# Promotion tier: 0 (highest priority) to 15 (lowest) # Writer instance has IsClusterWriter: true ```
Step 2: Set Promotion Tiers
```bash # Set priority for failover candidate # Lower tier number = higher priority for promotion
# Set primary preferred replica to tier 0 aws rds modify-db-instance \ --db-instance-identifier my-replica-1 \ --promotion-tier 0 \ --apply-immediately
# Set backup replicas to higher tiers aws rds modify-db-instance \ --db-instance-identifier my-replica-2 \ --promotion-tier 5 \ --apply-immediately
# Verify configuration aws rds describe-db-clusters --db-cluster-identifier my-cluster \ --query 'DBClusters[*].ClusterMembers[*].[DBInstanceIdentifier,PromotionTier]' ```
Step 3: Use Same Instance Types
```bash # Check instance classes in cluster aws rds describe-db-clusters --db-cluster-identifier my-cluster \ --query 'DBClusters[*].ClusterMembers[*].[DBInstanceIdentifier,DBInstanceClass]'
# Replicas should match primary instance class for: # - Same memory size (buffer pool) # - Same CPU capacity # - Faster recovery after promotion
# If mismatch, upgrade replicas aws rds modify-db-instance \ --db-instance-identifier my-replica-1 \ --db-instance-class db.r6g.xlarge \ --apply-immediately ```
Step 4: Configure Fast Failover Parameters
```bash # Aurora MySQL parameters affecting failover aws rds modify-db-cluster-parameter-group \ --db-cluster-parameter-group-name my-aurora-params \ --parameters ParameterName=aurora_parallel_query,ParameterValue=OFF,ApplyMethod=immediate
# Key parameters: # - innodb_buffer_pool_size: Pre-allocated on replica # - aurora_enable_parallel_query: Can slow failover if ON # - max_connections: Should be same on all instances
# Verify parameter group applied aws rds describe-db-instances --db-instance-identifier my-replica-1 \ --query 'DBInstances[*].DBParameterGroups[*].ParameterApplyStatus' ```
Step 5: Test Failover Performance
```bash # Perform controlled failover test aws rds failover-db-cluster \ --db-cluster-identifier my-cluster \ --target-db-instance-identifier my-replica-1
# Monitor failover events aws rds describe-events \ --source-identifier my-cluster \ --event-categories failover \ --duration 10
# Measure failover time # Typical Aurora: 20-30 seconds # Aurora Serverless v2: 30-60 seconds # If > 60 seconds, investigate further ```
Step 6: Optimize Client Connection Handling
```python # Use connection pooling with failover awareness import boto3 from sqlalchemy import create_engine from sqlalchemy.pool import QueuePool
# Use cluster endpoint (not instance endpoint) CLUSTER_ENDPOINT = "my-cluster.cluster-xyz.region.rds.amazonaws.com"
# Configure pool for failover engine = create_engine( f"mysql+pymysql://user:pass@{CLUSTER_ENDPOINT}/db", poolclass=QueuePool, pool_size=10, pool_recycle=1800, # Recycle connections every 30 min pool_timeout=30, # Wait for connection on failover connect_args={ "connect_timeout": 5, "read_timeout": 30, "write_timeout": 30 } ) ```
Step 7: Configure DNS TTL for Faster Reconnect
```bash # Aurora endpoints have TTL of ~5 seconds # But some clients/DNS resolvers cache longer
# Check DNS TTL dig my-cluster.cluster-xyz.region.rds.amazonaws.com
# Use the cluster endpoint (CNAME) not instance endpoint # Cluster endpoint automatically points to new writer after failover
# Application should: # 1. Use cluster endpoint for writes # 2. Use reader endpoint for reads # 3. Implement retry logic with exponential backoff ```
Step 8: Monitor Failover Metrics
```bash # Check failover duration via CloudWatch aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name FailoverDuration \ --dimensions Name=DBClusterIdentifier,Value=my-cluster \ --statistics Maximum \ --period 60
# Set alert for slow failovers aws cloudwatch put-metric-alarm \ --alarm-name aurora-failover-slow \ --namespace AWS/RDS \ --metric-name FailoverDuration \ --dimensions Name=DBClusterIdentifier,Value=my-cluster \ --threshold 60 \ --comparison-operator GreaterThanThreshold \ --period 60 \ --evaluation-periods 1 ```
Step 9: Handle Transaction Rollback
```bash # Long-running transactions slow failover # Check for long transactions aws rds describe-db-instances --db-instance-identifier my-primary \ --query 'DBInstances[*].DBInstanceStatus'
# Before failover, Aurora must: # 1. Abort all transactions # 2. Rollback uncommitted changes # 3. Apply any in-flight changes from storage
# Monitor long-running transactions mysql> SELECT * FROM information_schema.innodb_trx ORDER BY trx_started ASC;
mysql> SHOW ENGINE INNODB STATUS\G ```
Step 10: Use Aurora Global Database for Cross-Region
For cross-region failover:
```bash # Create global cluster aws rds create-global-cluster \ --global-cluster-identifier my-global-cluster \ --engine aurora-mysql
# Add primary region aws rds create-db-cluster \ --db-cluster-identifier my-cluster \ --global-cluster-identifier my-global-cluster \ --engine aurora-mysql \ --region us-east-1
# Add secondary region aws rds create-db-cluster \ --db-cluster-identifier my-cluster-secondary \ --global-cluster-identifier my-global-cluster \ --engine aurora-mysql \ --region us-west-2
# Cross-region failover: 1-2 minutes (vs seconds for same-region) ```
Aurora Failover Timeline
| Phase | Duration | Description |
|---|---|---|
| Detection | 10-30s | Detect primary failure |
| Election | 5-10s | Select new primary |
| Promotion | 10-30s | Promote replica |
| DNS Update | 5-10s | Update endpoint DNS |
| Recovery | Variable | Client reconnect |
Total expected: 30-60 seconds for well-configured Aurora
Verification
```bash # After configuration changes, test failover aws rds failover-db-cluster \ --db-cluster-identifier my-cluster \ --target-db-instance-identifier my-replica-1
# Monitor duration aws rds describe-events \ --source-identifier my-cluster \ --start-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --duration 5
# Check new writer
aws rds describe-db-clusters --db-cluster-identifier my-cluster \
--query 'DBClusters[*].ClusterMembers[?IsClusterWriter==true].DBInstanceIdentifier'
```
Related Issues
- [Fix AWS RDS Read Replica Lag High](/articles/fix-aws-rds-read-replica-lag-high)
- [Fix AWS RDS Aurora Global Database Lag](/articles/fix-aws-rds-aurora-global-database-lag)
- [Fix AWS RDS Connection Limit Exceeded](/articles/fix-aws-rds-connection-limit-exceeded)
Related Articles
- [AWS troubleshooting: Fix IAM Permission Denied - Complete Tro](fix-iam-permission-denied)
- [AWS cloud troubleshooting: AWS ACM Certificate Pending Validation Because the](aws-acm-certificate-pending-validation-wrong-route53-zone)
- [AWS cloud troubleshooting: AWS ALB Returns 502 Because the Target Closed the ](aws-alb-502-target-closed-connection-keepalive-timeout-mismatch)
- [AWS cloud troubleshooting: Fix AWS ALB CreateListener TargetGroupNotFound Err](aws-alb-createlistener-targetgroupnotfound)
- [AWS cloud troubleshooting: Fix Aws Alb Lambda 502 Bad Gateway Issue in AWS](aws-alb-lambda-502-bad-gateway)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Fix AWS RDS Aurora Failover Too Slow", "description": "Reduce Aurora failover time. Configure promotion tiers, optimize instance types, and implement failover testing.", "url": "https://www.fixwikihub.com/fix-aws-rds-aurora-failover-slow", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-04-01T14:48:24.076Z", "dateModified": "2026-04-01T14:48:24.076Z" } </script>