Introduction
RDS read replicas asynchronously replicate data from the primary instance. When replication can't keep up with write traffic, the replica falls behind. Queries on the replica return stale data, and the replica may become unusable for read scaling.
Symptoms
Replication lag metrics:
```bash $ aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --statistics Maximum \ --period 300
# Lag showing minutes or hours: "Datapoints": [{"Maximum": 300.0}] # 5 minutes lag ```
Application reads stale data:
-- Primary: Latest transaction ID = 5000
-- Replica query shows:
SELECT MAX(transaction_id) FROM orders; -- Returns 4800 (missing 200 transactions)Replica status:
```bash $ aws rds describe-db-instances --db-instance-identifier my-replica \ --query 'DBInstances[*].[DBInstanceStatus,ReadReplicaSourceDBInstanceIdentifier]'
"Status": "available" "ReplicaLag": "3600" # 1 hour behind ```
Common Causes
- 1.High write throughput - More writes than replication can process
- 2.Replica undersized - Instance class can't handle replication workload
- 3.Long-running transactions - Large transactions block replication
- 4.Network latency - Cross-region or poor network connectivity
- 5.Binary log contention - MySQL binlog processing bottleneck
- 6.IOPS limit reached - Storage performance limiting replication
- 7.Replication thread issues - Single replication thread in MySQL
Step-by-Step Fix
- 1.Check logs for specific error messages
- 2.Verify configuration settings
- 3.Test network connectivity
- 4.Review recent changes
- 5.Apply corrective action
- 6.Verify the fix
Step 1: Monitor Replication Lag
```bash # Check current lag aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --statistics Maximum \ --period 60 \ --start-time $(date -d '1 hour ago' +%s)000
# Set up lag alert aws cloudwatch put-metric-alarm \ --alarm-name rds-replica-lag-high \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --threshold 300 \ --comparison-operator GreaterThanThreshold \ --period 60 \ --evaluation-periods 3 ```
Step 2: Check Primary Instance Write Load
```bash # Check write throughput on primary aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name WriteIOPS \ --dimensions Name=DBInstanceIdentifier,Value=my-primary \ --statistics Sum \ --period 300
# Check transaction rate aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name Transactions \ --dimensions Name=DBInstanceIdentifier,Value=my-primary \ --statistics Sum \ --period 300
# If writes exceed 1000/sec, may overwhelm replica ```
Step 3: Upgrade Replica Instance Class
```bash # Compare primary and replica instance classes aws rds describe-db-instances --db-instance-identifier my-primary \ --query 'DBInstances[*].DBInstanceClass'
aws rds describe-db-instances --db-instance-identifier my-replica \ --query 'DBInstances[*].DBInstanceClass'
# Replica should match or exceed primary for compute/storage # Upgrade replica aws rds modify-db-instance \ --db-instance-identifier my-replica \ --db-instance-class db.r6g.xlarge \ --apply-immediately ```
Step 4: Increase Replica IOPS
```bash # Check replica IOPS utilization aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReadIOPS \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --statistics Maximum \ --period 300
# If hitting Provisioned IOPS limit, increase aws rds modify-db-instance \ --db-instance-identifier my-replica \ --iops 10000 \ --storage-type io1 \ --apply-immediately
# Or switch to gp3 with higher baseline aws rds modify-db-instance \ --db-instance-identifier my-replica \ --storage-type gp3 \ --apply-immediately ```
Step 5: Optimize Long-Running Transactions
```sql -- On primary, check for long transactions SELECT * FROM information_schema.innodb_trx ORDER BY trx_started ASC;
-- Kill blocking transactions if needed CALL mysql.rds_kill(THREAD_ID);
-- Avoid: -- Large batch inserts without chunking -- ALTER TABLE on large tables -- Transactions holding locks for extended periods ```
Chunk large operations:
```python # Instead of single large insert cursor.execute("INSERT INTO table SELECT * FROM source") # BAD
# Chunk the operation for offset in range(0, total, batch_size): cursor.execute(f""" INSERT INTO table SELECT * FROM source LIMIT {batch_size} OFFSET {offset} """) connection.commit() # Commit each chunk ```
Step 6: Check Network Bandwidth
```bash # For cross-region replicas, network bandwidth matters # Check network throughput aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name NetworkReceiveThroughput \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --statistics Maximum \ --period 300
# Cross-region replication limited by inter-region bandwidth # Consider same-region replica for better performance ```
Step 7: MySQL Replication Optimization
```bash # Check replica status in MySQL mysql> SHOW SLAVE STATUS;
-- Key fields: -- Seconds_Behind_Master: Lag in seconds -- Relay_Master_Log_File: Current binlog file -- Exec_Master_Log_Pos: Position in binlog
-- Increase parallel replication threads (MySQL 8.0+) aws rds modify-db-parameter-group \ --db-parameter-group-name my-mysql-params \ --parameters ParameterName=replica_parallel_workers,ParameterValue=4,ApplyMethod=immediate ```
Important MySQL parameters:
```bash # Increase binlog buffer aws rds modify-db-parameter-group \ --db-parameter-group-name mysql-params \ --parameters ParameterName=binlog_cache_size,ParameterValue=131072,ApplyMethod=pending-reboot
# Enable parallel replication aws rds modify-db-parameter-group \ --db-parameter-group-name mysql-params \ --parameters ParameterName=slave_parallel_workers,ParameterValue=4,ApplyMethod=pending-reboot
# Check sync_binlog setting # sync_binlog=0: Faster but less durable # sync_binlog=1: Slower but safe ```
Step 8: PostgreSQL Replication Optimization
```sql -- Check replication status SELECT * FROM pg_stat_replication;
-- Key fields: -- sent_lsn, write_lsn, flush_lsn, replay_lsn -- Lag = sent_lsn - replay_lsn
-- Calculate lag in bytes SELECT client_addr, state, sent_lsn - replay_lsn AS lag_bytes, pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb FROM pg_stat_replication; ```
Optimize PostgreSQL parameters:
```bash # Increase max_wal_senders aws rds modify-db-parameter-group \ --db-parameter-group-name postgres-params \ --parameters ParameterName=max_wal_senders,ParameterValue=10,ApplyMethod=pending-reboot
# Adjust wal_keep_size aws rds modify-db-parameter-group \ --db-parameter-group-name postgres-params \ --parameters ParameterName=wal_keep_size,ParameterValue=2048,ApplyMethod=pending-reboot ```
Step 9: Restart Replication If Stalled
```bash # For MySQL - restart replication mysql> STOP SLAVE; mysql> START SLAVE;
# For PostgreSQL - reconnect replica aws rds reboot-db-instance --db-instance-identifier my-replica
# If severe lag, recreate replica aws rds delete-db-instance --db-instance-identifier my-replica --skip-final-snapshot
aws rds create-db-instance-read-replica \ --db-instance-identifier my-replica \ --source-db-instance-identifier arn:aws:rds:region:account:db:my-primary ```
Step 10: Consider Aurora for Better Replication
```bash # Aurora has near-instant replication (milliseconds) # Aurora replicas share storage with primary
aws rds create-db-cluster \ --db-cluster-identifier my-aurora-cluster \ --engine aurora-mysql \ --master-username admin \ --master-user-password password
aws rds create-db-instance \ --db-instance-identifier my-aurora-replica \ --db-cluster-identifier my-aurora-cluster \ --instance-class db.r6g.large \ --engine aurora-mysql ```
Aurora advantages: - Replicas can be promoted instantly - No replication lag for reads (shared storage) - Up to 15 replicas per cluster
Replication Lag Thresholds
| Lag | Impact | Action |
|---|---|---|
| < 1 second | Excellent | Normal operation |
| 1-30 seconds | Acceptable | Monitor closely |
| 30-60 seconds | Concerning | Investigate cause |
| 1-5 minutes | Poor | Scale resources, optimize |
| > 5 minutes | Critical | Restart/recreate replica |
Verification
```bash # Check lag after fix aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value=my-replica \ --statistics Maximum \ --period 60
# Should show lag < 30 seconds
# Query replica and verify data matches mysql> SHOW SLAVE STATUS; # Seconds_Behind_Master should be small ```
Related Issues
- [Fix AWS RDS Connection Limit Exceeded](/articles/fix-aws-rds-connection-limit-exceeded)
- [Fix AWS RDS Aurora Failover Slow](/articles/fix-aws-rds-aurora-failover-slow)
- [Fix AWS RDS Parameter Group Not Applying](/articles/fix-aws-rds-parameter-group-not-applying)
Related Articles
- [AWS troubleshooting: Fix IAM Permission Denied - Complete Tro](fix-iam-permission-denied)
- [AWS cloud troubleshooting: AWS ACM Certificate Pending Validation Because the](aws-acm-certificate-pending-validation-wrong-route53-zone)
- [AWS cloud troubleshooting: AWS ALB Returns 502 Because the Target Closed the ](aws-alb-502-target-closed-connection-keepalive-timeout-mismatch)
- [AWS cloud troubleshooting: Fix AWS ALB CreateListener TargetGroupNotFound Err](aws-alb-createlistener-targetgroupnotfound)
- [AWS cloud troubleshooting: Fix Aws Alb Lambda 502 Bad Gateway Issue in AWS](aws-alb-lambda-502-bad-gateway)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Fix AWS RDS Read Replica Lag High", "description": "Reduce RDS read replica lag. Optimize replication, scale resources, and monitor Aurora/MySQL/PostgreSQL replication metrics.", "url": "https://www.fixwikihub.com/fix-aws-rds-read-replica-lag-high", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-04-01T10:11:54.182Z", "dateModified": "2026-04-01T10:11:54.182Z" } </script>