Home / Ansible / WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up

Ansible

WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up

PostgreSQL replica in Ansible Tower/AWX cannot catch up after rebuilding from a base backup that followed a massive DELETE operation on job events or inventory tables.

Published: Mar 7, 20269 min readBy FixWikiHub Editorial Team

Abstract illustration for a troubleshooting knowledge base category.

Introduction

When you rebuild a PostgreSQL replica in Ansible Tower or AWX after performing a large DELETE operation (such as purging millions of job events or cleaning up old inventory records), the replica may never catch up to the primary. The WAL replay process gets stuck processing the DELETE operations in the WAL stream, and the replica continues falling further behind despite having adequate hardware resources.

This issue occurs because PostgreSQL's logical replication and WAL replay processes DELETE operations inefficiently - each deleted row generates a WAL entry that must be individually replayed on the replica, creating a bottleneck that can last for days.

Symptoms

The replica will show ever-increasing lag despite being rebuilt:

``` # Initial replica status after rebuild $ sudo -u postgres psql -c "SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"

client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 2 GB

# 1 hour later client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 5 GB

# 6 hours later - lag continues growing client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 12 GB

# PostgreSQL logs on replica 2024-03-15 10:15:23.456 UTC [12345] LOG: received redo request at 0/2A000B90 2024-03-15 10:15:25.789 UTC [12345] LOG: redo done at 0/2A001230 2024-03-15 10:16:45.123 UTC [12345] LOG: checkpoint starting: time 2024-03-15 10:45:23.456 UTC [12345] LOG: checkpoint complete: wrote 45234 buffers (27.6%)

# Replication slot falling behind $ sudo -u postgres psql -c "SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

slot_name | active | retained_wal ---------------+--------+-------------- tower_replica | t | 50 GB

# Check WAL files accumulating in pg_wal $ ls -la /var/lib/postgresql/14/main/pg_wal/ | wc -l 1247

# Replication process using high CPU but low disk I/O $ top -p $(pgrep -f "wal_receiver") PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18423 postgres 20 0 245624 45124 12345 S 98.7 1.2 145:23.45 postgres: wal receiver ```

In Tower logs:

bash

2024-03-15 10:30:45,123 WARNING  awx.main.tasks Cluster heartbeat delayed by 180 seconds
2024-03-15 10:31:12,456 ERROR    awx.main.dispatch Instance tower-replica is lagging behind primary by 180+ seconds
2024-03-15 10:32:01,789 CRITICAL awx.main.models.ha Failover capacity degraded

Common Causes

The replica cannot catch up due to the inherent inefficiency of replaying DELETE operations:

1.WAL records each deleted tuple individually: When you delete 10 million rows from main_jobevent, PostgreSQL generates 10 million WAL records - one for each tuple deletion. The replica must read and apply each record sequentially.
2.Index updates multiply the work: Each DELETE also updates all indexes on the table. A table with 5 indexes generates 6 WAL entries per row (1 heap + 5 index entries), meaning 10 million deletions = 60 million WAL records.
3.Base backup taken after DELETE started: If the base backup was initiated while DELETE was still in progress, the replica receives a mix of pre-DELETE and post-DELETE pages, causing it to replay the entire DELETE operation.
4.FPI (Full Page Images) overhead: Large DELETE operations often trigger full page writes due to modifications, increasing WAL volume by 2-4x.
5.Replica disk I/O saturation: Even with fast storage, random I/O from updating indexes during replay bottlenecks the process.

Step-by-Step Fix

Step 1: Assess the Situation

Determine if catch-up is possible or if a fresh approach is needed:

```bash # Check current lag and trend sudo -u postgres psql -c " SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS current_lag, sent_lsn, replay_lsn FROM pg_stat_replication;"

# Calculate lag growth rate (run twice, 5 minutes apart) # Run 1: sudo -u postgres psql -t -c "SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) FROM pg_stat_replication;" > /tmp/lag1.txt sleep 300 # Run 2: sudo -u postgres psql -t -c "SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) FROM pg_stat_replication;" > /tmp/lag2.txt

# Calculate growth rate echo "Lag growth per minute: $(( ($(cat /tmp/lag2.txt) - $(cat /tmp/lag1.txt)) / 5 )) bytes"

# Check how much WAL remains sudo -u postgres psql -c " SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS total_wal_to_replay FROM pg_replication_slots;"

# Estimate time to catch up # time = (current_lag + new_wal) / (replay_rate - write_rate) ```

Step 2: Stop the Primary from Generating More WAL (If Possible)

Temporarily reduce WAL generation:

```bash # On primary, stop Tower services to reduce activity sudo ansible-tower-service stop

# Or pause job execution via API curl -X POST -k -u admin:password \ https://localhost/api/v2/system_job_templates/5/launch/ \ -H "Content-Type: application/json"

# Disable scheduled jobs temporarily curl -X PATCH -k -u admin:password \ https://localhost/api/v2/schedules/ \ -H "Content-Type: application/json" \ -d '{"enabled": false}' ```

Step 3: Speed Up WAL Replay on Replica

Optimize replica configuration for faster replay:

```bash # Edit postgresql.conf on replica sudo vim /etc/postgresql/14/main/postgresql.conf

# Increase these settings for faster replay checkpoint_timeout = 1h # Less frequent checkpoints max_wal_size = 8GB # More WAL buffer checkpoint_completion_target = 0.9 # Spread checkpoint writes wal_decode_buffer_size = 64MB # Larger decode buffer (PG 16+)

# Disable query execution during catch-up hot_standby = off # Focus all resources on replay

# Restart PostgreSQL on replica sudo systemctl restart postgresql ```

Step 4: Take a Fresh Base Backup (Fastest Solution)

If catch-up will take more than a few hours, rebuild from a fresh backup:

```bash # On primary, take a fresh base backup sudo -u postgres psql -c "SELECT pg_start_backup('fresh_backup', true, false);"

# Copy data directory to replica sudo -u postgres rsync -avz --delete \ /var/lib/postgresql/14/main/ \ tower-replica:/var/lib/postgresql/14/main/

# Or use pg_basebackup for cleaner process sudo -u postgres pg_basebackup \ -h tower-primary \ -D /var/lib/postgresql/14/main \ -U replication_user \ -P -v -R -X stream -C -S tower_replica_new

# On primary, stop backup sudo -u postgres psql -c "SELECT pg_stop_backup(false, true);"

# On replica, update configuration sudo vim /var/lib/postgresql/14/main/postgresql.auto.conf # Add: primary_slot_name = 'tower_replica_new'

# Start replica sudo systemctl start postgresql ```

Step 5: Prevent Future Issues with Batched Deletions

Configure Tower to delete records in smaller batches:

```yaml # Create a playbook for batched cleanup cat > /opt/ansible/cleanup_job_events.yml << 'EOF' --- - name: Cleanup old job events in batches hosts: localhost gather_facts: false become: true become_user: postgres vars: retention_days: 30 batch_size: 10000 sleep_between_batches: 2 tasks: - name: Delete job events batch postgresql_query: db: awx query: > DELETE FROM main_jobevent WHERE id IN ( SELECT id FROM main_jobevent WHERE created < now() - interval '{{ retention_days }} days' LIMIT {{ batch_size }} ) register: delete_result until: delete_result.rowcount == 0 retries: 10000 delay: "{{ sleep_between_batches }}" changed_when: delete_result.rowcount > 0

name: Run analyze after cleanup
postgresql_query:
db: awx
query: "ANALYZE main_jobevent"
changed_when: false
EOF

# Run via cron daily at 2 AM (crontab -l 2>/dev/null; echo "0 2 * * * ansible-playbook /opt/ansible/cleanup_job_events.yml") | crontab - ```

Step 6: Use VACUUM FULL or pg_repack Instead of DELETE

For large purges, use more efficient methods:

```bash # Option 1: VACUUM FULL (locks table, fastest for small tables) # First, delete the rows sudo -u postgres psql -d awx -c "DELETE FROM main_jobevent WHERE created < now() - interval '30 days';"

# Then reclaim space sudo -u postgres psql -d awx -c "VACUUM FULL main_jobevent;"

# Option 2: pg_repack (no locks, requires extension) sudo -u postgres psql -d awx -c "CREATE EXTENSION pg_repack;" sudo -u postgres pg_repack -t main_jobevent -d awx

# Option 3: Partition the table (best long-term solution) sudo -u postgres psql -d awx << 'EOF' -- Create partitioned table structure CREATE TABLE main_jobevent_partitioned ( LIKE main_jobevent INCLUDING DEFAULTS INCLUDING CONSTRAINTS ) PARTITION BY RANGE (created);

-- Create partitions by month CREATE TABLE main_jobevent_2024_01 PARTITION OF main_jobevent_partitioned FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE main_jobevent_2024_02 PARTITION OF main_jobevent_partitioned FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Migrate data (do this during maintenance window) INSERT INTO main_jobevent_partitioned SELECT * FROM main_jobevent; DROP TABLE main_jobevent; ALTER TABLE main_jobevent_partitioned RENAME TO main_jobevent; EOF ```

Verification

Confirm the replica is catching up or has caught up:

```bash # Monitor lag decreasing over time watch -n 10 'sudo -u postgres psql -t -c "SELECT pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) FROM pg_stat_replication;"'

# Expected: lag should decrease over time 512 MB 508 MB 495 MB ...

# Verify replication slot is healthy sudo -u postgres psql -c " SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

# Expected output (retained_wal should be small): slot_name | active | retained_wal ---------------+--------+-------------- tower_replica | t | 64 MB

# Check Tower HA status curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, heartbeat: .heartbeat}'

# Verify query performance on replica after catch-up sudo -u postgres psql -d awx -c " SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'main_jobevent';" ```

[ansible-replica-falls-behind-after-an-expensive-scan](/articles/ansible-replica-falls-behind-after-an-expensive-scan)
[ansible-job-event-table-bloat-causes-slow-queries](/articles/ansible-job-event-table-bloat-causes-slow-queries)
[ansible-database-partitioning-strategy](/articles/ansible-database-partitioning-strategy)

[WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
[WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
[WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
[WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
[WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)

Was this guide helpful?

Related search paths

People also search for

If the symptom is close but not identical, these search paths usually surface the right neighboring fixes faster than scrolling the full archive.

WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up Ansible WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up troubleshooting WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up fix PostgreSQL replica in Ansible Tower/AWX cannot catch up after rebuilding from a base backup that followed a massive DELETE operation on job events or inventory tables Ansible PostgreSQL replica in Ansible Tower/AWX cannot catch up after rebuilding from a base backup that followed a massive DELETE operation on job events or inventory tables

Explore Related Topics

Browse Guides from Other Categories

Discover troubleshooting guides from related categories to expand your knowledge.

FAQ

Ansible Troubleshooting FAQs

Common questions about troubleshooting and preventing similar issues

How do I know if this ansible-errors troubleshooting guide applies to my situation?

This guide is designed for ansible-errors issues. If you're experiencing similar symptoms described in the article, follow the step-by-step instructions. Start with the most common causes and work through the diagnostic process.

Is it safe to follow these ansible-errors troubleshooting steps?

Yes, all steps are designed to be safe and non-destructive. We recommend creating backups before making significant changes and testing each step before proceeding to the next.

How long does it typically take to resolve this type of ansible-errors issue?

Most ansible-errors issues can be resolved within 30 minutes to 2 hours, depending on the complexity and root cause. Follow the troubleshooting flow to identify and fix the problem efficiently.

How can I prevent this ansible-errors issue from happening again?

Regular maintenance, monitoring, and following best practices for ansible-errors configuration can help prevent recurrence. Consider implementing automated checks and alerts for early detection.

Written by

FixWikiHub Editorial Team

Our editorial team consists of experienced DevOps engineers, systems administrators, and cloud architects with hands-on experience in production environments across AWS, Azure, GCP, and on-premises infrastructure.

Every guide undergoes technical review for accuracy and is updated when software versions, commands, or best practices change.

Last updated: Mar 7, 2026

About our team

Important Notice

Disclaimer & Safety Guidelines

The troubleshooting steps in this guide are provided for educational and informational purposes. Before applying any changes to production systems:

Test in a staging environment first — Always verify commands and configurations in a non-production environment before deploying to live systems.
Create backups — Ensure you have current backups of databases, configurations, and critical files before making changes.
Understand the impact — Review how each step may affect your specific environment, dependencies, and users.
Consult official documentation — This guide supplements, but does not replace, official vendor documentation and best practices.

FixWikiHub is not responsible for any damages arising from the use of this content. See our Terms of Use for more information.

Resources

Official Documentation & Further Reading

For authoritative information, consult the official documentation for the technologies discussed in this guide. Our troubleshooting content supplements, but does not replace, vendor documentation.

AWS Documentation — Official Amazon Web Services guides and API references
Kubernetes Documentation — Official Kubernetes documentation
Nginx Documentation — Official Nginx web server documentation
Apache Documentation — Official Apache HTTP Server documentation
Docker Documentation — Official Docker container documentation

WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Assess the Situation

Step 2: Stop the Primary from Generating More WAL (If Possible)

Step 3: Speed Up WAL Replay on Replica

Step 4: Take a Fresh Base Backup (Fastest Solution)

Step 5: Prevent Future Issues with Batched Deletions

Step 6: Use VACUUM FULL or pg_repack Instead of DELETE

Verification

People also search for

Browse Guides from Other Categories

WordPress

SSL

DNS

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading

WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Assess the Situation

Step 2: Stop the Primary from Generating More WAL (If Possible)

Step 3: Speed Up WAL Replay on Replica

Step 4: Take a Fresh Base Backup (Fastest Solution)

Step 5: Prevent Future Issues with Batched Deletions

Step 6: Use VACUUM FULL or pg_repack Instead of DELETE

Verification

Related Issues

Related Articles

People also search for

Share this guide

More Ansible Troubleshooting Guides

Browse Guides from Other Categories

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading