Introduction

When you rebuild a PostgreSQL replica in Ansible Tower or AWX after performing a large DELETE operation (such as purging millions of job events or cleaning up old inventory records), the replica may never catch up to the primary. The WAL replay process gets stuck processing the DELETE operations in the WAL stream, and the replica continues falling further behind despite having adequate hardware resources.

This issue occurs because PostgreSQL's logical replication and WAL replay processes DELETE operations inefficiently - each deleted row generates a WAL entry that must be individually replayed on the replica, creating a bottleneck that can last for days.

Symptoms

The replica will show ever-increasing lag despite being rebuilt:

``` # Initial replica status after rebuild $ sudo -u postgres psql -c "SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"

client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 2 GB

# 1 hour later client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 5 GB

# 6 hours later - lag continues growing client_addr | state | lag --------------+-----------+---------- 10.0.1.52 | streaming | 12 GB

# PostgreSQL logs on replica 2024-03-15 10:15:23.456 UTC [12345] LOG: received redo request at 0/2A000B90 2024-03-15 10:15:25.789 UTC [12345] LOG: redo done at 0/2A001230 2024-03-15 10:16:45.123 UTC [12345] LOG: checkpoint starting: time 2024-03-15 10:45:23.456 UTC [12345] LOG: checkpoint complete: wrote 45234 buffers (27.6%)

# Replication slot falling behind $ sudo -u postgres psql -c "SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

slot_name | active | retained_wal ---------------+--------+-------------- tower_replica | t | 50 GB

# Check WAL files accumulating in pg_wal $ ls -la /var/lib/postgresql/14/main/pg_wal/ | wc -l 1247

# Replication process using high CPU but low disk I/O $ top -p $(pgrep -f "wal_receiver") PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18423 postgres 20 0 245624 45124 12345 S 98.7 1.2 145:23.45 postgres: wal receiver ```

In Tower logs:

bash
2024-03-15 10:30:45,123 WARNING  awx.main.tasks Cluster heartbeat delayed by 180 seconds
2024-03-15 10:31:12,456 ERROR    awx.main.dispatch Instance tower-replica is lagging behind primary by 180+ seconds
2024-03-15 10:32:01,789 CRITICAL awx.main.models.ha Failover capacity degraded

Common Causes

The replica cannot catch up due to the inherent inefficiency of replaying DELETE operations:

  1. 1.WAL records each deleted tuple individually: When you delete 10 million rows from main_jobevent, PostgreSQL generates 10 million WAL records - one for each tuple deletion. The replica must read and apply each record sequentially.
  2. 2.Index updates multiply the work: Each DELETE also updates all indexes on the table. A table with 5 indexes generates 6 WAL entries per row (1 heap + 5 index entries), meaning 10 million deletions = 60 million WAL records.
  3. 3.Base backup taken after DELETE started: If the base backup was initiated while DELETE was still in progress, the replica receives a mix of pre-DELETE and post-DELETE pages, causing it to replay the entire DELETE operation.
  4. 4.FPI (Full Page Images) overhead: Large DELETE operations often trigger full page writes due to modifications, increasing WAL volume by 2-4x.
  5. 5.Replica disk I/O saturation: Even with fast storage, random I/O from updating indexes during replay bottlenecks the process.

Step-by-Step Fix

Step 1: Assess the Situation

Determine if catch-up is possible or if a fresh approach is needed:

```bash # Check current lag and trend sudo -u postgres psql -c " SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS current_lag, sent_lsn, replay_lsn FROM pg_stat_replication;"

# Calculate lag growth rate (run twice, 5 minutes apart) # Run 1: sudo -u postgres psql -t -c "SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) FROM pg_stat_replication;" > /tmp/lag1.txt sleep 300 # Run 2: sudo -u postgres psql -t -c "SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) FROM pg_stat_replication;" > /tmp/lag2.txt

# Calculate growth rate echo "Lag growth per minute: $(( ($(cat /tmp/lag2.txt) - $(cat /tmp/lag1.txt)) / 5 )) bytes"

# Check how much WAL remains sudo -u postgres psql -c " SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS total_wal_to_replay FROM pg_replication_slots;"

# Estimate time to catch up # time = (current_lag + new_wal) / (replay_rate - write_rate) ```

Step 2: Stop the Primary from Generating More WAL (If Possible)

Temporarily reduce WAL generation:

```bash # On primary, stop Tower services to reduce activity sudo ansible-tower-service stop

# Or pause job execution via API curl -X POST -k -u admin:password \ https://localhost/api/v2/system_job_templates/5/launch/ \ -H "Content-Type: application/json"

# Disable scheduled jobs temporarily curl -X PATCH -k -u admin:password \ https://localhost/api/v2/schedules/ \ -H "Content-Type: application/json" \ -d '{"enabled": false}' ```

Step 3: Speed Up WAL Replay on Replica

Optimize replica configuration for faster replay:

```bash # Edit postgresql.conf on replica sudo vim /etc/postgresql/14/main/postgresql.conf

# Increase these settings for faster replay checkpoint_timeout = 1h # Less frequent checkpoints max_wal_size = 8GB # More WAL buffer checkpoint_completion_target = 0.9 # Spread checkpoint writes wal_decode_buffer_size = 64MB # Larger decode buffer (PG 16+)

# Disable query execution during catch-up hot_standby = off # Focus all resources on replay

# Restart PostgreSQL on replica sudo systemctl restart postgresql ```

Step 4: Take a Fresh Base Backup (Fastest Solution)

If catch-up will take more than a few hours, rebuild from a fresh backup:

```bash # On primary, take a fresh base backup sudo -u postgres psql -c "SELECT pg_start_backup('fresh_backup', true, false);"

# Copy data directory to replica sudo -u postgres rsync -avz --delete \ /var/lib/postgresql/14/main/ \ tower-replica:/var/lib/postgresql/14/main/

# Or use pg_basebackup for cleaner process sudo -u postgres pg_basebackup \ -h tower-primary \ -D /var/lib/postgresql/14/main \ -U replication_user \ -P -v -R -X stream -C -S tower_replica_new

# On primary, stop backup sudo -u postgres psql -c "SELECT pg_stop_backup(false, true);"

# On replica, update configuration sudo vim /var/lib/postgresql/14/main/postgresql.auto.conf # Add: primary_slot_name = 'tower_replica_new'

# Start replica sudo systemctl start postgresql ```

Step 5: Prevent Future Issues with Batched Deletions

Configure Tower to delete records in smaller batches:

```yaml # Create a playbook for batched cleanup cat > /opt/ansible/cleanup_job_events.yml << 'EOF' --- - name: Cleanup old job events in batches hosts: localhost gather_facts: false become: true become_user: postgres vars: retention_days: 30 batch_size: 10000 sleep_between_batches: 2 tasks: - name: Delete job events batch postgresql_query: db: awx query: > DELETE FROM main_jobevent WHERE id IN ( SELECT id FROM main_jobevent WHERE created < now() - interval '{{ retention_days }} days' LIMIT {{ batch_size }} ) register: delete_result until: delete_result.rowcount == 0 retries: 10000 delay: "{{ sleep_between_batches }}" changed_when: delete_result.rowcount > 0

  • name: Run analyze after cleanup
  • postgresql_query:
  • db: awx
  • query: "ANALYZE main_jobevent"
  • changed_when: false
  • EOF

# Run via cron daily at 2 AM (crontab -l 2>/dev/null; echo "0 2 * * * ansible-playbook /opt/ansible/cleanup_job_events.yml") | crontab - ```

Step 6: Use VACUUM FULL or pg_repack Instead of DELETE

For large purges, use more efficient methods:

```bash # Option 1: VACUUM FULL (locks table, fastest for small tables) # First, delete the rows sudo -u postgres psql -d awx -c "DELETE FROM main_jobevent WHERE created < now() - interval '30 days';"

# Then reclaim space sudo -u postgres psql -d awx -c "VACUUM FULL main_jobevent;"

# Option 2: pg_repack (no locks, requires extension) sudo -u postgres psql -d awx -c "CREATE EXTENSION pg_repack;" sudo -u postgres pg_repack -t main_jobevent -d awx

# Option 3: Partition the table (best long-term solution) sudo -u postgres psql -d awx << 'EOF' -- Create partitioned table structure CREATE TABLE main_jobevent_partitioned ( LIKE main_jobevent INCLUDING DEFAULTS INCLUDING CONSTRAINTS ) PARTITION BY RANGE (created);

-- Create partitions by month CREATE TABLE main_jobevent_2024_01 PARTITION OF main_jobevent_partitioned FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE main_jobevent_2024_02 PARTITION OF main_jobevent_partitioned FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Migrate data (do this during maintenance window) INSERT INTO main_jobevent_partitioned SELECT * FROM main_jobevent; DROP TABLE main_jobevent; ALTER TABLE main_jobevent_partitioned RENAME TO main_jobevent; EOF ```

Verification

Confirm the replica is catching up or has caught up:

```bash # Monitor lag decreasing over time watch -n 10 'sudo -u postgres psql -t -c "SELECT pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) FROM pg_stat_replication;"'

# Expected: lag should decrease over time 512 MB 508 MB 495 MB ...

# Verify replication slot is healthy sudo -u postgres psql -c " SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

# Expected output (retained_wal should be small): slot_name | active | retained_wal ---------------+--------+-------------- tower_replica | t | 64 MB

# Check Tower HA status curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, heartbeat: .heartbeat}'

# Verify query performance on replica after catch-up sudo -u postgres psql -d awx -c " SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'main_jobevent';" ```

  • [ansible-replica-falls-behind-after-an-expensive-scan](/articles/ansible-replica-falls-behind-after-an-expensive-scan)
  • [ansible-job-event-table-bloat-causes-slow-queries](/articles/ansible-job-event-table-bloat-causes-slow-queries)
  • [ansible-database-partitioning-strategy](/articles/ansible-database-partitioning-strategy)
  • [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
  • [WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
  • [WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
  • [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
  • [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Replica Rebuild Never Catches Up", "description": "Learn how to fix Ansible Replica Rebuild Never Catches Up After a Large Delete. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-replica-rebuild-never-catches-up-after-a-large-delete", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-03-07T16:38:40.237Z", "dateModified": "2026-03-07T16:38:40.237Z" } </script>