Introduction
When running Ansible Tower or AWX with a PostgreSQL replica for read scaling or high availability, you may encounter severe replication lag after executing expensive database operations such as large inventory scans, massive job event queries, or running analytics reports against the Tower database. The replica can fall hours behind the primary, causing stale reads, failed failover checks, and potential split-brain scenarios during HA failover.
This issue is particularly common in Tower installations with:
- Large inventories (50,000+ hosts)
- Long-running jobs generating millions of job events
- Custom reporting queries against the main_jobevent table
- Missing or outdated table statistics after bulk operations
Symptoms
In Ansible Tower logs and PostgreSQL replication monitoring:
``` # Tower task manager logs 2024-03-15 14:23:45,123 WARNING awx.main.tasks cluster_node_heartbeat() took 45.2s (threshold: 30s) 2024-03-15 14:24:12,456 ERROR awx.main.dispatch Task execution failed: database is read-only 2024-03-15 14:25:01,789 WARNING awx.main.models.ha Instance heartbeat overdue by 120 seconds
# PostgreSQL replica status $ sudo -u postgres psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;"
client_addr | state | sent_lsn | replay_lsn | lag_bytes ---------------+-----------+-------------+-------------+------------ 10.0.1.52 | streaming | 0/5000A890 | 0/3000B120 | 536870912
# Lag in human-readable format $ sudo -u postgres psql -c "SELECT client_addr, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"
client_addr | lag --------------+--------- 10.0.1.52 | 512 MB
# Check replication slot status $ sudo -u postgres psql -c "SELECT slot_name, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;"
slot_name | active | restart_lsn | lag ----------------+--------+--------------+------- tower_replica | t | 0/3000B120 | 512 MB
# Long-running query on replica $ sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"
pid | duration | query | state -------+-----------------+----------------------------------------------------------+-------- 18423 | 00:45:12.345678 | SELECT * FROM main_jobevent WHERE job_id = 12345 ORDER.. | active ```
In Tower UI, you may see: - Dashboard showing stale job counts - Inventory sync status not updating - "Database is read-only" errors during HA checks - Delayed or missing job events in job output
Common Causes
The replica falls behind due to several interconnected factors:
- 1.Single-threaded WAL replay in PostgreSQL: PostgreSQL replicas replay Write-Ahead Log (WAL) entries sequentially. A single large transaction (like deleting millions of job events or updating a massive inventory) creates a WAL backlog that must be applied one operation at a time.
- 2.Long-running queries block vacuum on primary: When an expensive scan runs on the replica, it holds open transaction snapshots. This prevents VACUUM from cleaning up dead tuples on the primary, causing table bloat and generating more WAL traffic.
- 3.Job events table growth: The
main_jobeventtable in Tower grows rapidly with every job run. A typical playbook run generates 10-100 events per host, so a 10,000-host inventory run can generate 100,000+ events. - 4.Missing table statistics: After bulk operations, PostgreSQL's query planner may choose inefficient plans (sequential scans instead of index scans), making queries much slower on both primary and replica.
- 5.Network throughput limits: If the replica is in a different availability zone or region, network bandwidth can become a bottleneck for WAL streaming, especially during peak activity.
Step-by-Step Fix
Step 1: Identify the Lag Source
First, determine what's causing the replication lag:
```bash # Check current replication lag sudo -u postgres psql -c " SELECT client_addr, state, sync_state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replication_lag, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) AS write_lag, pg_size_pretty(pg_wal_lsn_diff(write_lsn, flush_lsn)) AS flush_lag FROM pg_stat_replication;"
# Check for replication slot issues sudo -u postgres psql -c " SELECT slot_name, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"
# Find long-running queries on primary sudo -u postgres psql -c " SELECT pid, now() - query_start AS duration, query, state FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%' ORDER BY duration DESC LIMIT 10;"
# Check for blocked VACUUM on primary sudo -u postgres psql -c " SELECT pid, now() - xact_start AS xact_duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_duration DESC LIMIT 5;"
# Check table bloat (requires pgstattuple extension) sudo -u postgres psql -d awx -c " SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, n_dead_tup, n_live_tup, ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio FROM pg_stat_user_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" ```
Step 2: Stop Queries Blocking the Replica
If a long-running query is holding back the replica:
```bash # On the replica, find and terminate long-running queries sudo -u postgres psql -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes';"
# Terminate specific query (replace PID) sudo -u postgres psql -c "SELECT pg_terminate_backend(18423);"
# Or terminate all queries running longer than 30 minutes sudo -u postgres psql -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes' AND pid <> pg_backend_pid();" ```
Step 3: Speed Up WAL Replay
Increase the replica's replay speed:
```bash # Edit postgresql.conf on replica sudo vim /etc/postgresql/14/main/postgresql.conf
# Increase these parameters for faster replay hot_standby = on max_standby_streaming_delay = 30s # Default is 30s, can reduce wal_receiver_status_interval = 1s # Report status more frequently hot_standby_feedback = on # Prevent query conflicts
# Increase max_wal_senders if needed max_wal_senders = 10
# Restart PostgreSQL on replica sudo systemctl restart postgresql ```
Step 4: Optimize the Tower Database
Apply Tower-specific optimizations:
```bash # Update table statistics after bulk operations sudo -u postgres psql -d awx -c "ANALYZE main_jobevent;" sudo -u postgres psql -d awx -c "ANALYZE main_inventory;" sudo -u postgres psql -d awx -c "ANALYZE main_host;"
# Rebuild bloated indexes sudo -u postgres psql -d awx -c "REINDEX TABLE main_jobevent;"
# Configure Tower's cleanup settings in settings.py sudo vim /etc/tower/settings.py
# Add or modify: AWX_CLEANUP_POLICY = { 'job_events': {'days': 30, 'batch_size': 10000}, 'system_tracking': {'days': 30, 'batch_size': 10000} }
# Restart Tower services sudo ansible-tower-service restart ```
Step 5: Implement Batched Cleanup
Create a custom cleanup job to prevent future lag:
```yaml # /etc/tower/conf.d/cleanup.yml - name: Tower Job Event Cleanup hosts: localhost gather_facts: false vars: retention_days: 30 batch_size: 50000 tasks: - name: Get total job events to delete command: > psql -d awx -t -c "SELECT COUNT(*) FROM main_jobevent WHERE created < now() - interval '{{ retention_days }} days'" register: total_to_delete changed_when: false
- name: Delete job events in batches
- command: >
- psql -d awx -c "DELETE FROM main_jobevent WHERE id IN
- (SELECT id FROM main_jobevent
- WHERE created < now() - interval '{{ retention_days }} days'
- LIMIT {{ batch_size }})"
- register: delete_result
- until: delete_result.stdout == "DELETE 0"
- retries: 100
- delay: 5
- changed_when: "'DELETE' in delete_result.stdout"
- name: Analyze table after cleanup
- command: psql -d awx -c "ANALYZE main_jobevent;"
- changed_when: false
`
Step 6: Configure Replication Slots Properly
Ensure replication slots don't cause WAL accumulation:
```bash # Check if slot is causing issues sudo -u postgres psql -c " SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"
# If retained WAL is excessive (> 100GB), recreate the slot # ON PRIMARY: sudo -u postgres psql -c "SELECT pg_drop_replication_slot('tower_replica');" sudo -u postgres psql -c "SELECT pg_create_physical_replication_slot('tower_replica');"
# Then reconfigure replica to use the new slot # On replica's recovery.conf or postgresql.auto.conf: primary_slot_name = 'tower_replica'
# Restart replica sudo systemctl restart postgresql ```
Verification
After applying the fix, verify replication is healthy:
```bash # Check replication lag is minimal (< 1MB for healthy state) sudo -u postgres psql -c " SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"
# Expected output: client_addr | state | lag --------------+-----------+-------- 10.0.1.52 | streaming | 128 kB
# Verify Tower HA status is healthy curl -k -u admin:password https://localhost/api/v2/ping/ | jq
# Expected output: { "ha": true, "version": "4.3.0", "active_node": "tower-primary", "instances": [ {"node": "tower-primary", "heartbeat": "2024-03-15T14:30:00Z"}, {"node": "tower-replica", "heartbeat": "2024-03-15T14:30:01Z"} ] }
# Check job events are flowing sudo -u postgres psql -d awx -c " SELECT COUNT(*) FROM main_jobevent WHERE created > now() - interval '1 hour';"
# Run a test inventory scan and verify replica receives updates ansible-inventory -i /etc/tower/inventory --list | jq '. _meta' ```
Related Issues
- [ansible-job-event-table-bloat-causes-slow-queries](/articles/ansible-job-event-table-bloat-causes-slow-queries)
- [ansible-tower-ha-failover-fails-due-to-replication-lag](/articles/ansible-tower-ha-failover-fails-due-to-replication-lag)
- [ansible-database-vacuum-blocks-job-execution](/articles/ansible-database-vacuum-blocks-job-execution)
Related Articles
- [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
- [WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
- [WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
- [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
- [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Replica Falls Behind After an Ex", "description": "Learn how to fix Ansible Replica Falls Behind After an Expensive Scan. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-replica-falls-behind-after-an-expensive-scan", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-02-12T09:10:43.658Z", "dateModified": "2026-02-12T09:10:43.658Z" } </script>