Home / Ansible / WordPress troubleshooting: Ansible Replica Falls Behind After an Ex

Ansible

WordPress troubleshooting: Ansible Replica Falls Behind After an Ex

Ansible Tower/AWX PostgreSQL replica lag spikes after a large inventory scan or job event query causes replication to fall significantly behind the primary database.

Published: Feb 12, 20269 min readBy FixWikiHub Editorial Team

Abstract illustration for a troubleshooting knowledge base category.

Introduction

When running Ansible Tower or AWX with a PostgreSQL replica for read scaling or high availability, you may encounter severe replication lag after executing expensive database operations such as large inventory scans, massive job event queries, or running analytics reports against the Tower database. The replica can fall hours behind the primary, causing stale reads, failed failover checks, and potential split-brain scenarios during HA failover.

This issue is particularly common in Tower installations with: - Large inventories (50,000+ hosts) - Long-running jobs generating millions of job events - Custom reporting queries against the main_jobevent table - Missing or outdated table statistics after bulk operations

Symptoms

In Ansible Tower logs and PostgreSQL replication monitoring:

``` # Tower task manager logs 2024-03-15 14:23:45,123 WARNING awx.main.tasks cluster_node_heartbeat() took 45.2s (threshold: 30s) 2024-03-15 14:24:12,456 ERROR awx.main.dispatch Task execution failed: database is read-only 2024-03-15 14:25:01,789 WARNING awx.main.models.ha Instance heartbeat overdue by 120 seconds

# PostgreSQL replica status $ sudo -u postgres psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;"

# Lag in human-readable format $ sudo -u postgres psql -c "SELECT client_addr, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"

client_addr | lag --------------+--------- 10.0.1.52 | 512 MB

# Check replication slot status $ sudo -u postgres psql -c "SELECT slot_name, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;"

slot_name | active | restart_lsn | lag ----------------+--------+--------------+------- tower_replica | t | 0/3000B120 | 512 MB

# Long-running query on replica $ sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"

In Tower UI, you may see: - Dashboard showing stale job counts - Inventory sync status not updating - "Database is read-only" errors during HA checks - Delayed or missing job events in job output

Common Causes

The replica falls behind due to several interconnected factors:

1.Single-threaded WAL replay in PostgreSQL: PostgreSQL replicas replay Write-Ahead Log (WAL) entries sequentially. A single large transaction (like deleting millions of job events or updating a massive inventory) creates a WAL backlog that must be applied one operation at a time.
2.Long-running queries block vacuum on primary: When an expensive scan runs on the replica, it holds open transaction snapshots. This prevents VACUUM from cleaning up dead tuples on the primary, causing table bloat and generating more WAL traffic.
3.Job events table growth: The main_jobevent table in Tower grows rapidly with every job run. A typical playbook run generates 10-100 events per host, so a 10,000-host inventory run can generate 100,000+ events.
4.Missing table statistics: After bulk operations, PostgreSQL's query planner may choose inefficient plans (sequential scans instead of index scans), making queries much slower on both primary and replica.
5.Network throughput limits: If the replica is in a different availability zone or region, network bandwidth can become a bottleneck for WAL streaming, especially during peak activity.

Step-by-Step Fix

Step 1: Identify the Lag Source

First, determine what's causing the replication lag:

```bash # Check current replication lag sudo -u postgres psql -c " SELECT client_addr, state, sync_state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replication_lag, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) AS write_lag, pg_size_pretty(pg_wal_lsn_diff(write_lsn, flush_lsn)) AS flush_lag FROM pg_stat_replication;"

# Check for replication slot issues sudo -u postgres psql -c " SELECT slot_name, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

# Find long-running queries on primary sudo -u postgres psql -c " SELECT pid, now() - query_start AS duration, query, state FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%' ORDER BY duration DESC LIMIT 10;"

# Check for blocked VACUUM on primary sudo -u postgres psql -c " SELECT pid, now() - xact_start AS xact_duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_duration DESC LIMIT 5;"

# Check table bloat (requires pgstattuple extension) sudo -u postgres psql -d awx -c " SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, n_dead_tup, n_live_tup, ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio FROM pg_stat_user_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" ```

Step 2: Stop Queries Blocking the Replica

If a long-running query is holding back the replica:

```bash # On the replica, find and terminate long-running queries sudo -u postgres psql -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes';"

# Terminate specific query (replace PID) sudo -u postgres psql -c "SELECT pg_terminate_backend(18423);"

# Or terminate all queries running longer than 30 minutes sudo -u postgres psql -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes' AND pid <> pg_backend_pid();" ```

Step 3: Speed Up WAL Replay

Increase the replica's replay speed:

```bash # Edit postgresql.conf on replica sudo vim /etc/postgresql/14/main/postgresql.conf

# Increase these parameters for faster replay hot_standby = on max_standby_streaming_delay = 30s # Default is 30s, can reduce wal_receiver_status_interval = 1s # Report status more frequently hot_standby_feedback = on # Prevent query conflicts

# Increase max_wal_senders if needed max_wal_senders = 10

# Restart PostgreSQL on replica sudo systemctl restart postgresql ```

Step 4: Optimize the Tower Database

Apply Tower-specific optimizations:

```bash # Update table statistics after bulk operations sudo -u postgres psql -d awx -c "ANALYZE main_jobevent;" sudo -u postgres psql -d awx -c "ANALYZE main_inventory;" sudo -u postgres psql -d awx -c "ANALYZE main_host;"

# Rebuild bloated indexes sudo -u postgres psql -d awx -c "REINDEX TABLE main_jobevent;"

# Configure Tower's cleanup settings in settings.py sudo vim /etc/tower/settings.py

# Add or modify: AWX_CLEANUP_POLICY = { 'job_events': {'days': 30, 'batch_size': 10000}, 'system_tracking': {'days': 30, 'batch_size': 10000} }

# Restart Tower services sudo ansible-tower-service restart ```

Step 5: Implement Batched Cleanup

Create a custom cleanup job to prevent future lag:

```yaml # /etc/tower/conf.d/cleanup.yml - name: Tower Job Event Cleanup hosts: localhost gather_facts: false vars: retention_days: 30 batch_size: 50000 tasks: - name: Get total job events to delete command: > psql -d awx -t -c "SELECT COUNT(*) FROM main_jobevent WHERE created < now() - interval '{{ retention_days }} days'" register: total_to_delete changed_when: false

name: Delete job events in batches
command: >
psql -d awx -c "DELETE FROM main_jobevent WHERE id IN
(SELECT id FROM main_jobevent
WHERE created < now() - interval '{{ retention_days }} days'
LIMIT {{ batch_size }})"
register: delete_result
until: delete_result.stdout == "DELETE 0"
retries: 100
delay: 5
changed_when: "'DELETE' in delete_result.stdout"

name: Analyze table after cleanup
command: psql -d awx -c "ANALYZE main_jobevent;"
changed_when: false
`

Step 6: Configure Replication Slots Properly

Ensure replication slots don't cause WAL accumulation:

```bash # Check if slot is causing issues sudo -u postgres psql -c " SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;"

# If retained WAL is excessive (> 100GB), recreate the slot # ON PRIMARY: sudo -u postgres psql -c "SELECT pg_drop_replication_slot('tower_replica');" sudo -u postgres psql -c "SELECT pg_create_physical_replication_slot('tower_replica');"

# Then reconfigure replica to use the new slot # On replica's recovery.conf or postgresql.auto.conf: primary_slot_name = 'tower_replica'

# Restart replica sudo systemctl restart postgresql ```

Verification

After applying the fix, verify replication is healthy:

```bash # Check replication lag is minimal (< 1MB for healthy state) sudo -u postgres psql -c " SELECT client_addr, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag FROM pg_stat_replication;"

# Expected output: client_addr | state | lag --------------+-----------+-------- 10.0.1.52 | streaming | 128 kB

# Verify Tower HA status is healthy curl -k -u admin:password https://localhost/api/v2/ping/ | jq

# Expected output: { "ha": true, "version": "4.3.0", "active_node": "tower-primary", "instances": [ {"node": "tower-primary", "heartbeat": "2024-03-15T14:30:00Z"}, {"node": "tower-replica", "heartbeat": "2024-03-15T14:30:01Z"} ] }

# Check job events are flowing sudo -u postgres psql -d awx -c " SELECT COUNT(*) FROM main_jobevent WHERE created > now() - interval '1 hour';"

# Run a test inventory scan and verify replica receives updates ansible-inventory -i /etc/tower/inventory --list | jq '. _meta' ```

[ansible-job-event-table-bloat-causes-slow-queries](/articles/ansible-job-event-table-bloat-causes-slow-queries)
[ansible-tower-ha-failover-fails-due-to-replication-lag](/articles/ansible-tower-ha-failover-fails-due-to-replication-lag)
[ansible-database-vacuum-blocks-job-execution](/articles/ansible-database-vacuum-blocks-job-execution)

[WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
[WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
[WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
[WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
[WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)

Was this guide helpful?

Related search paths

People also search for

If the symptom is close but not identical, these search paths usually surface the right neighboring fixes faster than scrolling the full archive.

WordPress troubleshooting: Ansible Replica Falls Behind WordPress troubleshooting: Ansible Replica Falls Behind Ansible WordPress troubleshooting: Ansible Replica Falls Behind troubleshooting WordPress troubleshooting: Ansible Replica Falls Behind fix Ansible Tower/AWX PostgreSQL replica lag spikes after a large inventory scan or job event query causes replication to fall significantly behind the primary database Ansible Ansible Tower/AWX PostgreSQL replica lag spikes after a large inventory scan or job event query causes replication to fall significantly behind the primary database

Explore Related Topics

Browse Guides from Other Categories

Discover troubleshooting guides from related categories to expand your knowledge.

FAQ

Ansible Troubleshooting FAQs

Common questions about troubleshooting and preventing similar issues

How do I know if this ansible-errors troubleshooting guide applies to my situation?

This guide is designed for ansible-errors issues. If you're experiencing similar symptoms described in the article, follow the step-by-step instructions. Start with the most common causes and work through the diagnostic process.

Is it safe to follow these ansible-errors troubleshooting steps?

Yes, all steps are designed to be safe and non-destructive. We recommend creating backups before making significant changes and testing each step before proceeding to the next.

How long does it typically take to resolve this type of ansible-errors issue?

Most ansible-errors issues can be resolved within 30 minutes to 2 hours, depending on the complexity and root cause. Follow the troubleshooting flow to identify and fix the problem efficiently.

How can I prevent this ansible-errors issue from happening again?

Regular maintenance, monitoring, and following best practices for ansible-errors configuration can help prevent recurrence. Consider implementing automated checks and alerts for early detection.

Written by

FixWikiHub Editorial Team

Our editorial team consists of experienced DevOps engineers, systems administrators, and cloud architects with hands-on experience in production environments across AWS, Azure, GCP, and on-premises infrastructure.

Every guide undergoes technical review for accuracy and is updated when software versions, commands, or best practices change.

Last updated: Feb 12, 2026

About our team

Important Notice

Disclaimer & Safety Guidelines

The troubleshooting steps in this guide are provided for educational and informational purposes. Before applying any changes to production systems:

Test in a staging environment first — Always verify commands and configurations in a non-production environment before deploying to live systems.
Create backups — Ensure you have current backups of databases, configurations, and critical files before making changes.
Understand the impact — Review how each step may affect your specific environment, dependencies, and users.
Consult official documentation — This guide supplements, but does not replace, official vendor documentation and best practices.

FixWikiHub is not responsible for any damages arising from the use of this content. See our Terms of Use for more information.

Resources

Official Documentation & Further Reading

For authoritative information, consult the official documentation for the technologies discussed in this guide. Our troubleshooting content supplements, but does not replace, vendor documentation.

AWS Documentation — Official Amazon Web Services guides and API references
Kubernetes Documentation — Official Kubernetes documentation
Nginx Documentation — Official Nginx web server documentation
Apache Documentation — Official Apache HTTP Server documentation
Docker Documentation — Official Docker container documentation

WordPress troubleshooting: Ansible Replica Falls Behind After an Ex

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Identify the Lag Source

Step 2: Stop Queries Blocking the Replica

Step 3: Speed Up WAL Replay

Step 4: Optimize the Tower Database

Step 5: Implement Batched Cleanup

Step 6: Configure Replication Slots Properly

Verification

People also search for

Browse Guides from Other Categories

WordPress

SSL

DNS

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading

WordPress troubleshooting: Ansible Replica Falls Behind After an Ex

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Identify the Lag Source

Step 2: Stop Queries Blocking the Replica

Step 3: Speed Up WAL Replay

Step 4: Optimize the Tower Database

Step 5: Implement Batched Cleanup

Step 6: Configure Replication Slots Properly

Verification

Related Issues

Related Articles

People also search for

Share this guide

More Ansible Troubleshooting Guides

Browse Guides from Other Categories

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading