Introduction
When running large-scale Ansible Tower or AWX deployments with hundreds of hosts executing tasks simultaneously, you may notice gaps in the audit trail. Job events, task results, and host status changes fail to appear in the Tower UI or API responses despite the playbook completing successfully. This occurs when the callback plugin's event buffer overflows, when PostgreSQL cannot keep up with the write volume, or when async tasks complete without their results being captured by the event logging system.
The missing events create compliance and debugging challenges - you cannot determine which tasks ran on which hosts, and failed tasks may not appear in the job detail view. This issue is particularly common in environments running more than 500 concurrent hosts or executing more than 50 tasks per playbook.
Symptoms
The Tower job output shows gaps where task events should appear:
``` TASK [Deploy application] *********** changed: [server-001] changed: [server-002] changed: [server-003] ... [output truncated - 347 hosts missing from output]
TASK [Verify deployment] *********** ok: [server-001] ```
Checking the API for job events shows fewer events than expected:
```bash $ curl -s -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/jobs/12345/events/?page_size=1000" | jq '.count' 487
# But you ran against 500 hosts with 10 tasks each # Expected: 5000+ events (excluding skipped) ```
PostgreSQL logs show write throttling or connection timeouts:
2026-04-11 14:32:15.123 UTC [awx] LOG: could not receive data from client: Connection timed out
2026-04-11 14:32:15.456 UTC [awx] ERROR: remaining connection slots are reserved for non-replication superuser connectionsTower task container logs reveal callback issues:
$ kubectl logs -n awx deployment/awx-task | grep -i callback
awx.main.utils.handlers WARNING Event queue is full, dropping events
awx.main.utils.handlers ERROR Failed to save job event: connection already closedThe job detail page in Tower shows incomplete host summaries:
Host Summary:
OK: 12
Changed: 8
Failed: 0
Unreachable: 0
Total: 20 # But you targeted 100 hostsCommon Causes
1. Callback Plugin Event Buffer Overflow
The default awx_display callback plugin uses an internal buffer to collect events before sending them to the Tower API. Under burst load, this buffer fills faster than events can be dispatched:
```python # From awx/plugins/callback/awx_display.py MAX_EVENT_QUEUE_SIZE = 10000 # Default buffer size
# When buffer is full, events are dropped if self.event_queue.qsize() >= MAX_EVENT_QUEUE_SIZE: logger.warning("Event queue full, dropping event") return ```
2. PostgreSQL Connection Pool Exhaustion
Tower uses a connection pool for database writes. During burst loads, all connections are consumed waiting for slow writes:
# Check PostgreSQL connection count
$ psql -U awx -d awx -c "SELECT count(*) FROM pg_stat_activity WHERE datname='awx';"
count
-------
98 # At or near the max_connections limit (typically 100)3. Async Task Event Loss
Async tasks (with async: and poll: directives) generate separate events for job initiation and completion. If the async job completes before the poll interval, the completion event may not be captured:
# Async task where events can be lost
- name: Long running task
command: /opt/app/long_process.sh
async: 3600
poll: 60 # If task completes in 30s, completion event may be missed4. Tower Task Worker Threading Limits
Tower task workers are configured with a maximum number of concurrent playbook threads. When exceeded, event processing is delayed:
# Tower settings.py (default)
AWX_TASK_ENV = {
'MAX_EVENT_WORKERS': 4, # Limited concurrent event processors
}Step-by-Step Fix
Step 1: Diagnose the Event Loss Source
Check current Tower configuration and identify bottlenecks:
```bash # Check Tower task settings docker exec tower_task cat /etc/tower/settings.py | grep -i event
# For AWX on Kubernetes kubectl exec -n awx deployment/awx-task -- cat /etc/tower/settings.py | grep -i event
# Monitor PostgreSQL during job execution watch -n 1 "psql -U awx -d awx -c \"SELECT count(*), state FROM pg_stat_activity WHERE datname='awx' GROUP BY state;\""
# Check event queue metrics curl -s -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/metrics/" | jq '.task_manager_queue_size' ```
Step 2: Increase Event Buffer and Worker Limits
Update Tower settings to handle higher event volumes:
```python # /etc/tower/settings.py or via Tower Configuration # Increase event queue buffer size CALLBACK_QUEUE_SIZE = 50000 # Default is 10000
# Increase event worker threads AWX_TASK_ENV['MAX_EVENT_WORKERS'] = 16 # Default is 4
# Increase database connection pool DATABASES['default']['CONN_MAX_AGE'] = 60 DATABASES['default']['OPTIONS']['connect_timeout'] = 30 ```
For AWX on Kubernetes, update the AWX custom resource:
# awx-deployment.yaml
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:
task_env:
- name: MAX_EVENT_WORKERS
value: "16"
- name: CALLBACK_QUEUE_SIZE
value: "50000"Apply the changes:
```bash # Restart Tower services ansible-tower-service restart
# Or for AWX kubectl apply -f awx-deployment.yaml kubectl rollout restart deployment/awx-task -n awx ```
Step 3: Configure PostgreSQL for Higher Write Throughput
Increase PostgreSQL connection limits and tune for write-heavy workloads:
```sql -- Connect to PostgreSQL as superuser psql -U postgres
-- Increase max connections ALTER SYSTEM SET max_connections = 200;
-- Increase work memory for sorting during event writes ALTER SYSTEM SET work_mem = '64MB';
-- Increase checkpoint completion target for smoother writes ALTER SYSTEM SET checkpoint_completion_target = 0.9;
-- Increase wal buffers ALTER SYSTEM SET wal_buffers = '64MB';
-- Reload configuration SELECT pg_reload_conf(); ```
Update Tower's database connection pool in settings:
# /etc/tower/settings.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'awx',
'USER': 'awx',
'PASSWORD': 'password',
'HOST': 'localhost',
'PORT': '5432',
'CONN_MAX_AGE': 60,
'OPTIONS': {
'sslmode': 'prefer',
'connect_timeout': 30,
'options': '-c statement_timeout=60000'
},
}
}Step 4: Optimize Playbooks to Reduce Event Volume
Split large batch operations into smaller chunks with explicit event flushing:
```yaml # playbook.yml - Optimized for audit logging - name: Deploy with audit trail preservation hosts: all serial: 50 # Process 50 hosts at a time instead of all at once gather_facts: no
tasks: - name: Deploy application block: - name: Copy application files copy: src: app/ dest: /opt/app/ notify: Restart application
- name: Run deployment script
- shell: /opt/app/deploy.sh
- args:
- creates: /opt/app/.deployed
- rescue:
- - name: Log failure for this host
- debug:
- msg: "Deployment failed on {{ inventory_hostname }}"
- # This ensures the failure is captured in audit
- name: Force event flush between batches
- meta: clear_host_errors
- when: ansible_loop.last | default(false)
handlers: - name: Restart application systemd: name: app state: restarted ```
For async tasks, ensure polling catches completion events:
```yaml # Improved async task configuration - name: Long running process command: /opt/app/process.sh async: 3600 poll: 15 # Poll more frequently to catch completion register: async_result
- name: Check async task status
- async_status:
- jid: "{{ async_result.ansible_job_id }}"
- register: job_result
- until: job_result.finished
- retries: 240 # 240 * 15 seconds = 1 hour max wait
- delay: 15
`
Step 5: Enable Event Persistence Logging
Configure Tower to log dropped events for later reconstruction:
```python # /etc/tower/settings.py # Enable event persistence logging LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'handlers': { 'file': { 'level': 'DEBUG', 'class': 'logging.FileHandler', 'filename': '/var/log/tower/event_debug.log', }, }, 'loggers': { 'awx.main.utils.handlers': { 'handlers': ['file'], 'level': 'DEBUG', 'propagate': True, }, 'awx.main.models.jobs': { 'handlers': ['file'], 'level': 'DEBUG', 'propagate': True, }, }, }
# Log all events to file for audit reconstruction AWX_LOGGING_HANDLERS = ['console', 'file'] ```
Verification
Run a test deployment and verify event completeness:
```bash # Start a job and capture the job ID JOB_ID=$(curl -s -X POST -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ "https://tower.example.com/api/v2/job_templates/15/launch/" | jq '.id')
# Wait for job to complete while true; do STATUS=$(curl -s -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/jobs/$JOB_ID/" | jq -r '.status') echo "Job status: $STATUS" [ "$STATUS" = "successful" ] || [ "$STATUS" = "failed" ] && break sleep 5 done
# Count events EVENT_COUNT=$(curl -s -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/jobs/$JOB_ID/events/?page_size=1" | jq '.count')
# Compare to expected (hosts * tasks) EXPECTED_EVENTS=$((HOST_COUNT * TASK_COUNT)) echo "Events captured: $EVENT_COUNT / $EXPECTED_EVENTS"
| jq -r '.results[].host' | sort -u |
|---|
Check PostgreSQL can handle the load:
```bash # Monitor connections during job watch -n 1 "psql -U awx -d awx -c \"SELECT count(*) FROM pg_stat_activity WHERE datname='awx';\""
# Should stay below max_connections ```
Verify event queue isn't backing up:
# Check Tower metrics
curl -s -H "Authorization: Bearer $TOKEN" \
"https://tower.example.com/api/v2/metrics/" | jq '.event_queue_size'
# Should be 0 or very small when no jobs runningRelated Issues
- [ansible-background-worker-stuck-in-a-retry-loop](/articles/ansible-background-worker-stuck-in-a-retry-loop) - Tower worker processing issues
- [ansible-queue-backlog-grows-because-ack-never-reaches-the-broker](/articles/ansible-queue-backlog-grows-because-ack-never-reaches-the-broker) - Message queue issues in Tower
- [ansible-duplicate-execution-starts-after-failover](/articles/ansible-duplicate-execution-starts-after-failover) - Job execution tracking problems
Related Articles
- [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
- [WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
- [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
- [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
- [WordPress troubleshooting: Ansible Batch Writer Commits Partial Res](ansible-batch-writer-commits-partial-results-before-final-validation)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Audit Trail Misses Events Under ", "description": "Learn how to fix Ansible Audit Trail Misses Events Under Burst Load. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-audit-trail-misses-events-under-burst-load", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-02-12T07:03:00.253Z", "dateModified": "2026-02-12T07:03:00.253Z" } </script>