Introduction
When Ansible Tower or AWX recovers from a partial infrastructure outage (such as a network partition, database connectivity loss, or capacity limit), a retry storm can occur where hundreds of pending jobs, webhook callbacks, and scheduled tasks all attempt to execute simultaneously. This retry storm can overwhelm the recovered infrastructure, causing secondary outages and degraded performance that extends far beyond the original incident.
The retry storm occurs because: - Queued jobs retry automatically when connectivity returns - Webhook callbacks have exponential backoff that aligns across many triggers - Scheduled jobs that missed their window all execute at once - Callback receivers retry failed job event submissions - Auto-scaling groups launch new executors that all try to register
Symptoms
During the retry storm, you'll observe resource exhaustion:
``` # Tower task manager logs 2024-03-15 14:23:45,123 WARNING awx.main.dispatch Task queue depth: 450 (threshold: 100) 2024-03-15 14:24:12,456 ERROR awx.main.tasks Celery task timeout after 300 seconds 2024-03-15 14:25:01,789 CRITICAL awx.main.models.instance Instance tower-executor-15 at capacity (100% CPU, 95% memory) 2024-03-15 14:26:34,012 ERROR awx.main.dispatch Redis connection pool exhausted: QueuePool limit 50 reached
# Tower web logs 2024-03-15 14:27:15,345 ERROR awx.api.views Rate limit exceeded for job launch endpoint 2024-03-15 14:28:45,678 WARNING django.request Bad Request: /api/v2/jobs/ - Connection refused
# PostgreSQL connection exhaustion $ sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname='awx';" count ------- 487 (1 row)
# Max connections is typically 100-200 $ sudo -u postgres psql -c "SHOW max_connections;" max_connections ----------------- 100
# Celery/Redis queue overflow $ redis-cli LLEN tower_job_queue (integer) 892
# System resources $ uptime 14:30:45 up 45 days, load average: 89.23, 75.12, 45.34 # Extremely high load
$ free -h total used free shared buff/cache available Mem: 32Gi 30Gi 512Mi 2.0Gi 1.5Gi 512Mi Swap: 8.0Gi 7.8Gi 200Mi ```
API responses show degraded performance:
```bash $ curl -k -u admin:password https://tower.example.com/api/v2/jobs/ -w "\nTime: %{time_total}s\n" HTTP 503 Service Unavailable Time: 30.05s
$ curl -k -u admin:password https://tower.example.com/api/v2/ping/ { "error": "Service temporarily unavailable", "detail": "Database connection pool exhausted" } ```
Common Causes
The retry storm is caused by several factors converging after recovery:
- 1.Synchronized retry timers: Jobs and callbacks that failed during the outage all have similar retry timestamps. When the outage ends, they all retry nearly simultaneously.
- 2.Webhook exponential backoff alignment: External systems sending webhooks to Tower implement exponential backoff. After a 30-minute outage, multiple systems may retry within the same minute.
- 3.Missed scheduled jobs: Tower's scheduler runs missed jobs upon recovery. If 100 schedules missed their window during a 1-hour outage, all 100 will queue immediately.
- 4.Callback buffer overflow: Execution nodes buffer job events when Tower is unreachable. Upon reconnection, they submit all buffered events at once, overwhelming the API.
- 5.Auto-scaling cooldown expiration: If Tower is in an auto-scaled environment, new instances may launch during the outage and all register simultaneously upon recovery.
- 6.Connection pool exhaustion: Database and Redis connection pools reach limits as hundreds of tasks compete for resources.
Step-by-Step Fix
Step 1: Immediate Mitigation - Stop the Storm
Halt incoming requests to allow the system to recover:
```bash # Option 1: Temporarily block job launches at nginx level sudo vim /etc/nginx/conf.d/tower.conf
# Add rate limiting limit_req_zone $binary_remote_addr zone=job_launch:10m rate=1r/s;
location /api/v2/job_templates/ { limit_req zone=job_launch burst=5 nodelay; proxy_pass http://tower_api; }
# Reload nginx sudo nginx -s reload
# Option 2: Pause the Tower dispatcher sudo -u awx awx-manage job_queue_pause
# Option 3: Stop accepting new jobs via API curl -X PATCH -k -u admin:password \ -H "Content-Type: application/json" \ -d '{"max_concurrent_jobs": 0}' \ https://localhost/api/v2/instances/localhost/
# Option 4: Emergency - stop Tower services briefly sudo ansible-tower-service stop sleep 60 # Let queues drain sudo ansible-tower-service start ```
Step 2: Clear Backlogged Queues
Process or cancel pending tasks:
```bash # Check queue depths redis-cli --scan --pattern "celery-task-meta-*" | wc -l redis-cli LLEN celery redis-cli LLEN tower_job_queue
# View pending jobs curl -k -u admin:password "https://localhost/api/v2/jobs/?status=pending" | jq '.count'
# Cancel pending jobs (keep only critical) curl -k -u admin:password "https://localhost/api/v2/jobs/?status=pending" | \ jq -r '.results[] | select(.job_template != 42) | .id' | \ while read job_id; do curl -X POST -k -u admin:password \ "https://localhost/api/v2/jobs/$job_id/cancel/" done
# Or cancel all pending jobs via management command sudo -u awx awx-manage cancel_jobs --status=pending --older-than="1 hour"
# Clear stuck callbacks sudo -u awx awx-manage clear_stuck_callbacks --older-than="30 minutes"
# Purge Celery queues if completely stuck sudo -u awx celery -A awx purge -f ```
Step 3: Throttle Job Execution
Configure Tower to process jobs more slowly:
```bash # Edit Tower settings sudo vim /etc/tower/settings.py
# Reduce concurrent job capacity AWX_TASK_ENV = { 'MAX_CONCURRENT_JOBS': 10, # Default may be 50+ 'JOB_QUEUE_POLL_INTERVAL': 30, # Increase polling interval }
# Add jitter to retries AWX_TASK_CALLBACKS = { 'RETRY_BACKOFF': True, 'RETRY_BACKOFF_MAX': 600, # Max 10 minute delay 'RETRY_JITTER': True, # Add random jitter to prevent alignment }
# Restart Tower sudo ansible-tower-service restart ```
Step 4: Implement Rate Limiting
Configure Tower to handle retries gracefully:
```yaml # Edit /etc/tower/conf.d/custom_settings.yml BROKER_TRANSPORT_OPTIONS = { 'max_retries': 3, 'interval_start': 10, # Start at 10 seconds 'interval_step': 30, # Add 30 seconds each retry 'interval_max': 300, # Max 5 minute interval }
# Add circuit breaker for external calls AWX_EXTERNAL_REQUEST_TIMEOUT = 60 AWX_EXTERNAL_REQUEST_RETRIES = 2
# Configure database connection pooling DATABASES = { 'default': { 'CONN_MAX_AGE': 60, 'OPTIONS': { 'pool': { 'max_connections': 50, 'recycle': 300, } } } } ```
Step 5: Increase Resource Limits
Scale resources to handle the load:
```bash # Increase PostgreSQL connections sudo vim /etc/postgresql/14/main/postgresql.conf max_connections = 200 shared_buffers = 2GB
sudo systemctl restart postgresql
# Increase Redis memory and connections sudo vim /etc/redis/redis.conf maxmemory 4gb maxclients 2000
sudo systemctl restart redis
# Increase Celery worker processes sudo vim /etc/supervisord.d/tower-worker.ini [program:tower-worker] command=/var/lib/awx/venv/bin/celery worker --app=awx --concurrency=4 # Increase from default --max-tasks-per-child=100 --time-limit=3600
sudo supervisorctl restart tower-worker ```
Step 6: Implement Request Throttling in nginx
Add proper rate limiting at the load balancer:
```nginx # /etc/nginx/nginx.conf http { # Define rate limit zones limit_req_zone $binary_remote_addr zone=api_general:10m rate=10r/s; limit_req_zone $binary_remote_addr zone=job_launch:10m rate=1r/s; limit_req_zone $binary_remote_addr zone=webhook:10m rate=5r/s;
# Connection limits limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server { location /api/v2/ { limit_req zone=api_general burst=20 nodelay; limit_conn conn_limit 50; proxy_pass http://tower_api; }
location /api/v2/job_templates/ { limit_req zone=job_launch burst=10 nodelay; proxy_pass http://tower_api; }
location /api/v2/webhook/ { limit_req zone=webhook burst=20 nodelay; proxy_pass http://tower_api; } } } ```
Step 7: Create a Retry Storm Prevention Script
Automate detection and mitigation:
```python #!/usr/bin/env python3 # /usr/local/bin/retry_storm_monitor.py
import requests import time import sys from datetime import datetime
TOWER_URL = "https://localhost" AUTH = ("admin", "password") MAX_PENDING_JOBS = 50 MAX_QUEUE_DEPTH = 100
def get_pending_job_count(): response = requests.get( f"{TOWER_URL}/api/v2/jobs/?status=pending", auth=AUTH, verify=False ) return response.json().get("count", 0)
def get_running_job_count(): response = requests.get( f"{TOWER_URL}/api/v2/jobs/?status=running", auth=AUTH, verify=False ) return response.json().get("count", 0)
def pause_dispatcher(): print(f"{datetime.now()}: Pausing job dispatcher due to retry storm") # Add your pause mechanism here return True
def main(): consecutive_high_load = 0
while True: pending = get_pending_job_count() running = get_running_job_count()
print(f"{datetime.now()}: Pending: {pending}, Running: {running}")
if pending > MAX_PENDING_JOBS: consecutive_high_load += 1 if consecutive_high_load >= 3: pause_dispatcher() # Send alert # Exit or continue monitoring else: consecutive_high_load = 0
time.sleep(60)
if __name__ == "__main__": main() ```
Verification
Confirm the system has stabilized:
```bash # Monitor queue depths - should be decreasing watch -n 5 'redis-cli LLEN tower_job_queue'
# Check pending job count curl -k -u admin:password "https://localhost/api/v2/jobs/?status=pending" | jq '.count' # Should be < 50 and stable
# Check system load uptime # Load average should be returning to normal (< number of CPU cores)
# Verify PostgreSQL connections are healthy sudo -u postgres psql -c " SELECT count(*) as active_connections FROM pg_stat_activity WHERE datname='awx' AND state='active';" # Should be < 50
# Test job execution curl -X POST -k -u admin:password \ https://localhost/api/v2/job_templates/1/launch/ | jq '.status' # Should return "pending" or "running" quickly
# Check API response times curl -k -u admin:password https://localhost/api/v2/ping/ -w "\nTime: %{time_total}s\n" # Should be < 1 second ```
Related Issues
- [ansible-tower-job-queue-backlog](/articles/ansible-tower-job-queue-backlog)
- [ansible-celery-worker-exhaustion](/articles/ansible-celery-worker-exhaustion)
- [ansible-database-connection-pool-exhausted](/articles/ansible-database-connection-pool-exhausted)
Related Articles
- [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
- [WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
- [WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
- [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
- [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Retry Storm Starts After Partial", "description": "Learn how to fix Ansible Retry Storm Starts After Partial Outage Recovery. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-retry-storm-starts-after-partial-outage-recovery", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-02-12T02:36:49.247Z", "dateModified": "2026-02-12T02:36:49.247Z" } </script>