Introduction
During Ansible Tower or AWX rolling updates, instance health checks may report "green" status while background consumers like Celery workers, callback receivers, and event processors fail to reconnect to the new instance. This creates a silent failure where the Tower UI and API work correctly, but jobs never execute, callbacks go unprocessed, and scheduled tasks don't run. The health check passes because it only verifies the web layer, not the background worker connections.
This typically occurs after: - Tower version upgrades with rolling deployment - Scaling instances up or down - Network or firewall changes during maintenance - Redis or RabbitMQ broker changes - Changing the Tower instance hostname
Symptoms
Tower UI shows healthy status but jobs don't run:
``` # Tower UI Dashboard # Instances: All green checkmarks # Jobs: Stuck in "Pending" state indefinitely
# API health check returns OK $ curl -k -u admin:password https://tower.example.com/api/v2/ping/ | jq { "ha": true, "version": "4.3.0", "active_node": "tower-primary", "status": "healthy" } ```
But background workers show connection errors:
``` # Celery worker logs 2024-03-15 14:23:45,123 ERROR celery.worker Loss of connection to Redis broker 2024-03-15 14:24:12,456 WARNING celery.worker Trying to reconnect in 2.00 seconds... 2024-03-15 14:25:01,789 ERROR celery.consumer Cannot connect to redis://redis.internal:6379/0: Error 111 connecting to redis.internal:6379. Connection refused. 2024-03-15 14:26:34,012 CRITICAL celery.worker Stopping worker due to excessive connection failures
# Callback receiver logs 2024-03-15 14:27:15,345 ERROR awx.main.callbacks Callback receiver disconnected from dispatcher 2024-03-15 14:28:45,678 WARNING awx.main.dispatch No active callback receivers registered
# Dispatcher logs 2024-03-15 14:29:01,234 ERROR awx.main.dispatch Job dispatched but no workers available 2024-03-15 14:30:12,567 WARNING awx.main.tasks Job 12345 waiting for capacity (0 workers available) ```
Checking worker status shows they're not connected:
```bash # Check Celery worker status $ sudo -u awx celery -A awx inspect active Error: No nodes replied within time window.
$ sudo -u awx celery -A awx inspect ping Error: No nodes replied within time window.
# Check Tower instance capacity $ curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, capacity: .capacity, consumed_capacity: .consumed_capacity}' { "hostname": "tower-primary", "capacity": 0, # Should be > 0 "consumed_capacity": 0 }
# Check job queue depth growing $ redis-cli LLEN celery (integer) 1523 # Jobs stuck in queue
# Tower dispatcher shows no workers $ curl -k -u admin:password https://localhost/api/v2/instances/localhost/ | jq '.related.jobs' "https://tower.example.com/api/v2/instances/localhost/jobs/"
$ curl -k -u admin:password https://tower.example.com/api/v2/instances/localhost/jobs/ | jq '.count' 0 # No jobs running despite 1523 in queue ```
Common Causes
Background consumers fail to reconnect due to several factors:
- 1.Health check only verifies web layer: Tower's default health check endpoint (
/api/v2/ping/) only verifies the web service and database connection, not Celery workers or callback receivers. - 2.Stale Celery worker registration: Workers register themselves in Redis with their hostname. After a rollout with hostname change, the old registrations remain, and dispatchers try to route to non-existent workers.
- 3.Redis connection pool exhaustion: During reconnect attempts, workers may exhaust the Redis connection pool, preventing new connections.
- 4.Broker URL mismatch: Celery workers may have a cached or misconfigured broker URL after the rollout, connecting to the old Redis instance.
- 5.Firewall rules not updated: New instance IPs may not be allowed to connect to Redis/RabbitMQ.
- 6.Worker process not restarted: Rolling deployments may not properly restart worker processes, leaving them in a disconnected state.
- 7.Instance capacity not recalculated: Tower's capacity tracking is based on worker registration. If workers don't reconnect, capacity shows as 0.
Step-by-Step Fix
Step 1: Diagnose Worker State
Identify which workers are connected:
```bash # Check Celery worker registration redis-cli --scan --pattern "celery*" | head -20
# View registered workers redis-cli HGETALL "unacked" redis-cli SMEMBERS "unacked_index"
# Check Celery worker inspect sudo -u awx celery -A awx inspect active 2>/dev/null | jq '.[] | keys' sudo -u awx celery -A awx inspect registered 2>/dev/null | jq
# Check Tower instance capacity in database sudo -u postgres psql -d awx -c " SELECT hostname, capacity, consumed_capacity, enabled, managed FROM main_instance;"
# Check worker processes ps aux | grep -E "celery|awx-worker"
# Check supervisor status sudo supervisorctl status
# Check Redis connections redis-cli CLIENT LIST | grep -c "celery" ```
Step 2: Clear Stale Worker Registrations
Remove old worker registrations from Redis:
```bash # Connect to Redis redis-cli
# List all Celery-related keys KEYS celery*
# Delete stale worker registrations # (Be careful - this affects all queues) DEL celery
# Or more targeted - delete specific worker keys # First, identify your workers KEYS unacked*
# Purge unacked messages for specific worker DEL unacked DEL unacked_index
# Clear worker heartbeats KEYS _kombu.binding.celery* DEL _kombu.binding.celery DEL _kombu.binding.celeryev.*
# Exit Redis EXIT ```
Step 3: Restart Worker Processes
Force workers to re-register:
```bash # Restart all Tower workers sudo supervisorctl restart tower-worker:*
# Or restart specific worker groups sudo supervisorctl restart tower-worker-callbacks sudo supervisorctl restart tower-worker-default sudo supervisorctl restart tower-dispatcher
# For Tower installed via setup.sh sudo ansible-tower-service restart
# Wait for workers to start sleep 30
# Verify workers are running sudo supervisorctl status | grep -E "RUNNING|FATAL"
# Check worker logs for successful registration sudo tail -f /var/log/tower/celery.log | grep -i "ready" # Should see: "celery@tower-primary ready." ```
Step 4: Update Tower Instance Registration
Refresh instance capacity in Tower:
```bash # Update instance heartbeat manually curl -X POST -k -u admin:password \ https://localhost/api/v2/instances/localhost/heartbeat/
# Or via management command sudo -u awx awx-manage update_instance_capacity
# Check instance status curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, capacity: .capacity}'
# Expected output: { "hostname": "tower-primary", "capacity": 100 # Should be > 0 } ```
Step 5: Verify Broker Configuration
Ensure workers are connecting to the correct broker:
```bash # Check Tower settings for broker URL grep -E "CELERY_BROKER|BROKER_URL" /etc/tower/settings.py
# Should match your actual broker cat /etc/tower/settings.py | grep broker # BROKER_URL = 'redis://redis.internal:6379/0'
# Verify Redis is accessible redis-cli -h redis.internal ping # Should return: PONG
# Check network connectivity telnet redis.internal 6379 # Should connect successfully
# Verify firewall rules allow connection sudo iptables -L -n | grep 6379 # Or for firewalld sudo firewall-cmd --list-ports ```
Step 6: Fix Health Check to Include Workers
Add worker health to your monitoring:
```bash # Create custom health check script cat > /usr/local/bin/tower_health_check.sh << 'EOF' #!/bin/bash set -e
TOWER_URL="https://localhost" AUTH="admin:password"
# Check API health API_STATUS=$(curl -sk -u $AUTH "$TOWER_URL/api/v2/ping/" | jq -r '.status // "error"') if [ "$API_STATUS" != "healthy" ]; then echo "FAIL: API status is $API_STATUS" exit 1 fi
# Check worker capacity CAPACITY=$(curl -sk -u $AUTH "$TOWER_URL/api/v2/instances/localhost/" | jq -r '.capacity // 0') if [ "$CAPACITY" -eq 0 ]; then echo "FAIL: Worker capacity is 0" exit 1 fi
# Check Celery workers WORKER_COUNT=$(sudo -u awx celery -A awx inspect ping --timeout=5 2>/dev/null | grep -c "OK" || echo 0) if [ "$WORKER_COUNT" -eq 0 ]; then echo "FAIL: No Celery workers responding" exit 1 fi
echo "OK: Tower is healthy with $WORKER_COUNT workers and capacity $CAPACITY" exit 0 EOF
chmod +x /usr/local/bin/tower_health_check.sh
# Test the health check /usr/local/bin/tower_health_check.sh ```
Step 7: Configure Load Balancer Health Checks
Update your load balancer to check worker health:
```nginx # nginx health check configuration # /etc/nginx/conf.d/tower-health.conf
server { listen 8080; server_name localhost;
location /health { access_log off;
# Check API set $api_up 0; proxy_pass http://127.0.0.1:8081/api/v2/ping/; proxy_intercept_errors on;
# Return 503 if workers not available # This requires custom logic - use external script } } ```
For HAProxy:
```haproxy # /etc/haproxy/haproxy.cfg
backend tower_servers balance roundrobin option httpchk GET /api/v2/instances/localhost/ HTTP/1.1\r\nHost:\ localhost http-check expect status 200 http-check expect string '"capacity":' server tower01 10.0.1.10:8080 check server tower02 10.0.1.11:8080 check ```
Step 8: Trigger Worker Reconnection
Force workers to reconnect to the current instance:
```bash # Send broadcast to all workers sudo -u awx celery -A awx control broadcast pool_restart
# Or restart workers via supervisor with grace period sudo supervisorctl restart tower-worker:* sleep 10 sudo supervisorctl restart tower-callback-receiver:* sleep 10 sudo supervisorctl restart tower-dispatcher:*
# Verify with inspect sudo -u awx celery -A awx inspect stats | jq '.'
# Test job execution curl -X POST -k -u admin:password \ -H "Content-Type: application/json" \ https://localhost/api/v2/job_templates/1/launch/
# Check job started running curl -k -u admin:password "https://localhost/api/v2/jobs/?status=running" | jq '.count' # Should be > 0 ```
Verification
Confirm all workers are connected and processing:
```bash # Check worker registration sudo -u awx celery -A awx inspect ping | jq # Expected: {"celery@tower-primary": {"ok": "pong"}}
# Check worker stats sudo -u awx celery -A awx inspect stats | jq '.["celery@tower-primary"].pool'
# Check instance capacity curl -k -u admin:password https://localhost/api/v2/instances/localhost/ | jq '{capacity, consumed_capacity}' # Expected: {"capacity": 100, "consumed_capacity": 0-100}
# Check jobs are executing curl -k -u admin:password "https://localhost/api/v2/jobs/?status=running" | jq '.count' # Should match expected running jobs
# Monitor job queue processing watch -n 5 'redis-cli LLEN celery' # Queue depth should be decreasing as jobs execute
# Run health check script /usr/local/bin/tower_health_check.sh # Expected: OK: Tower is healthy with N workers and capacity M
# Check callback receiver is processing curl -k -u admin:password https://localhost/api/v2/jobs/?order_by=-created | \ jq '.results[0] | {id, status, job_events_count}' # job_events_count should be increasing for running jobs ```
Related Issues
- [ansible-celery-worker-connection-failures](/articles/ansible-celery-worker-connection-failures)
- [ansible-job-stuck-pending-no-workers](/articles/ansible-job-stuck-pending-no-workers)
- [ansible-tower-ha-instance-not-accepting-jobs](/articles/ansible-tower-ha-instance-not-accepting-jobs)
Related Articles
- [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
- [WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
- [WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
- [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
- [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Rollout Health Looks Green but B", "description": "Learn how to fix Ansible Rollout Health Looks Green but Background Consumers Never Reconnected. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-rollout-health-looks-green-but-background-consumers-never-reconnected", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-02-12T09:52:45.879Z", "dateModified": "2026-02-12T09:52:45.879Z" } </script>