Home / Ansible / WordPress troubleshooting: Ansible Rollout Health Looks Green but B

Ansible

WordPress troubleshooting: Ansible Rollout Health Looks Green but B

Tower rollout shows healthy status but background workers, callback receivers, and event consumers fail to reconnect to the new instance.

Published: Feb 12, 202610 min readBy FixWikiHub Editorial Team

Abstract illustration for a troubleshooting knowledge base category.

Introduction

During Ansible Tower or AWX rolling updates, instance health checks may report "green" status while background consumers like Celery workers, callback receivers, and event processors fail to reconnect to the new instance. This creates a silent failure where the Tower UI and API work correctly, but jobs never execute, callbacks go unprocessed, and scheduled tasks don't run. The health check passes because it only verifies the web layer, not the background worker connections.

This typically occurs after: - Tower version upgrades with rolling deployment - Scaling instances up or down - Network or firewall changes during maintenance - Redis or RabbitMQ broker changes - Changing the Tower instance hostname

Symptoms

Tower UI shows healthy status but jobs don't run:

``` # Tower UI Dashboard # Instances: All green checkmarks # Jobs: Stuck in "Pending" state indefinitely

# API health check returns OK $ curl -k -u admin:password https://tower.example.com/api/v2/ping/ | jq { "ha": true, "version": "4.3.0", "active_node": "tower-primary", "status": "healthy" } ```

But background workers show connection errors:

``` # Celery worker logs 2024-03-15 14:23:45,123 ERROR celery.worker Loss of connection to Redis broker 2024-03-15 14:24:12,456 WARNING celery.worker Trying to reconnect in 2.00 seconds... 2024-03-15 14:25:01,789 ERROR celery.consumer Cannot connect to redis://redis.internal:6379/0: Error 111 connecting to redis.internal:6379. Connection refused. 2024-03-15 14:26:34,012 CRITICAL celery.worker Stopping worker due to excessive connection failures

# Callback receiver logs 2024-03-15 14:27:15,345 ERROR awx.main.callbacks Callback receiver disconnected from dispatcher 2024-03-15 14:28:45,678 WARNING awx.main.dispatch No active callback receivers registered

# Dispatcher logs 2024-03-15 14:29:01,234 ERROR awx.main.dispatch Job dispatched but no workers available 2024-03-15 14:30:12,567 WARNING awx.main.tasks Job 12345 waiting for capacity (0 workers available) ```

Checking worker status shows they're not connected:

```bash # Check Celery worker status $ sudo -u awx celery -A awx inspect active Error: No nodes replied within time window.

$ sudo -u awx celery -A awx inspect ping Error: No nodes replied within time window.

# Check Tower instance capacity $ curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, capacity: .capacity, consumed_capacity: .consumed_capacity}' { "hostname": "tower-primary", "capacity": 0, # Should be > 0 "consumed_capacity": 0 }

# Check job queue depth growing $ redis-cli LLEN celery (integer) 1523 # Jobs stuck in queue

# Tower dispatcher shows no workers $ curl -k -u admin:password https://localhost/api/v2/instances/localhost/ | jq '.related.jobs' "https://tower.example.com/api/v2/instances/localhost/jobs/"

$ curl -k -u admin:password https://tower.example.com/api/v2/instances/localhost/jobs/ | jq '.count' 0 # No jobs running despite 1523 in queue ```

Common Causes

Background consumers fail to reconnect due to several factors:

1.Health check only verifies web layer: Tower's default health check endpoint (/api/v2/ping/) only verifies the web service and database connection, not Celery workers or callback receivers.
2.Stale Celery worker registration: Workers register themselves in Redis with their hostname. After a rollout with hostname change, the old registrations remain, and dispatchers try to route to non-existent workers.
3.Redis connection pool exhaustion: During reconnect attempts, workers may exhaust the Redis connection pool, preventing new connections.
4.Broker URL mismatch: Celery workers may have a cached or misconfigured broker URL after the rollout, connecting to the old Redis instance.
5.Firewall rules not updated: New instance IPs may not be allowed to connect to Redis/RabbitMQ.
6.Worker process not restarted: Rolling deployments may not properly restart worker processes, leaving them in a disconnected state.
7.Instance capacity not recalculated: Tower's capacity tracking is based on worker registration. If workers don't reconnect, capacity shows as 0.

Step-by-Step Fix

Step 1: Diagnose Worker State

Identify which workers are connected:

```bash # Check Celery worker registration redis-cli --scan --pattern "celery*" | head -20

# View registered workers redis-cli HGETALL "unacked" redis-cli SMEMBERS "unacked_index"

# Check Celery worker inspect sudo -u awx celery -A awx inspect active 2>/dev/null | jq '.[] | keys' sudo -u awx celery -A awx inspect registered 2>/dev/null | jq

# Check Tower instance capacity in database sudo -u postgres psql -d awx -c " SELECT hostname, capacity, consumed_capacity, enabled, managed FROM main_instance;"

# Check worker processes ps aux | grep -E "celery|awx-worker"

# Check supervisor status sudo supervisorctl status

# Check Redis connections redis-cli CLIENT LIST | grep -c "celery" ```

Step 2: Clear Stale Worker Registrations

Remove old worker registrations from Redis:

```bash # Connect to Redis redis-cli

# List all Celery-related keys KEYS celery*

# Delete stale worker registrations # (Be careful - this affects all queues) DEL celery

# Or more targeted - delete specific worker keys # First, identify your workers KEYS unacked*

# Purge unacked messages for specific worker DEL unacked DEL unacked_index

# Clear worker heartbeats KEYS _kombu.binding.celery* DEL _kombu.binding.celery DEL _kombu.binding.celeryev.*

# Exit Redis EXIT ```

Step 3: Restart Worker Processes

Force workers to re-register:

```bash # Restart all Tower workers sudo supervisorctl restart tower-worker:*

# Or restart specific worker groups sudo supervisorctl restart tower-worker-callbacks sudo supervisorctl restart tower-worker-default sudo supervisorctl restart tower-dispatcher

# For Tower installed via setup.sh sudo ansible-tower-service restart

# Wait for workers to start sleep 30

# Verify workers are running sudo supervisorctl status | grep -E "RUNNING|FATAL"

# Check worker logs for successful registration sudo tail -f /var/log/tower/celery.log | grep -i "ready" # Should see: "celery@tower-primary ready." ```

Step 4: Update Tower Instance Registration

Refresh instance capacity in Tower:

```bash # Update instance heartbeat manually curl -X POST -k -u admin:password \ https://localhost/api/v2/instances/localhost/heartbeat/

# Or via management command sudo -u awx awx-manage update_instance_capacity

# Check instance status curl -k -u admin:password https://localhost/api/v2/instances/ | jq '.results[] | {hostname: .hostname, capacity: .capacity}'

# Expected output: { "hostname": "tower-primary", "capacity": 100 # Should be > 0 } ```

Step 5: Verify Broker Configuration

Ensure workers are connecting to the correct broker:

```bash # Check Tower settings for broker URL grep -E "CELERY_BROKER|BROKER_URL" /etc/tower/settings.py

# Should match your actual broker cat /etc/tower/settings.py | grep broker # BROKER_URL = 'redis://redis.internal:6379/0'

# Verify Redis is accessible redis-cli -h redis.internal ping # Should return: PONG

# Check network connectivity telnet redis.internal 6379 # Should connect successfully

# Verify firewall rules allow connection sudo iptables -L -n | grep 6379 # Or for firewalld sudo firewall-cmd --list-ports ```

Step 6: Fix Health Check to Include Workers

Add worker health to your monitoring:

```bash # Create custom health check script cat > /usr/local/bin/tower_health_check.sh << 'EOF' #!/bin/bash set -e

TOWER_URL="https://localhost" AUTH="admin:password"

# Check API health API_STATUS=$(curl -sk -u $AUTH "$TOWER_URL/api/v2/ping/" | jq -r '.status // "error"') if [ "$API_STATUS" != "healthy" ]; then echo "FAIL: API status is $API_STATUS" exit 1 fi

# Check worker capacity CAPACITY=$(curl -sk -u $AUTH "$TOWER_URL/api/v2/instances/localhost/" | jq -r '.capacity // 0') if [ "$CAPACITY" -eq 0 ]; then echo "FAIL: Worker capacity is 0" exit 1 fi

# Check Celery workers WORKER_COUNT=$(sudo -u awx celery -A awx inspect ping --timeout=5 2>/dev/null | grep -c "OK" || echo 0) if [ "$WORKER_COUNT" -eq 0 ]; then echo "FAIL: No Celery workers responding" exit 1 fi

echo "OK: Tower is healthy with $WORKER_COUNT workers and capacity $CAPACITY" exit 0 EOF

chmod +x /usr/local/bin/tower_health_check.sh

# Test the health check /usr/local/bin/tower_health_check.sh ```

Step 7: Configure Load Balancer Health Checks

Update your load balancer to check worker health:

```nginx # nginx health check configuration # /etc/nginx/conf.d/tower-health.conf

server { listen 8080; server_name localhost;

location /health { access_log off;

# Check API set $api_up 0; proxy_pass http://127.0.0.1:8081/api/v2/ping/; proxy_intercept_errors on;

# Return 503 if workers not available # This requires custom logic - use external script } } ```

For HAProxy:

```haproxy # /etc/haproxy/haproxy.cfg

backend tower_servers balance roundrobin option httpchk GET /api/v2/instances/localhost/ HTTP/1.1\r\nHost:\ localhost http-check expect status 200 http-check expect string '"capacity":' server tower01 10.0.1.10:8080 check server tower02 10.0.1.11:8080 check ```

Step 8: Trigger Worker Reconnection

Force workers to reconnect to the current instance:

```bash # Send broadcast to all workers sudo -u awx celery -A awx control broadcast pool_restart

# Or restart workers via supervisor with grace period sudo supervisorctl restart tower-worker:* sleep 10 sudo supervisorctl restart tower-callback-receiver:* sleep 10 sudo supervisorctl restart tower-dispatcher:*

# Verify with inspect sudo -u awx celery -A awx inspect stats | jq '.'

# Test job execution curl -X POST -k -u admin:password \ -H "Content-Type: application/json" \ https://localhost/api/v2/job_templates/1/launch/

# Check job started running curl -k -u admin:password "https://localhost/api/v2/jobs/?status=running" | jq '.count' # Should be > 0 ```

Verification

Confirm all workers are connected and processing:

```bash # Check worker registration sudo -u awx celery -A awx inspect ping | jq # Expected: {"celery@tower-primary": {"ok": "pong"}}

# Check worker stats sudo -u awx celery -A awx inspect stats | jq '.["celery@tower-primary"].pool'

# Check instance capacity curl -k -u admin:password https://localhost/api/v2/instances/localhost/ | jq '{capacity, consumed_capacity}' # Expected: {"capacity": 100, "consumed_capacity": 0-100}

# Check jobs are executing curl -k -u admin:password "https://localhost/api/v2/jobs/?status=running" | jq '.count' # Should match expected running jobs

# Monitor job queue processing watch -n 5 'redis-cli LLEN celery' # Queue depth should be decreasing as jobs execute

# Run health check script /usr/local/bin/tower_health_check.sh # Expected: OK: Tower is healthy with N workers and capacity M

# Check callback receiver is processing curl -k -u admin:password https://localhost/api/v2/jobs/?order_by=-created | \ jq '.results[0] | {id, status, job_events_count}' # job_events_count should be increasing for running jobs ```

[ansible-celery-worker-connection-failures](/articles/ansible-celery-worker-connection-failures)
[ansible-job-stuck-pending-no-workers](/articles/ansible-job-stuck-pending-no-workers)
[ansible-tower-ha-instance-not-accepting-jobs](/articles/ansible-tower-ha-instance-not-accepting-jobs)

[WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
[WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
[WordPress troubleshooting: Ansible Background Worker Gets Stuck in ](ansible-background-worker-stuck-in-a-retry-loop)
[WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
[WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)

Was this guide helpful?

Related search paths

People also search for

If the symptom is close but not identical, these search paths usually surface the right neighboring fixes faster than scrolling the full archive.

WordPress troubleshooting: Ansible Rollout Health Looks Green but B WordPress troubleshooting: Ansible Rollout Health Looks Green but B Ansible WordPress troubleshooting: Ansible Rollout Health Looks Green but B troubleshooting WordPress troubleshooting: Ansible Rollout Health Looks Green but B fix Tower rollout shows healthy status but background workers, callback receivers, and event consumers fail to reconnect to the new instance Ansible Tower rollout shows healthy status but background workers, callback receivers, and event consumers fail to reconnect to the new instance

Explore Related Topics

Browse Guides from Other Categories

Discover troubleshooting guides from related categories to expand your knowledge.

FAQ

Ansible Troubleshooting FAQs

Common questions about troubleshooting and preventing similar issues

How do I know if this ansible-errors troubleshooting guide applies to my situation?

This guide is designed for ansible-errors issues. If you're experiencing similar symptoms described in the article, follow the step-by-step instructions. Start with the most common causes and work through the diagnostic process.

Is it safe to follow these ansible-errors troubleshooting steps?

Yes, all steps are designed to be safe and non-destructive. We recommend creating backups before making significant changes and testing each step before proceeding to the next.

How long does it typically take to resolve this type of ansible-errors issue?

Most ansible-errors issues can be resolved within 30 minutes to 2 hours, depending on the complexity and root cause. Follow the troubleshooting flow to identify and fix the problem efficiently.

How can I prevent this ansible-errors issue from happening again?

Regular maintenance, monitoring, and following best practices for ansible-errors configuration can help prevent recurrence. Consider implementing automated checks and alerts for early detection.

Written by

FixWikiHub Editorial Team

Our editorial team consists of experienced DevOps engineers, systems administrators, and cloud architects with hands-on experience in production environments across AWS, Azure, GCP, and on-premises infrastructure.

Every guide undergoes technical review for accuracy and is updated when software versions, commands, or best practices change.

Last updated: Feb 12, 2026

About our team

Important Notice

Disclaimer & Safety Guidelines

The troubleshooting steps in this guide are provided for educational and informational purposes. Before applying any changes to production systems:

Test in a staging environment first — Always verify commands and configurations in a non-production environment before deploying to live systems.
Create backups — Ensure you have current backups of databases, configurations, and critical files before making changes.
Understand the impact — Review how each step may affect your specific environment, dependencies, and users.
Consult official documentation — This guide supplements, but does not replace, official vendor documentation and best practices.

FixWikiHub is not responsible for any damages arising from the use of this content. See our Terms of Use for more information.

Resources

Official Documentation & Further Reading

For authoritative information, consult the official documentation for the technologies discussed in this guide. Our troubleshooting content supplements, but does not replace, vendor documentation.

AWS Documentation — Official Amazon Web Services guides and API references
Kubernetes Documentation — Official Kubernetes documentation
Nginx Documentation — Official Nginx web server documentation
Apache Documentation — Official Apache HTTP Server documentation
Docker Documentation — Official Docker container documentation

WordPress troubleshooting: Ansible Rollout Health Looks Green but B

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Diagnose Worker State

Step 2: Clear Stale Worker Registrations

Step 3: Restart Worker Processes

Step 4: Update Tower Instance Registration

Step 5: Verify Broker Configuration

Step 6: Fix Health Check to Include Workers

Step 7: Configure Load Balancer Health Checks

Step 8: Trigger Worker Reconnection

Verification

People also search for

Browse Guides from Other Categories

WordPress

SSL

DNS

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading

WordPress troubleshooting: Ansible Rollout Health Looks Green but B

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Diagnose Worker State

Step 2: Clear Stale Worker Registrations

Step 3: Restart Worker Processes

Step 4: Update Tower Instance Registration

Step 5: Verify Broker Configuration

Step 6: Fix Health Check to Include Workers

Step 7: Configure Load Balancer Health Checks

Step 8: Trigger Worker Reconnection

Verification

Related Issues

Related Articles

People also search for

Share this guide

More Ansible Troubleshooting Guides

Browse Guides from Other Categories

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading