Introduction
Ansible Tower and AWX use background workers (Celery workers) to process asynchronous tasks like job launches, project updates, inventory imports, and system maintenance. When these workers encounter errors that they interpret as transient, they may enter an infinite retry loop, continuously attempting to process the same failed task. This consumes worker slots, blocks the task queue, and prevents new jobs from being scheduled.
The issue typically manifests after infrastructure changes (database migrations, network topology updates, credential rotations) that create conditions the retry logic doesn't account for. Workers retry indefinitely because the error condition is permanent, but the exception handling treats it as recoverable.
Symptoms
Tower task logs show repeated attempts to process the same task:
$ kubectl logs -n awx deployment/awx-task --tail=100 | grep -i retry
[2026-04-11 14:36:22] awx.main.tasks WARNING Retrying task projects_15_update (attempt 1/3)
[2026-04-11 14:36:25] awx.main.tasks WARNING Retrying task projects_15_update (attempt 2/3)
[2026-04-11 14:36:28] awx.main.tasks WARNING Retrying task projects_15_update (attempt 3/3)
[2026-04-11 14:36:31] awx.main.tasks WARNING Retrying task projects_15_update (attempt 1/3)
[2026-04-11 14:36:34] awx.main.tasks WARNING Retrying task projects_15_update (attempt 2/3)The Tower UI shows jobs stuck in "Pending" or "Running" state indefinitely:
Jobs Dashboard:
Running: 5 (one for 3+ hours)
Pending: 127
Successful: 842PostgreSQL shows locked rows from the stuck transaction:
```sql SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%main_job%';
pid | state | query_start | query -------+--------+-------------------------------+---------------------------------- 28451 | active | 2026-04-11 11:22:15.123456+00 | UPDATE main_job SET status = 'r... (1 row) -- Running for 3+ hours ```
Celery worker status shows tasks in infinite retry:
$ celery -A awx inspect active
-> awx-task-1: OK
* {"id": "a1b2c3d4-uuid", "name": "awx.main.tasks.run_job_launch",
"retries": 847, "state": "RETRY"}The RabbitMQ/Redis queue depth grows unbounded:
```bash # For Redis-backed Celery $ redis-cli LLEN celery (integer) 1547
# For RabbitMQ $ rabbitmqctl list_queues name messages celery 1547 ```
Common Causes
1. Misconfigured Retry Decorator Without Maximum
Task definitions may have retry logic that doesn't enforce a maximum retry count:
# From awx/main/tasks.py (problematic pattern)
@task(bind=True, max_retries=None) # No limit!
def update_project(self, project_id):
try:
project = Project.objects.get(id=project_id)
project.update()
except Exception as exc:
# Retries indefinitely because max_retries=None
raise self.retry(exc=exc)2. Permanent Errors Classified as Transient
The retry logic catches all exceptions and retries, but some errors (like "project not found" after deletion) are permanent:
@task(bind=True, max_retries=3)
def run_job(self, job_id):
job = Job.objects.get(id=job_id)
if not job.project:
raise Exception("Project deleted") # Permanent error, but caught by retry3. Database Lock Contention Creating Phantom Failures
When multiple workers try to update the same job record, one gets a lock timeout and retries:
# Lock timeout causes retry loop
DatabaseError: lock wait timeout exceeded
at awx/main/models/jobs.py:245 in update_status4. Celery Beat Periodic Task Overlap
Scheduled tasks queue faster than they can complete, creating a backlog:
# Celery beat schedule
CELERYBEAT_SCHEDULE = {
'cleanup-old-jobs': {
'task': 'awx.main.tasks.cleanup_jobs',
'schedule': 60.0, # Every 60 seconds
},
}
# If cleanup takes 90 seconds, tasks queue upStep-by-Step Fix
Step 1: Identify the Stuck Task
Find which task is in the retry loop:
```bash # Check Celery worker active tasks celery -A awx inspect active
# For Tower installed via package sudo rabbitmqctl list_queues name messages consumers
# For AWX on Kubernetes with Redis kubectl exec -n awx deployment/awx-task -- redis-cli --scan --pattern "celery-task-meta-*" | head -20
# Check PostgreSQL for locked jobs psql -U awx -d awx -c " SELECT j.id, j.name, j.status, j.started, pg_locks.pid FROM main_job j LEFT JOIN pg_locks ON pg_locks.transactionid = j.id WHERE j.status IN ('running', 'pending') ORDER BY j.started; " ```
Step 2: Terminate the Stuck Task
Remove the poisoned task from the queue:
```bash # Revoke the specific task by ID celery -A awx control revoke <task-uuid> --terminate
# Or through Tower API curl -X POST -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/jobs/12345/cancel/"
# Clear all pending tasks from Redis kubectl exec -n awx deployment/awx-task -- redis-cli FLUSHDB
# For RabbitMQ, purge the queue rabbitmqctl purge_queue celery ```
Kill the stuck database transaction:
```sql -- Find the blocking PID SELECT pid, usename, state, query FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%';
-- Terminate the connection SELECT pg_terminate_backend(<pid>); ```
Step 3: Configure Proper Retry Limits
Update Tower task settings to enforce retry limits:
```python # /etc/tower/settings.py or custom settings # Configure Celery task retry limits CELERY_TASK_SOFT_TIME_LIMIT = 3600 # 1 hour soft limit CELERY_TASK_TIME_LIMIT = 7200 # 2 hour hard limit
# Maximum retries for all tasks CELERY_TASK_MAX_RETRIES = 5
# Retry delay with exponential backoff CELERY_TASK_DEFAULT_RETRY_DELAY = 60 # Start at 60 seconds CELERY_TASK_MAX_RETRY_DELAY = 300 # Max 5 minutes between retries ```
For AWX deployed via operator, set environment variables:
# awx.yaml
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:
task_env:
- name: CELERY_TASK_MAX_RETRIES
value: "5"
- name: CELERY_TASK_DEFAULT_RETRY_DELAY
value: "60"Step 4: Implement Idempotent Error Handling
Create a custom task mixin that distinguishes permanent from transient errors:
```python # custom_tasks.py - Add to Tower plugins from celery import Task from django.db import DatabaseError import requests
class RetryableTask(Task): """Task mixin with smart retry logic."""
# Define which exceptions are retryable RETRYABLE_EXCEPTIONS = ( ConnectionError, TimeoutError, requests.exceptions.RequestException, DatabaseError, # But only certain database errors )
# Define which exceptions should never retry PERMANENT_EXCEPTIONS = ( ValueError, KeyError, TypeError, ObjectDoesNotExist, PermissionDenied, )
def on_retry(self, exc, task_id, args, kwargs, einfo): # Log the retry with full context logger.warning( f"Task {self.name} retry {self.request.retries}/{self.max_retries}: {exc}" )
# Check if we've exceeded retries if self.request.retries >= self.max_retries: logger.error(f"Task {self.name} exceeded max retries, giving up") return # Don't retry again
# Check if this is a permanent error if isinstance(exc, self.PERMANENT_EXCEPTIONS): logger.error(f"Task {self.name} hit permanent error: {exc}") return # Don't retry permanent errors
raise self.retry(exc=exc, countdown=self.get_retry_delay())
def get_retry_delay(self): """Exponential backoff: 60s, 120s, 240s, 300s, 300s""" delay = min(60 * (2 ** self.request.retries), 300) return delay
# Usage in custom tasks @task(bind=True, base=RetryableTask, max_retries=5) def update_inventory(self, inventory_id): try: inventory = Inventory.objects.get(id=inventory_id) inventory.update_sources() except Inventory.DoesNotExist: logger.error(f"Inventory {inventory_id} does not exist - not retrying") raise # Permanent error, won't retry due to PERMANENT_EXCEPTIONS except requests.exceptions.ConnectionError: logger.warning(f"Connection error updating inventory {inventory_id}") raise self.retry(countdown=60) # Will retry with backoff ```
Step 5: Add Health Checks and Automatic Recovery
Configure Tower to detect and recover stuck workers:
```python # /etc/tower/settings.py # Worker health check interval AWX_WORKER_HEARTBEAT = 60 # Check every 60 seconds
# Kill workers that haven't checked in AWX_WORKER_TIMEOUT = 300 # 5 minutes without heartbeat = dead
# Enable task result backend for tracking CELERY_RESULT_BACKEND = 'redis://localhost:6379/1' CELERY_TASK_TRACK_STARTED = True CELERY_TASK_REPORTS = True ```
Create a monitoring playbook to detect stuck tasks:
```yaml # monitor_workers.yml - name: Monitor Tower worker health hosts: localhost vars: tower_url: "https://tower.example.com" max_job_duration: 7200 # 2 hours
tasks: - name: Get running jobs uri: url: "{{ tower_url }}/api/v2/jobs/?status=running" method: GET headers: Authorization: "Bearer {{ tower_token }}" return_content: yes register: running_jobs
- name: Find stuck jobs (running too long)
- set_fact:
- stuck_jobs: "{{ running_jobs.json.results | selectattr('started', 'defined') | selectattr('elapsed', '>', max_job_duration) | list }}"
- name: Alert on stuck jobs
- debug:
- msg: "Job {{ item.id }} ({{ item.name }}) has been running for {{ item.elapsed }} seconds"
- loop: "{{ stuck_jobs }}"
- when: stuck_jobs | length > 0
- name: Cancel stuck jobs
- uri:
- url: "{{ tower_url }}/api/v2/jobs/{{ item.id }}/cancel/"
- method: POST
- headers:
- Authorization: "Bearer {{ tower_token }}"
- loop: "{{ stuck_jobs }}"
- when: stuck_jobs | length > 0
`
Verification
Run tests to confirm retry limits are working:
```bash # Check Celery configuration celery -A awx inspect conf | grep -i retry
# Start a test job that will fail curl -X POST -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/job_templates/1/launch/" \ -d '{"extra_vars": "{\"fail_test\": true}"}'
# Watch the retry count watch -n 5 "celery -A awx inspect active | grep retries" # Should show retries incrementing to max_retries, then stopping
# Check worker logs for proper retry behavior kubectl logs -n awx deployment/awx-task -f | grep -i retry # Should see: "exceeded max retries, giving up" ```
Verify tasks complete or fail gracefully:
# Check job status after retries
curl -s -H "Authorization: Bearer $TOKEN" \
"https://tower.example.com/api/v2/jobs/" | jq '.results[] | {id, status, failed}'Related Issues
- [ansible-dead-letter-queue-fills-because-a-poison-message-is-never-quarantined](/articles/ansible-dead-letter-queue-fills-because-a-poison-message-is-never-quarantined) - Poison message handling
- [ansible-job-runner-replays-completed-work-after-lease-expiry](/articles/ansible-job-runner-replays-completed-work-after-lease-expiry) - Job replay issues
- [ansible-queue-backlog-grows-because-ack-never-reaches-the-broker](/articles/ansible-queue-backlog-grows-because-ack-never-reaches-the-broker) - Queue management problems
Related Articles
- [WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
- [WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
- [WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
- [WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
- [WordPress troubleshooting: Ansible Batch Writer Commits Partial Res](ansible-batch-writer-commits-partial-results-before-final-validation)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "WordPress troubleshooting: Ansible Background Worker Gets Stuck in ", "description": "Learn how to fix Ansible Background Worker Gets Stuck in a Retry Loop. Professional WordPress troubleshooting solutions with step-by-step guidance. WP error fix, WordPress optimization, WP security, WordPress performance.", "url": "https://www.fixwikihub.com/ansible-background-worker-stuck-in-a-retry-loop", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-02-12T09:54:20.213Z", "dateModified": "2026-02-12T09:54:20.213Z" } </script>