Home / Ansible / WordPress troubleshooting: Ansible Background Worker Gets Stuck in

Ansible

WordPress troubleshooting: Ansible Background Worker Gets Stuck in

Fix Ansible Tower/AWX background workers stuck in infinite retry loops caused by transient error mishandling, missing retry limits, or poisoned task queues.

Published: Feb 12, 20269 min readBy FixWikiHub Editorial Team

Abstract illustration for a troubleshooting knowledge base category.

Introduction

Ansible Tower and AWX use background workers (Celery workers) to process asynchronous tasks like job launches, project updates, inventory imports, and system maintenance. When these workers encounter errors that they interpret as transient, they may enter an infinite retry loop, continuously attempting to process the same failed task. This consumes worker slots, blocks the task queue, and prevents new jobs from being scheduled.

The issue typically manifests after infrastructure changes (database migrations, network topology updates, credential rotations) that create conditions the retry logic doesn't account for. Workers retry indefinitely because the error condition is permanent, but the exception handling treats it as recoverable.

Symptoms

Tower task logs show repeated attempts to process the same task:

bash

$ kubectl logs -n awx deployment/awx-task --tail=100 | grep -i retry
[2026-04-11 14:36:22] awx.main.tasks WARNING Retrying task projects_15_update (attempt 1/3)
[2026-04-11 14:36:25] awx.main.tasks WARNING Retrying task projects_15_update (attempt 2/3)
[2026-04-11 14:36:28] awx.main.tasks WARNING Retrying task projects_15_update (attempt 3/3)
[2026-04-11 14:36:31] awx.main.tasks WARNING Retrying task projects_15_update (attempt 1/3)
[2026-04-11 14:36:34] awx.main.tasks WARNING Retrying task projects_15_update (attempt 2/3)

The Tower UI shows jobs stuck in "Pending" or "Running" state indefinitely:

bash

Jobs Dashboard:
  Running: 5 (one for 3+ hours)
  Pending: 127
  Successful: 842

PostgreSQL shows locked rows from the stuck transaction:

```sql SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%main_job%';

Celery worker status shows tasks in infinite retry:

bash

$ celery -A awx inspect active
-> awx-task-1: OK
    * {"id": "a1b2c3d4-uuid", "name": "awx.main.tasks.run_job_launch",
       "retries": 847, "state": "RETRY"}

The RabbitMQ/Redis queue depth grows unbounded:

```bash # For Redis-backed Celery $ redis-cli LLEN celery (integer) 1547

# For RabbitMQ $ rabbitmqctl list_queues name messages celery 1547 ```

Common Causes

1. Misconfigured Retry Decorator Without Maximum

Task definitions may have retry logic that doesn't enforce a maximum retry count:

python

# From awx/main/tasks.py (problematic pattern)
@task(bind=True, max_retries=None)  # No limit!
def update_project(self, project_id):
    try:
        project = Project.objects.get(id=project_id)
        project.update()
    except Exception as exc:
        # Retries indefinitely because max_retries=None
        raise self.retry(exc=exc)

2. Permanent Errors Classified as Transient

The retry logic catches all exceptions and retries, but some errors (like "project not found" after deletion) are permanent:

python

@task(bind=True, max_retries=3)
def run_job(self, job_id):
    job = Job.objects.get(id=job_id)
    if not job.project:
        raise Exception("Project deleted")  # Permanent error, but caught by retry

3. Database Lock Contention Creating Phantom Failures

When multiple workers try to update the same job record, one gets a lock timeout and retries:

python

# Lock timeout causes retry loop
DatabaseError: lock wait timeout exceeded
    at awx/main/models/jobs.py:245 in update_status

4. Celery Beat Periodic Task Overlap

Scheduled tasks queue faster than they can complete, creating a backlog:

yaml

# Celery beat schedule
CELERYBEAT_SCHEDULE = {
    'cleanup-old-jobs': {
        'task': 'awx.main.tasks.cleanup_jobs',
        'schedule': 60.0,  # Every 60 seconds
    },
}
# If cleanup takes 90 seconds, tasks queue up

Step-by-Step Fix

Step 1: Identify the Stuck Task

Find which task is in the retry loop:

```bash # Check Celery worker active tasks celery -A awx inspect active

# For Tower installed via package sudo rabbitmqctl list_queues name messages consumers

# For AWX on Kubernetes with Redis kubectl exec -n awx deployment/awx-task -- redis-cli --scan --pattern "celery-task-meta-*" | head -20

# Check PostgreSQL for locked jobs psql -U awx -d awx -c " SELECT j.id, j.name, j.status, j.started, pg_locks.pid FROM main_job j LEFT JOIN pg_locks ON pg_locks.transactionid = j.id WHERE j.status IN ('running', 'pending') ORDER BY j.started; " ```

Step 2: Terminate the Stuck Task

Remove the poisoned task from the queue:

```bash # Revoke the specific task by ID celery -A awx control revoke <task-uuid> --terminate

# Or through Tower API curl -X POST -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/jobs/12345/cancel/"

# Clear all pending tasks from Redis kubectl exec -n awx deployment/awx-task -- redis-cli FLUSHDB

# For RabbitMQ, purge the queue rabbitmqctl purge_queue celery ```

Kill the stuck database transaction:

```sql -- Find the blocking PID SELECT pid, usename, state, query FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%';

-- Terminate the connection SELECT pg_terminate_backend(<pid>); ```

Step 3: Configure Proper Retry Limits

Update Tower task settings to enforce retry limits:

```python # /etc/tower/settings.py or custom settings # Configure Celery task retry limits CELERY_TASK_SOFT_TIME_LIMIT = 3600 # 1 hour soft limit CELERY_TASK_TIME_LIMIT = 7200 # 2 hour hard limit

# Maximum retries for all tasks CELERY_TASK_MAX_RETRIES = 5

# Retry delay with exponential backoff CELERY_TASK_DEFAULT_RETRY_DELAY = 60 # Start at 60 seconds CELERY_TASK_MAX_RETRY_DELAY = 300 # Max 5 minutes between retries ```

For AWX deployed via operator, set environment variables:

yaml

# awx.yaml
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  task_env:
    - name: CELERY_TASK_MAX_RETRIES
      value: "5"
    - name: CELERY_TASK_DEFAULT_RETRY_DELAY
      value: "60"

Step 4: Implement Idempotent Error Handling

Create a custom task mixin that distinguishes permanent from transient errors:

```python # custom_tasks.py - Add to Tower plugins from celery import Task from django.db import DatabaseError import requests

class RetryableTask(Task): """Task mixin with smart retry logic."""

# Define which exceptions are retryable RETRYABLE_EXCEPTIONS = ( ConnectionError, TimeoutError, requests.exceptions.RequestException, DatabaseError, # But only certain database errors )

# Define which exceptions should never retry PERMANENT_EXCEPTIONS = ( ValueError, KeyError, TypeError, ObjectDoesNotExist, PermissionDenied, )

def on_retry(self, exc, task_id, args, kwargs, einfo): # Log the retry with full context logger.warning( f"Task {self.name} retry {self.request.retries}/{self.max_retries}: {exc}" )

# Check if we've exceeded retries if self.request.retries >= self.max_retries: logger.error(f"Task {self.name} exceeded max retries, giving up") return # Don't retry again

# Check if this is a permanent error if isinstance(exc, self.PERMANENT_EXCEPTIONS): logger.error(f"Task {self.name} hit permanent error: {exc}") return # Don't retry permanent errors

raise self.retry(exc=exc, countdown=self.get_retry_delay())

def get_retry_delay(self): """Exponential backoff: 60s, 120s, 240s, 300s, 300s""" delay = min(60 * (2 ** self.request.retries), 300) return delay

# Usage in custom tasks @task(bind=True, base=RetryableTask, max_retries=5) def update_inventory(self, inventory_id): try: inventory = Inventory.objects.get(id=inventory_id) inventory.update_sources() except Inventory.DoesNotExist: logger.error(f"Inventory {inventory_id} does not exist - not retrying") raise # Permanent error, won't retry due to PERMANENT_EXCEPTIONS except requests.exceptions.ConnectionError: logger.warning(f"Connection error updating inventory {inventory_id}") raise self.retry(countdown=60) # Will retry with backoff ```

Step 5: Add Health Checks and Automatic Recovery

Configure Tower to detect and recover stuck workers:

```python # /etc/tower/settings.py # Worker health check interval AWX_WORKER_HEARTBEAT = 60 # Check every 60 seconds

# Kill workers that haven't checked in AWX_WORKER_TIMEOUT = 300 # 5 minutes without heartbeat = dead

# Enable task result backend for tracking CELERY_RESULT_BACKEND = 'redis://localhost:6379/1' CELERY_TASK_TRACK_STARTED = True CELERY_TASK_REPORTS = True ```

Create a monitoring playbook to detect stuck tasks:

```yaml # monitor_workers.yml - name: Monitor Tower worker health hosts: localhost vars: tower_url: "https://tower.example.com" max_job_duration: 7200 # 2 hours

tasks: - name: Get running jobs uri: url: "{{ tower_url }}/api/v2/jobs/?status=running" method: GET headers: Authorization: "Bearer {{ tower_token }}" return_content: yes register: running_jobs

name: Find stuck jobs (running too long)
set_fact:
stuck_jobs: "{{ running_jobs.json.results | selectattr('started', 'defined') | selectattr('elapsed', '>', max_job_duration) | list }}"

name: Alert on stuck jobs
debug:
msg: "Job {{ item.id }} ({{ item.name }}) has been running for {{ item.elapsed }} seconds"
loop: "{{ stuck_jobs }}"
when: stuck_jobs | length > 0

name: Cancel stuck jobs
uri:
url: "{{ tower_url }}/api/v2/jobs/{{ item.id }}/cancel/"
method: POST
headers:
Authorization: "Bearer {{ tower_token }}"
loop: "{{ stuck_jobs }}"
when: stuck_jobs | length > 0
`

Verification

Run tests to confirm retry limits are working:

```bash # Check Celery configuration celery -A awx inspect conf | grep -i retry

# Start a test job that will fail curl -X POST -H "Authorization: Bearer $TOKEN" \ "https://tower.example.com/api/v2/job_templates/1/launch/" \ -d '{"extra_vars": "{\"fail_test\": true}"}'

# Watch the retry count watch -n 5 "celery -A awx inspect active | grep retries" # Should show retries incrementing to max_retries, then stopping

# Check worker logs for proper retry behavior kubectl logs -n awx deployment/awx-task -f | grep -i retry # Should see: "exceeded max retries, giving up" ```

Verify tasks complete or fail gracefully:

bash

# Check job status after retries
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://tower.example.com/api/v2/jobs/" | jq '.results[] | {id, status, failed}'

[ansible-dead-letter-queue-fills-because-a-poison-message-is-never-quarantined](/articles/ansible-dead-letter-queue-fills-because-a-poison-message-is-never-quarantined) - Poison message handling
[ansible-job-runner-replays-completed-work-after-lease-expiry](/articles/ansible-job-runner-replays-completed-work-after-lease-expiry) - Job replay issues
[ansible-queue-backlog-grows-because-ack-never-reaches-the-broker](/articles/ansible-queue-backlog-grows-because-ack-never-reaches-the-broker) - Queue management problems

[WordPress troubleshooting: Ansible Artifact Download Uses an Old Mi](ansible-artifact-download-uses-an-old-mirror-after-proxy-change)
[WordPress troubleshooting: Ansible Audit Trail Misses Events Under ](ansible-audit-trail-misses-events-under-burst-load)
[WordPress troubleshooting: Ansible Backup Completes but Restore Fai](ansible-backup-completes-but-restore-fails-checksum-validation)
[WordPress troubleshooting: Ansible Batch Importer Duplicates Rows A](ansible-batch-importer-duplicates-rows-after-a-retry)
[WordPress troubleshooting: Ansible Batch Writer Commits Partial Res](ansible-batch-writer-commits-partial-results-before-final-validation)

Was this guide helpful?

Related search paths

People also search for

If the symptom is close but not identical, these search paths usually surface the right neighboring fixes faster than scrolling the full archive.

WordPress troubleshooting: Ansible Background Worker Gets Stuck in WordPress troubleshooting: Ansible Background Worker Gets Stuck in Ansible WordPress troubleshooting: Ansible Background Worker Gets Stuck in troubleshooting WordPress troubleshooting: Ansible Background Worker Gets Stuck in fix Fix Ansible Tower/AWX background workers stuck in infinite retry loops caused by transient error mishandling, missing retry limits, or poisoned task queues Ansible Fix Ansible Tower/AWX background workers stuck in infinite retry loops caused by transient error mishandling, missing retry limits, or poisoned task queues

Explore Related Topics

Browse Guides from Other Categories

Discover troubleshooting guides from related categories to expand your knowledge.

FAQ

Ansible Troubleshooting FAQs

Common questions about troubleshooting and preventing similar issues

How do I know if this ansible-errors troubleshooting guide applies to my situation?

This guide is designed for ansible-errors issues. If you're experiencing similar symptoms described in the article, follow the step-by-step instructions. Start with the most common causes and work through the diagnostic process.

Is it safe to follow these ansible-errors troubleshooting steps?

Yes, all steps are designed to be safe and non-destructive. We recommend creating backups before making significant changes and testing each step before proceeding to the next.

How long does it typically take to resolve this type of ansible-errors issue?

Most ansible-errors issues can be resolved within 30 minutes to 2 hours, depending on the complexity and root cause. Follow the troubleshooting flow to identify and fix the problem efficiently.

How can I prevent this ansible-errors issue from happening again?

Regular maintenance, monitoring, and following best practices for ansible-errors configuration can help prevent recurrence. Consider implementing automated checks and alerts for early detection.

Written by

FixWikiHub Editorial Team

Our editorial team consists of experienced DevOps engineers, systems administrators, and cloud architects with hands-on experience in production environments across AWS, Azure, GCP, and on-premises infrastructure.

Every guide undergoes technical review for accuracy and is updated when software versions, commands, or best practices change.

Last updated: Feb 12, 2026

About our team

Important Notice

Disclaimer & Safety Guidelines

The troubleshooting steps in this guide are provided for educational and informational purposes. Before applying any changes to production systems:

Test in a staging environment first — Always verify commands and configurations in a non-production environment before deploying to live systems.
Create backups — Ensure you have current backups of databases, configurations, and critical files before making changes.
Understand the impact — Review how each step may affect your specific environment, dependencies, and users.
Consult official documentation — This guide supplements, but does not replace, official vendor documentation and best practices.

FixWikiHub is not responsible for any damages arising from the use of this content. See our Terms of Use for more information.

Resources

Official Documentation & Further Reading

For authoritative information, consult the official documentation for the technologies discussed in this guide. Our troubleshooting content supplements, but does not replace, vendor documentation.

AWS Documentation — Official Amazon Web Services guides and API references
Kubernetes Documentation — Official Kubernetes documentation
Nginx Documentation — Official Nginx web server documentation
Apache Documentation — Official Apache HTTP Server documentation
Docker Documentation — Official Docker container documentation

WordPress troubleshooting: Ansible Background Worker Gets Stuck in

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Identify the Stuck Task

Step 2: Terminate the Stuck Task

Step 3: Configure Proper Retry Limits

Step 4: Implement Idempotent Error Handling

Step 5: Add Health Checks and Automatic Recovery

Verification

People also search for

Browse Guides from Other Categories

WordPress

SSL

DNS

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading

WordPress troubleshooting: Ansible Background Worker Gets Stuck in

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Identify the Stuck Task

Step 2: Terminate the Stuck Task

Step 3: Configure Proper Retry Limits

Step 4: Implement Idempotent Error Handling

Step 5: Add Health Checks and Automatic Recovery

Verification

Related Issues

Related Articles

People also search for

Share this guide

More Ansible Troubleshooting Guides

Browse Guides from Other Categories

Ansible Troubleshooting FAQs

FixWikiHub Editorial Team

Disclaimer & Safety Guidelines

Official Documentation & Further Reading