Alimov's Academy

🧵 One Celery Worker, 10k+ Tasks, and a Lesson in Prioritization

One Celery Worker, 10k+ Tasks, and a Lesson in Prioritization

In a Django web application, Celery was initially configured with a single worker running inside a Docker container. The system relied on real-time data pulled from an external platform. However, due to occasional API rate limits, network instability, or platform-side issues, real-time data could sometimes fail to arrive.

To ensure data consistency, a full synchronization process was scheduled twice daily during non-working hours. On certain occasions, this full sync had to be triggered during business hours, resulting in over 10,000 Celery tasks being queued at once.

This setup led to a critical issue: feature-related tasks were stuck behind the backlog of sync jobs. As a result, users experienced delays and partial downtime in key functionality.

Initial Solution: Queue Prioritization

To solve this, the Celery configuration was updated to use multiple queues:

high_priority for real-time and user-facing features
default for standard background tasks
low_priority for resource-heavy synchronization jobs

With this setup, Celery was able to prioritize high-importance tasks first and defer the low-priority workload. This significantly improved responsiveness during load spikes — until another bottleneck emerged.

New Bottleneck: External API Failure

An unexpected issue occurred when the external data source began serving requests with expired SSL certificates. Instead of failing quickly, API calls started hanging indefinitely. Because all queues were still handled by the same worker, even high_priority tasks became delayed. Core features degraded, with response times stretching up to five minutes.

Final Fix: Isolated Workers per Queue

To mitigate this, separate Celery workers were deployed for each queue:

A dedicated worker for high_priority tasks
A separate worker for default tasks
An isolated worker for low_priority sync jobs

This ensured that delays in one queue could not cascade into others. For example, if the sync worker is blocked by slow or unresponsive external APIs, real-time tasks continue to be processed independently.

Outcome

With isolated workers and clear queue prioritization, the system became significantly more resilient. User-facing features remained stable even under load, and synchronization issues no longer impacted core performance. The architecture now provides a more fault-tolerant and user-friendly experience.

What About Your Setup?

Have similar issues been encountered with Celery or background task processing?
How is task prioritization handled in your system?
Are workers isolated by queue, or is a different strategy used?
What tools or techniques have helped increase system resilience under load?

Feel free to share lessons learned, improvements made, or challenges faced. Feedback is always welcome! Join the discussion on Telegram