Everything looked fine—at least, until it didn’t.
Our Django app ran on Celery with a single worker inside Docker. Most of the time, things worked smoothly: real-time data flowed in from an external platform, users got their updates instantly, and life was good.
But reality isn’t always that kind. The external API occasionally hit us with rate limits, unstable connections, or simply failed to deliver. To keep data consistent, we scheduled a full sync twice a day, outside business hours.
And then came the moment that broke things.
One day, a full sync had to be triggered during business hours. Suddenly, 10,000+ tasks flooded the queue. Everything—every click, every feature—was forced to wait behind that mountain of jobs. What used to respond in under 15 seconds now took five minutes. For users, that felt like downtime.
We split the work into three queues:
high_priority for real-time, user-facing features
default for everyday background tasks
low_priority for heavy sync jobs
This worked—at first. High-priority jobs jumped ahead, and sync tasks politely waited their turn. Response times improved.
But then another storm rolled in.
The external data source started serving requests with expired SSL certificates. Calls didn’t fail fast; they just hung. And since all queues were handled by the same worker, everything slowed down again. High-priority tasks that should have been instant now dragged out for minutes.
We gave each queue its own worker:
One dedicated to high_priority tasks
One for default jobs
One isolated for low_priority syncs
Now, if the sync worker gets stuck waiting on a bad API call, it doesn’t matter—real-time features keep running smoothly.
The architecture became far more resilient. User-facing features stayed fast and stable, even under heavy load. And sync jobs, while still heavy, no longer had the power to bring everything else down with them.
The lesson? Don’t let a single slow queue hold your entire system hostage.