❄️ Suddenly Main Features Froze… Here’s What Happened and How We Fixed It

Everything looked fine—at least, until it didn’t.

Our Django app ran on Celery with a single worker inside Docker. Most of the time, things worked smoothly: real-time data flowed in from an external platform, users got their updates instantly, and life was good.

But reality isn’t always that kind. The external API occasionally hit us with rate limits, unstable connections, or simply failed to deliver. To keep data consistent, we scheduled a full sync twice a day, outside business hours.

And then came the moment that broke things.

One day, a full sync had to be triggered during business hours. Suddenly, 10,000+ tasks flooded the queue. Everything—every click, every feature—was forced to wait behind that mountain of jobs. What used to respond in under 15 seconds now took five minutes. For users, that felt like downtime.

The First Fix: Prioritizing Tasks

We split the work into three queues:

  • high_priority for real-time, user-facing features

  • default for everyday background tasks

  • low_priority for heavy sync jobs

This worked—at first. High-priority jobs jumped ahead, and sync tasks politely waited their turn. Response times improved.

But then another storm rolled in.

The New Bottleneck: Hanging API Calls

The external data source started serving requests with expired SSL certificates. Calls didn’t fail fast; they just hung. And since all queues were handled by the same worker, everything slowed down again. High-priority tasks that should have been instant now dragged out for minutes.

The Final Fix: Isolated Workers

We gave each queue its own worker:

  • One dedicated to high_priority tasks

  • One for default jobs

  • One isolated for low_priority syncs

Now, if the sync worker gets stuck waiting on a bad API call, it doesn’t matter—real-time features keep running smoothly.

The Outcome

The architecture became far more resilient. User-facing features stayed fast and stable, even under heavy load. And sync jobs, while still heavy, no longer had the power to bring everything else down with them.

The lesson? Don’t let a single slow queue hold your entire system hostage.