Handling Chaos: How We Debugged a Memory Leak in Prod
It started on a quiet Friday. We noticed memory usage on one of our Node.js services creeping up — slowly at first, then alarmingly fast. It wasn't crashing, but it was on track to eat the entire container within hours. Something was leaking.
Step 1: Confirming It Was a Leak
First, we ruled out spikes from high traffic. Our dashboards showed a steady user load. The memory graph, however, looked like a staircase climbing to doom. We restarted the service — memory dropped. But within minutes, the pattern resumed. Classic memory leak behavior.
Step 2: Capturing a Heap Snapshot
We attached Chrome DevTools to the running container and captured a heap snapshot using the --inspect
flag.
Then we compared it with another snapshot taken 10 minutes later.
The culprit? A growing array tied to cached responses — we were accidentally pushing responses into an in-memory store that never expired. Our quick caching layer became a slow memory death.
Step 3: Reproducing Locally
We isolated the logic in a local test file and used autocannon
to simulate high load. Within minutes, memory ballooned. Confirmed.
Step 4: Fixing the Root Cause
We replaced the naive cache array with a proper Map
that auto-evicted entries based on age and size.
We also added metrics around cache hit rate and memory usage, so we could catch this earlier next time.
Step 5: Deploying and Verifying
The fix went live Monday morning. Memory usage flattened. We let it run for 48 hours with synthetic traffic before considering it solved. We also added a custom alert: if memory climbs steadily for 10 mins, notify the team. Better safe than firefighting.
Lessons We Learned
- 🚨 Not all bugs crash your app. Some bleed it dry silently.
- 🔍 Heap snapshots are gold when logs aren't enough.
- 🧠 "Temporary hacks" often live forever — document them properly.
- 📊 Metrics and alerts should evolve with your codebase, not lag behind.
Debugging memory leaks in production is not just technical — it's emotional. It demands calm under pressure, clear thinking, and collaboration.
In the end, it wasn't a heroic save. It was methodical diagnosis, boring testing, and careful deployment. And honestly? That's the kind of hero I want on my team.