The trials and tribulations of "Heisenbugs"

[ Mood: Fed Up With Life ]
We have been struggling with a memory leak in OpManager, that seems to happen under some relatively rare circumstances. It is documented in

http://forums.adventnet.com/viewtopic.php?t=1641&highlight=

and in

http://forums.adventnet.com/viewtopic.php?p=4825

It was damn hard to reproduce this bug, and only after a customer kindly shared their OpManager database were we able to reproduce it. But that hasn’t made it any easier to pin it down. It takes a few hours to manifest itself, so the turn around time is just excruciatingly slow. Second, memory profiling tools have been ineffective because, sure enough, the leak won’t happen when you are looking for it - the classic definition of a Heisenbug. The most likely culprit in that case is a timing and thread related issue.

We think we are close to swatting this nasty bug, but the nature of this bug makes it hard to be very certain. Customers are anxiously waiting - and that adds to the tension for the developers.

And, the sad reality is, there is no silver bullet here. The combinatorial explosion of potential pathways in software indicates that no matter how comprehensive our testing is, something like this is likely to happen once in a while. Fortunately, it is rare, but when it happens, it is just utter misery.

No Comment

No comments yet

Leave a reply