Thursday, October 01, 2009

Why you should Blame your Tools, sometimes.

As software developers, we would all wish to have 3rd party libraries that works magically immediately after installation. While this is to be expected and holds true for popular and widely used libraries, people do tend to forget that it is still more of a rule of thumb than an iron-cast fact.

Like the urban myth goes, "a good workman never blames his tools", this mindset is so ingrained in the software development culture, that it is always bad form to assume that bug may reside in a library rather than in your code. After all, it is more likely that you are the offending culprit, given it is less likely a widely used code base will make a mistake when compared to code just seen by you alone. But with everything statistical, there is always a chance that an outlier event can happen.

And it did for me.

Internally, we rely on log4cxx as our logging library, which a peculiar bug seems to only be triggered upon the termination of our application. After spending more than 2 days of debugging, the error seemed to have consistently been at the line where the logger object is being accessed. And like all true developers, I bit my lip and kept soldiering on, reading through my own code, checking the same lines over and over again, trying to see if I had missed defining a variable, or if I had forgotten to allocate memory for it. Naturally, after while of doing so, I began to start questioning my personal sanity rather than the lines of code I've written, amid this 'impossibility' that the error could actually be from the library.

And then I popped.

I turned over to my colleague and mentioned to him that the code kept failing on the logging line. The team had known that I've been stuck on this problem for the last day. Even as I started, I could see the mild derision on his face, "it can't be the library. Maybe gdb is not showing what actually is going on."

So on and on the circle I went. Then someone else suggested using valgrind.

Since I was running out of options, I didn't see any harm trying. At most I'll waste an hour more to the day I've already wasted, anyway.

Valgrind didn't really show anything more different to what gdb was showing. Basically saying the same thing - that log4cxx is accessing an invalid memory location. Given I wasn't an expert on valgrind, I got someone else to come over and discussed what we were seeing on the stack trace.

No luck still.

It's gone past the point of frustration, into the point of desperation that I decided to just fucking comment out the offending statement and see if the problem goes away.

After recompling and re-running the application 30+ times without the bug re-surfacing, I was very certain to point at the logger library being the culprit. But of course if I did, I couldn't definitively prove it, and worse, I had no workaround; the entire code base is reliant on it for logging, and to take it away would mean a major re-engineering exercise.

Sometimes talking to others help. Sometimes they offer a difference in perspective, and other times, they may have caught something you hadn't.

Given that we had multiple threads that were still operating when a shutdown signal is received, I thought maybe it'll be worth trying to stop those threads safely before the termination signal is processed - since it's likely that some portions or memory may be freed while the threads are still utilising them. But I was just shooting wildly - it was a very long shot, given how the threading interactions were designed, it was difficult to believe that it could have been the problem. But trying something is better than doing nothing.

So I sat down with another colleague of mine, and talked him through what I wanted to do, and invariably the discussion lead to another where we talked about the valgrind backtrace that I had generated, which showed memory destruction caused by exit(). After a bit of theorising back and forth about the behaviour, and on how I'd would write a quick hack to test my hypothesis, my colleague decided to do a ubiquity search on 'log4cxx exit crash^H^H^H^H', which then our eyes popped when we saw the preview result before it changed into something else that he was meaning to type.

"Go back again", I said excitedly. He was already as excited as I was, so there wasn't really the need for any prompting.

It turns out that there is a known problem in log4cxx found months ago, and someone had even posted a patch on it already. But here's the thing about software: unless you are willing to roll your own fixes, be prepared to wait for a long time before someone eventually fixes it. And even if someone did fix it, the upstream maintainer will still take a while before committing it into the main trunk. Or even worse, he may just choose to ignore it. And yes it does happen - ask Ulrich Drepper. :P

On how to fix it, and package it up nicely for use, I'll write about that in a later post, for now I want to share an invaluable lesson when it comes to software development, which also equally applies in life: while established conventions are more often right than wrong, it never hurts to question them just in case they aren't. It may not be okay to blame your tools all the time, but it sometimes worth to know when to call a shoddy spade when you see one.