Friday, August 13, 2004

Frustrating Week at Work

It was one of those weeks where I felt incompetent, and even worse, thought other people felt I was incompetent. ;)

We are trying to hit a deadline of zero Priority 1 bugs, by next Monday. Those of you who work in the software industry realize bug priority is a bit arbitrary. But anyway, I was given a bug to investigate. The bug touches on an area I'm not as familiar with, and involves a second product.

It went as follows:

Day 1

Read the bug report, and spent at least half the day trying to reproduce the bug. The bug was reported by another group inside the company that was trying to use a library my group publishes. The bug is an error message, "credentials invalid", when resubmitting some information.

So, I find a copy of their program and install it. Hours later, nothing. I can't get anything to work with this program. All I get is an error "can't scan remote object." Remote object? I'm trying to find a file on my own computer! I call and talk to somebody on the phone.

Them: "Which build did you install?"
Me: "The latest, the 2041 build."
Them: "The dot zero build or the dot one build?"
Me: "The dot one, it had a later timestamp."
Them: "Ah, that's probably it. Dot zero builds have all the features enabled for testing. Dot one builds require full licensing info."
Me: "(thinking) @#$&%*&!. (speaking) Okay, I'll install the dot zero build."
Them: "Also, you have to join a domain for the reporting option to work."
Me: "(thinking) #!@%^**, %%@#!*!)"

Well, the reporting option was what I need working to repro the bug. So it is important to have that option actually work. After more time installing, I still can't get their program to work, even after joining the domain.

I fish around and drag up some test code, maybe I can use that to repro the problem. After a false start, I am setup to avoid using the actual product reporting the bug.

I can't repro the bug, after spending a huge amount of effort trying.

Day 2

I have some test code running that does everything required, but I don't see the bug. Essentially, the other team reports a problem when resubmitting some information that was already submitted. The test code submits with no problem, and resubmits with no problems.

I step through the info submitting code, and examine the program state. Everything looks fine. Something that looks odd are the user credentials - they are empty. Aha! A flag goes up in my mind - after all, the error message has to do with invalid credentials. But, after reading our source code for a while, I determine that empty credentials actually means to use the default credentials (i.e. the credentials supplied when our product is installed).

I exchange email with the other group. Plus, I asked about remote access to a machine which exhibits this problem, because I am just not seeing it.

Me: "I still don't see this. I've installed a half dozen time, tried flipping all sorts of config options, the product errors out before I am far enough to see the bug. I submit with no problem, and resubmit off the properties menu with no problem."
Them: "Oh! You have to go to [other location], double-click, and submit from there. Then it fails."
Me: "(thinking) Well why the !#$@%^*# didn't you write that in the bug report."

So finally, I am able to reproduce the problem. This is good, I fire up the debugger and start to investigate. It is most convenient for me to examine the test code with the debugger, so I spend some time examining the information the test code sends off to the product.

I'm not getting far, everything looks the same. The infomation supplied the first two ways (original submit, and propety submit) match the info supplied the third way ([other location] submit).

I set a few breakpoints in the debugger, thinking I'll just trace execution and see what happens. I pay extra attention to the empty credentials, as it seems related to the error message. Unfortunately, as I trace in the debugger and confirm by reading source code, the credentials pass through untouched! I was expecting them to be modified somewhere along the way, causing the failure. Something doesn't add up though, if it is a credential problem, why did it work two ways but not the third?

At this point, I've mined the test code as much as possible. Time to start debugging on the product. So I check around, set a few breakpoints, and then go to reproduce the problem. The program should halt on one of my breakpoints, and I can try to see if there is a problem.

Unfortunately, the program executes right through and shows the error. Argh, none of the breakpoints was hit! As an analogy for non-programmers, imagine the police setting up a roadblock to catch somebody. He gets away, and later the police realize they forgot to block one road out. Or, in this case, had setup the roadblock in the wrong neighborhood.

Day 3

At this point, my boss drags somebody else over to help me. Well that's good, I'm not too familiar with this area of the code.

My co-worker quickly notices the credentials are empty. I point out yes, the test code purposely sets that and empty credentials work for the job submit and the resubmit. Plus, empty credentials mean to use the default credentials.

Eventually, we bring in a third person who is directly familiar with the affected area. He notices what the problem is - the machine name needs a double backslashes in front!

I work on a Windows product and in Windows, computers have what is called a UNC name, which is "\\" preceeding the computer name. The UNC name is used mostly to refer to files on a remote computer, for example if a file on my computer (KBARRUS-N) is named info.txt, I can refer to it remotely as (\\KBARRUS-N\path\info.txt).

Earlier when I had checked the information submitted to our product, I dumped the computer name variable. Sure enough, it was correct, but without the preceeding backslashes. I didn't think this was a problem, since the submit and resubmit worked. But the third way to submit a job allows the user to edit various fields, so that likely involves stricter validation of the information before continuing. Actually, the info was submitted anyway and somehow the missing backslashes led to a credential error message. In any case, the third guy most familiar with the area said that was the problem, the computer name needs to be in UNC form, and the fact it worked through two other methods was a coincidence.

We resolved the bug back to the other group, telling them to add the backslashes before calling our code. In reality, our code should either convert names to UNC form, or refuse to accept names that aren't in UNC form. Since we're in crunch mode and don't want to change computer name parsing in the product right now, it is easier for the other group to fix their code. Perhaps later we'll fix ours to make it consistent.

Anyway, that was three days searching for basically the wrong naming format. Of course, it would help if our error message were actually useful, such as "Error in computer name". The red herring about credentials wasted a lot of our time!

I'm glad we found the problem. Because that means I didn't have to stay late last night, this night, and work as much as possible over the weekend. All the same, I would rather have found it on my own. But, given the problem was so subtle it may have taken me a long time of eliminating other possible failures before finding the root cause.

No comments: