« July 2008 | Main | October 2008 »

August 04, 2008

Sloppy programming

Hell, it seems that the Apple incident inspired me after all and I got our SBP2 bug down :-)

That was a tough nut to crack. The SBP2 driver was failing with a sample 4TB external hard drive, actually giving me a BSOD. However this was not a nice BSOD like most others. The kernel crashed because it detected a corrupt doubly linked list, and of course I had no idea where the corruption came from.

After cleaning up a lot of stuff I found the culprit... just by looking at the debug messages. We had a failed kernel mode sanity check that led to a buffer overrun. However the buffer was at the end of a struct thus overwriting the next struct and leading to crazy behaviour when the OTHER struct was getting used, much later on.

Anyway, to cut a long story short the buffer should be 16 bytes instead of 12 bytes. I changed that and looked carefully through the code, then stepped through the initialization code and all looked nice. Sure enough there was no crash any more, however nothing else worked either :-D

So I embarked on a journey to find what the heck could be causing this and here is my finding.

Suppose you have a struct like this:

struct ORB_DATA
{
// Some members here...

// At the VERY END of the ORB_DATA struct.
struct ORB_CMD orb;
};


NEVER, EVER ASSUME that:

(sizeof(ORB_DATA) - sizeof(ORB_CMD)) == FIELD_OFFSET(ORB_DATA, orb)

The compiler may align things as it sees fit, so just use FIELD_OFFSET and leave the neat tricks alone.

Have fun!
Dimitris Staikos

Blunders from Apple? No never!

I am sick and tired of hearing all the newly enlightened Apple enthousiasts about how cool Apple and Macs are. Everything is so cute! Everything "just works"! Everything is so fast! Everything is so well designed!

Hallucinations can be really great while they last, so just beat the heck out of it man!

Anyway, while I was working late tonight trying to unravel the mysteries of SBP2 all of a sudden the Apple installation toolkit decides to give me a break and tells me that I should install something which I can't remember. It was a mere 2MB download so I said what the heck.

Next thing I know, it asks me to reboot and I kindly decline. However it seems that it couldn't make up its mind whether it was done or not with me, so it started checking again for additional software updates! So it comes up with a new window informing me that I should "update" QuickTime 7.5 and Safari, which of course I didn't have installed on the first place so how could I possibly update them. Just for the record, a mere 28MB and 22MB download each, while the Opera browser that I use is a mere 8.5MB that is if you get the international version.

One thing struck me a bit odd. What the hell is "Safari"?
Of course I do know that it is a browser, but shouldn't they at least qualify it with a subtitle, like "Safari browser"? I mean, I DON'T have it installed on my machine and still they expect me to KNOW what "Safari" is? That's some confidence I must admit, but they did go overboard this time!

Being the curious guy that I am I clicked on "Safari" in order to see the description they provided for it. And here's what expected me:

Safari

Wow wow wow!!! Safari is "WhichDescription()". Now I am enlightened too :-D

Needless to say, I unchecked Safari because I don't browse the jungle in the first place and I thought of giving QuickTime 7.5 a try. After a minute or so of downloading, I get a beautiful error message telling me that "There were installation errors".

Think about it, isn't it just so cute? It could have said "There WILL be installation errors" but they didn't want to sound almighty I guess :-D

Anyway I was too quick to dismiss that dialog so I run Apple Update once again to repro the error and show it to you. Magically enough... it installed correctly this time :-) Apple nirvana, keep trying dudes and you will get it!

Have fun!
Dimitris Staikos

P.S: I did like the folding dialog though... So cute...
P.P.S: Yeah I know, MS is to blame since I am running Winddoze in the first place so what did I think?

August 01, 2008

Solving an NMI crash

As I described in one of my previous posts, there is a special kind of fatal error called NMI error, a form of ultra fatal error that Windows users don't encounter as often as the infamous Blue Screen of Death (BSOD).

It is caused by some erroneous hardware behavior which, in order to make things sound 'common sense', let's say crashes the motherboard, while BSODs crash the operating system.

Of course, it is the software (operating system and device drivers) that programs the hardware, so if the hardware is not at fault (true hardware damage) then it is the software that is to blame once again :-).
Even if the chips have bugs (hardware bugs) the device driver developer is required to write code that circumvents the hardware bug so that the device operates correctly.

I have been working with Windows device driver code for a total of about 6 years in my career and of course I have encountered and solved dozens of BSODs in our drivers. But so far I had met only one NMI error caused by our driver. A couple of days ago I encountered my second NMI error and I was truly excited!!!

Why I was excited instead of disappointed and thinking that our code is being crappy? For starters, I KNOW our code is not crappy (famous last words) so I couldn't possibly worry about that.
There are two reasons why I was excited:
(1) One NMI error in 6 years means that you don't get the chance to debug such beasts very often. The more often you get to debug such errors the more experienced you become in the sorts of things that cause NMIs and how to debug them. Such a kind of experience is extremely hard to get and extremely valuable.
(2) Secondly, as I explained in my previous NMI post, the software was running smoothly everywhere but was causing an NMI on some semi-exotic platform. This was exactly the case once again, an exotic Xeon-based super server. If we have clients that decide to run our software even on such exotic machines then (a) we must be having a lot of clients (since only a small percentage use exotic machines overall) and (b) they trust our software enough to use it in extremely demanding applications that require exotic machines.

In both NMIs the actual case was that the client let's say started using our software on Series 3 of the server, then upgraded to Series 4, then Series 5 and when they tried to upgrade to Series 6 (in their labs of course) they found out that an NMI was occuring.
So there is no matter of trust to our code and our company. Our clients KNOW that our software works correctly and they obviously realize that we don't have the latest and greatest version of every exotic server available in our lab to test with, so they tell us about the error, we get it fixed and both parties are totally happy and excited!

So if there is a common sense lesson here for ordinary users it is this one: If you decide to build your own super exotic PC with a uniquely super cool combination of the latest-greatest-fanciest hardware don't act that much surprised if you have driver problems (BSODs or more likey NMIs). If you want to stay out of trouble then shoot for something that is less exotic. Or at least ask for your PC provider to build your PC and then test it a bit before you pay for it and take it home.

OK then, just for the record, let me tell you what was causing the NMI.

Initially I thought that it was a cache-coherency issue that made the 1394 adapter read garbage for the context program of its DMA context. Cache-coherency means that the code updates some memory, but the new memory contents are still inside the CPU cache because that portion of the cache was not flushed to main memory.
However devices read their context programs from main memory, so if the data did not reach there yet the device will read garbage and act accordingly. Too bad that devices can't popup error message boxes to the user :-D
By examining the code carefully for that kind of error I located a little well-hidden window of opportunity where it could occur. So I fixed this bug, but lucky me, it was not the cause of the NMI. The NMI persisted.

Then I fired up my 1394 Bus Analyzer and after several experiments and server crashes one thing was evident. The 1394 adapter would always transmit exactly 51 packets before NMIing. Now, *always* and *exactly 51* are a strong indication that something a little special must be happening on the 52nd packet.
I knew of course that each packet uses 5 DMA descriptors in the DMA context program, each descriptor being 16 bytes, so I did some simple math: 5*16*51=4080. BINGO!!!
Each physical page of memory has 4096 bytes, so the 5 descriptors of the 52nd packet were on a physical page boundary. That's always a good start, although the DMA context program is being written in 'physically contiguous' pages so crossing the boundary shouldn't result in any surprises.

The NMI was caused by what we call 'isochronous transmit'. But there is also 'Isochronous receive' and it also uses 5 descriptors of 16 bytes each, but DID NOT crash on the 52nd packet, on the same machine of course.
How could that possibly be?

I studied the chip specs closely for any mention of anything related to physical page boundaries for the DMA context programs and sure enough... there was nothing. No restrictions whatsoever were mentioned.

Mind you, the code that is preparing the isochronous transmit context program was originally written back in 1999 so it has been literally tested on thousands of computers since then. This was sure a neat thing that was going down here. (Note: 1999 is more than 6 years ago, but I didn't work with drivers all the time ;-))

Then I studied our code again and soon I found out that the isochronous transmit was using 5 descriptors but the first one was a bit special, in fact something like a "double" descriptor. Since 4096-4080=16, then it became evident that this "double" descriptor was getting split in two physical pages (always physically continuous in memory).

Hmmm, I thought, maybe this machine is not too happy with this fact and crashes with the NMI at the moment the 1394 DMA chip is trying to read the 32-bytes of the next descriptor in one operation from two different physical pages.
This sounded plausible enough in my mind, so I started to give it a try.

Of course I was not sure that it was the correct reason, I mean it works everywhere else right?

I would have to change the code first then run it and see if the NMI goes away. But I estimated the correction to be at least a week's worth of coding, because too many things had to change in order to accomodate for the required 'holes' in the context program.
Classic case of a Catch-22. I can't put in 4-5 days of work just to test an idea! What if it was not the correct one?

Then something else came to mind... a quick and dirty solution... If I add 3 nop descriptors (nop="no op"="no operation") to each packet then I will have 8*16 bytes = 128 bytes. And sure enough 4096 is divisible by 128, so I would never have the special descriptor on a page boundary.
Of course this means 48 wasted bytes of physical memory for each packet, which is a very precious resource. The requests we deal with may contain thousands of packets, so that would be a non-trivial waste.

I didn't have to think twice before I decided that this was not an acceptable solution, but it was just fine in order to try out my theory!

And then came that precious moment of glory!!!
It worked like a charm :-)
My theory was right and bye-bye NMI #2.

Then I embarked on a 4-day effort to implement the proper solution that doesn't waste memory and works too.

Isn't it just so cool being a driver developer in your spare time? :-D

Have fun!
Dimitris Staikos