As I described in one of my previous posts, there is a special kind of fatal error called NMI error, a form of ultra fatal error that Windows users don't encounter as often as the infamous Blue Screen of Death (BSOD).
It is caused by some erroneous hardware behavior which, in order to make things sound 'common sense', let's say crashes the motherboard, while BSODs crash the operating system.
Of course, it is the software (operating system and device drivers) that programs the hardware, so if the hardware is not at fault (true hardware damage) then it is the software that is to blame once again :-).
Even if the chips have bugs (hardware bugs) the device driver developer is required to write code that circumvents the hardware bug so that the device operates correctly.
I have been working with Windows device driver code for a total of about 6 years in my career and of course I have encountered and solved dozens of BSODs in our drivers. But so far I had met only one NMI error caused by our driver. A couple of days ago I encountered my second NMI error and I was truly excited!!!
Why I was excited instead of disappointed and thinking that our code is being crappy? For starters, I KNOW our code is not crappy (famous last words) so I couldn't possibly worry about that.
There are two reasons why I was excited:
(1) One NMI error in 6 years means that you don't get the chance to debug such beasts very often. The more often you get to debug such errors the more experienced you become in the sorts of things that cause NMIs and how to debug them. Such a kind of experience is extremely hard to get and extremely valuable.
(2) Secondly, as I explained in my previous NMI post, the software was running smoothly everywhere but was causing an NMI on some semi-exotic platform. This was exactly the case once again, an exotic Xeon-based super server. If we have clients that decide to run our software even on such exotic machines then (a) we must be having a lot of clients (since only a small percentage use exotic machines overall) and (b) they trust our software enough to use it in extremely demanding applications that require exotic machines.
In both NMIs the actual case was that the client let's say started using our software on Series 3 of the server, then upgraded to Series 4, then Series 5 and when they tried to upgrade to Series 6 (in their labs of course) they found out that an NMI was occuring.
So there is no matter of trust to our code and our company. Our clients KNOW that our software works correctly and they obviously realize that we don't have the latest and greatest version of every exotic server available in our lab to test with, so they tell us about the error, we get it fixed and both parties are totally happy and excited!
So if there is a common sense lesson here for ordinary users it is this one: If you decide to build your own super exotic PC with a uniquely super cool combination of the latest-greatest-fanciest hardware don't act that much surprised if you have driver problems (BSODs or more likey NMIs). If you want to stay out of trouble then shoot for something that is less exotic. Or at least ask for your PC provider to build your PC and then test it a bit before you pay for it and take it home.
OK then, just for the record, let me tell you what was causing the NMI.
Initially I thought that it was a cache-coherency issue that made the 1394 adapter read garbage for the context program of its DMA context. Cache-coherency means that the code updates some memory, but the new memory contents are still inside the CPU cache because that portion of the cache was not flushed to main memory.
However devices read their context programs from main memory, so if the data did not reach there yet the device will read garbage and act accordingly. Too bad that devices can't popup error message boxes to the user :-D
By examining the code carefully for that kind of error I located a little well-hidden window of opportunity where it could occur. So I fixed this bug, but lucky me, it was not the cause of the NMI. The NMI persisted.
Then I fired up my 1394 Bus Analyzer and after several experiments and server crashes one thing was evident. The 1394 adapter would always transmit exactly 51 packets before NMIing. Now, *always* and *exactly 51* are a strong indication that something a little special must be happening on the 52nd packet.
I knew of course that each packet uses 5 DMA descriptors in the DMA context program, each descriptor being 16 bytes, so I did some simple math: 5*16*51=4080. BINGO!!!
Each physical page of memory has 4096 bytes, so the 5 descriptors of the 52nd packet were on a physical page boundary. That's always a good start, although the DMA context program is being written in 'physically contiguous' pages so crossing the boundary shouldn't result in any surprises.
The NMI was caused by what we call 'isochronous transmit'. But there is also 'Isochronous receive' and it also uses 5 descriptors of 16 bytes each, but DID NOT crash on the 52nd packet, on the same machine of course.
How could that possibly be?
I studied the chip specs closely for any mention of anything related to physical page boundaries for the DMA context programs and sure enough... there was nothing. No restrictions whatsoever were mentioned.
Mind you, the code that is preparing the isochronous transmit context program was originally written back in 1999 so it has been literally tested on thousands of computers since then. This was sure a neat thing that was going down here. (Note: 1999 is more than 6 years ago, but I didn't work with drivers all the time ;-))
Then I studied our code again and soon I found out that the isochronous transmit was using 5 descriptors but the first one was a bit special, in fact something like a "double" descriptor. Since 4096-4080=16, then it became evident that this "double" descriptor was getting split in two physical pages (always physically continuous in memory).
Hmmm, I thought, maybe this machine is not too happy with this fact and crashes with the NMI at the moment the 1394 DMA chip is trying to read the 32-bytes of the next descriptor in one operation from two different physical pages.
This sounded plausible enough in my mind, so I started to give it a try.
Of course I was not sure that it was the correct reason, I mean it works everywhere else right?
I would have to change the code first then run it and see if the NMI goes away. But I estimated the correction to be at least a week's worth of coding, because too many things had to change in order to accomodate for the required 'holes' in the context program.
Classic case of a Catch-22. I can't put in 4-5 days of work just to test an idea! What if it was not the correct one?
Then something else came to mind... a quick and dirty solution... If I add 3 nop descriptors (nop="no op"="no operation") to each packet then I will have 8*16 bytes = 128 bytes. And sure enough 4096 is divisible by 128, so I would never have the special descriptor on a page boundary.
Of course this means 48 wasted bytes of physical memory for each packet, which is a very precious resource. The requests we deal with may contain thousands of packets, so that would be a non-trivial waste.
I didn't have to think twice before I decided that this was not an acceptable solution, but it was just fine in order to try out my theory!
And then came that precious moment of glory!!!
It worked like a charm :-)
My theory was right and bye-bye NMI #2.
Then I embarked on a 4-day effort to implement the proper solution that doesn't waste memory and works too.
Isn't it just so cool being a driver developer in your spare time? :-D
Have fun!
Dimitris Staikos