Monday, January 29, 2007

Remembering the pioneers of space

It was 40 years ago that Virgil Grissom, Ed White and Roger Chaffee died in the Apollo launchpad fire (and many test pilots before and after them).

Many have stood on all their shoulders since - this is the way of engineering progress. Mistakes are made, unaccounted for situations occur, systems fail and people die as a result. We add new data points, make our notes and pray we learn something from the errors. One of the worst things for an engineer is the mystery failure - the one you can't explain.

As a young whelp engineer at Rockwell working on the Space Shuttle I had the honor of working with one of cockpit flight computer systems (the DEU's that control the screens and keypads) that exhibited just such a "mystery failure" with disturbing regularity. For a number of years before I got there, the machines were just locking up randomly. No particular pattern of events or sequence led to it. Nobody could figure out why. It was being written off as cosmic rays, gremlins, etc. No matter what was causing it, we were losing one about once a month to this issue. With only about 20 units in existence in the whole Shuttle program (in labs and orbiters combined), the chances of losing one on a flight was significantly greater than zero.

I resolved to find that bug and explain it, even if it couldn't be fixed (there was only 20 words of patch space left in the box, so a repair might not be possible). However, an explained failure is far preferable to NASA and flight crews than mystery failures. Known devils are OK. Unknown devils are scary.

After several days of sliding a mental window over the assembler code(I suspected it was a vulnerability window interrupt related issue), I did find the cause - there was indeed a ~43 microsecond window in which the thing would go nuts if an interrupt occurred during that small window of vulnerability. The DEU is a really slow machine, so a 43 microsecond window on it was only a few instructions.

Essentially, the issue was a classic mutual exclusion problem one would solve with a semaphore or similar mechanism. Problem was, the CPU was very primitive and didn't have any low level hardware capability for implementing a semaphore the way modern CPU's do. i.e. it had no notion of an exchange instruction. There's always Bankers algorithn, but that wasn't going to fit in 20 words of patch space, so we lived with it for STS-1/2/3, and I don't know what happend after that. But at least it was explained now. Knowledge had been increased. The unknown, had become known.

3 comments:

Mike said...

Sounds like fun. More fun than working on web services...

Purple Avenger said...

Yea, aerospace/defense systems are a whole different world than commercial stuff.

The lab toys are certainly a lot more exotic and expensive.

The Shuttle cockpit display video hardware was incredibly advanced architecturally for stuff built in the early 70'.

The CPU that drove that hardware was incredibly primitive though - a single accumulator architecture with no protection mechanism at all. When the program went nuts, there were not GP faults, invalid instruction, exceptions, etc - it just kept executing and wasting the core (these machines used actual magnetic core memory).

The video hardware had a notion of sprites, display lists, and object on the list could have properties like direction and speed, rotation, and intensity variability.

So you could take some symbology element, slap it onto the display list and the video processor itself would slide it across the screen, rotate it, and vary its intensity without the wimpy SP0 integer CPU having to do anything.

That allowed a very wimp and slow CPU to do very impressive display manipulations and not get saturated.. We didn't see those sorts of architectural features in a PC video card until well into the SVGA era.

In a way it reminds me a little of the old IBM PGA video adapter that had a ton of hardware assist and an onboard 8088 video processor. You didn't manipulate the high-res screen on a PGA directly, you sent it directions through a small buffer, and its onboard 8088 handled an internal display list.

Cost killed PGA (~$2,000 back in 83' or 84', and another $2,000 for the monitor) and the EGA took hold instead. If the PGA had taken hold as a video standard, PC video systems would have advanced a LOT more quickly.

Purple Avenger said...

in fact, I was told once that it was the EXACT processor used in a commodore 64, or was it vic 20? Something like that.

That's not true.

The SP0's (cockpit display/keypads) and AP101's (the 5 main flight computers everyone's heard about) are not even microprocessors.

They're a backplane based system with a slew of boards that plug into the backplane's. The SP0/DEU "CPU" consisted of somewhere around (if I remember correctly) 19 boards, each built with discrete transistor logic - i.e. not an IC anywhere on them.

We're talking seriously "old school" here. We're also talking seriously radiation hard too. The discrete logic and magnetic core made the things virtually impervious to any sort of radiation, and the magnetic core is by nature non-volatile.

one of they guys had to write a memory scrubber for whatever the first military mission was to wipe the core boards clean. You can't just erase it, there are hysteresis effects that remain after a simple erasure. You gotta hit it with many patterns over and over to really get it clean.