It was 40 years ago that Virgil Grissom, Ed White and Roger Chaffee died in the Apollo launchpad fire (and many test pilots before and after them).
Many have stood on all their shoulders since - this is the way of engineering progress. Mistakes are made, unaccounted for situations occur, systems fail and people die as a result. We add new data points, make our notes and pray we learn something from the errors. One of the worst things for an engineer is the mystery failure - the one you can't explain.
As a young whelp engineer at Rockwell working on the Space Shuttle I had the honor of working with one of cockpit flight computer systems (the DEU's that control the screens and keypads) that exhibited just such a "mystery failure" with disturbing regularity. For a number of years before I got there, the machines were just locking up randomly. No particular pattern of events or sequence led to it. Nobody could figure out why. It was being written off as cosmic rays, gremlins, etc. No matter what was causing it, we were losing one about once a month to this issue. With only about 20 units in existence in the whole Shuttle program (in labs and orbiters combined), the chances of losing one on a flight was significantly greater than zero.
I resolved to find that bug and explain it, even if it couldn't be fixed (there was only 20 words of patch space left in the box, so a repair might not be possible). However, an explained failure is far preferable to NASA and flight crews than mystery failures. Known devils are OK. Unknown devils are scary.
After several days of sliding a mental window over the assembler code(I suspected it was a vulnerability window interrupt related issue), I did find the cause - there was indeed a ~43 microsecond window in which the thing would go nuts if an interrupt occurred during that small window of vulnerability. The DEU is a really slow machine, so a 43 microsecond window on it was only a few instructions.
Essentially, the issue was a classic mutual exclusion problem one would solve with a semaphore or similar mechanism. Problem was, the CPU was very primitive and didn't have any low level hardware capability for implementing a semaphore the way modern CPU's do. i.e. it had no notion of an exchange instruction. There's always Bankers algorithn, but that wasn't going to fit in 20 words of patch space, so we lived with it for STS-1/2/3, and I don't know what happend after that. But at least it was explained now. Knowledge had been increased. The unknown, had become known.