Friday, November 17, 2006

It was 20+ years ago this code comment was written

Going through some piles of ancient floppy disks I rediscovered some code. Usually ancient code isn't very interesting in modern terms, but this code embodied an IDEA that is still very interesting.

The comment below in green text was written back around 1984 on a 4.77mhz 8088. The program it was part of was absolutely unique in its day -- it took already compiled object modules produced by a commercial C compiler(Lattice), tore them apart, completely analyzed the logic flow in them, then rewrote the code to prevent the application being processed from accidentally overwriting DOS, the IVT or its own code due to some pointer bug.

When a bug was caught, the modified code produced a register dump and issued a message indicating exactly what module and line number the the error occurred on. The trapper code was of course replaceable and could just as well have performed a warm start and reinitialization of the app for something that simply couldn't be allowed to crash. The ability to have a self diagnosing and potentially self healing C code application in 1984 was a pretty heady notion. In a real mode environment with no hardware protections available at all, it was a near miraculous notion. It wasn't until the 386 chip and its V86 mode came along that outfits like Numega even bothered to attempt producing a similar result.

Similar code analysis magic constructed to build this 20+ year old app could be applied the output from mediocre compilers today (ex. GCC). Once you've scanned and identified all the straight line blocks of code, classified all the instructions as to what registers, flags, etc they touch, you're in a good position to rewrite those blocks with semantic equivalents that are smaller or faster than the stock compiler output. Think peephole optimizing. GCC's x86 peephole optimization capability is weak. A good assembler programmer can ALWAYS beat it handily.

Of course, once all straight line blocks are identified, its a simple matter to insert things like real time profiling counters. Its also a trivial task to insert code path coverage counters.

How many testing groups would like to know what parts of a program have and have not been hit during testing?

How many developers would like to know exactly what parts of a program people are actually using in real life as opposed to what the programmer imagines they might be using?

If an app contained a blob of code that is never executed by customers in the field or during forced error and unit testing, one necessarily has to question why that piece of code is even in the application -- its effectively "dead".

There are some tools around today that can do some of this, but some require source changes, others run the app under a debugger or other noxious schemes.

I believe I have a new programming project to fiddle with. We have come so far, yet some things at the lowest levels are still as bad as they always were. Maybe I can fix some of that. Not many care to play at this gritty a level, but its something I've always loved. Its real, very real. No hand waving. No bullshit. The bits don't lie. It appeals to my INTJ nature a lot.

Some will say, why not just fix GCC and other mediocre compilers? I could in the case of GCC, can't in the case of closed source stuff. Anyway -- I choose not to. How many versions of patches/mods for all the different compiler releases in the field do I care to maintain? The answer to that question is zero. People need older releases for various reasons (ex. the Linux 2.4 kernels refuse to compile with the GCC 4.1 compiler. Great coordination there guys! Siskel and Ebert give that clusterf**k two thumbs down). My approach is relatively compiler version independent, and if the compiler guys ever start producing the same code my scheme would, then my scheme essentially degenerates into a build time NOP when it can't find anything to improve.

I've seen the future. It happened over 20 years ago ;->


CHANGE this code at your own risk. It works, and works well. It could also be broken very easily by naive hacking.

This piece of code performs a complicated function and uses a recursive algorithm to do it. Pay particular attention to the storage classes of the variables in the scancode() routine!!

Imbedded in this code are HEX constants which are MAGIC numbers dependent on 8086/80186/80286 opcodes. Some pointer arithmetic is also very INTEL specific. INTEL isn't ever going to change these so I don't feel too guilty about hard wiring them in! Needless to say, any attempt to alter this code without a SOLID knowledge of INTEL's architecture at it's lowest level would be pure folly.



MikeT said...

Ummm where's the code?

Purple Avenger said...

When I figure out how to, I'll probably start a source forge project and check in the 84' vintage stuff as a historical reference.