Tuesday, July 18, 2006

Debunking "Debunking the Myth of High-level Languages"

My first English blog post. Yipee! Those of you who read Hebrew are welcome to my original blog of the same name. This blog will be English only.

In a recent five part article, David Chisnall made the claim that much of the common wisdom stating that "high level" languages, such as Java, result in slower executing code than C as plain wrong. In a nutshell, David's argument revolves around the idea that C is long past acurately describing what the hardware knows how to do on the one hand, while runtime (Just In Time) compilation allows optimizations not available when doing static compilation.

While I do agree that the difference between C and Java performance characteristics are much smaller than in previous life times, I will try to argue that the notion that we should all switch to a higher level language to gain performance is a wrong one. Here's why.

At the premise of David's claims is the fact that the era in which CPUs were simple tools carying out one instruction at a time are now long gone. The CISC processors of yore died in favor of agile RISC processors. The idea behind RISC is that you lower the transparency the machine code has to have of the internal CPU architecture. In return, you allow the compiler to perform better hardware dependent optimizations. While so doing, you identify the 5 commands that most programs execute for 98% of the time, and make those very very fast. Everything else, you leave up to the compiler to implement itself.

The result are machines with a machine code that is practically impossible for a human to code directly. One example is that the instruction immediatly after a branch operation will be executed regardless of whether the branch is taken or not, but only if no pipeline bubble occures. You now have to figure out whether the code preceding the branch had a dependency that cause a bubble to happen, and place the last command to be executed before the branch after it, or not, based on the results. It's something a compiler can do ok, but a human will find much harder to do.

David is thus wrong in stating that the machine code is an abstraction step away from what actually runs on the machine. On RISC machines, machine code is not converted to microcode. That was a CISC thing, and is now long dead.

There are two problems with the RISC vision. The first is that cross compatibility between different generations of the same CPU are now, for all intent and purposes, impossible. If a new version of the CPU adds a data path that prevents a pipeline bubble where one used to be, code written for the old CPU will not run correctly on the new one. While this does result in faster code when compiled for the most recent CPU, it also results in a management nightmare as far as getting your software and hardware aligned.

The second problem is called "Intel". While Intel did design its CPUs as RISC machines on the inside (they had a dual pipeline design as far back as the first Pentium chip), their machine code is a CISC relic, mixing instruction codes from the register-register generation, register-memory genenartion, and has some relics dating back to the time when accomulator based programing was the way to go. In short, the Inel machine code is a royal mess. Intel made the strategic decision not to fall for problem #1 above (and when they tried not to, with the Itanium chip, the result was a sad pathetic commercial failure). This means all the mess described above needs to be supported somehow, cheaply, and while running fast.

So how do you do that? Intel's answer - you put an optimizer inside the CPU! Any Intel chip dating original Pentium and onward will, the first time executing code, analyze the code while putting it into the instruction cache. In effect, Pentium chips convert the i386 CISC instruction set into a RISC instruction set while running it the first time. It then runs this RISC set any subsequent run over the same code. This is how they manage to fit this mess of a machine code into a pipelined design, and a multiple-pipeline design while at it.

The phenomen is so pervasive that some people claim that optimizing code for size, rather than speed, yields best performance, simply due to the fact that more code fits into the instruction set cache, avoiding the need to reconvert code. This is also the reason that flushing the instruction cache is such a horribly expensive operation on Intel.

Going back to David's article. Yes, C is now no longer a borderline assembly language. A very sophisticated optimizer needs to analyze the code for loop unrolling, interdependencies, and other hairy concepts before being able to reasonably produce hardware efficient machine code. Please keep in mind, however, that the reason for the need for this optimizer is that C is no longer what it used to be. In other words, no language is future proof.

So what about the claim that current high level languages are better suited for current CPUs, then? I think a broader picture must be looked at when selecting your programming lanaguage. The main question is this: How long do you want your program to live? If another CPU technology breakthrough is probable within your program's lifetime, wouldn't you rather have a language where at least the compiler's optimizer has a chance of catching up? One important point to keep in mind is that the simpler the language (and not many commonly used languages today are simpler than C), the easier it is for the optimizer to understand what your code is doing, and construct the necessary data-flow analysis required for performing optimizations. In other words, the simpler the language, the more sophisticated the optimizer can be.

David didn't just shoot out his opinion. David also mentioned an actual benchmark. The benchmark compared statically compiled code for the MIPS machine with runtime compiled MIPS code for the MIPS machine. The later ran 10% to 20% faster than the former. I believe that it is no accident that the test was not conducted (at least, not successfully) on an Intel. MIPS is a RISC machine, in the very strict, old fashioned, pre-Intel sense of the word. It is not obvious that you could repeat the same exercise on a machine that does the aggressive hardware optimization that Intel CPUs do.

Last, but not least, the most commonly used JIT language used today is Java. The closest runner up is .Net. For both languages, the "object form" you get is not a high level language. In that respect, there is no difference between a Java compiler (that compiles into bytecode, not GCJ) and a C compiler. Both compilers compile a high level language into a machine code. In Java's case, the machine code is a made up machine, that no actual CPU will understand. It does, however, look exactly like any other machine code would look. There is no magic "special information" that can be gleaned from the Java bytecode. If JIT can perform aggressive optimizations for a Java bytecode, there is no reason that a JIT cannot perform those same optimizations for the native machine code you use, no matter what it is.

Shachar