Thursday, February 15, 2007

The solution to comment spam

I think I have the solution for comment spam. You know, those annoying automated "comments" every moderately popular blog receives in bulk which advertise, well, all the usual things spam advertises.

I think most solutions today are severely lacking. Identifying number of links produces both false positives and false negatives, asking users to answer capchas is annoying, problematic for blind people, and sometimes machine readable anyways (and, when not, tends to be non-human readable either).

The solution I'm proposing isn't new, as such. It's just that I have never seen anyone apply it to comment spam, and I think it might work. It is, in a very abstract way, based on Merkel's puzzles. Here it is.

Introduce into every form that is meant for sending a comment to the blog a hidden field with a "seriousness proof number". This number will be different for each post in the blog, and will change once every few days (giving several hours of grace time for the old number to still be active). If a comment is posted with incorrect number, dump it without asking the moderator.

So far, I have not said anything really new. Spammers are already fairly adept at parsing the incoming HTML, identifying the authentication number, and making sure it appears in the spam they send. So far, we see, that this method is not very effective.

So, the next step in making this system more effective is to encrypt the authentication number. We'll send the authentication number AES encrypted, and send a small javascript program that decrypts the number and places it in the form.

Any and all of you who know anything about cryptography will twitch in pain at me calling this "encryption". After all, the thing that defines encryption is the (in)availability of the key, much more than the actual algorithm used. In order for the legitimate user to be able to post comment, the Javascript must provide the key for decryption. The spammers, after a couple of days, will simply teach their parsers to extract the key from the javascript, and use it to send their spams. We see that this twist can be effective for a single blogger protecting his own blog (security by anonymity), but not when implemented in standard platform, such as "blogger" or wordpress.

So we introduce a third modification to our plan. We now supply the authentication token encrypted, but we do not provide the key! Instead, we provide a javascript program that brute-forces the key. Of course, the encrypted text needs to contain a piece of known plain text, so that the program can tell when it successfully decrypted the key. Also, the key length must not be longer than the authentication token length, or it will be easier to brute force the token directly, rather than the key. Still, this method will surely keep the spammer out.

Wait, don't call me insane just yet. I'm not really serious. While it will keep the spammers out (who can afford to brute force 128 bit of key just to send spam?), it will also keep the legitimate commenters out (who can afford to brute force a 128 bit key just to send a comment?). However, we can now turn it into a competition over "who has more available resources in order to post a comment".

What I suggest is that we give each commenter all bits required to decrypt the authentication token BUT, say, 16. Brute forcing 2¹⁶ isn't beyond the ability of any modern computer, and should not take too long either. However, for a spammer, this ups the cost of each comment sent, and thus reduces the number (and, therefor, economic interest) of spams sent.

Of course, the number "16" can be tweaked as necessary. However, I do believe that a number exists such that legitimate users will not find it onerous to post comments, while spammers will.

A few points to keep in mind:

This solution does not require user registration in order to comment.
There is no need for the actual user to do anything. Everything is done, automatically, by the computer.
In particular, this solution doesn't have any problems with blind terminals and other handicaps.

On the other hand, this solution does require Javascript to be enabled, and does require a significant CPU time investment. It's probably a good idea to form the javascript in such a way as it will only solve the puzzle if the user actually wants to post a comment, and not every time they view the blog.

Now, all that is missing is for someone who has the time to implement to do so....

Shachar

Sunday, January 14, 2007

Preventing a window from receiving focus on click in Windows

Ok, you may ask yourself how come I'm writing a Win32 programming howto in a blog labeled "Linux and Free Software". The answer is quite simple:
I ran upon this problem, and google could not find the answer. Since I like the "share alike" ideas, I'm documenting the solution somewhere google-visible.

Warning - highly technical post ahead:

Explanation of problem:
Sometimes you may want to create a window that does not receive focus when clicked. The most simple example is an on-screen virtual keyboard. It's an applet that displays a keyboard on screen, and pressing the keys with the mouse produces the relevant key, as if pressed on the keyboard.

If such an applet got focus when it was clicked, then the key produced would go to the window with the focus - the on screen keyboard. Such an application must receive the mouse clicks without activating the application's window.

Like I said above, I searched all over the web for a solution, without much success, despite the fact that I know, for a fact, that Microsoft's own "on screen keyboard" (Start-Programs-Accessories-Accessibility on windows 2000 and XP) does just that.

After a lot of looking around, I found out the solution. Starting with Windows 2000, CreateWindowEx can receive an extended style called "WS_EX_NOACTIVATE". Guess what it's doing. No, guess.... :-)

Let's hope that the internet just got smarter by one tiny little bit.

Shachar

Tuesday, November 14, 2006

How to ask me for technical support

Note: If I sent you a link to this post, please please please don't take is personally. It is just my way of trying to keep up with things I do. Please read this post through and follow the instructions displayed here.

As some of my readers (why do I bother with plural form?) know, I manage several free software projects. Most notable of these are PgOleDb and rsyncrypto. These are the project that have actual users, but there are several other pieces of code I decided to post up on the net for everyone to use, or just projects managed by others that I have contributed to.

As you may also know, I am also founder and CEO of a small open source consulting company called Lingnu Open Source Consulting. This company makes money from providing services, such as support, for open source projects. These include general FOSS projects, such as Linux or Wine, but this also includes the FOSS projects written by me, on company time, mentioned in the previous paragraph.

Every once in a while, someone will send me an email, to a personal email address, with a question or a request for support for one of the projects I manage. This page is up mainly so I can explain how (and when) I answer such support requests.

In a nutshell, there are two types of support requests I will take.

Each and every project I manage has a mailing list. The list is pointed to from the project's home page. Community based support (including by myself) is available, free of charge or commitment, on the lists.

Current or prospective customers of Lingnu are free to approach me in private for specific requests for my time.

If I sent you a reply pointing you to this page, it usually means that you contacted me, in private, asking for support without offering to pay for my time. Such requests will simply not be answered. Either re-send your request to the mailing list (and, yes, most of the mailing lists I manage require you to be a subscriber, for spam filtering reasons), or specify that you are looking for commercial support.

One common mistake people make is to reply to me, in private, to emails with information I send in reply to emails sent to the mailing list. In other words, people ask a question on the mailing list, I answer on the list, and people hit "reply" and ask a follow up in private. When you answer emails I send to a mailing list, please be sure to hit "reply to all", so that the list will have a copy of your email. Failing to do so will cause me not to answer your email.

I hope this makes things clearer.

And Now the Why....

I know that this policy may seem arrogant. Alternatively, it may seem that my intention is to "dry out" the community support for my FOSS projects, so that people find themselves compelled to pay for my time. I am really truly sorry about that. Please allow me to assure you all that neither of these are my intention.

I would very much like for a community to develop around all of the FOSS projects I started, and thus, I do not wish to discourage using the community channel to get free of charge support. However, when a question is asked on the public mailing list, two things happen:

All people subscribed to the list see the question, and get a chance to answer

The question, and answer, get public recoding in the list's archive, and will subsequently get picked up by search engines

The above two are crucial for the development of a community that rely on more than my (very limited) time for support. Since asking me a question in private effectively prevents both from happening, I refuse to spend time on questions sent in private.

What Should You Do if You Were Directed Here?

The answer to that is very simple. Simply open your original email for editing, set the "to" to that of the relevant mailing list (please do NOT CC me personally), and hit "send". That's all. I promise you that I do my best to answer all questions arriving on the mailing lists for all of my projects.

If the email you originally sent was a reply to an email from me sent to the list, it's ok for you to CC me. Just make sure that the mailing list address is in there, and make sure to hit "reply to all" the next time you want to reply to my public emails regarding community support.

If you feel that you need me to dedicate time to your problem, and you are willing to pay for this time, and you still got a reply pointing you here, then allow me first to apologize. Please re-send the email stating, at the start of it, that you are seeking commercial support, and ask for a quote.

I'm hoping that this page serve to create better communication.

Shachar

Saturday, November 11, 2006

Project Alky is dead

Or - why reinventing the wheel is not an automatically good thing to try and do.

A while ago, the internet (ok, just the FOSS world) was temporarily ablaze with talk of a new project. The project was called "Alky", and it was to do what Wine have not managed to do, even after over eleven years of development. It was to allow Linux users to run Windows binaries.

The reason I'm not giving a link to the Alky project's web site is not because I like Wine and I find the Alky's developers approach naive (and slightly arrogant). The real reason is that the project's web site is dead. On IRC, on the Freenode network, the #alky channel holds the following topic: "http://www.alkyproject.com/ - The alky project appears to be dead and cremated; the web site is gone, as was the channel for a while. If anyone knows anything else, please add it to the topic."

When alky launched, it spread healthy amounts of great promises. It said that Wine, while doing a decent enough job in it's own niche, is not fast enough to run games. They said that the reason is the need for each Wine process to communicate certain operations with the wineserver, and that they will preprocess the binaries so that this is not necessary.

Except, of course, they didn't. Instead they closed the project down, public interest nonwithstanding.

And the sad truth is that alky was pretty much doomed from the start. Here's why:
It failed to take on the experience of predecessors.

Wineserver, while indeed a performance bottleneck, has a function. Others (such as Transgaming) did offer other solutions to the same problem that wineserver is after solving, but the simple point of the matter is that some solution needs to be offered. Alky never got around to saying what their solution is going to be.

In a nutshell, Wine is composed of three major sections. One is the PE loader, capable of taking Windows executable or DLL, and mapping it into memory and resolving dependecies the same way that Windows does. The second is the wineserver, that allows synchronization between different processes. The third is the Win32 implementation itself. The last part is, by far, the greatest one. Over 90% of the code in Wine is taken by the implementation of the various DLLs that compose the thing amorphically called the "Win32 API".

Alky got as far as as implementing an alternative to the PE loader (which was done, in Alky, as a preprocessor for the files). It then defined the problem as 90% solved, and went public. As you can understand, the problem was, at best, only 10% solved.

So the project is now dead. Though it may not sound so from my post, I'm actually sorry to see it go like that. At the very least, it would have been nice to still have access to the work already done, and see whether Wine can benefit from it.

Alky, June 2006 - Oct 2006.

Shachar

Tuesday, July 18, 2006

Debunking "Debunking the Myth of High-level Languages"

My first English blog post. Yipee! Those of you who read Hebrew are welcome to my original blog of the same name. This blog will be English only.

In a recent five part article, David Chisnall made the claim that much of the common wisdom stating that "high level" languages, such as Java, result in slower executing code than C as plain wrong. In a nutshell, David's argument revolves around the idea that C is long past acurately describing what the hardware knows how to do on the one hand, while runtime (Just In Time) compilation allows optimizations not available when doing static compilation.

While I do agree that the difference between C and Java performance characteristics are much smaller than in previous life times, I will try to argue that the notion that we should all switch to a higher level language to gain performance is a wrong one. Here's why.

At the premise of David's claims is the fact that the era in which CPUs were simple tools carying out one instruction at a time are now long gone. The CISC processors of yore died in favor of agile RISC processors. The idea behind RISC is that you lower the transparency the machine code has to have of the internal CPU architecture. In return, you allow the compiler to perform better hardware dependent optimizations. While so doing, you identify the 5 commands that most programs execute for 98% of the time, and make those very very fast. Everything else, you leave up to the compiler to implement itself.

The result are machines with a machine code that is practically impossible for a human to code directly. One example is that the instruction immediatly after a branch operation will be executed regardless of whether the branch is taken or not, but only if no pipeline bubble occures. You now have to figure out whether the code preceding the branch had a dependency that cause a bubble to happen, and place the last command to be executed before the branch after it, or not, based on the results. It's something a compiler can do ok, but a human will find much harder to do.

David is thus wrong in stating that the machine code is an abstraction step away from what actually runs on the machine. On RISC machines, machine code is not converted to microcode. That was a CISC thing, and is now long dead.

There are two problems with the RISC vision. The first is that cross compatibility between different generations of the same CPU are now, for all intent and purposes, impossible. If a new version of the CPU adds a data path that prevents a pipeline bubble where one used to be, code written for the old CPU will not run correctly on the new one. While this does result in faster code when compiled for the most recent CPU, it also results in a management nightmare as far as getting your software and hardware aligned.

The second problem is called "Intel". While Intel did design its CPUs as RISC machines on the inside (they had a dual pipeline design as far back as the first Pentium chip), their machine code is a CISC relic, mixing instruction codes from the register-register generation, register-memory genenartion, and has some relics dating back to the time when accomulator based programing was the way to go. In short, the Inel machine code is a royal mess. Intel made the strategic decision not to fall for problem #1 above (and when they tried not to, with the Itanium chip, the result was a sad pathetic commercial failure). This means all the mess described above needs to be supported somehow, cheaply, and while running fast.

So how do you do that? Intel's answer - you put an optimizer inside the CPU! Any Intel chip dating original Pentium and onward will, the first time executing code, analyze the code while putting it into the instruction cache. In effect, Pentium chips convert the i386 CISC instruction set into a RISC instruction set while running it the first time. It then runs this RISC set any subsequent run over the same code. This is how they manage to fit this mess of a machine code into a pipelined design, and a multiple-pipeline design while at it.

The phenomen is so pervasive that some people claim that optimizing code for size, rather than speed, yields best performance, simply due to the fact that more code fits into the instruction set cache, avoiding the need to reconvert code. This is also the reason that flushing the instruction cache is such a horribly expensive operation on Intel.

Going back to David's article. Yes, C is now no longer a borderline assembly language. A very sophisticated optimizer needs to analyze the code for loop unrolling, interdependencies, and other hairy concepts before being able to reasonably produce hardware efficient machine code. Please keep in mind, however, that the reason for the need for this optimizer is that C is no longer what it used to be. In other words, no language is future proof.

So what about the claim that current high level languages are better suited for current CPUs, then? I think a broader picture must be looked at when selecting your programming lanaguage. The main question is this: How long do you want your program to live? If another CPU technology breakthrough is probable within your program's lifetime, wouldn't you rather have a language where at least the compiler's optimizer has a chance of catching up? One important point to keep in mind is that the simpler the language (and not many commonly used languages today are simpler than C), the easier it is for the optimizer to understand what your code is doing, and construct the necessary data-flow analysis required for performing optimizations. In other words, the simpler the language, the more sophisticated the optimizer can be.

David didn't just shoot out his opinion. David also mentioned an actual benchmark. The benchmark compared statically compiled code for the MIPS machine with runtime compiled MIPS code for the MIPS machine. The later ran 10% to 20% faster than the former. I believe that it is no accident that the test was not conducted (at least, not successfully) on an Intel. MIPS is a RISC machine, in the very strict, old fashioned, pre-Intel sense of the word. It is not obvious that you could repeat the same exercise on a machine that does the aggressive hardware optimization that Intel CPUs do.

Last, but not least, the most commonly used JIT language used today is Java. The closest runner up is .Net. For both languages, the "object form" you get is not a high level language. In that respect, there is no difference between a Java compiler (that compiles into bytecode, not GCJ) and a C compiler. Both compilers compile a high level language into a machine code. In Java's case, the machine code is a made up machine, that no actual CPU will understand. It does, however, look exactly like any other machine code would look. There is no magic "special information" that can be gleaned from the Java bytecode. If JIT can perform aggressive optimizations for a Java bytecode, there is no reason that a JIT cannot perform those same optimizations for the native machine code you use, no matter what it is.

Shachar

Linux and Free Software