Ramblings of Narc

When the issue isn't confused enough.

Spam, ReCAPTCHA, and Stuff

So if you’re a visitor here who’s ever at least thought of posting a comment, you’ll probably know I recently (about half a year ago?) switched away from Akismet to reCAPTCHA for my spam-blocking needs. ReCAPTCHA is nice, and the fact that they also make it possible for humans to help where OCR fails is a big bonus for them.

However, the fact that it’s one of the most common types of CAPTCHA means that it’s also the one under the heaviest attack, and that means there are spambots that have learned how to crack it.

As evidence of this, I offer the (dozens of) spam comments I just deleted from my queue (as I was typing this, another one just showed up). The major difference between this spam and the stuff that used to pass through Akismet is the length — these new spam comments are very long. This works in my favour, of course, since it makes it easy to figure out what to delete: if it takes a tap of the Page Down key to get to the end of the comment, it’s very likely spam.

However, if we disregard the content of the spam (which is easily changeable), we can see that it’s really quite a bad idea to rely on any kind of CAPTCHA by itself. It seems I have to echo the many others who have said that spam is a machine-generated problem with only human solutions.

Ultimately, every kind of anti-spam solution has drawbacks:

  • statistical analysis solutions (think Bayesian filters) will have false positives and false negatives sometimes.
  • distributed blacklists (like Akismet) fail because they’re blacklists — and enumerating badness is a failure waiting to happen[1]. On top of that, open blacklists are easy to poison, leading to… false positives, of course.
  • CAPTCHAs, as a special class of solutions, fail because they rely on computers not being able to “read” as well as humans can — the problem being that some humans cannot read as well as a computer can; and also that computers are getting smarter all the time.

I’m sure there are other types of anti-spam solutions I haven’t enumerated, and likely they all fail on one point or another.

One of the best approaches to such problems is a whitelist-based approach, or enumerating goodness. This is much easier to do, since the number of honest commenters is likely much lower and much more stable than the number of potential spammers out there.

But wait, narc, doesn’t that mean that I have to keep an eye on my moderation queue, to whitelist the allowed commenters?” Well, obviously, but you’d have to keep an eye there anyway to check for false positives, and to delete all the spam you’re getting. So there are no savings either way.

With that said, using a solution like reCAPTCHA can reduce the immense size of the moderation queue, which is a good enough reason to use it. But you still need to keep your eye on the ball, and you also shouldn’t forget that CAPTCHAs will keep out parts of your (potential) audience. I try to do that, and if I ever seem to have failed, I strongly encourage you to contact me and remind me.

Update: I’ve had to close comments on this post due to the fact that it got targeted by a bunch of spammers (who don’t seem to have much trouble with the reCAPTCHA). Meh.


Sorry, the comment form is closed at this time.