Cracking SolidBlue Spam Interceptor's Authentication CAPTCHA
Introduction
A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is an image containing some items, possibly obscured, which a web site will use to check that a user trying to use its systems is a human. They're generally used when signing up with an email site, in a bid to stop spam.Earlier today, Slashdot posted a link to a story about the spam problem. The first poster to the follow-up discussion (who I later discovered worked for SolidBlue) didn't believe CAPTCHAs could be cracked, and challenged someone to find an automated way of cracking the SolidBlue Spam Interceptor Authentication. I was a tad bored, so I gave it a go.
How to do it
There are a number of points to note about the SI CAPTCHA:- It uses a reasonably high-contrast colour scheme (dark letters on a pale background)
- It doesn't use overlapping letters
- It uses a small subset of the alphabet
Pre-processing
When I went to the referenced link at Spam Interceptor, I was given the following CAPTCHA:
I loaded this image into the GIMP. The first thing we want to do when we process this image is to remove the chroma (colour) information and leave only the luma (brightness) information. In order to do that, we just convert the image to greyscale. Doing so yielded the following image:
After this, we need to get rid of all the pale background and leave the darker foreground in place. I was lucky to stumble upon the Layer->Colours->Threshold function. That converted the image to black and white.
I saved this image as an X bitmap (XBM) file, which is handy for compiling into C programs.
Building a set of characters
The individual characters in the CAPTCHA use the same font throughout. Any letter "L" will look exactly like any other letter "L". This means that if we grab one letter L from our CAPTCHA, we can do a simple match against our copy and tell whether the letter exists in the image. I tried a few different seeds to the CAPTCHA-generating script, and got a set of 8 different glyphs. This seems to be all I can get hold of from this address. It's possible that other hosts may see more glyphs.Search algorithm
The search algorithm is pretty simple: Start at the left-hand end of the CAPTCHA and work top-to-bottom until you see a "black" pixel. At this point, you have the top-most pixel on the left-most column of the character. Some characters, like the S and the J, don't have a pixel in the top-left corner, so an offset needs to be added to the y-coordinate to determine where the top of the glyph would start. Then just loop through all the glyphs to see whether there's a match. If there is, output the character and skip right by the width of the glyph.Proof of concept code
Obviously this isn't going to be a fully automatic solution, but you can see how I used freely available (with source) code to do the pre-processing of the image, and you can see where the processing works. Linking it all together isn't rocket science.Here's the code:
- authimage3.c, the XBM.
- glyphs.c, the set of glyphs.
- main.c, the matching code.
Notes
- Please don't accuse me of helping spammers by doing this. It's taken me as long to write this HTML as it did to write the C code. Any idiot can do it.
- If you're one of the 12 or so people who came and looked at the images when I linked them from here, you may note that the 3rd image is different. That's because I only realised when I was testing the code that I'd taken the default threshold from the GIMP and not the one that actually kept all the characters.
- If you're from SolidBlue, sorry I broke your CAPTCHA. You should make it harder to beat. Otherwise your users will think there's security there when there isn't.
- If you're from the Positive Internet Company, wondering why you're being Slashdotted, I'm sorry. Think of it as a test :-)