Muahahahahahah: http://www.kirkouimet.com/files/images/escaptcha.gif >:]
Yes it is 4:13 AM, but it was totally worth it. My solution is completely homebrewed in PHP and was largely inspired by Mac Newbold's post. I changed my algorithm several times - in the initial stages I was getting 100% accuracy with 40 seconds of processing, but I managed to play around with arrays and got it down to 100% accuracy in under 4 seconds every time. Needless to say, my script is back up and running again and I am happily auto-scaling my Linux VServer's resources based on demand. Thanks for all of the helpful replies! Kirk Ouimet [email protected] -----Original Message----- From: Mac Newbold [mailto:[email protected]] Sent: Saturday, March 07, 2009 4:22 PM To: Kirk Ouimet Cc: [email protected] Subject: Re: [UPHPU] Breaking Captchas Today at 12:45am, Kirk Ouimet said: > Hi List, > > My web host allows me to control how much RAM is available on my hosted > Linux VServer and charges me $1 for every 10 MB allocated. I wrote a script > this week that uses information from the Linux command "top" to scale > resources available based on current demand. Running the script ends up > saving me about $40/month. Everything was going great until they put a > Captcha on the page that my script uses to set my allocated resources. > > Here's an example of their Captcha image: > > http://www.kirkouimet.com/top/captcha.png > > I want to defeat it. Using PHP preferably. Anyone have any tips for me? I don't want to talk a whole lot about it on a publicly archived list ;), but I've done some Optical Character Recognition (OCR) on images similar to that, using PHP and the GD library. The images I was doing the OCR for used the same font for all their images, so all I had to do was recognize each of the 26 letters, and now it never misses a beat. Here's a snippet that might get you started: $size_arr = getimagesize($file); #print_r($size_arr); $sx = $size_arr[0]; $sy = $size_arr[1]; $im = ImageCreateFromPNG($file); $out="($sx,$sy)\n "; $out1=array(); $out2=""; if ($im) { foreach (range(0,$sx-1) as $x) { $out.=($x/10>=1 ? floor($x/10)%10 : " ");} $out.="\n "; foreach (range(0,$sx-1) as $x) { $out .= $x%10; } $out.="\n"; foreach (range(0,$sy-1) as $y) { #echo "Row Y=$y\n"; $out .= ($y/10>=1 ? floor($y/10)%10 : " ").$y%10; $n=0; foreach (range(0,$sx-1) as $x) { $color = imagecolorat($im,$x,$y); $bg = ($color==16777164); $fg = !$bg; $char = ($fg ? "*" : " "); $out .= $char; #echo "X=$x char=$char fg=$fg, bg=$bg, color=$color\n"; } #echo "**********\n"; $out .= "\n"; } echo $out; } Since this one was a consistent font, I just had to recognize the pattern of pixels in the first few columns of pixels for each letter, and I could tell what it was, and would skip over x columns to the start of the next letter. It didn't need to do anything probabilistically by making guesses based on percentages or anything, so it was totally deterministic, which make it really nice and much easier. For yours, you'd want to start by taking off the border, then by running a noise reduction algorithm over it to take out the stray dots. Basically, you'll load the pixels into a two dimentional array, and go through it row by row, column by column. You'll want to take out any pixel that is dark when everything around it is light. Like this: (x-1,y+1) (x,y+1) (x+1,y+1) (x-1,y ) (x,y ) (x+1,y ) (x-1,y-1) (x,y-1) (x+1,y-1) If the pixel at (x,y) is black, and the other 8 are white, set (x,y) to white. Give that a try and see if that cleans up the image enough to just find the letters. A variant on this that is a little more agressive is to reset it to white if it only has 0 black neighbors OR 1 black neighbor. That can erase fine lines though, so you want to be more careful with it. Anyway, it's a topic I am very interested in, and I think your problem can be solved. This month is crazy busy at work, but if by April you don't have it solved but still want to solve it, I'd love to spend a few hours playing with it with you. It might be as little as a couple hours to crack the hardest part of the problem. One thing you'll need to do though is get a large sample of captchas, so you make sure you have at least a few incidences of every letter. The first step, after any image preprocessing, is to make sure you can recognize each letter correctly, then you can tackle the whole image problem much more easily. Thanks, Mac -- Mac Newbold Code Greene, LLC CTO/Chief Technical Officer 44 Exchange Place Office: 801-582-0148 Salt Lake City, UT 84111 Cell: 801-694-6334 www.codegreene.com
_______________________________________________ UPHPU mailing list [email protected] http://uphpu.org/mailman/listinfo/uphpu IRC: #uphpu on irc.freenode.net
