Muahahahahahah:

http://www.kirkouimet.com/files/images/escaptcha.gif >:]

Yes it is 4:13 AM, but it was totally worth it.

My solution is completely homebrewed in PHP and was largely inspired by Mac
Newbold's post. I changed my algorithm several times - in the initial stages
I was getting 100% accuracy with 40 seconds of processing, but I managed to
play around with arrays and got it down to 100% accuracy in under 4 seconds
every time.

Needless to say, my script is back up and running again and I am happily
auto-scaling my Linux VServer's resources based on demand.

Thanks for all of the helpful replies!

Kirk Ouimet
[email protected]

-----Original Message-----
From: Mac Newbold [mailto:[email protected]] 
Sent: Saturday, March 07, 2009 4:22 PM
To: Kirk Ouimet
Cc: [email protected]
Subject: Re: [UPHPU] Breaking Captchas

Today at 12:45am, Kirk Ouimet said:

> Hi List,
>
> My web host allows me to control how much RAM is available on my hosted
> Linux VServer and charges me $1 for every 10 MB allocated. I wrote a
script
> this week that uses information from the Linux command "top" to scale
> resources available based on current demand. Running the script ends up
> saving me about $40/month. Everything was going great until they put a
> Captcha on the page that my script uses to set my allocated resources.
>
> Here's an example of their Captcha image:
>
> http://www.kirkouimet.com/top/captcha.png
>
> I want to defeat it. Using PHP preferably. Anyone have any tips for me?

I don't want to talk a whole lot about it on a publicly archived list ;), 
but I've done some Optical Character Recognition (OCR) on images similar 
to that, using PHP and the GD library.

The images I was doing the OCR for used the same font for all their 
images, so all I had to do was recognize each of the 26 letters, and now 
it never misses a beat.

Here's a snippet that might get you started:

   $size_arr = getimagesize($file);
   #print_r($size_arr);

   $sx = $size_arr[0];
   $sy = $size_arr[1];

   $im = ImageCreateFromPNG($file);

   $out="($sx,$sy)\n  ";
   $out1=array();
   $out2="";

   if ($im) {
     foreach (range(0,$sx-1) as $x) { $out.=($x/10>=1 ? floor($x/10)%10 : " 
");}
     $out.="\n  ";
     foreach (range(0,$sx-1) as $x) { $out .= $x%10; }
     $out.="\n";
     foreach (range(0,$sy-1) as $y) {
       #echo "Row Y=$y\n";
       $out .= ($y/10>=1 ? floor($y/10)%10 : " ").$y%10;
       $n=0;
       foreach (range(0,$sx-1) as $x) {
         $color = imagecolorat($im,$x,$y);
         $bg = ($color==16777164);
         $fg = !$bg;
         $char = ($fg ? "*" : " ");
         $out .= $char;
         #echo "X=$x   char=$char   fg=$fg, bg=$bg, color=$color\n";
       }
       #echo "**********\n";
       $out .= "\n";
     }
     echo $out;
   }

Since this one was a consistent font, I just had to recognize the pattern 
of pixels in the first few columns of pixels for each letter, and I could 
tell what it was, and would skip over x columns to the start of the next 
letter. It didn't need to do anything probabilistically by making guesses 
based on percentages or anything, so it was totally deterministic, which 
make it really nice and much easier.

For yours, you'd want to start by taking off the border, then by running a 
noise reduction algorithm over it to take out the stray dots. Basically, 
you'll load the pixels into a two dimentional array, and go through it row 
by row, column by column. You'll want to take out any pixel that is dark 
when everything around it is light. Like this:

   (x-1,y+1)  (x,y+1)  (x+1,y+1)
   (x-1,y  )  (x,y  )  (x+1,y  )
   (x-1,y-1)  (x,y-1)  (x+1,y-1)

If the pixel at (x,y) is black, and the other 8 are white, set (x,y) to 
white.

Give that a try and see if that cleans up the image enough to just find 
the letters. A variant on this that is a little more agressive is to reset 
it to white if it only has 0 black neighbors OR 1 black neighbor. That can 
erase fine lines though, so you want to be more careful with it.

Anyway, it's a topic I am very interested in, and I think your problem can 
be solved. This month is crazy busy at work, but if by April you don't 
have it solved but still want to solve it, I'd love to spend a few hours 
playing with it with you. It might be as little as a couple hours to crack 
the hardest part of the problem.

One thing you'll need to do though is get a large sample of captchas, so 
you make sure you have at least a few incidences of every letter. The 
first step, after any image preprocessing, is to make sure you can 
recognize each letter correctly, then you can tackle the whole image 
problem much more easily.

Thanks,
Mac

--
Mac Newbold                     Code Greene, LLC
CTO/Chief Technical Officer     44 Exchange Place
Office: 801-582-0148            Salt Lake City, UT  84111
Cell:   801-694-6334            www.codegreene.com
_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to