Today at 12:45am, Kirk Ouimet said:

Hi List,

My web host allows me to control how much RAM is available on my hosted
Linux VServer and charges me $1 for every 10 MB allocated. I wrote a script
this week that uses information from the Linux command "top" to scale
resources available based on current demand. Running the script ends up
saving me about $40/month. Everything was going great until they put a
Captcha on the page that my script uses to set my allocated resources.

Here's an example of their Captcha image:

http://www.kirkouimet.com/top/captcha.png

I want to defeat it. Using PHP preferably. Anyone have any tips for me?

I don't want to talk a whole lot about it on a publicly archived list ;), but I've done some Optical Character Recognition (OCR) on images similar to that, using PHP and the GD library.

The images I was doing the OCR for used the same font for all their images, so all I had to do was recognize each of the 26 letters, and now it never misses a beat.

Here's a snippet that might get you started:

  $size_arr = getimagesize($file);
  #print_r($size_arr);

  $sx = $size_arr[0];
  $sy = $size_arr[1];

  $im = ImageCreateFromPNG($file);

  $out="($sx,$sy)\n  ";
  $out1=array();
  $out2="";

  if ($im) {
foreach (range(0,$sx-1) as $x) { $out.=($x/10>=1 ? floor($x/10)%10 : " ");}
    $out.="\n  ";
    foreach (range(0,$sx-1) as $x) { $out .= $x%10; }
    $out.="\n";
    foreach (range(0,$sy-1) as $y) {
      #echo "Row Y=$y\n";
      $out .= ($y/10>=1 ? floor($y/10)%10 : " ").$y%10;
      $n=0;
      foreach (range(0,$sx-1) as $x) {
        $color = imagecolorat($im,$x,$y);
        $bg = ($color==16777164);
        $fg = !$bg;
        $char = ($fg ? "*" : " ");
        $out .= $char;
        #echo "X=$x   char=$char   fg=$fg, bg=$bg, color=$color\n";
      }
      #echo "**********\n";
      $out .= "\n";
    }
    echo $out;
  }

Since this one was a consistent font, I just had to recognize the pattern of pixels in the first few columns of pixels for each letter, and I could tell what it was, and would skip over x columns to the start of the next letter. It didn't need to do anything probabilistically by making guesses based on percentages or anything, so it was totally deterministic, which make it really nice and much easier.

For yours, you'd want to start by taking off the border, then by running a noise reduction algorithm over it to take out the stray dots. Basically, you'll load the pixels into a two dimentional array, and go through it row by row, column by column. You'll want to take out any pixel that is dark when everything around it is light. Like this:

  (x-1,y+1)  (x,y+1)  (x+1,y+1)
  (x-1,y  )  (x,y  )  (x+1,y  )
  (x-1,y-1)  (x,y-1)  (x+1,y-1)

If the pixel at (x,y) is black, and the other 8 are white, set (x,y) to white.

Give that a try and see if that cleans up the image enough to just find the letters. A variant on this that is a little more agressive is to reset it to white if it only has 0 black neighbors OR 1 black neighbor. That can erase fine lines though, so you want to be more careful with it.

Anyway, it's a topic I am very interested in, and I think your problem can be solved. This month is crazy busy at work, but if by April you don't have it solved but still want to solve it, I'd love to spend a few hours playing with it with you. It might be as little as a couple hours to crack the hardest part of the problem.

One thing you'll need to do though is get a large sample of captchas, so you make sure you have at least a few incidences of every letter. The first step, after any image preprocessing, is to make sure you can recognize each letter correctly, then you can tackle the whole image problem much more easily.

Thanks,
Mac

--
Mac Newbold                     Code Greene, LLC
CTO/Chief Technical Officer     44 Exchange Place
Office: 801-582-0148            Salt Lake City, UT  84111
Cell:   801-694-6334            www.codegreene.com

_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to