Today at 12:45am, Kirk Ouimet said:
Hi List,
My web host allows me to control how much RAM is available on my hosted
Linux VServer and charges me $1 for every 10 MB allocated. I wrote a script
this week that uses information from the Linux command "top" to scale
resources available based on current demand. Running the script ends up
saving me about $40/month. Everything was going great until they put a
Captcha on the page that my script uses to set my allocated resources.
Here's an example of their Captcha image:
http://www.kirkouimet.com/top/captcha.png
I want to defeat it. Using PHP preferably. Anyone have any tips for me?
I don't want to talk a whole lot about it on a publicly archived list ;),
but I've done some Optical Character Recognition (OCR) on images similar
to that, using PHP and the GD library.
The images I was doing the OCR for used the same font for all their
images, so all I had to do was recognize each of the 26 letters, and now
it never misses a beat.
Here's a snippet that might get you started:
$size_arr = getimagesize($file);
#print_r($size_arr);
$sx = $size_arr[0];
$sy = $size_arr[1];
$im = ImageCreateFromPNG($file);
$out="($sx,$sy)\n ";
$out1=array();
$out2="";
if ($im) {
foreach (range(0,$sx-1) as $x) { $out.=($x/10>=1 ? floor($x/10)%10 : "
");}
$out.="\n ";
foreach (range(0,$sx-1) as $x) { $out .= $x%10; }
$out.="\n";
foreach (range(0,$sy-1) as $y) {
#echo "Row Y=$y\n";
$out .= ($y/10>=1 ? floor($y/10)%10 : " ").$y%10;
$n=0;
foreach (range(0,$sx-1) as $x) {
$color = imagecolorat($im,$x,$y);
$bg = ($color==16777164);
$fg = !$bg;
$char = ($fg ? "*" : " ");
$out .= $char;
#echo "X=$x char=$char fg=$fg, bg=$bg, color=$color\n";
}
#echo "**********\n";
$out .= "\n";
}
echo $out;
}
Since this one was a consistent font, I just had to recognize the pattern
of pixels in the first few columns of pixels for each letter, and I could
tell what it was, and would skip over x columns to the start of the next
letter. It didn't need to do anything probabilistically by making guesses
based on percentages or anything, so it was totally deterministic, which
make it really nice and much easier.
For yours, you'd want to start by taking off the border, then by running a
noise reduction algorithm over it to take out the stray dots. Basically,
you'll load the pixels into a two dimentional array, and go through it row
by row, column by column. You'll want to take out any pixel that is dark
when everything around it is light. Like this:
(x-1,y+1) (x,y+1) (x+1,y+1)
(x-1,y ) (x,y ) (x+1,y )
(x-1,y-1) (x,y-1) (x+1,y-1)
If the pixel at (x,y) is black, and the other 8 are white, set (x,y) to
white.
Give that a try and see if that cleans up the image enough to just find
the letters. A variant on this that is a little more agressive is to reset
it to white if it only has 0 black neighbors OR 1 black neighbor. That can
erase fine lines though, so you want to be more careful with it.
Anyway, it's a topic I am very interested in, and I think your problem can
be solved. This month is crazy busy at work, but if by April you don't
have it solved but still want to solve it, I'd love to spend a few hours
playing with it with you. It might be as little as a couple hours to crack
the hardest part of the problem.
One thing you'll need to do though is get a large sample of captchas, so
you make sure you have at least a few incidences of every letter. The
first step, after any image preprocessing, is to make sure you can
recognize each letter correctly, then you can tackle the whole image
problem much more easily.
Thanks,
Mac
--
Mac Newbold Code Greene, LLC
CTO/Chief Technical Officer 44 Exchange Place
Office: 801-582-0148 Salt Lake City, UT 84111
Cell: 801-694-6334 www.codegreene.com
_______________________________________________
UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net