I got a mail with image spam today (I probably got quite a few but gmail blocks most of them nowadays):
http://www.webfast.com/~skip/thermometer.gif I ran gocr 0.41 over it and got this output: > _'__o______ __ ____o______ ___ i__8 _____ 00,__ 0 0_,_> 0 __8 ___E3 __>_E3 __ E3_,__ _____ 0,__,_ _ _0______ _ 0 __0 _, ___ _E3____ E3 _ _ _ ____ ____ 'o__0____ ____ 0>,E3 _______ __ _________, _,______ _ 0 __________ ___,,_____, ____ ____',____ ____ ___ ___ _ 0 ___ >__ ____ ___ ____ _ ___E3_ ___e__ ___E3___ 0 ______ The latest version is 0.43, so I downloaded and built it (with a couple slight tweaks needed). When fed the same image it spit out: _ _ _ _ _ _ _ X;niy_nha_ Technology Ltd qnb oI_ _ p_rce I1.SB lP 1_.6_ hb te: H_ts Il_ghs of I1._B TodJy .M_ rc _ Fxpected T _ rr _ Ini thc Izst 3 _ eks they ha_e ianded o_er I1.Z M II_on _n contracts. TJdays n _ Jnnounced anothe? huge cont_iact. Read all the n _ and set ycur buy fur_ mm f_rst cn_ng Tuesday nD rn_ng! Pretty huge improvement. (I think you can see why I gave up on gocr before.) By comparison, with my latest massaging of the input fed to ocrad I get: X?nU?nha? TechnologU L_d glbol! _ p_rce __.58 LP _3.6_ __e: H__s H_ghs or __.78 Tod_V _re _ Expec_ed T_rr_ In _he las_ 3 _ehs _heV ha_e landed o_er t_.2 n?ll?on ?n con_roc_s, TodoVs n_ onnounced ono_her huge con_rac_, RPad all _he n_ and se_ Uour buU ror mM r?rs_ _h?ng TuesdaU nDrn?ng! Without any massaging ocrad doesn't find any text. You have to give the --invert flag. Seems like it should automatically try to invert the image if its first attempt to extract text completely fails. At any rate, gocr looks much better than it did. I'm going to install it and give your patch a try for a couple days. It looks fine based on a simple skim of the changes. Go ahead and check it in so more people can play with it. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev