-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello,
I am proud to be able to announce a new release of FuzzyOcr with lots of new features and changes. You can download it at http://users.own-hero.net/~decoder/fuzzyocr/ Before installing this plugin, make sure to read the INSTALL file. If you run into problems, make sure to read the FAQ file before sending me or the list anything. NOTE: Although this version has been tested for quite a while by different people, and seems stable, it might still have bugs. I am not responsible for any damage caused by this Plugin. Changes made since 2.2beta1: 1) FuzzyOcr now allows you to do more than one scan on a picture. This is useful to do several scans with different settings//programs on the same image. The results are combined. For every word in the wordlist, it checks how many hits it gets in each result and picks the highest match count as total count for this word. Here is how these "scansets" work: Basically, each scanset is a single program, or a chain of programs, that take PNM image input, and give out text. Some examples: Simplest scanset (only uses gocr, with default values): $gocr -i - Another simple scanset (only uses gocr, but with different grey threshold settings): $gocr -l 180 -i - Advanced scanset (invokes pnmnorm and pnmquant to preprocess the image): pnmnorm | pnmquant 3 | pnmnorm | $gocr -l 180 -i - The last scanset will try to reduce the colors in the image to 3, before using gocr on it. Note that $gocr is replaced by the actual path+binary name of gocr at runtime. You can redirect the STDERR output of your custom programs to the errfile that FuzzyOcr uses. (This is useful because the STDERR is printed to logfile if a scanset fails). Here is an example: pnmnorm 2>>$errfile | pnmquant 3 2>>$errfile | pnmnorm 2>>$errfile | $gocr -l 180 -i - If this scanset now fails (which can happen if pnmquant is not able to reduce the colors properly to 3), then you'll see the errors in the logfile when using debug mode. If not, tell me ;D The default for this setting (focr_scansets), is to do 2 scans (see the config file for details). To get back to one, use something like: focr_scansets $gocr -i - In the config file, you will also see the syntax for multiple scansets (comma seperated) and more examples. 2) The whole tempfile system was rewritten. FuzzyOcr now uses the internal SpamAssassin functions for tmpfile/tmpdir generation (specification of a path for temporary files is no longer needed/possible). All files are properly unlinked now. 3) FuzzyOcr now supports interlaced gifs. They get converted to non-interlaced ones and then processed. If the interlaced image is corrupt, then it will not be scanned. Instead, it will be scored with the corrupt image score only. That is because of the limitation in giffix to fix interlaced gifs. The corrupt image score has therefore been increased to 5 points. 4) FuzzyOcr now supports animated gifs. It has two ways to check them. The first one is used, if the image contains less than x frames, where x is the value specified by "focr_gif_max_frames" in the config. The default is 5. In this method, imagemagick's convert is invoked to put the images all together to a bigger image which contains all frames. Then it gets processed. The second method is used, if the image has exactly or more frames than x. In this method, gifasm is used to split the image into files each containing one frame (this happens in a tmpdir), then the biggest file is picked for scanning. Corrupt animated gifs are handled exactly as corrupt interlaced gifs. 5) FuzzyOcr now supports external wordlists. It has both a global wordlist (which must be configured in the cf file) and a list based on the user executing spamassassin/spamc. Both lists allow comments in bash style (#comment and wordhere #comment). The personal list's relative (to the homedir) path and name can be configured in the cf file. The default is .spamassassin/fuzzyocr.words. Both global and personal list are concatenated before scanning. A sample wordlist is shipped within the tarball. 6) Spaces are now stripped from wordlist words and OCR results before matching. This increases the chances to hit, because gocr sometimes recognizes lots of spaces where no spaces are (depends on font). 7) Logfile is now locked for exclusive writing when a message is logged. Same applies for tmpfiles. This ensure that spamd childs running at the same time don't interfere. 8) An experimental MD5 database feature has been added (disabled by default). It allows you to save MD5 hashes of already recognized images in a database for a faster processing if the same image reaches you again. 8) Millions of bugfixes and rewrites ;) I can't enumerate them all :P TODO: -The second test for animated gifs is still a bit hacky... it works for me but I don't know how well it works with different mails :) So, keep me informed :) -The effectiveness of the MD5 db is unknown. Please tell me if this feature catches enough mails to be worth using it :) Bugreports and support requests can be sent to me, the mailing lists, or you can contact me on IRC directly: Server: irc.own-hero.net Port: 6667 (or 6697 for SSL) Channel: #fuzzyocr Special thanks for helping me to Howard Kash, Burnie Pettersen, Ken Bass, Maximilian Grothusmann, Robert LeBlanc, Matthias Keller, Alex the Ninja and everyone else I might have missed :D Best regards, Chris P.S.: Starting at September 01, I am busy with university stuff again, and won't be able to do much for this project, until I get more time again :) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE7LFtJQIKXnJyDxURAjN8AJ9w/DtR3IcOhGgJa9MMh4kTm8PgIgCeIc+O FYmm7VkUu5kK9haDrdmpc5Q= =4pnR -----END PGP SIGNATURE-----