New major FuzzyOcr version: 2.3 (RC1)

decoder Wed, 23 Aug 2006 12:50:32 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,



I am proud to be able to announce a new release of FuzzyOcr with lots
of new features and changes.

You can download it at http://users.own-hero.net/~decoder/fuzzyocr/

Before installing this plugin, make sure to read the INSTALL file.

If you run into problems, make sure to read the FAQ file before
sending me or the list anything.

NOTE: Although this version has been tested for quite a while by
different people, and seems stable, it might still have bugs. I am not
responsible for any damage caused by this Plugin.

Changes made since 2.2beta1:

1) FuzzyOcr now allows you to do more than one scan on a picture.
    This is useful to do several scans with different
settings//programs on the same image.
    The results are combined.
    For every word in the wordlist, it checks how many hits it gets in
each result and picks the highest match count as total count for this
word.
    Here is how these "scansets" work:

    Basically, each scanset is a single program, or a chain of
programs, that take PNM image input, and give out text.

    Some examples:

        Simplest scanset (only uses gocr, with default values): $gocr -i -
        Another simple scanset (only uses gocr, but with different
grey threshold settings): $gocr -l 180 -i -
        Advanced scanset (invokes pnmnorm and pnmquant to preprocess
the image): pnmnorm | pnmquant 3 | pnmnorm | $gocr -l 180 -i -


    The last scanset will try to reduce the colors in the image to 3,
before using gocr on it.
    Note that $gocr is replaced by the actual path+binary name of gocr
at runtime.

    You can redirect the STDERR output of your custom programs to the
errfile that FuzzyOcr uses.
    (This is useful because the STDERR is printed to logfile if a
scanset fails).
    Here is an example:

        pnmnorm 2>>$errfile | pnmquant 3 2>>$errfile | pnmnorm
2>>$errfile | $gocr -l 180 -i -


    If this scanset now fails (which can happen if pnmquant is not
able to reduce the colors properly to 3),
    then you'll see the errors in the logfile when using debug mode.
If not, tell me ;D

    The default for this setting (focr_scansets), is to do 2 scans
(see the config file for details). To get back to one, use something like:

        focr_scansets $gocr -i -

    In the config file, you will also see the syntax for multiple
scansets (comma seperated) and more examples.

2) The whole tempfile system was rewritten.
    FuzzyOcr now uses the internal SpamAssassin functions for
tmpfile/tmpdir generation (specification of a path for temporary files
is no longer needed/possible).
    All files are properly unlinked now.

3) FuzzyOcr now supports interlaced gifs. They get converted to
non-interlaced ones and then processed.
    If the interlaced image is corrupt, then it will not be scanned.
Instead, it will be scored with the corrupt image score only.
    That is because of the limitation in giffix to fix interlaced
gifs. The corrupt image score has therefore been increased to 5 points.

4) FuzzyOcr now supports animated gifs. It has two ways to check them.
    The first one is used, if the image contains less than x frames,
where x is the value specified by "focr_gif_max_frames" in the config.
    The default is 5. In this method, imagemagick's convert is invoked
to put the images all together to a bigger image which contains all
frames.
    Then it gets processed. The second method is used, if the image
has exactly or more frames than x.
     In this method, gifasm is used to split the image into files each
containing one frame (this happens in a tmpdir), then the biggest file
is picked for scanning.

Corrupt animated gifs are handled exactly as corrupt interlaced gifs.

5) FuzzyOcr now supports external wordlists. It has both a global
wordlist (which must be configured in the cf file) and a list based on
the user executing spamassassin/spamc.
    Both lists allow comments in bash style (#comment and wordhere
#comment).
    The personal list's relative (to the homedir) path and name can be
configured in the cf file.
    The default is .spamassassin/fuzzyocr.words. Both global and
personal list are concatenated before scanning.
    A sample wordlist is shipped within the tarball.

6) Spaces are now stripped from wordlist words and OCR results before
matching.
    This increases the chances to hit, because gocr sometimes
recognizes lots of spaces where no spaces are (depends on font).

7) Logfile is now locked for exclusive writing when a message is
logged. Same applies for tmpfiles.
    This ensure that spamd childs running at the same time don't
interfere.

8) An experimental MD5 database feature has been added (disabled by
default).
     It allows you to save MD5 hashes of already recognized images in
a database for a faster processing if the same image reaches you again.

8) Millions of bugfixes and rewrites ;) I can't enumerate them all :P


TODO:

    -The second test for animated gifs is still a bit hacky... it
works for me but I don't know how well it works with different mails :)
     So, keep me informed :)

    -The effectiveness of the MD5 db is unknown. Please tell me if
this feature catches enough mails to be worth using it :)



Bugreports and support requests can be sent to me, the mailing lists,
or you can contact me on IRC directly:

Server: irc.own-hero.net
Port: 6667 (or 6697 for SSL)
Channel: #fuzzyocr

Special thanks for helping me to Howard Kash, Burnie Pettersen, Ken
Bass, Maximilian Grothusmann, Robert LeBlanc, Matthias Keller,
Alex the Ninja and everyone else I might have missed :D


Best regards,


Chris


P.S.: Starting at September 01, I am busy with university stuff again,
and won't be able to do much for this project, until I get more time
again :)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE7LFtJQIKXnJyDxURAjN8AJ9w/DtR3IcOhGgJa9MMh4kTm8PgIgCeIc+O
FYmm7VkUu5kK9haDrdmpc5Q=
=4pnR
-----END PGP SIGNATURE-----

New major FuzzyOcr version: 2.3 (RC1)

Reply via email to