Re: A rant about FUZZY_OCR

Henrik K Sun, 26 Apr 2009 23:42:47 -0700

On Sun, Apr 26, 2009 at 02:37:06PM -0400, Adam Katz wrote:
> > On Fri, Apr 24, 2009 at 05:14:21PM -0400, Adam Katz wrote:
> >> I wouldn't trust FUZZY_OCR with anything.  12 points is *WAY* too high
> >> for any single thing.  I had to disable this plugin a year or three
> >> ago because it assigned 20+ points to legit screenshots in ham (and
> >> that was /after/ I trimmed its flagging words file down in size)!
> 
> Henrik K wrote:
> > You do realize that it's configurable? Who to blame if you just run
> > things blindly.
> 
> I expect the defaults to at least border on sane.  As noted before, I've
> tried and failed to configure it.  Could you point me at where the
> configuration options are specified, specifically focr_threshold?  All I
> see is the installation manual and the .cf file, neither of which is
> terribly informative (like say the perldoc pages for other plugins).


Unfortunately it's not a sane world. But if you have any logic, you will see
that focr_base_score and focr_add_score mainly make up the score. One can
argue that the popular "botnet" plugin also doesn't have sane defaults.

> I don't know if I still have the example of the bad hit from those years
> ago, but it made absolutely no sense, hitting dozens of "words found"
> that did not exist ... and this was a PNG screen capture, not even a
> photo or a JPEG-compressed image.  My company deals with screen captures
> a LOT, and I just can't afford for such a poorly designed plugin to run
> amok the way Fuzzy OCR does.

I'm sorry that you are disappointed on the design. If you need "mission
critical" code, then you must expect that code people generously make on
their spare time for free might have few kinks around. Were you on fuzzyocr
mailing list few years ago and participate on the development process?

> It's extremely disturbing that there are several tests (which is a good
> thing), but none of them are designed to test for false positives, or
> even to help you tweak the detection threshold.  You're left guessing
> what reasonable levels are, especially when the config file (the best
> docs I could find) points you at the manual (which I believe is the
> install guide, which doesn't even include the string "thresh").
> The last release was two years ago, and even on the svn trunk, the word
> list hasn't been updated ... ever (excepting minor tweaks like a
> threshold change from 0.1 to 0.01).  How is this fair?

The plugin was last needed few years ago? Why is it supposed to be updated
to this day as there was no image spam? There is not much point making
general word lists. You put there what your mail flow sees. Someone from
medical company could be using it and come screaming at the "bad defaults"..

> The claim that FUZZY_OCR can't use the Bayesian database is a weak one,
> too; just make a custom prefix to the tokens it creates (I don't know
> SA's bayes token syntax, but other implementations use things like
> "subject:foo" to indicate that the word "foo" in the subject differs
> from the word "foo" elsewhere, so you could have "fuzzyocr:foo"
> instead).  Implement the fuzziness by inserting a dozen tokens for each
> possible parsing.)  This would solve the issue of stale or inappropriate
> word lists.

You are free to contribute code. If I remember right, someone might have
been trying it, maybe some talk can be found on mailinglist archives.

> Finally, I have no way of testing the thing live.  Since FUZZY_OCR is a
> dynamically scored rule, I can't just push it to 0.001 and see the hits,
> the way I can with the BAYES_XX thresholds for example.  (Sure, I can
> make all score-changing values 0.001, but I'm not sure that would
> properly test it, and given my past experiences, I wouldn't be surprised
> if this still causes problems.)

Nothing of this makes sense. If you don't have a test server, too bad. If
you don't trust the "score-changing values" too bad. It all worked for me.

> It's a great idea, but I'd like to see it mature some first, especially
> with respect to its documentation, test emails, word list, and live testing.

If was quickly developed to an ongoing problem. The problem disappeared
years ago. It was mature enough for 99% of users at that time. Though it did
add lots of complexity and stricter MTA rules etc handled the job just fine
also.

Cheers,
Henrik

Re: A rant about FUZZY_OCR

Reply via email to