Re: FuzzyOcr 3.5.1 released

John Scully Sun, 07 Jan 2007 18:15:42 -0800

I wonder if Vernon Schryver at rhyolite could tie fuzzy OCR into the DCC(distributed Checksum) project. We operate one of the several hundred nodesin the DCC network, and it has been a great tool in spam control. Foranyone who is not familiar with it, DCC is a network of public and privateservers that exchange floods of millions of bulk mail "fingerprints" basedboth on "spamminess" and just general bulk of the mailings. info atwww.rhyolite.com

The advantage is that the DCC servers keep their checksum DB in memory andare lightning fast. The OCR check would be a lot more intensive then thecurrent conversion of a mail body into a set of checksums...but it wouldallow the network of servers to exchange the fingerprints of spam images.

To give you an idea, our DCC server currently has these stats: The keyitems - 22,057,457 checksums in memory, using a little over 1.1G of RAM. Wereceive about 4,000 reports per minute from the network and send about 200per minute from emails we process.

Of course, you only need to run your own DCC server if processing well over100,000 emails per day.


John Scully
isupportisp.com

----- Original Message -----From: "Andy Dills" <[EMAIL PROTECTED]>

To: "decoder" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <users@spamassassin.apache.org>
Sent: Sunday, January 07, 2007 5:42 PM
Subject: Re: FuzzyOcr 3.5.1 released


On Sun, 7 Jan 2007, Andy Dills wrote:

On Sun, 7 Jan 2007, decoder wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hello all,
>
>
> since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
> testers and bug reporters :) so big thanks.


I have something I'm curious about, having run FuzzyOcr in a medium size
(3-400k messages per day) mail cluster for about a week now.

Why do you do database maintenance with every unmatched check?

>From Hashing.pm:

        unless ($match) {
            my $then = time - ($conf->{focr_db_max_days}*86400);
--->        $sql = qq(select * from $db.$dbfile order by $dbfile.check);
            my $sth  = $ddb->prepare($sql); $sth->execute;
            while (my @row = $sth->fetchrow_array) {
                my $hash2 = $row[1] || "0:0:0:0";
                $hash2 .= "::$row[0]";
                if (within_threshold($digest,$hash2)) {
                    $txt   = 'Approx';
                    $key   = $row[0];
                    $next  = $row[5] + 1;
                    $when  = $row[7] || $now;

$ret = $dbfile eq $conf->{focr_mysql_hash} ?$row[8] : $row[5];

                    $dinfo = $row[9] || '';

infolog("Found[$dbfile]: Score='$row[8]' Info:'$row[9]'");

                    last;
                }
            }
            # Expire old records...

---> $sql = qq(delete from $db.$dbfile where $dbfile.check <$then);

            debuglog($sql,2);
            $ddb->do($sql);
        }

Those two queries are extremely expensive in a larger envrionment...Ihave

commented this code segment out on our cluster, and have written a quick

maintenance script that runs once per day...dropped the response timefrom

2-3s to .01-.05s on queries, and eliminated the suddenly large
and customer-annoying mailqueues.


Sorry to follow up to my own post, but now that I read this segment a
little closer I realize that I'm basically commenting out the matching
capability of the Hashing mechanism, eliminating all value of the Hashing
in the first place.

So...I guess my point is, unless there is a better way of determining the
match than checking every single hash in the database (hoping that you
find one that is close enough along the way), it's more efficient (in
larger environments at least) to just scan each mail message without
hashing enabled.

Thoughts?

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Re: FuzzyOcr 3.5.1 released

Reply via email to