I wonder if Vernon Schryver at rhyolite could tie fuzzy OCR into the DCC
(distributed Checksum) project. We operate one of the several hundred nodes
in the DCC network, and it has been a great tool in spam control. For
anyone who is not familiar with it, DCC is a network of public and private
servers that exchange floods of millions of bulk mail "fingerprints" based
both on "spamminess" and just general bulk of the mailings. info at
www.rhyolite.com
The advantage is that the DCC servers keep their checksum DB in memory and
are lightning fast. The OCR check would be a lot more intensive then the
current conversion of a mail body into a set of checksums...but it would
allow the network of servers to exchange the fingerprints of spam images.
To give you an idea, our DCC server currently has these stats: The key
items - 22,057,457 checksums in memory, using a little over 1.1G of RAM. We
receive about 4,000 reports per minute from the network and send about 200
per minute from emails we process.
Of course, you only need to run your own DCC server if processing well over
100,000 emails per day.
John Scully
isupportisp.com
----- Original Message -----
From: "Andy Dills" <[EMAIL PROTECTED]>
To: "decoder" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <users@spamassassin.apache.org>
Sent: Sunday, January 07, 2007 5:42 PM
Subject: Re: FuzzyOcr 3.5.1 released
On Sun, 7 Jan 2007, Andy Dills wrote:
On Sun, 7 Jan 2007, decoder wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hello all,
>
>
> since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
> testers and bug reporters :) so big thanks.
I have something I'm curious about, having run FuzzyOcr in a medium size
(3-400k messages per day) mail cluster for about a week now.
Why do you do database maintenance with every unmatched check?
>From Hashing.pm:
unless ($match) {
my $then = time - ($conf->{focr_db_max_days}*86400);
---> $sql = qq(select * from $db.$dbfile order by $dbfile.check);
my $sth = $ddb->prepare($sql); $sth->execute;
while (my @row = $sth->fetchrow_array) {
my $hash2 = $row[1] || "0:0:0:0";
$hash2 .= "::$row[0]";
if (within_threshold($digest,$hash2)) {
$txt = 'Approx';
$key = $row[0];
$next = $row[5] + 1;
$when = $row[7] || $now;
$ret = $dbfile eq $conf->{focr_mysql_hash} ?
$row[8] : $row[5];
$dinfo = $row[9] || '';
infolog("Found[$dbfile]: Score='$row[8]' Info:
'$row[9]'");
last;
}
}
# Expire old records...
---> $sql = qq(delete from $db.$dbfile where $dbfile.check <
$then);
debuglog($sql,2);
$ddb->do($sql);
}
Those two queries are extremely expensive in a larger envrionment...I
have
commented this code segment out on our cluster, and have written a quick
maintenance script that runs once per day...dropped the response time
from
2-3s to .01-.05s on queries, and eliminated the suddenly large
and customer-annoying mailqueues.
Sorry to follow up to my own post, but now that I read this segment a
little closer I realize that I'm basically commenting out the matching
capability of the Hashing mechanism, eliminating all value of the Hashing
in the first place.
So...I guess my point is, unless there is a better way of determining the
match than checking every single hash in the database (hoping that you
find one that is close enough along the way), it's more efficient (in
larger environments at least) to just scan each mail message without
hashing enabled.
Thoughts?
Andy
---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---