On Sun, 7 Jan 2007, Andy Dills wrote: > On Sun, 7 Jan 2007, decoder wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > > > Hello all, > > > > > > since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many > > testers and bug reporters :) so big thanks. > > > I have something I'm curious about, having run FuzzyOcr in a medium size > (3-400k messages per day) mail cluster for about a week now. > > Why do you do database maintenance with every unmatched check? > > >From Hashing.pm: > > unless ($match) { > my $then = time - ($conf->{focr_db_max_days}*86400); > ---> $sql = qq(select * from $db.$dbfile order by $dbfile.check); > my $sth = $ddb->prepare($sql); $sth->execute; > while (my @row = $sth->fetchrow_array) { > my $hash2 = $row[1] || "0:0:0:0"; > $hash2 .= "::$row[0]"; > if (within_threshold($digest,$hash2)) { > $txt = 'Approx'; > $key = $row[0]; > $next = $row[5] + 1; > $when = $row[7] || $now; > $ret = $dbfile eq $conf->{focr_mysql_hash} ? $row[8] : > $row[5]; > $dinfo = $row[9] || ''; > infolog("Found[$dbfile]: Score='$row[8]' Info: > '$row[9]'"); > last; > } > } > # Expire old records... > ---> $sql = qq(delete from $db.$dbfile where $dbfile.check < $then); > debuglog($sql,2); > $ddb->do($sql); > } > > > Those two queries are extremely expensive in a larger envrionment...I have > commented this code segment out on our cluster, and have written a quick > maintenance script that runs once per day...dropped the response time from > 2-3s to .01-.05s on queries, and eliminated the suddenly large > and customer-annoying mailqueues.
Sorry to follow up to my own post, but now that I read this segment a little closer I realize that I'm basically commenting out the matching capability of the Hashing mechanism, eliminating all value of the Hashing in the first place. So...I guess my point is, unless there is a better way of determining the match than checking every single hash in the database (hoping that you find one that is close enough along the way), it's more efficient (in larger environments at least) to just scan each mail message without hashing enabled. Thoughts? Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---