Hello!
I've decided to compare how dspam and spamassassin
bayes implementations perform, because speed is very
important for large installations and author of dspam
says that his pure C implementation is much faster than
Perl. I've created two perl scripts for running dspam
agent and spamc over my ham corpus and measuring total,
min, max and average time of processing message. Dspam
was configured with mysql storage optimized for
speed. For spamassassin benchmark I've used spamd with
only bayes rules. Both were trained on exactly the same
spam and ham corpus. Here are the results:

# DSPAM

1630 messages processed.
Total time: 230.084 wallclock secs (30.42 cusr + 17.97 csys = 48.39 CPU)
Max message processing time: 13.3091468811035
Avg message processing time: 0.140900963947086
Min message processing time: 0.0444350242614746

# SpamAssassin

1630 messages processed.
Total time: 254.895 wallclock secs ( 3.54 cusr + 10.46 csys = 14.00 CPU)
Max message processing time: 3.65092492103577
Avg message processing time: 0.156147952606342
Min message processing time: 0.0727198123931885

It seems that SpamAssassin is not much slower than
dspam, althoug results are biased because:
1) dspam was configured with default settings which
enables two algorithms (bayes and altbayes);
2) dspam was configured to attach signatures with
tokens for re-learning
3) dspam uses chained tokens which increase volume of
data to be processed.

I'm also very surprised that dspam max message processing
time is higher. 

This is mostly a toy benchmark but I would like to hear
suggestions on how results can be imroved.
Eugene

-- 
Email: jmv /at/ online.ru
#!/usr/bin/perl -w

use strict;
use Benchmark ':hireswallclock';
use List::Util qw(min max sum);

my $dir = "/home/sad/Mail/ham-corpus";
my @messages = <$dir/*>;

my $total_start = new Benchmark;
my $count = 0;
my @times;
my ($t0, $t1, $td);
foreach my $message (@messages) {
  $t0 = new Benchmark;
  system("dspam < $message >/dev/null");
  $t1 = new Benchmark;
  $td = timediff($t1, $t0);
  push(@times, @$td[0]);
  $count++;
}
my $total_stop = new Benchmark;

print "\n";
print "$count messages processed.\n";
print "Total time: " . timestr(timediff($total_stop, $total_start), 'nop') . "\n";
print "Max message processing time: " . max(@times) . "\n";
print "Avg message processing time: " . sum(@times)/@times . "\n";
print "Min message processing time: " . min(@times) . "\n";
#!/usr/bin/perl -w

use strict;
use Benchmark ':hireswallclock';
use List::Util qw(min max sum);

my $dir = "/home/sad/Mail/ham-corpus";
my @messages = <$dir/*>;

my $total_start = new Benchmark;
my $count = 0;
my @times;
my ($t0, $t1, $td);
foreach my $message (@messages) {
  $t0 = new Benchmark;
  system("spamc -p 4444 -s 10000000 < $message >/dev/null");
  $t1 = new Benchmark;
  $td = timediff($t1, $t0);
  push(@times, @$td[0]);
  $count++;
}
my $total_stop = new Benchmark;

print "\n";
print "$count messages processed.\n";
print "Total time: " . timestr(timediff($total_stop, $total_start), 'nop') . "\n";
print "Max message processing time: " . max(@times) . "\n";
print "Avg message processing time: " . sum(@times)/@times . "\n";
print "Min message processing time: " . min(@times) . "\n";

Reply via email to