On 04/03/06 09:56 PM, Gabriel Wachman wrote:
A colleague and I are writing a paper about a spam filter he developed.
We'd like to compare it against various open source filters, including
SpamAssassin. The methodology we are using is to train the filter on a
set of messages, and then test it on an independent set of messages. The
key is that the filter cannot update itself at all after training.

That's the key?!

The reality is SpamAssassin CAN update itself after initial training. Doing one set of tests with bayes_auto_learn disabled is fine, but not also doing a set of tests with it enabled is warping reality.

<snip config>

During testing, I can see spamassassin create a "bayes_journal" file and
write to it continuously. I understand this is spamassassin's way of
storing its updates to bayes_* temporarily until the updates are merged.
My concern is that it's using bayes_journal in addition to bayes_toks
and bayes_seen during testing, but I just want it to use the bayes_toks
and bayes_seen generating during training.

Even with bayes_auto_learn disabled, the tokens' atimes are still updated. That's the way SpamAssassin works. That's what helps SpamAssassin's bayes implementation in being effective.


Can someone tell me how to run spamassassin in testing mode, without
making any updates or doing any learning, but only classifying messages?

Make the files read-only or hack at the code, or break both it's legs and left arm. :)

Just be sure to make sure you're clear in your paper that you've intentionally hampered SpamAssassin's ability to classify mail.


Daryl

Reply via email to