Is the SA Bayes implementation mathematically sound?

Damian Sat, 22 Dec 2018 15:39:46 -0800

Hi all,

is there someone who has a good grasp around the mathematics of Bayes
learning with respect to SpamAssassin?


I assume that training a fresh BayesStore with a set of spam and ham
samples is mathematically sound. What bothers me a little is the
expiration logic.

The purpose of expiration seems to be a practical one, we don't want the
BayesStore grow too much. But is there a conceptual counterpart? One
such concept could be:
Maintain the store as if it were trained from scratch with spam and ham
mails up to N days into the past.

However if I am not mistaken, that is not the implementation.

The nspam and nham magic counters mostly only increase. They will
decrease when a message is forgotten or relearnt, but they will not
decrease on expiration.

If I am not mistaken there are conceptual differences between some
BayesStore implementations. PgSQL will expire tokens if configured, but
it will not expire seen messages. Redis on the other hand expires both
tokens and seen messages (with a huge ttl difference between those two
in the default configuration, on top of that).

As a result, after some time, probably most BayesStores are in a state
for which there is no mail-sample set that can lead to said state via
training. Can such state still lead to statistically valid conclusions?
Can both implementations be correct?

Damian

Is the SA Bayes implementation mathematically sound?

Reply via email to