Hi,

I think it seldom pays to be too clever with Bayes.  If (and this is a
big if) you have a large enough sample of mail, in our experience it's
better just to shovel it all into Bayes than to be selective about
what you present to Bayes.  The Bayes algorithms are usually pretty
good at picking out the signal from the noise.

That's been my finding as well - it's too involved to think that I can try and outsmart the algorithm by picking and choosing what to use every time.

Bayes expiry is a tricky thing.  To do expiry in a way that can be justified
mathematically, you really should expire messages, not individual tokens.
Otherwise, you're skewing the probabilities.  Doing it properly is unwieldy
because you have to remember all the messages (or at least, all the tokens
in the messages) going back over your expiry window.

There's also no way I could do this because I've enabled autolearning. I need as many tips to automate this successfully as possible.

What we do is twice a day, we build a brand new Bayes database from scratch
containing messages we've seen in the last 14 days.  The database
contains tokens from about 5.1 million spams and 4.5 million hams, totalling
about 18 million tokens.

So instead of trying to figure out the proper expiry period, you just start over completely every two weeks?

Why is this most effective than continually learning as you go, expiring tokens older than two weeks?

That just sounds seriously labor-intensive. Is this because you have end-users involved with training and they're not doing it correctly that would have you dump the database twice per day?

What does that do to your "bayes seen" component? It doesn't have much time to learn over time, or is that not necessary because you're using the last two weeks of data?

Obviously, for this to work, you need a large message volume and a large
number of people marking stuff as ham vs. spam.  It's probably not a feasible
approach for small-to-medium SpamAssassin installations.

So there is no differentiation between domains or networks in your bayes database?

Is the header and attachments part of the learning, or does bayes only consider the body?

Would it be helpful to have something that graphs the data to monitor the effect of learning changes? Does something already exist?

Thanks,
Alex

Reply via email to