Re: Skipping RBL checks for internal servers

Alex Regan Sun, 22 Mar 2015 09:45:56 -0700

Hi,

I think it seldom pays to be too clever with Bayes.  If (and this is a
big if) you have a large enough sample of mail, in our experience it's
better just to shovel it all into Bayes than to be selective about
what you present to Bayes.  The Bayes algorithms are usually pretty
good at picking out the signal from the noise.

That's been my finding as well - it's too involved to think that I cantry and outsmart the algorithm by picking and choosing what to use everytime.

Bayes expiry is a tricky thing.  To do expiry in a way that can be justified
mathematically, you really should expire messages, not individual tokens.
Otherwise, you're skewing the probabilities.  Doing it properly is unwieldy
because you have to remember all the messages (or at least, all the tokens
in the messages) going back over your expiry window.

There's also no way I could do this because I've enabled autolearning. Ineed as many tips to automate this successfully as possible.

What we do is twice a day, we build a brand new Bayes database from scratch
containing messages we've seen in the last 14 days.  The database
contains tokens from about 5.1 million spams and 4.5 million hams, totalling
about 18 million tokens.

So instead of trying to figure out the proper expiry period, you juststart over completely every two weeks?

Why is this most effective than continually learning as you go, expiringtokens older than two weeks?

That just sounds seriously labor-intensive. Is this because you haveend-users involved with training and they're not doing it correctly thatwould have you dump the database twice per day?

What does that do to your "bayes seen" component? It doesn't have muchtime to learn over time, or is that not necessary because you're usingthe last two weeks of data?

Obviously, for this to work, you need a large message volume and a large
number of people marking stuff as ham vs. spam.  It's probably not a feasible
approach for small-to-medium SpamAssassin installations.

So there is no differentiation between domains or networks in your bayesdatabase?

Is the header and attachments part of the learning, or does bayes onlyconsider the body?

Would it be helpful to have something that graphs the data to monitorthe effect of learning changes? Does something already exist?


Thanks,
Alex

Re: Skipping RBL checks for internal servers

Reply via email to