Re: Bayes - Balance of spam

Matt Kettler Tue, 12 Feb 2008 06:32:36 -0800

Arthur Dent wrote:

Hello All,


Please forgive my ignorance, but I don't fully understand just how Bayes
works.

I dutifully feed all the spam (and ham) I get into sa-learn and generally
Bayes works pretty well.

I am a little concerned however that at this moment I seem to be getting
bombarded with Russian spam. It currently outweighs all other spam by
about 100:1

My worry is that Bayes will eventually come to believe that only Russian
spam is *really* spam as it will, if the current trend continues,
overwhelm the other spam in the Bayes DB.

That won't happen unless there's such a massive flood of unique tokens(words) that they flush all other tokens out of your bayes DB.

Bombardment or not, it's highly unlikely that 100,000 unique Russianwords are going to enter your bayes database, which is what it wouldtake with the default bayes_expiry_max_db_size .

Odds are, this bombardment is mostly the same 1000 or so words over andover again. All that's going to do is raise the spam count on thosetokens, which won't have any impact at all on other spam email.

Really, a more realistic risk is that SA may learn that all Russianlanguage email is spam, unless you actually get some Russian languagenonspam. (ie: bayes will contain very few Russian words, but the ones itdoes will have strong spam scores). If you don't speak Russian, that'sprobably not a significant problem...

Am I worrying unnecessarily, or should I make efforts to "balance" the
spam I am feeding to bayes?

I would generally advise against trying to "balance" bayes. My ownphilosophy is this is more likely to lead to self-poisoning than anyrealistic benefit. I'd only try to "balance" in ways that actually makethings closer to your real spam feed.

Re: Bayes - Balance of spam

Reply via email to