Arthur Dent wrote:
Hello All,

Please forgive my ignorance, but I don't fully understand just how Bayes
works.

I dutifully feed all the spam (and ham) I get into sa-learn and generally
Bayes works pretty well.

I am a little concerned however that at this moment I seem to be getting
bombarded with Russian spam. It currently outweighs all other spam by
about 100:1

My worry is that Bayes will eventually come to believe that only Russian
spam is *really* spam as it will, if the current trend continues,
overwhelm the other spam in the Bayes DB.
That won't happen unless there's such a massive flood of unique tokens (words) that they flush all other tokens out of your bayes DB.

Bombardment or not, it's highly unlikely that 100,000 unique Russian words are going to enter your bayes database, which is what it would take with the default bayes_expiry_max_db_size .

Odds are, this bombardment is mostly the same 1000 or so words over and over again. All that's going to do is raise the spam count on those tokens, which won't have any impact at all on other spam email.

Really, a more realistic risk is that SA may learn that all Russian language email is spam, unless you actually get some Russian language nonspam. (ie: bayes will contain very few Russian words, but the ones it does will have strong spam scores). If you don't speak Russian, that's probably not a significant problem...


Am I worrying unnecessarily, or should I make efforts to "balance" the
spam I am feeding to bayes?
I would generally advise against trying to "balance" bayes. My own philosophy is this is more likely to lead to self-poisoning than any realistic benefit. I'd only try to "balance" in ways that actually make things closer to your real spam feed.




Reply via email to