Arthur Dent wrote:
Hello All,
Please forgive my ignorance, but I don't fully understand just how Bayes
works.
I dutifully feed all the spam (and ham) I get into sa-learn and generally
Bayes works pretty well.
I am a little concerned however that at this moment I seem to be getting
bombarded with Russian spam. It currently outweighs all other spam by
about 100:1
My worry is that Bayes will eventually come to believe that only Russian
spam is *really* spam as it will, if the current trend continues,
overwhelm the other spam in the Bayes DB.
That won't happen unless there's such a massive flood of unique tokens
(words) that they flush all other tokens out of your bayes DB.
Bombardment or not, it's highly unlikely that 100,000 unique Russian
words are going to enter your bayes database, which is what it would
take with the default bayes_expiry_max_db_size .
Odds are, this bombardment is mostly the same 1000 or so words over and
over again. All that's going to do is raise the spam count on those
tokens, which won't have any impact at all on other spam email.
Really, a more realistic risk is that SA may learn that all Russian
language email is spam, unless you actually get some Russian language
nonspam. (ie: bayes will contain very few Russian words, but the ones it
does will have strong spam scores). If you don't speak Russian, that's
probably not a significant problem...
Am I worrying unnecessarily, or should I make efforts to "balance" the
spam I am feeding to bayes?
I would generally advise against trying to "balance" bayes. My own
philosophy is this is more likely to lead to self-poisoning than any
realistic benefit. I'd only try to "balance" in ways that actually make
things closer to your real spam feed.