On Fri, 16 May 2014 07:22:56 -0400 "David F. Skoll" <d...@roaringpenguin.com> wrote:
James> Is there any way to limit Bayes content checking to only the James> first X characters of the message body? I ask this because it is James> clear that the spam messages getting through contain text meant James> to poison the tests but this gibberish always trails the main James> message and is separated by a large white space in most cases. David> In my experience, trying to be too clever with Bayes is David> counter-productive. Those Bayes-poisoning attacks rarely work on David> a well-trained corpus. You probably just need more training for David> Bayes to figure out what's happening. In the last few (~10) days, I have seen a marked increase in FNs, usually with Bayes values in the 50s and 60s. By marked, I mean I do pretty much nothing but adjust my various ad-hoc rules to keep from being flooded ;-\ On close inspection, I see that the hash-busting garbage appended is (faux) technical computing talk instead of the usual cookbooks or classical literature :-p That is, scrambled Stack Overflow discussions and the like. And of course that is what most of my ham is about, so it makes very good sense that Bayes gets confused. I include a magic dump just in case something is wrong with my training. But if not, isn't this a situation where something like James' suggestion would help? [4+0]~$ sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 5593 0 non-token data: nspam 0.000 0 6190 0 non-token data: nham 0.000 0 148413 0 non-token data: ntokens 0.000 0 1384366530 0 non-token data: oldest atime 0.000 0 1400253567 0 non-token data: newest atime 0.000 0 1400253356 0 non-token data: last journal sync atime 0.000 0 1395423790 0 non-token data: last expiry atime 0.000 0 11059200 0 non-token data: last expire atime delta 0.000 0 25914 0 non-token data: last expire reduction count -- Please *no* private copies of mailing list or newsgroup messages.