On Sun, 2014-01-05 at 01:56 +0000, Mark Tully wrote: > One pattern of messages which I’ve noticed slip through are those which > have a multipart and have a block of bayes poisoning text in the > text/plain part, with the real spam payload in the text/html part. > What I’m seeing is that the text/plain block manages to hit a few of > my hammy-tokens and so has its bayes score tempered enough to allow it > to slip through. Of course, I then teach it this is spam, but given > the random nature of this text block, it just seems this is inserting > noise in the bayes DB. I guess it would eventually average out, but > still... > > So I’m wondering, given that most e-mail clients nowadays don’t show > the text/plain part if there is a text/html part, why not have SA’s > bayes filter just ignore the text/plain part if there is a text/html > part and just focus on that? It’s just being used for noise after all?
First of all, SA uses all textual MIME parts for Bayes classification. That is in your example, the text/html payload as well as the text/plain decoy. I am pretty sure ignoring the text/plain sub-part of an multipart/ alternative MIME part in favor of the text/html will not magically boost Bayes results. Because everyone's spam is different and there's no such thing as Bayes poison. ;) "Bayes poison" here means, there are tokens with a very strong hammy score -- and spammers injecting that token into their spam, in order to get a hammy-ish Bayes classification. However, if spammers do use such a token, it either is not hammy in the first place, or will quickly cease to be a strong ham indicator. Moreover, this silently assumes there are tokens that are hammy for each and every user. Which is just not the case, even if limiting to a given language. The strongest ham tokens highly depend on the user -- they are the tiny, often overlooked details that differentiate that one user from the majority. The name of the small town, the local sports club, common interests or anything with a rather local spatial (shops, places) or temporal distribution. Exactly the tokens that are not ham for the majority. Tokens that can be used to spoil Bayes only, if special crafted for a target. As you mentioned yourself: The result of that "poisonous" blob is to lower the spammyness and get (closer to) BAYES_50. Which is by definition a big fat shrug -- neither spammy, nor hammy. Which matches my observations. Even the most effective results I have ever seen on a non-personal attack is merely getting the Bayes classification to a neutral. And that was not a "regular" text token, but includes mail headers. And a biased Bayes database towards some specific mail headers that spam run happened to use... > Of course, the counter argument would be spammers would then just stop > using multi part and dump the poisoning block into the text/html part > instead - so maybe this is just a stupid suggestion :) Like, say, put the eye-catching payload at the top for the user to spot immediately, and dump the "everybody loves raymond" poison below? Using a commonly not displayed text/plain part as you described is just one attempt to get "average" tokens into spam. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}