On Sun, 2014-01-05 at 01:56 +0000, Mark Tully wrote:
> One pattern of messages which I’ve noticed slip through are those which
> have a multipart and have a block of bayes poisoning text in the
> text/plain part, with the real spam payload in the text/html part. 
> What I’m seeing is that the text/plain block manages to hit a few of
> my hammy-tokens and so has its bayes score tempered enough to allow it
> to slip through. Of course, I then teach it this is spam, but given
> the random nature of this text block, it just seems this is inserting
> noise in the bayes DB. I guess it would eventually average out, but
> still...
> 
> So I’m wondering, given that most e-mail clients nowadays don’t show
> the text/plain part if there is a text/html part, why not have SA’s
> bayes filter just ignore the text/plain part if there is a text/html
> part and just focus on that? It’s just being used for noise after all?

First of all, SA uses all textual MIME parts for Bayes classification.
That is in your example, the text/html payload as well as the text/plain
decoy.

I am pretty sure ignoring the text/plain sub-part of an multipart/
alternative MIME part in favor of the text/html will not magically boost
Bayes results. Because everyone's spam is different and there's no such
thing as Bayes poison. ;)

"Bayes poison" here means, there are tokens with a very strong hammy
score -- and spammers injecting that token into their spam, in order to
get a hammy-ish Bayes classification. However, if spammers do use such a
token, it either is not hammy in the first place, or will quickly cease
to be a strong ham indicator.

Moreover, this silently assumes there are tokens that are hammy for each
and every user. Which is just not the case, even if limiting to a given
language. The strongest ham tokens highly depend on the user -- they are
the tiny, often overlooked details that differentiate that one user from
the majority.

The name of the small town, the local sports club, common interests or
anything with a rather local spatial (shops, places) or temporal
distribution. Exactly the tokens that are not ham for the majority.
Tokens that can be used to spoil Bayes only, if special crafted for a
target.


As you mentioned yourself: The result of that "poisonous" blob is to
lower the spammyness and get (closer to) BAYES_50. Which is by
definition a big fat shrug -- neither spammy, nor hammy.

Which matches my observations.

Even the most effective results I have ever seen on a non-personal
attack is merely getting the Bayes classification to a neutral. And that
was not a "regular" text token, but includes mail headers. And a biased
Bayes database towards some specific mail headers that spam run happened
to use...


> Of course, the counter argument would be spammers would then just stop
> using multi part and dump the poisoning block into the text/html part
> instead - so maybe this is just a stupid suggestion :)

Like, say, put the eye-catching payload at the top for the user to spot
immediately, and dump the "everybody loves raymond" poison below?

Using a commonly not displayed text/plain part as you described is just
one attempt to get "average" tokens into spam.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to