Re: ADDRESS_IN_SUBJECT et al

Karsten Bräckelmann Wed, 24 Jul 2013 20:16:11 -0700

On Wed, 2013-07-24 at 21:53 -0400, Ian Turner wrote:
> They are moderately low-scoring, sadly (I wouldn't have noticed otherwise!), 
> mainly due to bayes poison. A typical message looks like this:


Do you manually train them as spam?

>  -1.9 BAYES_00               BODY: Bayes spam probability is 0 to 1%
>                              [score: 0.0000]

Ouch. A probability score of < 0.00005 -- which pretty much equals no
token learned as spammy. Seriously? How often do you see "Funds" (mind
the uppercase!) or "funds" in ham? How many of them do have that word in
the Subject (which in addition gets treated specially by SA)?

See where I am heading? Any chance your Bayes DB is completely borked?
  sa-learn --dump magic

Might be worth putting a sample or three up a pastebin of your choice,
to see more of the text.

And for further digging, which are the top hammy / spammy tokens? See
M::SA::Conf [1], section Template Tags.


> Looking at the code for check_for_to_in_subject, it looks like the regular 
> expression used for LOCALPART_IN_SUBJECT is rather different (much more 
> specific) than the one used for ADDRESS_IN_SUBJECT. Presumably that's why 
> this 
> rule doesn't match.
> 
> An example subject from this spam (address changed to protect the innocent):
> <some...@example.com>_Need Approval for Fast Funds? July 24th 2013_

Do the Subjects strictly follow that pattern? Including the angle
brackets AND the underscore? Dead easy target for a local rule to squat
them.

BTW, don't get me wrong, I am not trying to prevent the old eval() rule
from re-appearing. It's just that such pattern hasn't been mentioned as
an issue in like ages, so my focus is on helping with your issue first.


> For "address" mode, the regex is this one: /\b\Q$full_to\E\b/i
> But for "user" mode, the regex is this one:
>     /^(?:
>         (?:re|fw):\s*(?:\w+\s+)?\Q$to\E$
>         |(?-i:\Q$to\E)\s*[,:;!?-](?:$|\s)
>         |\Q$to\E$
>         |,\s*\Q$to\E[,:;!?-]$
>     )/ix
> 
> Among other restrictions, this regex seems to only match the username at the 
> beginning or end of the subject.

It does accept quite some more, including leading Re: with an optional,
arbitrary word following. Some restrictions are definitely necessary,
since the "local part" often resembles a user's first name, company
name, generic roles...

It does not match /^<localpart/ with a single opening angle bracket,
though.


[1] http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: ADDRESS_IN_SUBJECT et al

Reply via email to