Hello sckot,

Wednesday, January 21, 2004, 2:09:51 PM, you wrote:

s>    I've noticed several spam mails with a lot of quoted text (quotes from
s> Dave Barry, some of Moby Dick, that sort of thing. Usually all
s> punction is stripped out, but not always.) included within brackets or
s> an HTML title. It's likely being used to counterweight the
s> message against a Bayesian filter, since most of the words generally
s> also appear in ham. I made two rules to catch this. It doesn't seem like
s> it'd bring up false positives (perhaps increasing the title length past
s> 80), and works quite well against my corpus, but are there any problems
s> I'm overlooking with this approach?

s> rawbody L_Text_Padding_In_Html      /<(title>)?[ '-.,?!\w]{50,}>/
s> describe L_Text_Padding_In_Html  Text padding within brackets or HTML
s> title to fool bayesian filter
s> score L_Text_Padding_In_Html 3.0

s> rawbody L_Very_Long_Title  /<title>[ '-.,?!\w]{80,}<\/title>/
s> describe L_Very_Long_Title HTML title longer than 80 characters to fool
s> bayesian filter
s> score L_Very_Long_Title 1.0

I tested your rules against my corpus:

L_Text_Padding_In_Html -- 985s/84h of 91714 corpus (74113s/17601h) 01/21/04

Ham appears to be from web pages with valid comments that were sent as
email, and from people who enclose comments or instructions in angle
brackets, eg:
>  <Google Search backup file date OR time OR access groupaix.htm>
>  <snip personal reasons for separation, as immaterial>
>  < Thankyou, [name removed]!!!  What do you think of my ramblings?>

L_Very_Long_Title -- 100s/1h of 91714 corpus (74113s/17601h) 01/21/04

Sole ham was a tech support response, and the "title" apparently included
the entire contents of my original trouble reoprt, or much of it:
>  <title>Getting error 182 while trying to download magazine. Delivery
>  Manager claims ...</title>

Bob Menschel





-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to