Hello sckot, Wednesday, January 21, 2004, 2:09:51 PM, you wrote:
s> I've noticed several spam mails with a lot of quoted text (quotes from s> Dave Barry, some of Moby Dick, that sort of thing. Usually all s> punction is stripped out, but not always.) included within brackets or s> an HTML title. It's likely being used to counterweight the s> message against a Bayesian filter, since most of the words generally s> also appear in ham. I made two rules to catch this. It doesn't seem like s> it'd bring up false positives (perhaps increasing the title length past s> 80), and works quite well against my corpus, but are there any problems s> I'm overlooking with this approach? s> rawbody L_Text_Padding_In_Html /<(title>)?[ '-.,?!\w]{50,}>/ s> describe L_Text_Padding_In_Html Text padding within brackets or HTML s> title to fool bayesian filter s> score L_Text_Padding_In_Html 3.0 s> rawbody L_Very_Long_Title /<title>[ '-.,?!\w]{80,}<\/title>/ s> describe L_Very_Long_Title HTML title longer than 80 characters to fool s> bayesian filter s> score L_Very_Long_Title 1.0 I tested your rules against my corpus: L_Text_Padding_In_Html -- 985s/84h of 91714 corpus (74113s/17601h) 01/21/04 Ham appears to be from web pages with valid comments that were sent as email, and from people who enclose comments or instructions in angle brackets, eg: > <Google Search backup file date OR time OR access groupaix.htm> > <snip personal reasons for separation, as immaterial> > < Thankyou, [name removed]!!! What do you think of my ramblings?> L_Very_Long_Title -- 100s/1h of 91714 corpus (74113s/17601h) 01/21/04 Sole ham was a tech support response, and the "title" apparently included the entire contents of my original trouble reoprt, or much of it: > <title>Getting error 182 while trying to download magazine. Delivery > Manager claims ...</title> Bob Menschel ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk