At 13:28 2004/03/19, Don Anthony wrote:
My point was simply, why is the URL not used that exists in the msg body of these spams to help flag it, and instead all the attention placed on the header?
It's important to distinguish, first of all, between the motive of "blocking" spam and "reporting" spam. Most of the people running spam filters these days are only (or at least primarily) concerned with the blocking aspect--they just want to shield themselves from the spam onslaught, since that's Problem Number One. Once that problem gets under control, they have the luxury of worrying about Problem Number Two, namely reporting spam.
For those who want to *block* spam, rules and methods that identify spam features are desirable. URLs contained in an e-mail *might* be spam symptoms, if they happen to be on a list of known spam suppliers, such as Chris Santerre's BigEvil list (which is a set of rules you can download and use with SpamAssassin, see http://wiki.apache.org/spamassassin/CustomRulesets). It's also worth noting that SpamAssassin examines more than just the mail header, it looks for features throughout the mail item--headers, body, as well as network-based tests that check the peer's IP address against various DNSBLs and compare the hashed mail contents against collaborative databases like Razor, Pyzor, and DCC. The "attention" is not focused in any one place in particular, and that's what makes SpamAssassin as effective as it is--it's a broad-spectrum analysis tool.
For those who want to *report* spam, on the other hand, you want a post-processing script of some sort (one was posted to this list recently, if I recall correctly) which extracts a list of the URLs contained in an e-mail. That way you can collect your spam items, post-process them with such a script, and end up with a list of URLs you can use for reporting purposes (or as fodder for a RHSBL, etc.). The snag here is that sometimes legitimate URLs can get included in this list, either accidentally (e.g. the www.w3.org domain frequently shows up in the <!DOCTYPE> header of HTML-based mail) or on purpose (spammers have been known to include some legitimate URLs in their mailings in the hope of getting past spam filters, or at least poisoning automated RHSBL collectors). In other words, these lists should really be built by hand, or in a semi-automated manner that requires human confirmation (e.g. SpamCop, Maia Mailguard, etc.).
As a kind of "P.S." to this thread, note that "SPAM" is Hormel's trademark, and they've been very tolerant of its use in the context of unsolicited bulk e-mail. All they've asked, really, is that we use "spam" and not "SPAM" to refer to this e-mail plague (http://www.spam.com/ci/ci_in.htm "if the term is to be used, it should be used in all lower-case letters to distinguish it from our trademark SPAM, which should be used with all uppercase letters.").
Robert LeBlanc <[EMAIL PROTECTED]>
Renaissoft, Inc.
Maia Mailguard <http://www.renaissoft.com/maia/>
