At 12:11 PM 8/6/2004, Daulton, Douglas wrote:
I am curious, is there a master list of the words and phrases SA
considers "spammy"?  Also, is there a similar list of the measures used
to evaluate HTML design.

Really, it would be highly inaccurate to characterize it as "phrases" or anything so simple.


SA works on regex patterns. Many of these correlate to headers generated as servers pass the email along, some to text and phrases in the body, some to elements of HTML, and others to encoding formats. Honestly speaking, most of the weight falls on headers and on invalid html encodings.

However, none of this reflects the behavior of the bayes subsystem within SpamAssassin. That part is entirely up to each site's training. My bayes DB will have a very different concept of spam and ham compared with Theo's or Justin's.

Probably the best place to get a "master list" is to go directly to the rules themselves. if you download the tarball or zipfile of SA it's all in the .cf rules subdirectory. Everything is in ordinary perl regex format.

Most of the simple "phrase" type rules will be in 20_phrases.cf

Most of the HTML evaluation rules are in 20_html_tests.cf, but many of these use "eval" functions, which are implemented as code. Most of them should at least give you a rough idea what they look for based on their names.

 I'd like to see and publish that master list to my
team so we can proactively clean the spam from our ham.

I think many spammers would want that list as well to clean the "spam" from their spam, so I think you'll understand why it's not available in a pre-simplified form.








Reply via email to