The scores are assigned by a genetic algorithm. Essentially two piles of
email are created, one of spam, one of nonspam. A SpamAssassin mass-check
is run to generate a set of one-line reports as to what rules each email in
each pile matches. The GA then has the task of examining these rule-match
sets and trying to assign scores that correctly categorize the most mail.
Often a rule which sounds like it should be a sign of spam gets a negative
score. There are several causes of this. The score set for 2.40/2.41 seemed
to be plagued with a lot of them. I know the dev's have recently done a
"pruning" of poor-performing rules and re-ran the GA with much better
results. I think the new scores will be in 2.42.
As for cases that tend to cause unexpected negative scores, here's a few I
can think of:
1) Something you thought only spammers did is done by lots of nonspammers
too. This is probably the case in FROM_HAS_MIXED_NUMS. All those
[EMAIL PROTECTED] email addresses that people use for their personal
chatter aren't spam. Idi0ts maybe, but a lot of people have these as
"disposable" addresses that aren't spammers.
2) Something you think at causal glance is a spam feature is also a feature
of a few MUA's that spammer's generally don't use. SUPERLONG_LINE is in
this category I think. Some spams match it but also some obscure MUA's do
this to all emails (ie: some MUA's tend to send emails as one single line
per paragraph). Also most spam consists of lots of single-line messages
("buy now!") without a lot of lengthy paragraphs, but conversational emails
tend to have very long paragraphs in them.
3) A typo or bug in a rule makes it match some common non-spam expression,
instead of the spam phrase.. One such bug was an attempt to match "no
credit" and some other common credit repair phrases which also matched
"notice: your credit card will be billed when your order is shipped". It
wasn't requiring a space or word-break after the "no" part :)
4) Sometimes a rule get's "weighed down on" to correct a common
particularly high scoring false-positive case. If there's a common set of
rules causing FP's, generally the one with the least spam matches will wind
up being pushed negative to compensate.
5) some spam, or reports of spam slip into the nonspam pile during
evaluation. Most of the time this is pretty low-impact, but If the rule
doesn't have a lot of hits in general, a few mis-placed emails can wildly
swing the score. (the mis-placed to correctly placed email ratio needs to
be less than the degree to which the GA favors avoiding tagging nonspam, at
the expense of missing a little spam)
6) Yes, there are some glitches in the GA itself, but those are getting better.
At 10:07 PM 9/26/2002 -0600, Danita Zanre wrote:
>I'm admittedly new to this stuff, so please bear with me. I just got a
>message with the following explanations:
>
>Trying to understand the "negative" values here - why would a line longer
>than 199 characters "decrease" the score? Also, why would the "From"
>lines having mixed numbers/no real name decrease the value?
>
>I realize I can change these values for myself if I choose, but I guess
>before I start messing with the values I'd like to understand the logic
>behind these settings.
>
>Thanks.
>
>Danita
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Spamassassin-talk mailing list
>[EMAIL PROTECTED]
>https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk