Re: SA Score -> Confidence Percentage

jdow Sun, 30 Jul 2006 17:18:47 -0700

From: "John Rudd" <[EMAIL PROTECTED]>


On Jul 30, 2006, at 4:37 PM, jdow wrote:

From: "John Rudd" <[EMAIL PROTECTED]>

On Jul 26, 2006, at 5:23 PM, jdow wrote:


I am a bit of a heretic in this group because I take the nasty step
of taking rules that are almost always right (one error per thousand
or more hits) and make sure the score on the rule is designed to
push the score AWAY from 5.0 in the appropriate direction.

Do you have your score variations published anywhere?
Might be a useful "alternate score set" or something.


Not the scoring variations. There is WAY too much "ymmv" involved.

Yeah, I kind of like the idea of a more distinct separation betweenspam and ham though, and wouldn't mind seeing how the results differ.Wouldn't mind seeing a published "alternate" version of the scoreswhich embodies the "brassiere" curve, so to speak.


The chief changes are to BAYES_95 and BAYES_99. Keep raising the
scores a little at a time until you see false positives (ham being
tagged as spam) that include the BAYES_95 and BAYES_99 scores. Then
back off a step or two.

The other part HERE is having the luxury of being able to run about
40 of the SARE rule sets. They provide a BIG boost to the spam
scores as a general rule. But they are not so good for some of the
spam that BAYES_99 HERE (YMMV) seems to catch very nicely. With a
stock score of something like 3.5 I had a fair number of spams leaking
through. I (YMMV) was able to boost the score to a teeny smidgen over
5.0 and no see any mismarked messages (in MY mailbox) with a BAYES_99
tag.

A secondary part of this is the trick I use for mailing lists like
apcupsd (APC UPSs), LKML, FreeBSD users, and a few others that allow
postings by people off the list and do not rigorously spam filter.
(They do spam filter. But they leak a little, some more than others.)
As it happens these lists have a second bad trait. The patches and
bug reports often trigger some of the otherwise quite good SARE crazy
letter combos or crazy punction combos rules. So I hacked a solution
that is awkward but effective for both the leaky spam and the bug
reports:

1) Create a rule for each of the lists that unambiguously detects
  email from that list. Each list manager is a bit different in this
  regard.

2) Create a meta rule to combine all these into one "ITS_A_LEAKY_LIST"
  rule. (Of course, give it a sensible name. {^_-})

3) Create a set of rules for the higher and lower BAYES_XX scores that
  trigger for that score and the "ITS_A_LEAKY_LIST" hit. These rules
  would make low scores even lower and high scores even higher. I also
  have the "ITS_A_LEAKY_LIST" scored a little negative because they
  seem to score a little high in general. Tweak the actual values to
  match your needs.

With this set of rules in place the BAYES_99 set to 5.0001 (just to
thumb a nose or two - yes, I am obnoxious <sigh>) I am getting no
false markups as a result.

Both changes knocked out about half the false markups I'd been having
before I installed them. Now, for me, false markups are mercifully rare
except for some tradezine mailings. I've been too lazy to build special
discriminatory rules for EDN magazine articles as compared to their spams.
So I leave their "stuff" marked as spam because I prefer to read the
dead trees edition, anyway. It gets me away from the computer for awhile.
SA is handy that way. You can elect to score something that is technically
not spam as spam and not be bothered by it unless you elect to be
bothered in specific cases. (You DO review your spam mailbox before
tossing the spam, don't you? And you DO tweak the header so that you can
easily sort on the score for all the spams. Looking at scores over 10
here is usually pointless. But I look at those below ten because some of
them are nice manually trained BAYES fodder. Stuff that already hits
BAYES_99 isn't worth the CPU cycles to train on, though, IMAO.)

{^_-}

{^_-}

Re: SA Score -> Confidence Percentage

Reply via email to