>...

Matt Kettler replied:

>John Tice wrote:
>>
>> Greetings,
>> This is my first post after having lurked some. So, I'm getting these
>> same "RE: good" spams but they're hitting eight rules and typically
>> scoring between 30 and 40. I'm really unsophisticated compared to you
>> guys, and it begs the question––what am I doing wrong? All I use is a
>> tweaked user_prefs wherein I have gradually raised the scores on
>> standard rules found in spam that slips through over a period of time.
>> These particular spams are over the top on bayesian (1.0), have
>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone
>> flags them for me. I'm trying to understand the reason you would not
>> want to have these type of rules set high enough? I must be way over
>> optimized––what am I not getting? 
>
>
>BAYES_99, by definition, has a 1% false positive rate.
>

        Matt,

        If we were to presume a uniform distribution between a estimate of
99% and 100%, then the FP rate would be .5%, not 1%.  And for large sites
(i.e. 10s or thousands or messages a day or more), this may be what occurs;
But what I see and what I assume many other small sites see is a very much
non-uniform distribution;  From the last 30 hours, the average estimate (re.
the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99
rule is .999941898013269 with about two thirds of them reporting bayes=1 and
a lowest value of bayes=0.998721756590216.

        While SA is quite robust largely because of the design feature that
no single reason/cause/rule should by itself mark a message as spam, I have
to guess that the FP rate that the majority of users see for BAYES_99 is far
below 1%.  From the estimators reported above, I would expect that I would
have seen a .003% FP rate for the last day plus a little, if only I received
100,000 or so spam messages to have been able to see it:).

        I don't change the scoring from the defaults, but if people were to
want to, maybe they could change the rules (or add a rule) for BAYES_99_99
which would take only scores higher than bayes=.9999 and which (again with
a uniform distribution) have an expected FP rate of .005% - than re-score
that just closer (but still less) than the spam threshold, or add a point
of fraction thereof to raise the score to just under the spam threshhold
(adding a new rule would avoid having to edit distributed files and thus
would probably be the "better" method).

        Anyway, to better address the OP's questions:  The system is more
robust if instead of changing the weighting of existing rules (assuming that
they were correctly established to begin with), you add more possible inputs
(and preferably independant ones - i.e. where the FPs between rules have a
low correlation).  Simply increasing scores will improve your spam "capture"
rate, just as decreasing the spam threshold will - but both methods will add
to the likelyhood of false positives;  Look into the distributed documentation
to see the expected FP rates at different spam threshold levels for numbers
to drive this point home (and changing specific rules' scores is just like
changing the threshold, but in a non-uniform fashion - unless you actually
measure the values for your own site's mail and recompute numbers that are
a better estimate for local traffic).

        Paul Shupak
        [EMAIL PROTECTED]

Reply via email to