Re: [SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Robert Menschel Thu, 14 Aug 2003 11:32:34 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Gary,


Tuesday, August 5, 2003, 8:00:20 AM, you wrote:

GF> I've been running SA with Bayes enabled only the past few days. Bayes
GF> has been auto-learned on two rather large corpuses, which yielded
GF> about 1100 auto-learn messages (per the Bayes journal file). I've
GF> noticed the number of false negatives (ie, spam mis-classified as
GF> ham) have dropped to almost zero, but I'm seeing maybe half a dozen
GF> false positives (ham mis-classified as spam) per day. I'm having to
GF> white list friends and newsletters that previously went through just
GF> fine.

Terminology confusion (possibly mine).  Auto-learn is what happens with
emails one by one as they come through SA.  Learning from corpus would
tend to be via direct (manual, not auto) sa-learn.

Of those 1100 messages, how many were spam, and how many were ham? I
don't think I've seen more than a half dozen FPs in any *month*, much
less a day.

GF> Generally, I'm using SA in local mode, and backing out to network
GF> mode only when local says no ham was found.

So you're running SA against your rule set and Bayes without DNSBL
checks, and then if these do not scream SPAM (high score) or HAM
(negative score), you then check DNSBL to see if they will give a spam
score?

GF> Given my ham to spam ratio (roughly 1 to 5) that's been okay, but it
GF> probably leads to a surprising result where spam is over-aggressively
GF> mis-classified. I'm using 2.60 cvs (6/30) at the moment, but I think
GF> the same problem would come up on version 2.55.

Very possibly not -- 2.60 doesn't yet have statistically determined
rules; the rule set is more advanced than 2.55, and to my knowledge
hasn't yet been run against the giant SA corpus available to the
developers. After that process the rule score defaults are adjusted to
minimize FPs. Again to my knowledge, that FP minimization step hasn't yet
taken place for 2.60

GF> The problem is that I'm seeing these misclassified spams as having
GF> only, or nearly only, BAYES_99 asserted. ...

I don't remember ever seeing BAYES_99 on anything that wasn't spam,
and I've only seen BAYES_90 on non-spam once in three months. That leads
me to question the accuracy of your original corpus.  How was it built
and classified?  What are the chances that persons A and B classified
emails as spam, and Bayes learned it as spam, while persons C and D claim
these are not spam?

GF> Using BAYES_99 as an example, it will be scored 5.2 with Bayes
GF> enabled, while running in local (non-network) mode, and only 3.008
GF> when networking is enabled. Trouble is, that 5.2 exceeds the default
GF> cut off of 5. ...   

GF> What I'm working up to here: For those of you using Bayes, did you
GF> also move your threshold value up (to say, 7 or above), or do you
GF> simply tolerate more false positives? (I'd have to say that the
GF> four/five false positives I'm now seeing per day, and didn't see
GF> before is too high a number for my tastes).

I rely heavily on Bayes. I run with a required hits of 9.0, and I run
with BAYES_99 set at 9.0, and with BAYES_90 set at 7.5 (83% of
threshold). I think I got one FP in all of July, and it had a low Bayes
score.

So in summary, no, I don't think your Bayes *scores* are the problem. I
think the main problem is that Bayes learned ham as spam. I would suggest
checking through your spam corpus and relearning any misclassified emails
as ham.

A second and less critical problem may be your use of 2.60 and its not
yet statistically validated scores. This will remain less important as
long as you have ham with Bayes scores 90% and over.

Good luck.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0

iQA/AwUBPzB155ebK8E4qh1HEQLN2gCgpg1vEiUcvTJ+4HwVeuLn/XFGDz4An06f
q0sQMBbXgnA0Cr+5DLVHNnyS
=Wz4N
-----END PGP SIGNATURE-----




-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Reply via email to