Re: [SAtalk] how to change the bayes auto_learn threshold to zero or above?

2004-01-30 Thread Kris Deugau
Brett Dikeman wrote:
 I tried setting bayes_auto_learn_threshold_nonspam to a positive
 value- almost all legitimate email we get on the particular system is
 marked somewhere between 0 and 2- rarely any lower.  Ever(save for
 whitelisting).

You've found the right setting, and I'm not aware of any particular
problems setting to a positive value.  (The default value is 0.1.) 
Staring at the man page reveals no official limits in this respect.

I vaguely recall that there are a few rules or types of rule that are
ignored when calculating a score to apply to decide whether to
autolearn, but I don't recall the details.  Among other things, however,
SA will use one of the two NON-Bayes score sets- which in some cases can
result in a *very* different score.

 That setting appeared to have no effect. I just sent a test, it got a
 score of 0, and was tagged autolearn=no. Even after I removed the
 whitelists.  I need to change this.

Very strange.  Even with the default thresholds, I was still seeing
messages autolearned as ham on the system I'm maintaining- but I was
also seeing spam autolearned as ham.  I *dropped* the autolearn
threshold for spam down to -0.5 (-2 for a brief time) which seemed to
not autolearn spam.  In my case, autolearning spam is Really Bad,
because I have no way to easily counterbalance such messages.

 Also- maybe it's just me, but it seems rather silly to not allow the
 user to auto-learn messages that have been whitelisted, either
 sitewide or user-specific.  Could someone a)explain the reasoning
 here

I think part of the reasoning is that whitelisting is a last-resort
option (before processing outside of SA, of course g) to get a message
to NOT be tagged.  Someone sending legit-but-very-spammy mail whose
messages are autolearned will reduce the effectiveness of the Bayes
score (by how much I'm not sure) because you're effectively declaring
spammy tokens to be less spammy.

It would definitely be nice to be able to override this limitation
though.

-kgd
-- 
Sendmail administration is not black magic.  There are legitimate
technical reasons why it requires the sacrificing of a live chicken.
   - Unknown


---
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] how to change the bayes auto_learn threshold to zero or above?

2004-01-29 Thread Brett Dikeman
Martin Radford wrote:

It might be because you get the occasional false positive that you
want to avoid (but all the rest come under your threshold).  You
probably would want these autolearned as ham.
Actually, at the moment the bayes engine thinks 99% of the messages 
going through it are spam, simply because it's auto-learning spam 
messages but never auto-learning ham...because messages never get 
negative scores.

Or it might be because the messages are from a mailing list like this
one, where the messages may well contain extracts from spam.  In this
case you positively *don't* want to autolearn them as ham, because
it'll adversely affect the Bayes database's training.
I read this in the archives while researching the problem before asking 
the list(Gasp, yes!  A user did research before posting!) and I think 
it's such an obscure problem, that it doesn't affect us.  In our 
specific use, in fact- this particular circumstance will never, ever, 
ever happen; nobody forwards spam to this particular mail server(but it 
does get deluges of spam on its own).  As for the people on SA-user 
getting copies of spams...believe it or not, Spamassassin users vastly- 
and I do mean vastly- outnumber spamassassin-* list members.  I also 
don't know many people that forward each other spams(at least not people 
that want to keep their friends).

Further- as I recall last time someone said what if a spam gets quoted 
on -list,  several people on the SA list pointed out that even on the 
SA list, such occurances are rare.

So yes- I think your argument is rather obscure and moot for 99% of your 
users.

Did you consider that the occasional spam auto-learned as ham really 
isn't that bad, if you're auto-learning many more legitimate messages? 
SA tends to grossly tip the scales towards auto-learning spam versus 
ham, all for the sake of not accidentally learning a rather 
theoretical(for most users) case.  Left to its own devices, the bayes 
engine will eventually mark more and more messages as spam, and the 
engine becomes completely useless- which is much worse than a slight 
inaccuracy from the occasional spam that gets auto-learned as ham.

Developers are always well-meaning when they institute rules(that cannot 
be overridden) to address specific circumstances.  However, these little 
rules often end up causing a lot of people a lot of grief and solving a 
problem that really wasn't that big in the first place.  It's like not 
giving your dinner guests steak knives because there's the slim chance 
they might poke themselves in the ear with it.  Yeah, your dinner guests 
will be safe- but they're going to have a hell of a time enjoying the 
steak with that butter knife you gave them instead.  Another example 
would be the infamous crash involving that Airbus plane that overrode 
the pilot's command for more throttle. The computer(and its programmers) 
had good intentions, but failed to realize that in the end, there has to 
be a magic red button somewhere that puts someone with situational 
knowledge back in control of things.  SA has many such restrictions and 
few Magic Red Buttons.

In several cases, spamassassin assumes it knows better than I do, and 
overrides my config directives(and further, doesn't warn me it's doing 
so).  If you want to warn me in the install/config/whatever docs that 
turning on auto-learning of messages above X score or turning on 
auto-learning of whitelisted messages is dangerous, fine.  So be it. 
Some people might not instantly realize the implication.  But give us 
the OPTION of doing it.

So here's my suggestion, and it's two-part:

a)strip the min+max limit controls from the two auto-learn params.  If I 
want to be a moron and set my auto-learn-spam to 2(ie, below the magic 
number 6), that's my bloody business, not yours ;-)

b)add a auto_learn_whitelist, and have a couple of options. 
Off(nothin'), auto(ie let bayes auto-learn messages that were 
auto-whitelisted), manual(ie config-file whitelist rules) and all(both 
auto and manual, mwuaha).  Ok, so they're not intelligently named, but 
that combo will make just about anybody happy.

Make the default 'off' if you REALLY, really think the whole 
subversive-spam thing is a problem for the MAJORITY OF YOUR USERS. 
Chances are manual is the next-safest option, since generally users 
have to be smarter than the average bear to set up their own rules(or 
their admins had good reasons for adding global rules- as I did on our 
system, whitelisting our biggest customers).  Auto and Both would be the 
least safest.

You could work around the problem by creating your own rules to
identify these messages, and give them a negative score.
The messages in question have no common element.  They come from 
virtually anyone; in most cases, they're inititated by the user out of 
the blue, so we can't even inject headers or taglines to look for later.

Brett

---
The SF.Net email is sponsored 

Re: [SAtalk] how to change the bayes auto_learn threshold to zero or above?

2004-01-29 Thread Martin Radford
At Fri Jan 23 19:44:34 2004, Brett Dikeman wrote:
 
 Also- maybe it's just me, but it seems rather silly to not allow the
 user to auto-learn messages that have been whitelisted, either sitewide
 or user-specific.  Could someone a)explain the reasoning here and b)tell
 me how to change this?  It is almost completely contrary to what we
 want- any addresses we've whitelisted are guaranteed to be sending
 legitimate email and we would ABSOLUTELY want them auto-learned as ham,
 NOT the other way around...

One of the problems is that there's no way of telling Bayes *why*
you've whitelisted these messages.  

It might be because you get the occasional false positive that you
want to avoid (but all the rest come under your threshold).  You
probably would want these autolearned as ham.

Or it might be because the messages are from a mailing list like this
one, where the messages may well contain extracts from spam.  In this
case you positively *don't* want to autolearn them as ham, because
it'll adversely affect the Bayes database's training.

You could work around the problem by creating your own rules to
identify these messages, and give them a negative score.  These *will*
cause Bayes to pick the messages up, as long as it takes them below
the autolearn-as-ham threshold.

Martin
-- 
Martin Radford  |   Only wimps use tape backup: _real_ 
[EMAIL PROTECTED] | men just upload their important stuff  -o)
Registered Linux user #9257 |  on ftp and let the rest of the world  /\\
- see http://counter.li.org |   mirror it ;)  - Linus Torvalds _\_V


---
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk