Hello tony,

Tuesday, March 16, 2004, 11:08:43 PM, you wrote:

>> TB> Hey cool, done that now.  Just looked at the headers of a message
>> TB> received which says "autolearn=ham" This was a message from the SA
>> TB> group funnily enough - presumably that is correct?
>>
>> Unless that message included spam samples, then no problem.
>>
>> I suggest you set your non-spam auto-learn threshold to -0.01 to make
>> sure that spam that hits no rules is not accidentally learned as ham.

tac> errr...what?  who? where? how???

I use:
> auto_learn_threshold_nonspam       -2
> bayes_auto_learn_threshold_nonspam -2
I forget which version applies to 2.5x and which to 2.6x -- adapt to the
score you want to use as a threshold, and put it into your local config
file (eg: local.cf or whatever).

tac> Also, loooking at more headers - if they say "autolearn=no", does that
tac> just mean SA had no idea if it was spam or ham, or does it just mean that
tac> autolearn is off and I Was looking at an old message? ;)else?

No, autolearn=no simply means the email didn't score high enough (as
spam) or low enough (as non-spam) to be auto-learned. It means auto-learn
is on, but the email message didn't qualify.

>> My understanding is that each domain with a $HOME will have one
>> $HOME/.spamassassin directory, and the bayes database built there will
>> apply to all [EMAIL PROTECTED] for that domain.

tac> Cool.  That does indeed seem to be the case - my mailboxes were
tac> refreshingly free of spam this morning - hurrah!

>> cp /dev/null $file
>> or
>> cat </dev/null >$file
>> are two methods I've used to empty files.

tac> okay will do that - is there any advantage of one over the other apart
tac> from less typing? ;)

None that we can measure.

>> TB> - is my first ever shell script!:
tac> [..]
>> TB> Any obvious flaws there guys, or something I could do better?   It
>>
>> Looks good to me.  I wouldn't cat them all into one file first, since my
>> understanding is that the shorter/quicker sa-learn runs are better (less
>> chance they'll block bayes update by incoming email and auto-learn).

tac> okay, thanks m8.  You cat them in your script though don't you?

No. My commands are:
> sa-learn --spam --mbox sa.learn.spam      # do the sa-learn
> ls -lF `pwd`/sa.learn.spam                # record this file in my log
> cat sa.learn.spam >>~/mail/cw-spam/inbox  # append to my corpus
> cat ~/mynull      > sa.learn.spam         # empty the mailbox

>> TB> If the former, then presumably my script would be better off
>> contatenating
>> TB> the spam and ham files before passing them to a single run of
>> sa-learn?
>>
>> I run my scripts once an hour.

tac> Blimey - do you get THAT much spam? ;)

7-8k spam a week, and will probably hit 9k around June.

>> You need 200+ spams and 200+ hams before Bayes takes effect and starts
>> applying its scores to your emails. It then remains effective unless you
>> drop below those numbers (such as by deleting the database files and
>> starting over). That has nothing to do with sa-learn. The more often
>> sa-learn runs, the more current your bayes database is.

tac> Okay, thanks.
tac> For ham, do you just copy everything from your inbox (apart from spam not
tac> caught) or is there stuff you WOULDN'T put through the spam filter? eg,
tac> all the posts to this list?

I do not sa-learn the SA mailing lists, nor any other mail which contains
samples of spam, nor discussions of spam.  Otherwise I sa-learn
everything.

tac> I am on a number of lists and the volume would make perfect ham material,
tac> but I'd be worried sometimes that the content wouldn't and certain
tac> characteristics - eg being sent to a large no. of people, me not being
tac> explicitly set as a recipient.

If you (or your people) ever get non-spam from the same people who use
those lists, discussing the same topics, then learning them will be a big
help (avoid FPs).

tac> For spam, is there any value in passing already identified spam (sent to
tac> the spambox thru sa-learn?

a) If you do multiple domains as I do, then it's easier to simply feed
all spam into all three domains rather than figure out where it has
already been learned.

b) For simplicity of management, I dump spam from all three domains into
one spamtrap. It's easier for me then to sa-learn all of them, rather
than keep the already learned spam separate.

Bob Menschel



Reply via email to