Hello FH,

Tuesday, February 15, 2005, 3:40:43 PM, you wrote:

>> Next time you get one of those spam that sneaks through, run
>> > spamassassin -D <email >output 2>debug.out

F> There must be a disconnect somewhere. I just did this w/ a "drugs
F> online" spam I just received.  When it first came in it had a
F> rating of 1.9, I saved it as a file (not an mbox) on the server and
F> ran the above command and it reported a 12.5!!!

What were the rule hit changes? Depending on the time between the
first scan and the second, some of that might have been due to network
tests having been taught the spam. The more time that passed, the more
likely such a score increase would be. Bayes also could have been
involved, since other emails could have increased the Bayes score.

However, if the rule hit differences included non-trainable rules,
then yes, you have a serious disconnect.

F> After running sa-learn on the mbox I saved the email to it didn't
F> change anything (the above still reported a 12.5).

Assuming part of that first 12.5 was a BAYES_99, then no additional
amount of learning would increase the score.

F> I then "bounced" the message back to myself and when it hit the
F> incoming mailbox again this time it was autolearned as ham and
F> rated as 0.4.

Your bounce was a different email, with different headers, through a
different email path, and so it was filtered differently.

Auto-learning it as ham is IMO a problem. I think that auto-learning
anything with a positive score as ham is asking for trouble. I have my
ham auto-learn thresholds set at -2. (I have several negative scoring
rules specific to my domains.)

F> Running it back through the above command again that only scored a
F> 7.9 :( ?!?

Reasonable -- by bouncing the email back into the system you caused it
to be auto-learned as ham, which confused Bayes, so Bayes gave it a
lower score, dropping your total score from 12.5 to 7.9.  I see
nothing wrong with that result (besides the ham auto-learn).

F> So just to double check I'm doing this right:

F> - Mail comes in to the server and is picked up by postfix (running
F> as postfix). 
F> - It's passed off to procmail via "mailbox_command = /bin/procmail"
F> in the postfix/main.cf file. 
F> - Procmail calls spamc which passed off the mail to spamd (running
F> as spamd and started via an init.d script that runs "spamd -d -u
F> spamd" at startup).  
F> - That runs it through spamassassin and marks it up if
F> appropropriate and then dumps it into the mailbox.

So far so good. I'm not that familiar with procmail/postfix/spamd, but
it sounds reasonable.

F> - Not getting into what the other users are doing, if I get an
F> unmarked spam I save it to a mailbox (I use [PC-]Pine btw in case
F> that makes a difference) and occasionally run "sa-learn --showdots
F> --spam --mbox spam" as root on that file.

That's what I do, except that I do it for ALL spam, marked or not,
learned or not.  And I do the same with ALL ham, marked or not,
learned or not, but with --ham.

F> This is how it's supposed to work right?  I did a "find" for journal, seen 
and
F> toks and only came back w/ those in the expected place
F> (/var/spool/spamassassin).  The only other spamassassin files I
F> found that looked "out of place" (aka not the config file or the
F> share/rules files) were in the ~root/.spamassassin

By ~root/.spamassassin, do you mean each individual's root or home
directory, then a .spamassassin directory under that? And in your
config files, do you specify a Bayes database path?

Normally, if you don't override anything, and if you're using the -u
parameter to spamd as you're doing, each user should have a
$HOME/.spamassassin/bayes_* set of files. These are the files used
during filtering, and these are the files updated during learning
(whether auto or manual).

If you also have these bayes files in /var/spool/spamassasin, then why
are they there? Are they being updated?  I'm wondering whether you're
training the $HOME/.spamassassin/bayes_* files but filtering on a
central set of files.

Though from the directory you listed, it looks like you have
individual auto-whitelist files, but no individual bayes files, in
which case you should not have that conflict.  If everyone is
filtering in and learning to one central database, then except for
timing you should have no disconnect there.

Bob Menschel



Reply via email to