On 29 Dec 2015, at 13:24, RW wrote:

On Mon, 28 Dec 2015 23:42:03 -0500
Bill Cole wrote:


Using these facts, my learning script that runs as root and reads
from multiple real users' Maildirs does this to learn ham:

 for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u
$SAUSER sa-learn --ham --mbox

Where $HAMS is the list of ham message files and $SAUSER is the user
handling the system-wide BayesDB. I use formail there just to give
each message a leading 'From ' line (i.e. mbox format) so that the
whole bunch can be piped into a single sa-learn invocation.

IIRC when you do that sa-learn just creates a temporary file and then
runs on that.

Yes, with the advantage of using Mail::SpamAssassin::Util::secure_tmpfile() rather than whatever I happen to roll up in a bit of Q&D shell that I never get around to reviewing for edge cases...

The main reason to do something like that is to avoid the heavyweight sudo & load of a Perl script for each message.


The alternative without formail would be to pipe each raw message into
its own sa-learn.

The alternative is to give it a directory.

Sure, one can reimplement Mail::SpamAssassin::Util::secure_tmpfile and/or Mail::SpamAssassin::Util::secure_tmpdir and use that. One can copy files from multiple user Maildirs and maybe error out before cleaning up or maybe forget to set perms right or maybe make some mistake I haven't thought of.

Or, I could use a tool that's been at least nominally open to review for many years across many versions and which stands a strong chance of having had at least one set of more competent eyes run across it looking for flaws to fix. I'm lazy...

It can work out for itself
whether it's maildir or just a directory of files. If you need to train
an arbitrary  selection of files, you could symlink them into a
temporary directory.

Not if the user you want to train as can't read the real files. Symlinks don't confer permission to read their targets (that would be very bad.)

If you run spamd it's also possible to train via
spamc.

Yes. IF you run spamd and it's how your system-wide SA filtering is done already, that's arguably the best way to do ad hoc (re)training since you can be sure it's hitting the right DB and you can feed it in parallel.

Personally I'd avoid the unforced use of mbox around Bayes without
being sure that "From-escaping" is taken account of . The problem is
that formail will replace a "From" at the beginning of a body line with
">From" which changes the msgid hash and prevents the correct
retraining of mail that was trained without going through formail -
e.g. the correction of autotraining.

An excellent point, which I had not considered. I'm mildly surprised that sa-learn doesn't s/^>From /From /' each message when disassembling the mbox, but only mildly. It seems I've got a script to fix...

I just had a quick look and I can't see any support for this in
SpamAssassin. It's not a major problem, but in this case it's an easily
avoidable one.

Yes. Only a small fraction of messages need the escaping at all, but it's enough to not use formail & mbox.

There's also the option of using inherited ACLs on Maildirs if they are supported on the filesystem being used.

Reply via email to