On 29 Dec 2015, at 13:24, RW wrote:
On Mon, 28 Dec 2015 23:42:03 -0500
Bill Cole wrote:
Using these facts, my learning script that runs as root and reads
from multiple real users' Maildirs does this to learn ham:
for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u
$SAUSER sa-learn --ham --mbox
Where $HAMS is the list of ham message files and $SAUSER is the user
handling the system-wide BayesDB. I use formail there just to give
each message a leading 'From ' line (i.e. mbox format) so that the
whole bunch can be piped into a single sa-learn invocation.
IIRC when you do that sa-learn just creates a temporary file and then
runs on that.
Yes, with the advantage of using
Mail::SpamAssassin::Util::secure_tmpfile() rather than whatever I happen
to roll up in a bit of Q&D shell that I never get around to reviewing
for edge cases...
The main reason to do something like that is to avoid the heavyweight
sudo & load of a Perl script for each message.
The alternative without formail would be to pipe each raw message
into
its own sa-learn.
The alternative is to give it a directory.
Sure, one can reimplement Mail::SpamAssassin::Util::secure_tmpfile
and/or Mail::SpamAssassin::Util::secure_tmpdir and use that. One can
copy files from multiple user Maildirs and maybe error out before
cleaning up or maybe forget to set perms right or maybe make some
mistake I haven't thought of.
Or, I could use a tool that's been at least nominally open to review for
many years across many versions and which stands a strong chance of
having had at least one set of more competent eyes run across it looking
for flaws to fix. I'm lazy...
It can work out for itself
whether it's maildir or just a directory of files. If you need to
train
an arbitrary selection of files, you could symlink them into a
temporary directory.
Not if the user you want to train as can't read the real files. Symlinks
don't confer permission to read their targets (that would be very bad.)
If you run spamd it's also possible to train via
spamc.
Yes. IF you run spamd and it's how your system-wide SA filtering is
done already, that's arguably the best way to do ad hoc (re)training
since you can be sure it's hitting the right DB and you can feed it in
parallel.
Personally I'd avoid the unforced use of mbox around Bayes without
being sure that "From-escaping" is taken account of . The problem is
that formail will replace a "From" at the beginning of a body line
with
">From" which changes the msgid hash and prevents the correct
retraining of mail that was trained without going through formail -
e.g. the correction of autotraining.
An excellent point, which I had not considered. I'm mildly surprised
that sa-learn doesn't s/^>From /From /' each message when disassembling
the mbox, but only mildly. It seems I've got a script to fix...
I just had a quick look and I can't see any support for this in
SpamAssassin. It's not a major problem, but in this case it's an
easily
avoidable one.
Yes. Only a small fraction of messages need the escaping at all, but
it's enough to not use formail & mbox.
There's also the option of using inherited ACLs on Maildirs if they are
supported on the filesystem being used.