Re: Bayes FP/FN Training Procedures

Louis LeBlanc 6 Jan 2005 15:13:38 -0000

On 01/06/05 08:41 AM, Jeff Koch sat at the `puter and typed:
> 
> Has anyone come up with a script or method that would allow users to 
> forward their false positive and false negative emails back to an address 
> on the mailserver where they can be used to train the Bayes database. I 
> understand that Bayes needs the email in its original format so the script 
> has to strip off the forwarding enclosure.
> 
> Thanks in advance.


Cool idea.  I have one that allows a user to send an email with a list
of addresses to whitelist or blacklist.  They send it to their own
address with a +whitelist or +blacklist extension.  Frinstance, I
could send to [EMAIL PROTECTED] and whitelist an
address.  Naturally, it requires a password in there as well, but it
works.  This really only boils down to a procmail recipe at the server
end, but I did write a quick mutt macro that uses formail to parse the
>From address out of the message and autosend it using a script with
about 20 lines of Perl code.  It also assumes your MTA can handle
plussed folders, but this can be worked around with a subject scan or
something similar.

I wonder if the same thing could work with this idea.  One would have
to be careful what was passed into bayes.  Anyone know exactly what
and how this would need to be encapsulated?  I'm guessing it would
require some perlish at the server end to be called from procmail, but
it would have to be encapsulated carefully at the client end to avoid
piping the encapsulation headers through the learner.

XXX

Just because it's remotely relevant, I use maildir now with my mail
server.  This allows easy confirmation of spam by providing a
different subdirectory for new and read email.  So anything in the
.../cur directory is marked as read, and in the spam folder that
should be confirmed spam.  Autolearned spam goes into a different
folder altogether.  In my years with SA, this has a 0% FP rate, so I
don't feel I even have to bother with it anymore.

I wrote a script that uses Mail::SpamAssassin to parse the confirmed
spam, then move it to a spamdump folder.  I did some shameless
borrowing from sa-learn, giving credit in the script, of course.  By
default, the spamdump is recreated each month, leaving the old to be
purged at the users will.  I made my script extremely flexible, with
some powerful and flexible configuration methods, so you can pretty
much configure anything of consequence.

The reason I did this is that I wanted to be able to confirm spam and
have it learned as spam, then moved away.  The configuration uses a
list of directories expected to contain confirmed spam.

I also wanted to have autolearned spam moved out without trying to
relearn it.  This is done with another list of directories, containing
autolearned spam.  I wanted to include both read and unread
autolearned spam - remember, I'm getting 100% accuracy in this set -
so I simply included both directories in the list.

Naturally, it will also use a list of directories that contain
confirmed ham, and learn them as such, but these will be left where
they are.  No good hiding the users real mail, right?  At some point I
hope to keep track of the last time the script was run and use that
here to parse only files with a last mod or create time since the last
run.  Whether that approach is better than just rechecking all of them
may be debatable.

There is a configuration switch to autoreport all learned spam.  This
is off by default, and I haven't used it yet.

Once a month (when the new spamdump is created) the script will force
a sync and expire.  This can be done every time the script runs by
turning on a config switch.

Anyone interested it checking it out to provide feedback?  There are a
couple things that might be considered downsides or TODO items:

* The configuration method is a bit technical (has to be valid perl),
  but it's pretty powerful if you use your imagination.  At some
  point, I hope to find a way to do configuration through the
  Mail::SpamAssassin::Conf module for consistency, but I'm not sure how
  it will handle list definition, or even if that module was written
  to be used by other scripts.

* It is limited to directory based mail, no mbox or mbx files - it was
  written solely with maildir in mind.

* New spam archive folders are created with a system call - to
  maildirmake by default, but that can be changed to a mkdir -p
  command if necessary.  I've done a quick scan for a perl module to
  create the maildir, but haven't found one yet.  Courier IMAP doesn't
  have one, it uses a C/C++ utility to do it.

* Just because a file winds up in the confirmed spam directory doesn't
  guarantee it will be learned, but it will be scanned.  It isn't
  uncommon to see a message come through that has enough in common
  with a message already learned as spam to be skipped.  The script
  doesn't forget and relearn by default, so it might not catch the
  case of an autolearned FN.  To do this, I may need to duplicate the
  Mail::SpamAssassin::ArchiveIterator object and use one to forget all
  messages, then use the other to relearn them as spam.  I haven't
  found a way to tell Mail::SpamAssassin->learn() to force a relearn
  yet.

* There's a LOT of commentary in the script, but it's not a real POD
  yet.

There's still quite a bit to do, but it's been working great on my
system for about a week now.  I have the verbosity turned up a bit,
and the nightly crons send me the output.  So far so good.  I hope to
make it worthy of submission to the SA project, but it still requires
some work.

Lou
-- 
Louis LeBlanc          [EMAIL PROTECTED]
Fully Funded Hobbyist, KeySlapper Extrordinaire :)
http://www.keyslapper.org                     ԿԬ

Not one hundred percent efficient, of course ... but nothing ever is.
    -- Kirk, "Metamorphosis", stardate 3219.8

Re: Bayes FP/FN Training Procedures

Reply via email to