On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:
On 4/20/2013 at 2:00 PM, John Hardin <jhar...@impsec.org> wrote:
On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:
In order to send the samples, the user will forward the messages, as an
attachment. Each is an individual message to either ham or spam, with
the (hopefully) correct attachment.
Are you extracting the attachments off those messages to feed to sa-learn?
Or are you feeding in the entire forwarded message including the
attachment?
If the latter, you're training stuff you shouldn't be (the headers of the
submission to the training folders) and you'll see every user's submission
of the same multi-recipient spam as being learned separately.
This is one reason it's better, if possible, to have global training
folders that users can just move/copy messages into. If training
submissions pass though your mail system again, things get complicated.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
. . .
Well, err . . . umm.
Looks as if I misunderstood something here. I thought it was OK to
forward, as an attachment and SA/Bayes would "figure it out".
SA is smart enough to strip the markup it has added to the spam message
when it was scanned, which can include awrapping the original message as
an attachment rather than modifying it. It's not smart enough to figure
out that the message you are really interested in is attached to the
message you're giving it absent that markup.
This is a pretty common situation. I've added a feature request for an
--attachment option for sa-learn, to tell it to learn from an attachment
rather than the entire message.
I did think that curious, but, hey, what do I know? That's obvious now
. . . Anyway it made it easier for me to feed bayes that way.
Too bad it does not work.
Looks like I gotta learn to Samba.
Oh, no - I didn't mean to imply copying to a shared *file* folder for
training. What I meant by "if possible" is if you have server-side message
storage and your users are reading their mail via IMAP or are using a mail
server (e.g. MSExchange) that allows you to set up a single shared
server-side *mail* folder that everyone can access. (Well, two folders,
one for training misclassified spam and the other for training
misclassified ham.)
If you are using IMAP rather than POP and you want to offer greater
privacy to your users, you can set up per-user ham and spam training
folders in their local mail directories, have them copy messages to those
mail folders for training, and your training script can just directly read
those folders.
However, your using forwarding suggests that you're not using server-side
storage.
You're *almost* there with your current setup. Rather than learning Samba,
you need to learn some mail tools that will let you step through the
messages in your existing training folder, pull out the RFC822 attachment
from each and feed that to sa-learn.
It might be easier to use two separate scripts: one that processes a mail
folder full of forwarded messages, extracts the attached message from each
and adds that to a second mail folder, and another (very simple) script to
learn from that second folder. You already have the second script, so you
just need to write the first one.
I suppose I should clear bayes and start over, then?
Yes, but you don't need to discard your existing corpora of messages that
your users have submitted (assuming you kept them).
Once you have your training script extracting the attachments and have
reset the bayes database, just drop your entire corpus of forwarded
learn-as-spam messages back into the spam training folder and the entire
corpus of forwarded learn-as-ham messages back into the ham training
folder and you should be good.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
My sidearm is a piece of emergency equipment. It absolutely must
be reliable, not "smart".
-----------------------------------------------------------------------
3 days until Max Planck's 155th birthday