On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:

On 4/20/2013 at 2:00 PM, John Hardin <jhar...@impsec.org> wrote:
On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:

In order to send the samples, the user will forward the messages, as an
attachment.  Each is an individual message to either ham or spam, with
the (hopefully) correct attachment.

Are you extracting the attachments off those messages to feed to sa-learn?
Or are you feeding in the entire forwarded message including the
attachment?

If the latter, you're training stuff you shouldn't be (the headers of the
submission to the training folders) and you'll see every user's submission
of the same multi-recipient spam as being learned separately.

This is one reason it's better, if possible, to have global training
folders that users can just move/copy messages into. If training
submissions pass though your mail system again, things get complicated.

--
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
. . .

Well, err . . . umm.

Looks as if I misunderstood something here. I thought it was OK to forward, as an attachment and SA/Bayes would "figure it out".

SA is smart enough to strip the markup it has added to the spam message when it was scanned, which can include awrapping the original message as an attachment rather than modifying it. It's not smart enough to figure out that the message you are really interested in is attached to the message you're giving it absent that markup.

This is a pretty common situation. I've added a feature request for an --attachment option for sa-learn, to tell it to learn from an attachment rather than the entire message.

I did think that curious, but, hey, what do I know? That's obvious now . . . Anyway it made it easier for me to feed bayes that way.

Too bad it does not work.

Looks like I gotta learn to Samba.

Oh, no - I didn't mean to imply copying to a shared *file* folder for training. What I meant by "if possible" is if you have server-side message storage and your users are reading their mail via IMAP or are using a mail server (e.g. MSExchange) that allows you to set up a single shared server-side *mail* folder that everyone can access. (Well, two folders, one for training misclassified spam and the other for training misclassified ham.)

If you are using IMAP rather than POP and you want to offer greater privacy to your users, you can set up per-user ham and spam training folders in their local mail directories, have them copy messages to those mail folders for training, and your training script can just directly read those folders.

However, your using forwarding suggests that you're not using server-side storage.

You're *almost* there with your current setup. Rather than learning Samba, you need to learn some mail tools that will let you step through the messages in your existing training folder, pull out the RFC822 attachment from each and feed that to sa-learn.

It might be easier to use two separate scripts: one that processes a mail folder full of forwarded messages, extracts the attached message from each and adds that to a second mail folder, and another (very simple) script to learn from that second folder. You already have the second script, so you just need to write the first one.

I suppose I should clear bayes and start over, then?

Yes, but you don't need to discard your existing corpora of messages that your users have submitted (assuming you kept them).

Once you have your training script extracting the attachments and have reset the bayes database, just drop your entire corpus of forwarded learn-as-spam messages back into the spam training folder and the entire corpus of forwarded learn-as-ham messages back into the ham training folder and you should be good.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  My sidearm is a piece of emergency equipment. It absolutely must
  be reliable, not "smart".
-----------------------------------------------------------------------
 3 days until Max Planck's 155th birthday

Reply via email to