Re: re-learning ? was - bayes - large message

John Hardin Sat, 20 Apr 2013 14:55:25 -0700

On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:

On 4/20/2013 at 2:00 PM, John Hardin <jhar...@impsec.org> wrote:

On Sat, 20 Apr 2013, Joe Acquisto-j4 wrote:

In order to send the samples, the user will forward the messages, as an
attachment.  Each is an individual message to either ham or spam, with
the (hopefully) correct attachment.


Are you extracting the attachments off those messages to feed to sa-learn?
Or are you feeding in the entire forwarded message including the
attachment?

If the latter, you're training stuff you shouldn't be (the headers of the
submission to the training folders) and you'll see every user's submission
of the same multi-recipient spam as being learned separately.

This is one reason it's better, if possible, to have global training
folders that users can just move/copy messages into. If training
submissions pass though your mail system again, things get complicated.

--
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/

. . .

Well, err . . . umm.

Looks as if I misunderstood something here. I thought it was OK toforward, as an attachment and SA/Bayes would "figure it out".

SA is smart enough to strip the markup it has added to the spam messagewhen it was scanned, which can include awrapping the original message asan attachment rather than modifying it. It's not smart enough to figureout that the message you are really interested in is attached to themessage you're giving it absent that markup.

This is a pretty common situation. I've added a feature request for an--attachment option for sa-learn, to tell it to learn from an attachmentrather than the entire message.

I did think that curious, but, hey, what do I know? That's obvious now. . . Anyway it made it easier for me to feed bayes that way.
Too bad it does not work.

Looks like I gotta learn to Samba.

Oh, no - I didn't mean to imply copying to a shared *file* folder fortraining. What I meant by "if possible" is if you have server-side messagestorage and your users are reading their mail via IMAP or are using a mailserver (e.g. MSExchange) that allows you to set up a single sharedserver-side *mail* folder that everyone can access. (Well, two folders,one for training misclassified spam and the other for trainingmisclassified ham.)

If you are using IMAP rather than POP and you want to offer greaterprivacy to your users, you can set up per-user ham and spam trainingfolders in their local mail directories, have them copy messages to thosemail folders for training, and your training script can just directly readthose folders.

However, your using forwarding suggests that you're not using server-sidestorage.

You're *almost* there with your current setup. Rather than learning Samba,you need to learn some mail tools that will let you step through themessages in your existing training folder, pull out the RFC822 attachmentfrom each and feed that to sa-learn.

It might be easier to use two separate scripts: one that processes a mailfolder full of forwarded messages, extracts the attached message from eachand adds that to a second mail folder, and another (very simple) script tolearn from that second folder. You already have the second script, so youjust need to write the first one.

I suppose I should clear bayes and start over, then?

Yes, but you don't need to discard your existing corpora of messages thatyour users have submitted (assuming you kept them).

Once you have your training script extracting the attachments and havereset the bayes database, just drop your entire corpus of forwardedlearn-as-spam messages back into the spam training folder and the entirecorpus of forwarded learn-as-ham messages back into the ham trainingfolder and you should be good.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  My sidearm is a piece of emergency equipment. It absolutely must
  be reliable, not "smart".
-----------------------------------------------------------------------
 3 days until Max Planck's 155th birthday

Re: re-learning ? was - bayes - large message

Reply via email to