Andrew Sykes wrote: > Hi, > > I'm writing some code to integrate SpamAssassin with Apache JAMES. > > I want to setup an address to allow me to pipe spam into sa-learn. I > have a prototype of this working fine, but would like to allow various > webmail client users to be able to forward spam messages to this > address. > > As I have very limited understanding of how SA works, I don't want to > end up blocking the forwarding addresses. > > If I whitelist the forwarding addresses, can I then simply pipe a > forwarded spam from that address into sa-learn or is there more to it? >
There's MUCH more to it.. In fact, whitelisting won't really affect what sa-learn does at all. Generally speaking, forwarded messages are mostly useless to sa-learn. Exactly how useless depends a bit on the mail client.. SA tokenizes MANY mail headers, including Received:, not just From: and To. All the headers in a forwarded message are completely new, thus the sa-learn process will be learning the headers generated by forwarding, and not spam. SA also tokenizes the body of the message. However, most mail clients substantially modify the body of the message when you forward. Generally speaking they only preserve one of the mime sections in a multipart/alternative message. Spammers FREQUENTLY have text/plain sections which are dissimilar from the text/html. By forwarding you're loosing all but one mime section (generally text/html is kept). On top of this, most mail clients also insert "Forwarded message:" type text into the body, and add Fwd: to the subject. SA also tokenizes the in-body mime headers describing how the message was encoded. However, when you forward, the mail client doing the forward re-encodes things its own way. What might have been base64 encoded may now be quoted-printable, 8 bit, or 7 bit. So, fundamentally, as far as bayes is concerned the forwarded message is a completely different message than the original spam. You can try this sometime by taking an original spam, and a forwarded version of it and feed them both to spamassassin or sa-learn with "-D bayes" added. This will cause the debug output to list all the tokens used. Take a look at the tokens. .some are the same, but many are different.