Andrew Sykes wrote:
> Hi,
>
> I'm writing some code to integrate SpamAssassin with Apache JAMES.
>
> I want to setup an address to allow me to pipe spam into sa-learn. I
> have a prototype of this working fine, but would like to allow various
> webmail client users to be able to forward spam messages to this
> address.
>
> As I have very limited understanding of how SA works, I don't want to
> end up blocking the forwarding addresses.
>
> If I whitelist the forwarding addresses, can I then simply pipe a
> forwarded spam from that address into sa-learn or is there more to it?
>   

There's MUCH more to it.. In fact, whitelisting won't really affect what
sa-learn does at all.

Generally speaking, forwarded messages are mostly useless to sa-learn.
Exactly how useless depends a bit on the mail client..

SA tokenizes MANY mail headers, including Received:, not just From: and
To. All the headers in a forwarded message are completely new, thus the
sa-learn process will be learning the headers generated by forwarding,
and not spam.

SA also tokenizes the body of the message. However, most mail clients
substantially modify the body of the message when you forward. 
Generally speaking they only preserve one of the mime sections in a
multipart/alternative message. Spammers FREQUENTLY have text/plain
sections which are dissimilar from the text/html. By forwarding you're
loosing all but one mime section (generally text/html is kept).

On top of this, most mail clients also insert "Forwarded message:" type
text into the body, and add Fwd: to the subject.

SA also tokenizes the in-body mime headers describing how the message
was encoded. However, when you forward, the mail client doing the
forward re-encodes things its own way. What might have been base64
encoded may now be quoted-printable, 8 bit, or 7 bit.

So, fundamentally, as far as bayes is concerned the forwarded message is
a completely different message than the original spam.

You can try this sometime by taking an original spam, and a forwarded
version of it and feed them both to spamassassin or sa-learn with "-D
bayes" added. This will cause the debug output to list all the tokens
used. Take a look at the tokens. .some are the same, but many are different.







Reply via email to