I use SpamAssassin for a while now and must say it really is an extremely fine piece of software. I have a question about sa-learn. I run SA on a mail gateway. I intend to write a shell script that enables me to send an email to the mail gateway with a command, e.g. 'learn' and the text of the spam mail copy-pasted to the body of the mail. What exactly does SA learn? Does it learn what's in the body only or also the header contents?
Both body and header are broken down into "tokens" and those tokens are learned individualy.
In body text, a token is usually just a word. In headers, it gets a little more comlicated.
SA 2.6 stores the tokens in a way you can dump them in human readable form. 3.0 stores SHA hashes of those tokens for privacy and speed reasons...
Some example header tokens dumped from 2.6x from would be: 0.978 2 0 1108156490 N:H*M:NNANNNN
Which matches a message-id header (H means header token, *M: is shorthand for Message-Id.). It is looking for one that contains 2 numbers, the letter 'A' and 4 numbers.
The first number is the "spam probability" of the token.. in this case, 97.8%.. The 2 is the number of times it was seen in learned spam, and 0 is the number of times seen in learned ham.
Some header tokens don't have any of the fancy encoding, and are more readable: 0.958 1 0 1108285937 HLocation:battelle
Which looks for a "Location:" header containing "battelle"
A sample body token would be: 0.018 36 90 1108415529 minimize
Body tokens can also use some of the same "N = number" encodings as headers, but you get the idea..
SA 3.x uses the same tokens, but dumps them into SHA1, so you just get numeric gibberish. You can't tell what the token is, but you can tell if another token is the same.