Re: sa-learn does learn what exactly ?

Matt Kettler 16 Feb 2005 18:22:46 -0000

At 01:01 PM 2/16/2005, Philipp Snizek, seaan.net ag wrote:

I use SpamAssassin for a while now and must say it really is an
extremely fine piece of software.
I have a question about sa-learn. I run SA on a mail gateway. I intend
to write a shell script that enables me to send an email to the mail
gateway with a command, e.g. 'learn' and the text of the spam mail
copy-pasted to the body of the mail.
What exactly does SA learn? Does it learn what's in the body only or
also the header contents?

Both body and header are broken down into "tokens" and those tokens are learned individualy.

In body text, a token is usually just a word. In headers, it gets a little more comlicated.

SA 2.6 stores the tokens in a way you can dump them in human readable form. 3.0 stores SHA hashes of those tokens for privacy and speed reasons...


Some example header tokens dumped from 2.6x from would be:
0.978          2          0 1108156490  N:H*M:NNANNNN

Which matches a message-id header (H means header token, *M: is shorthand for Message-Id.). It is looking for one that contains 2 numbers, the letter 'A' and 4 numbers.

The first number is the "spam probability" of the token.. in this case, 97.8%.. The 2 is the number of times it was seen in learned spam, and 0 is the number of times seen in learned ham.

Some header tokens don't have any of the fancy encoding, and are more readable:
0.958          1          0 1108285937  HLocation:battelle

Which looks for a "Location:" header containing "battelle"

A sample body token would be:
0.018         36         90 1108415529  minimize

Body tokens can also use some of the same "N = number" encodings as headers, but you get the idea..

SA 3.x uses the same tokens, but dumps them into SHA1, so you just get numeric gibberish. You can't tell what the token is, but you can tell if another token is the same.

Re: sa-learn does learn what exactly ?

Reply via email to