On Monday 16 October 2006 01:21, Magnus Holmgren took the opportunity to say:
> What I'm saying is that
>
> $ sa-learn --spam < testmessage
>
> and
>
> $ sa-learn --spam testmessage
>
> give different results. I forgot to mention the version, 3.1.4 (Debian
> Etch). 3.0.3 (Debian Sarge) doesn't exhibit this behaviour, but there seems
> to be some other fishiness going on. I'll investigate further.

I just realised that on the first SpamAssassin pass the top Received: line is 
a preliminary one, different from the real one added by Exim upon writing the 
message to the queue (Date:, the *bottom* Received:, and a bit of the body 
are used to generate a message ID), but this doesn't explain the strange 
results in 3.1.4.

Now, in Mail::SpamAssassin::Bayes, get_msgid() returns the contents of the 
Message-ID field (if it exists) as well a generated ID. But learn_trapped() 
and forget_trapped() always uses the generated ID (which, as I mentioned, is 
a hash of the Date field, first/bottom Received, and a bit of the body). I 
think the message identification could be improved.

Using the Message-ID field isn't optimal since spammers can't be trusted to 
put unique IDs in their spams. But how about the local message-ID from the 
first internal MTA? Then if a message arrives twice (e.g. directly and via a 
mailing list) it might be learned twice it can be argued. But what about 
multiple copies of the same spam? Anyway, using the bottom Received line 
isn't much better than using Message-ID.

-- 
Magnus Holmgren        [EMAIL PROTECTED]
                       (No Cc of list mail needed, thanks)

Attachment: pgp1oa6b0iHYB.pgp
Description: PGP signature

Reply via email to