On Monday 16 October 2006 01:21, Magnus Holmgren took the opportunity to say: > What I'm saying is that > > $ sa-learn --spam < testmessage > > and > > $ sa-learn --spam testmessage > > give different results. I forgot to mention the version, 3.1.4 (Debian > Etch). 3.0.3 (Debian Sarge) doesn't exhibit this behaviour, but there seems > to be some other fishiness going on. I'll investigate further.
I just realised that on the first SpamAssassin pass the top Received: line is a preliminary one, different from the real one added by Exim upon writing the message to the queue (Date:, the *bottom* Received:, and a bit of the body are used to generate a message ID), but this doesn't explain the strange results in 3.1.4. Now, in Mail::SpamAssassin::Bayes, get_msgid() returns the contents of the Message-ID field (if it exists) as well a generated ID. But learn_trapped() and forget_trapped() always uses the generated ID (which, as I mentioned, is a hash of the Date field, first/bottom Received, and a bit of the body). I think the message identification could be improved. Using the Message-ID field isn't optimal since spammers can't be trusted to put unique IDs in their spams. But how about the local message-ID from the first internal MTA? Then if a message arrives twice (e.g. directly and via a mailing list) it might be learned twice it can be argued. But what about multiple copies of the same spam? Anyway, using the bottom Received line isn't much better than using Message-ID. -- Magnus Holmgren [EMAIL PROTECTED] (No Cc of list mail needed, thanks)
pgp1oa6b0iHYB.pgp
Description: PGP signature