> I get the same message-id behavior from our Exchange server. Ah - so do I, I realise now. I was looking just in the headers that SpamBayes shows in "Show Clues", but I see I get the invalid token even without one showing there. Looking at the tokenizing code, if there is no message-id header, then the 'invalid' token is generated. (I suppose the thought is that not having the header is not a valid case).
> Notice that there is no message id header of any sort, and that > the From and To fields do not use Internet standard address format. > The following tokens were included among the clues, and are typical for > most if not all of my Exchange mail: > > """ > token spamprob #ham #spam > 'message-id:invalid' 0.214766 19 9 > 'x-mailer:none' 0.622068 88 258 > 'from:no real name:2**0' 0.642539 29 93 > """ > > Maybe there's a property in the Outlook message object > somewhere that we need to retrieve and add to the headers when we > reconstruct the message? Maybe we ought to be making an attempt to generate headers for all those in the safe_headers option (or, alternatively, changing the default value for safe_headers for Outlook users. The headers that we could probably generate include "date", "from" (we could be smarter about how it is presented), "importance", "in-reply-to" (?), "message-id", "organization" (?), "received" (maybe too much effort), "reply-to", "to" (smarter), and "user-agent". We could generate "x-mailer", which is tokenized separately, too. None of this is hard - it's just a case of running Outlook2000/sandbox/dumpprops.py on one of these messages, looking up the appropriate property names, and then modifying the function to get & format the appropriate data. I guess (but do not know) that getting a few extra properties as well as the ones we already get would not significantly effect the time that was required. However, there is the question of whether this will help or hinder. At the moment, we get a whole bunch of "I'm an Exchange message" tokens, which I suspect for most people are significant ham clues. If we replace those with more data, maybe it'll be harder to nail Exchange messages (I would guess not, but stupid beats smart, etc). We could add an (experimental?) Outlook option "synthesised_exchange_headers", which lists headers (like those above) to try and synthesise (the current situation being "to,from,subject"). That way at least users could relatively easily change the situation (e.g. revert back to 1.0.x behaviour). (Retraining would probably be necessary to have much effect, though). I'll try and find time to whip up something like this and run some test scripts with it (although the ratio of Exchange mail will have a big influence on results, I imagine) and see what happens. Probably not until the end of the week, or the start of next one, though. At least, since it's Outlook, if we make the situation worse, Tim will probably notice and yell at us <wink>. =Tony.Meyer _______________________________________________ spambayes-dev mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/spambayes-dev