Here's a couple of ideas for tokenising/scoring messages that someone might like to experiment with. I have no time in the next few months, but if I send them here, they won't just disappear into the vagaries of my long term memory. <wink>
multipart/alternative: When confronted by a multipart/alternative, score each alternative separately, and keep the highest score only. Discard the scoring from the lower scoring part(s). I'm seeing a _lot_ of spam with pure wordsalad text/plain, and spam text in the html only. stylesheet interpretation: There's probably some moderate wins in parsing (to a small degree) inline CSS in text/html - at least to remove the stuff which has been styled 'hidden'. Got your own ideas for tokenising tricks that are worth trying? Post them, we can collect them somewhere for people who want to experiment... -- Anthony Baxter <[EMAIL PROTECTED]> It's never too late to have a happy childhood. _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
