[spambayes-dev] some tokenising ideas for someone who wants to experiment

Anthony Baxter Thu, 16 Jun 2005 00:05:43 -0700

Here's a couple of ideas for tokenising/scoring messages that someone might 
like to experiment with. I have no time in the next few months, but if I send 
them here, they won't just disappear into the vagaries of my long term 
memory. <wink>


multipart/alternative:

   When confronted by a multipart/alternative, score each alternative 
separately, and keep the highest score only. Discard the scoring from the 
lower scoring part(s). I'm seeing a _lot_ of spam with pure wordsalad 
text/plain, and spam text in the html only. 

stylesheet interpretation:

   There's probably some moderate wins in parsing (to a small degree) inline 
CSS in text/html - at least to remove the stuff which has been styled 
'hidden'.

Got your own ideas for tokenising tricks that are worth trying? Post them, we 
can collect them somewhere for people who want to experiment... 

-- 
Anthony Baxter     <[EMAIL PROTECTED]>
It's never too late to have a happy childhood.
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

[spambayes-dev] some tokenising ideas for someone who wants to experiment

Reply via email to