Matt> The Trac[1] project has resurrected work on a SpamBayes plugin for
Matt> filtering Wiki and ticket edits after finding the current Akismet
Matt> system to be unreliable. Tony Meyer added some comments[2] to the
Matt> Wiki suggesting that we write a custom tokenizer instead of using
Matt> the built-in email-centric tokenizer.
Why not just create an "email message" out of the input? If the headers are
identical in every message they won't generate any useful tokens and the
message body will be all that yields useful clues. OTOH, if you have login
or IP address information for the spammers, you might suitably populate the
From: field.
Matt> Are there examples from other people that have written custom
tokenizers
Matt> that may be helpful, or do you have any hints on what to take into
Matt> account for writing an effective tokenizer for Wiki text?
So far, I think most of us have bent our input to look like email. I think
that would be a lot easier than writing and debugging a new tokenizer.
Skip
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev