On Tue, 2006-10-31 at 13:51 +1300, Tony Meyer wrote: > [Matt] > >> Yes, I think it would be fine to start testing the filter that > >> way, but I figured since the custom tokenizer had been suggested > >> it was worth looking into what would be required and what the > >> advantages might be. > > [Skip] > > Maybe subclass tokenizer.Tokenizer and override the tokenize method? > > That's all that's needed. Just changing: > ...snip... > > should be enough to skip header tokenization (and not have to worry > about putting headers or a blank line in front of the content) and > skip the decoding parts of the tokenization (I assume the wiki > content will be plain text and not application/octet, base64, qp, etc). > > The code that deals with HTML should probably be replaced with code > that deals with Trac's wiki formatting. For email, SpamBayes gets > rid of all tags, so Trac could similarly dump formatting characters > ('', ''', and the like), or keep them (you'd have to test to see > whether they were useful or not). Probably the code above that deals > with uuencode, HTML styles, HTML comments, and breaking entities > could be dropped as well.
Thanks, that should give me a good starting point. I'll check back if I have any more questions. -- Matt Good _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev