On Tue, 2006-10-31 at 13:51 +1300, Tony Meyer wrote:
> [Matt]
> >> Yes, I think it would be fine to start testing the filter that
> >> way, but I figured since the custom tokenizer had been suggested
> >> it was worth looking into what would be required and what the
> >> advantages might be.
>
> [Skip]
> > Maybe subclass tokenizer.Tokenizer and override the tokenize method?
>
> That's all that's needed. Just changing:
>
...snip...
>
> should be enough to skip header tokenization (and not have to worry
> about putting headers or a blank line in front of the content) and
> skip the decoding parts of the tokenization (I assume the wiki
> content will be plain text and not application/octet, base64, qp, etc).
>
> The code that deals with HTML should probably be replaced with code
> that deals with Trac's wiki formatting. For email, SpamBayes gets
> rid of all tags, so Trac could similarly dump formatting characters
> ('', ''', and the like), or keep them (you'd have to test to see
> whether they were useful or not). Probably the code above that deals
> with uuencode, HTML styles, HTML comments, and breaking entities
> could be dropped as well.
Thanks, that should give me a good starting point. I'll check back if I
have any more questions.
-- Matt Good
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev