Re: [spambayes-dev] effective tokenizer for wiki text

Matt Good Mon, 30 Oct 2006 17:17:16 -0800

On Tue, 2006-10-31 at 13:51 +1300, Tony Meyer wrote:
> [Matt]
> >> Yes, I think it would be fine to start testing the filter that
> >> way, but I figured since the custom tokenizer had been suggested
> >> it was worth looking into what would be required and what the
> >> advantages might be.
> 
> [Skip]
> > Maybe subclass tokenizer.Tokenizer and override the tokenize method?
> 
> That's all that's needed.  Just changing:
> 
...snip...
> 
> should be enough to skip header tokenization (and not have to worry  
> about putting headers or a blank line in front of the content) and  
> skip the decoding parts of the tokenization (I assume the wiki  
> content will be plain text and not application/octet, base64, qp, etc).
> 
> The code that deals with HTML should probably be replaced with code  
> that deals with Trac's wiki formatting.  For email, SpamBayes gets  
> rid of all tags, so Trac could similarly dump formatting characters  
> ('', ''', and the like), or keep them (you'd have to test to see  
> whether they were useful or not).  Probably the code above that deals  
> with uuencode, HTML styles, HTML comments, and breaking entities  
> could be dropped as well.


Thanks, that should give me a good starting point.  I'll check back if I
have any more questions.

-- Matt Good

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Re: [spambayes-dev] effective tokenizer for wiki text

Reply via email to