Re: [spambayes-dev] effective tokenizer for wiki text

Tony Meyer Mon, 30 Oct 2006 16:53:25 -0800

[Matt]
>> Yes, I think it would be fine to start testing the filter that
>> way, but I figured since the custom tokenizer had been suggested
>> it was worth looking into what would be required and what the
>> advantages might be.


[Skip]
> Maybe subclass tokenizer.Tokenizer and override the tokenize method?

That's all that's needed.  Just changing:

     def tokenize(self, obj):
         msg = self.get_message(obj)

         for tok in self.tokenize_headers(msg):
             yield tok
         for tok in self.tokenize_body(msg):
             yield tok

to

     def tokenize(self, obj):
         text = obj
         # The rest of this is from tokenize_body.

         # Replace numeric character entities (like &#97; for the letter
         # 'a').
         text = numeric_entity_re.sub(numeric_entity_replacer, text)

         # Normalize case.
         text = text.lower()

         if options["Tokenizer", "replace_nonascii_chars"]:
             # Replace high-bit chars and control chars with '?'.
             text = text.translate(non_ascii_translate_tab)

         for t in find_html_virus_clues(text):
             yield "virus:%s" % t

         # Get rid of uuencoded sections, embedded URLs, <style  
gimmicks,
         # and HTML comments.
         for cracker in (crack_uuencode,
                         crack_urls,
                         crack_html_style,
                         crack_html_comment,
                         crack_noframes):
             text, tokens = cracker(text)
             for t in tokens:
                 yield t

         # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
         # create a space too.
         text = breaking_entity_re.sub(' ', text)
         # It's important to eliminate HTML tags rather than, e.g.,
         # replace them with a blank (as this code used to do), else
         # simple tricks like
         #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
         # can be used to disguise words.  <br> and <p> were special-
         # cased just above (because browsers break text on those,
         # they can't be used to hide words effectively).
         text = html_re.sub('', text)

         # Tokenize everything in the body.
         for w in text.split():
             n = len(w)
             # Make sure this range matches in tokenize_word().
             if 3 <= n <= maxword:
                 yield w

             elif n >= 3:
                 for t in tokenize_word(w):
                     yield t

should be enough to skip header tokenization (and not have to worry  
about putting headers or a blank line in front of the content) and  
skip the decoding parts of the tokenization (I assume the wiki  
content will be plain text and not application/octet, base64, qp, etc).

The code that deals with HTML should probably be replaced with code  
that deals with Trac's wiki formatting.  For email, SpamBayes gets  
rid of all tags, so Trac could similarly dump formatting characters  
('', ''', and the like), or keep them (you'd have to test to see  
whether they were useful or not).  Probably the code above that deals  
with uuencode, HTML styles, HTML comments, and breaking entities  
could be dropped as well.

=Tony.Meyer
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

Re: [spambayes-dev] effective tokenizer for wiki text

Reply via email to