[Matt]
>> Yes, I think it would be fine to start testing the filter that
>> way, but I figured since the custom tokenizer had been suggested
>> it was worth looking into what would be required and what the
>> advantages might be.
[Skip]
> Maybe subclass tokenizer.Tokenizer and override the tokenize method?
That's all that's needed. Just changing:
def tokenize(self, obj):
msg = self.get_message(obj)
for tok in self.tokenize_headers(msg):
yield tok
for tok in self.tokenize_body(msg):
yield tok
to
def tokenize(self, obj):
text = obj
# The rest of this is from tokenize_body.
# Replace numeric character entities (like a for the letter
# 'a').
text = numeric_entity_re.sub(numeric_entity_replacer, text)
# Normalize case.
text = text.lower()
if options["Tokenizer", "replace_nonascii_chars"]:
# Replace high-bit chars and control chars with '?'.
text = text.translate(non_ascii_translate_tab)
for t in find_html_virus_clues(text):
yield "virus:%s" % t
# Get rid of uuencoded sections, embedded URLs, <style
gimmicks,
# and HTML comments.
for cracker in (crack_uuencode,
crack_urls,
crack_html_style,
crack_html_comment,
crack_noframes):
text, tokens = cracker(text)
for t in tokens:
yield t
# Remove HTML/XML tags. Also . <br> and <p> tags should
# create a space too.
text = breaking_entity_re.sub(' ', text)
# It's important to eliminate HTML tags rather than, e.g.,
# replace them with a blank (as this code used to do), else
# simple tricks like
# Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
# can be used to disguise words. <br> and <p> were special-
# cased just above (because browsers break text on those,
# they can't be used to hide words effectively).
text = html_re.sub('', text)
# Tokenize everything in the body.
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= maxword:
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
should be enough to skip header tokenization (and not have to worry
about putting headers or a blank line in front of the content) and
skip the decoding parts of the tokenization (I assume the wiki
content will be plain text and not application/octet, base64, qp, etc).
The code that deals with HTML should probably be replaced with code
that deals with Trac's wiki formatting. For email, SpamBayes gets
rid of all tags, so Trac could similarly dump formatting characters
('', ''', and the like), or keep them (you'd have to test to see
whether they were useful or not). Probably the code above that deals
with uuencode, HTML styles, HTML comments, and breaking entities
could be dropped as well.
=Tony.Meyer
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev