[Skip] > Why not just create an "email message" out of the input? If the > headers are > identical in every message they won't generate any useful tokens > and the > message body will be all that yields useful clues. OTOH, if you > have login > or IP address information for the spammers, you might suitably > populate the > From: field.
ISTM that it would be just as little work to write a "wiki-page to email" module as to create a Tokenizer subclass that tokenizes wiki pages. You can then skip all of the header tokenization (and any email-specific tokenization in the body, if there is any, but I can't think of any) and generate any additional tokens out of any metadata that might be available (maybe comment, author, etc?). [Matt] >> Are there examples from other people that have written custom >> tokenizers >> that may be helpful, or do you have any hints on what to take into >> account for writing an effective tokenizer for Wiki text? What exactly gets passed to the tokenizer? Anything more than just the content (complete? diff?) of the wiki page? If it's just the content/diff then other than the words themselves, URLs are probably the most useful content. You could try enabling (or improving) the URL slurping code, perhaps. > So far, I think most of us have bent our input to look like email. > I think > that would be a lot easier than writing and debugging a new tokenizer. A tokenizer's pretty simple, really - all it has to do is take the object you want to tokenize and yield a series of strings. It's been a couple of years, but I wrote some non-email tokenizers at one point. =Tony.Meyer _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
