> We would like to > understand what interactions the tokenizer has with the different > modules.
The tokenizer reads the options to know what tokenizing to do. Any of the other modules that need to tokenize a message use the tokenizer. That's about it*. > Is there any documentation available that describes the different > modules? There's README-DEVEL.txt in the source, and the (extensive) comments in the code. Feel free to ask questions here. > We are interested in what the email representation is after email is > tokenized and going into the learner and classifier. The email is an iterable (generator in this case, but any iterable would do) of strings. > In addition, we would like to isolate the tokenizer. Already done - tokenizer.py is already isolated from the rest of SpamBayes, other than the options (which control what tokenization is done). =Tony.Meyer * Ok, not quite all. The experimental URL slurping option imports the classifier, because it only generates tokens if the score is already known to be unsure, and the tokenizer doesn't otherwise know anything about score. If this became non-experimental a tidier way would be found for this. The experimental image tokenization also uses the ImageStripper module. And the tokenizer uses mboxutils.get_message so that you can pass a string, file, or something like that, or a email.Message object, to tokenize (this is just convenience, really). _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev