> IMHO, I would tell your users to replace the > default_bayes_customize.ini > with my settings below *smile*, as I found that more tokens the > better for > my mail stream.
This certainly falls outside the realm of simple, however. I'm also not convinced that all of these options are good for everyone (or they would default to 'on'). Enabling all the options is also not something I'd recommend. As part of the 2005 TREC spam track, one of the SpamBayes runs submitted included enabling all boolean options (except the slurping ones). Results from TREC aren't complete yet, but initial testing indicates that this run performs worse than running with defaults. [...] > [Classifier] > > x-use_bigrams: True *Most* tests have indicated that the windowing bi-gram scheme is better than straight unigrams. There have been a few cases where this was not true, however. It does also vastly increase the database size. It's certainly a good technique (and in 1.1 is no longer experimental), and if people are going to go to the effort of customizing the tokenization/classification options, this would be the best choice. > max_discriminators: 150 This is already the default, so isn't needed (it has no effect). > replace_nonascii_chars: True > record_header_absence: True These two are specifically enabled in the Outlook plug-in by default (although including them in the file is necessary if it's replaced). replace_nonascii_chars is probably a bad idea for anyone that receives non-English ham, and probably a good idea for anyone else. record_header_absence has also had mixed results. > x-fancy_url_recognition: True > x-pick_apart_urls: True These are experimental options; as such they haven't had the extensive testing that other options have. It's not clear yet whether these are a good idea for any user or not. > x-reduce_habeas_headers: True > x-search_for_habeas_headers: True It's pretty clear that Habeas's headers are a failed experiement. These options probably aren't worth including, and are likely to be removed in a future release. > basic_header_tokenize: True > basic_header_skip: date x-.* domainkey-signature Testing hasn't shown that basic_header_tokenize is a good idea. Is there a reason you turned it on? > octet_prefix_size: 5 This is the default; it will have no effect. > mine_received_headers: True As long as the training data is from the user, this should help. > address_headers: from sender reply-to errors-to I don't have any testing to hand about this, but I doubt that removing "to" and "cc" from the headers that are tokenized is a good idea. For me, at least, the data in the "to" and "cc" headers is definitely a good indicator of whether the message is ham/spam; I would expect this would be the case for many people. Adding errors- to might help; I don't know if any testing has been done on that. > generate_long_skips: True This is the default; it will have no effect. > skip_max_word_size: 50 I believe that (in the early days) there was a lot of testing to determine what the best minimum and maximum token sizes were. 50 is a *lot* better than the default 12 - do you really have many strong tokens longer than 12? > [URLRetriever] > > x-cache_directory: url-cache > x-cache_expiry_days: 31 > x-only_slurp_base: True > x-slurp_urls: True > x-web_prefix:web: I would not recommend enabling these without understanding what they do. The main issue is that as a result of enabling them, SpamBayes will be downloading a lot of extra material - for those where connection speed or bandwidth are issues, this might not be a good step. It's also not at all clear that they are beneficial - without the only_slurp_base option, testing generally indicates good results, but that means that any 'bugs' will be triggered. With the only_slurp_base option, results are mixed, leaning towards negative. =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
