Bill, I would say that the training has not changed. Since you last asked Tony about training, no major modifications were done (or needed) and there have been a lot of tweaks to catch more spam, and are realized my modifying the default_bayes_customize.ini file.
IMHO, I would tell your users to replace the default_bayes_customize.ini with my settings below *smile*, as I found that more tokens the better for my mail stream. All you have to say is, "train only on unsures with no initial training". This way they will only train on current messages and with a buffed .ini file, that is all they would ever need. With the below settings I am able to distinguish PayPal phishing scams in 3 trains... ----------------------------------------- [Classifier] x-use_bigrams: True max_discriminators: 150 [Tokenizer] replace_nonascii_chars: True record_header_absence: True x-fancy_url_recognition: True x-pick_apart_urls: True x-reduce_habeas_headers: True x-search_for_habeas_headers: True basic_header_tokenize: True basic_header_skip: date x-.* domainkey-signature check_octets: True octet_prefix_size: 5 mine_received_headers: True address_headers: from sender reply-to errors-to generate_long_skips: True summarize_email_prefixes: True summarize_email_suffixes: True skip_max_word_size: 50 [URLRetriever] x-cache_directory: url-cache x-cache_expiry_days: 31 x-only_slurp_base: True x-slurp_urls: True x-web_prefix:web: ----------------------------------------- Erik Brown -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hely Holdings Pty Ltd (Sales Dept.) Sent: Sunday, September 18, 2005 1:32 AM To: [email protected] Subject: [Spambayes] Tony Meyer - Training question Hi Tony. Back in August 2004 you kindly critiqued a spam chapter for me from my security book "The Hacker's Nightmare". I am gearing up for a new edition of THN and will be expanding the spam section a fair bit in the process. I deal only with the Outlook plug-in. At this time I would like to know if you have changed your opinion on training since then. Here's what you said in a message to me on August 10, 2004 after reading my draft chapter. ---------- BEGIN QUOTE ---------- Training is a difficult issue to write about. The problem is that not enough is yet known about the best ways to train, and that the Outlook plug-in really only facilitates a couple of different methods. However, it is almost certain that 'train on everything' is a bad idea, that smaller databases are generally better than large ones, and that imbalances are bad. These are not hard rules. Your training described has a huge imbalance, and is a pretty large database, and is (at least initially) train-on-everything, and yet I presume you have had good results or you wouldn't be writing this. In general, though, based on both testing and feedback from users, the above is true. I believe that the best training method to recommend to people using the plug-in is: * Don't do *any* initial training. (Everything will now end up in the 'unsure' folder.) * Train on *everything* that ends up in the 'unsure' folder. At first, this will be a lot of mail, but it will rapidly reduce. * Train on *all* mistakes (at first, there may be some false positives/false negatives, but these will even more rapidly reduce). Once 10-20 mails of each type have been trained, the system should be very accurate. ---------- END QUOTE ---------- For my target audience I need to make all explanations and instructions as simple as possible. If I started describing techniques like Seth Goodman's "Recursive Training Set Selection For Outlook" I'd have them throwing up out of fear and confusion. I basically distilled your advice down to "do no pre-training at all - train only on the UNSURE folder". While that seems to work fine and has been well received, it was after all a year and several releases ago. Where do you stand on training these days, for people who simply will not or cannot follow a complicated set of instructions. Best regards, - Bill H. -- We take security very seriously. All outgoing mail is certified Virus Free. To boost YOUR security visit The Hacker's Nightmare: http://HackersNightmare.com. Checked by AVG Anti-Virus. Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date: 16/09/2005 _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
