I forgot to mention they must train on all false negatives and positives as well.
Erik Brown -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hely Holdings Pty Ltd (Sales Dept.) Sent: Sunday, September 18, 2005 1:32 AM To: [email protected] Subject: [Spambayes] Tony Meyer - Training question Hi Tony. Back in August 2004 you kindly critiqued a spam chapter for me from my security book "The Hacker's Nightmare". I am gearing up for a new edition of THN and will be expanding the spam section a fair bit in the process. I deal only with the Outlook plug-in. At this time I would like to know if you have changed your opinion on training since then. Here's what you said in a message to me on August 10, 2004 after reading my draft chapter. ---------- BEGIN QUOTE ---------- Training is a difficult issue to write about. The problem is that not enough is yet known about the best ways to train, and that the Outlook plug-in really only facilitates a couple of different methods. However, it is almost certain that 'train on everything' is a bad idea, that smaller databases are generally better than large ones, and that imbalances are bad. These are not hard rules. Your training described has a huge imbalance, and is a pretty large database, and is (at least initially) train-on-everything, and yet I presume you have had good results or you wouldn't be writing this. In general, though, based on both testing and feedback from users, the above is true. I believe that the best training method to recommend to people using the plug-in is: * Don't do *any* initial training. (Everything will now end up in the 'unsure' folder.) * Train on *everything* that ends up in the 'unsure' folder. At first, this will be a lot of mail, but it will rapidly reduce. * Train on *all* mistakes (at first, there may be some false positives/false negatives, but these will even more rapidly reduce). Once 10-20 mails of each type have been trained, the system should be very accurate. ---------- END QUOTE ---------- For my target audience I need to make all explanations and instructions as simple as possible. If I started describing techniques like Seth Goodman's "Recursive Training Set Selection For Outlook" I'd have them throwing up out of fear and confusion. I basically distilled your advice down to "do no pre-training at all - train only on the UNSURE folder". While that seems to work fine and has been well received, it was after all a year and several releases ago. Where do you stand on training these days, for people who simply will not or cannot follow a complicated set of instructions. Best regards, - Bill H. -- We take security very seriously. All outgoing mail is certified Virus Free. To boost YOUR security visit The Hacker's Nightmare: http://HackersNightmare.com. Checked by AVG Anti-Virus. Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date: 16/09/2005 _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
