on Fri Jul 27 2007, "Mark Hammond" <mhammond-AT-skippinet.com.au> wrote:
>> That is high relative to the conventional wisdom, but I'm questioning >> the correctness of that wisdom. > > Check out this thread, which should give you a reasonable idea: > > http://mail.python.org/pipermail/spambayes-dev/2003-November/001578.html > >> Perhaps its time to re-evaluate that statement? > > Google also shows anecdotal reports of poor results after an imbalance as > low as 2:1, so I don't think it would be responsible to re-evaluate that > statement until clear evidence was presented to the contrary. Because those tests don't have all the same real-world constraints as I do, I'm still trying to figure out whether they answer my question: Is it better to withold data (some previously-misclassified spams) from the system when training in order to keep ham and spam balanced, or will I get better results if I let it see all the previously-misclassified spam despite the imbalance? In my admittedly not-rigorously-tested experience, it's generally better to let the system see more data (at least with train-to-exhaustion). -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html