on Sun Jul 29 2007, skip-AT-pobox.com wrote: > Amedee> I have the same experience: > > Amedee> [EMAIL PROTECTED] { ~ }$ ./spamstats > Amedee> Spam: 2415 Ham: 651 > > Amedee> That's 3.7:1, and it's increasing. > > One of the reasons I can keep a nearly 1:1 ratio is that when it gets a bit > out of whack I simply delete some old spam. In my experience the nature of > spam changes over time while the nature of ham rarely does. I also use > train-to-exhaustion which only trains in fixed ratios.
No longer. There's the --unbalanced option. Also, I've been using this very simple patch, which, instead of insanely barreling ahead with the ratio specified even if the corpora are closer to 1:1, reverts using to the ratio in the corpora. Thus the ratio parameter becomes a ratio /limit/ and, along with using --reverse, the oldest spam that falls outside the limit tend to be ignored.
Index: tte.py =================================================================== --- tte.py (revision 3156) +++ tte.py (working copy) @@ -114,10 +114,11 @@ hambone_ = list(reversed(hambone_)) spamcan_ = list(reversed(spamcan_)) + nspam,nham = len(spamcan_),len(hambone_) if ratio: rspam,rham = ratio - else: - rspam,rham = len(spamcan_),len(hambone_) + if (rspam > rham) == (rspam * nham > rham * nspam): + rspam,rham = nspam,nham # define some indexing constants ham = 0
-- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com
_______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html