on Sun Jul 29 2007, skip-AT-pobox.com wrote:

>     Amedee> I have the same experience:
>
>     Amedee> [EMAIL PROTECTED] { ~ }$ ./spamstats
>     Amedee>  Spam: 2415 Ham: 651
>
>     Amedee> That's 3.7:1, and it's increasing.
>
> One of the reasons I can keep a nearly 1:1 ratio is that when it gets a bit
> out of whack I simply delete some old spam.  In my experience the nature of
> spam changes over time while the nature of ham rarely does.  I also use
> train-to-exhaustion which only trains in fixed ratios.

No longer.  There's the --unbalanced option.  Also, I've been using
this very simple patch, which, instead of insanely barreling ahead
with the ratio specified even if the corpora are closer to 1:1,
reverts using to the ratio in the corpora.  Thus the ratio parameter
becomes a ratio /limit/ and, along with using --reverse, the oldest
spam that falls outside the limit tend to be ignored.

Index: tte.py
===================================================================
--- tte.py	(revision 3156)
+++ tte.py	(working copy)
@@ -114,10 +114,11 @@
         hambone_ = list(reversed(hambone_))
         spamcan_ = list(reversed(spamcan_))
     
+    nspam,nham = len(spamcan_),len(hambone_)
     if ratio:
         rspam,rham = ratio
-    else:
-        rspam,rham = len(spamcan_),len(hambone_)
+        if (rspam > rham) == (rspam * nham > rham * nspam):
+            rspam,rham = nspam,nham
 
     # define some indexing constants
     ham = 0
-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com
_______________________________________________
SpamBayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to