> The difficulty is that there's no way to prune the database, either to > adjust the imbalance or to simply decrease the database's size. You > have > to start again from scratch. The Spambayes establishment doesn't > consider this to be much of an issue, since (as Seth points out) > Spambayes does a good job of starting from scratch and building an > acceptable scoring system after seeing surprisingly little data.
I'm not sure that I'd say that it's not considered much of an issue. The problem is that pruning a database is difficult. As I understand it, the only safe way to do this is to remove/add entire messages, rather than individual tokens. However, the SpamBayes database only keeps track of individual tokens and their ham/spam counts, so we don't have enough information to remove a set of added tokens, unless the original message is available. (IIRC, Skip created an enhanced database that kept enough information to do this at some point; the code is probably around somewhere). I'm still mostly of the opinion that using some sort of 'train to exhaustion' regime would work best. This would allow both expiry and balancing (it essentially does pruning), and still deliver excellent results. However, it would mean keeping cached mail around for a while, at least. I just don't have the time at the moment (as the failure to get 1.1a2 out demonstrates) to implement this for the Outlook plug-in or sb_server (I did do a partial sb_server implementation some time ago, but I don't recall how far I got). > Another point (I've made it before, but I guess it bears repeating) is > that the database imbalance is absolutely inherent in the current > implementation of the Spambayes algorithm, at least in the Outlook > plugin. Because users set the cutoffs to avoid false positives (you > have > to if the program is going to be useful), virtually all of Spambayes's > mistakes are false negatives. Since mistakes are all you train on > after > the initial startup, virtually all new entries into the database are > spam. The better job Spambayes does, the worse the imbalance becomes. Training should be done on all unsure messages, too. When I was using the Outlook plug-in, I commonly had ham end up as (low scoring) unsure. That should reduce the imbalance somewhat. Theoretically, once SpamBayes starts making mistakes, the number of ham-as-unsure would increase, thus helping the balance. Something that I think would help is not training every false negative/spam-as-unsure. Something along the lines of training one, then rescoring the others to see if they need training. However, the plug-in does not make this a simple task, at least at the moment. > [...] it's a problem that has yet to be solved. I certainly agree that this is true. ISTM that the 'imbalance' problem is one that is shared by other filters, as well (c.f. the discussion of the problem in the TREC Spam Track papers). Anyone know of a good statistician with time to spare? <0.1 wink> =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
