On -0600, Tony Meyer wrote: > I'm still mostly of the opinion that using some sort of 'train to > exhaustion' regime would work best. This would allow both expiry and > balancing (it essentially does pruning), and still deliver excellent > results.
I agree that train-to-exhaustion is very appealing. How does it accomplish expiry? > However, it would mean keeping cached mail around for a > while, at least. Well, not cached mail, but the list of tokens that were trained from that message. In the case of train-to-exhaustion, you'd also need a training count to tell you how many times you trained on the message. <...> > Training should be done on all unsure messages, too. When I was > using the Outlook plug-in, I commonly had ham end up as (low scoring) > unsure. That should reduce the imbalance somewhat. Theoretically, > once SpamBayes starts making mistakes, the number of ham-as-unsure > would increase, thus helping the balance. I use thresholds of 0.05 and 0.80, and the result is that virtually every message in unsure is ham. It is a convenience to not have the same number of ham as spam classify as unsure. So unless you're willing to leave the ham threshold very low and tolerate ham showing up in the unsure folder pretty regularly, the database will tend to become unbalanced over time, in addition to growing faster than it otherwise needs to. > Something that I think would help is not training every false > negative/spam-as-unsure. Something along the lines of training one, > then rescoring the others to see if they need training. However, the > plug-in does not make this a simple task, at least at the moment. Yes, this is another option to just deleting unsure spam. Here's a scheme that would automate this and encourage users to avoid overtraining. - Create two new folders under the unsure folder called "reclassified as ham" and reclassified as spam". - Upon a training event, rescore the messages in the ham, spam and unsure folders. If messages change classification do as follows: move unsures in to the unsure folder, move newly classified ham into "reclassified as ham" and newly classified spam into "reclassified as spam". - Have an additional button for "accept training" that moves messages from "reclassified as ham" into ham, moves message from "reclassified as spam" into spam without doing incremental training. After the operation was complete, the "accept training" button and the empty "reclassified as ..." folders would disappear. The reason to delete the empty folders is that upon training a new message, seeing one or both of the "reclassified as ..." folders appear would draw the user's attention to any reclassifications, which are probably mistakes that need to be corrected. Here are some pro's and cons. pro: 1) Makes results of training a single message immediately obvious. 2) Removes unsures that now classify as ham or spam from the unsure folder. 3) Avoids leaving newly created false positives and false negatives in the ham and spam folders, where they are easy to miss. 4) Makes it more obvious when a user trains a message into the wrong classification, as several other messages will immediately move to the unsure or "reclassified as ..." folders. 5) Does not require the user to display spam scores and make decisions based on them. 6) Encourages the user to train on the smallest number of messages necessary to create correct classifications. 7) Compatible with train-to-exhaustion. If a message is trained as ham or spam but still doesn't classify correctly, it automatically goes back to the unsure folder. con: 1) Requires dynamically creating and deleting two other folders under unsure. 2) Requires a third button for the unsure folder that is context sensitive. 3) Will generate user questions. -- Seth Goodman _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
