on Thu Jan 03 2008, gpr <grp_eee-AT-yahoo.com> wrote: > Hi, > > I have a large no of good and spam messages (few thousands) collected over a > year and would > like to use these for initial training. But I know that it is preferable to > train with only small subset of these messages (may be a thousand - 500 spam > and 500 ham) to keep my training db minimal,fast and effective. > > My query is....do I need to manually pick out some thousand latest messages > from this large corpus and input to SpamBayes or Can SpamBayes automatically > (in fact smartly) do this job for me when given the entire set and a > required corpus size? > > If this feature is not available would this not be a hell of useful feature > to support? Ok, why I think manual classification - just picking up the > latest 1000 messages (for a corpus size 1000) from my large corpus- may not > be much effective : > > Not all the messages from the corpus may need to be trained ( using train on > error+unsures strategy) , for example if the last hundred good messages I > received are of the same type (ex:a long running thread about a specific > topic)...then SpamBayes can easily classify any future message of this type > by just training on small part of these messages...So to get to a message > corpus size of 1000 messages (and to train SpamBayes over a wide coverage of > spam and ham message types), I may need to repeat the training multiple > times with different subsets until I > achieve an effective corpus.
I use the train-to-exhaustion script, contrib/tte.py, whose "prune" option can effectively remove the messages that don't make any difference from your training set. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
