Boris, Have you looked at online decision trees and the ilke http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf I think ultimately the concept boils down to Temese's observation of their being some measure (in the paper's case, concept drift) that triggers re-training of the entire set. C On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
> Hi Temese, > > thank you very much for this information. > > Boris > > On Tue, Mar 6, 2012 at 11:14, Temese Szalai <[email protected]> wrote: >> Hi Boris - >> >> Unless Mahout has super-powers that I am not aware of, years of experience >> in text classification tell me that - yes, you will have to rebuild the >> classifier model regularly as new labeled data becomes available. >> >> If you are building a system that incorporates a user feedback loop as it >> sounds like you are (i.e., "yes, this message is spam"), one thing that >> might reduce the amount of classifier re-training would be to verify that >> the >> new incoming labeled document is not already in your data set, i.e., not a >> dupe. Additionally, you probably want to wait to retrain until you have >> some critical mass of newly labeled documents or else you have a critical >> data point to include. >> >> If someone has the ability to say "no this is not spam", keeping that data >> as labeled data to add to your anti-content/negative content set would be >> valuable. >> Best, >> Temese >> >> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <[email protected]> wrote: >> >>> Hi all, >>> >>> is there a way to update a classifier model on the fly? Or do I need >>> to recompute everything each time I add a document to a category in >>> the training set? >>> >>> I would like to build something similar to some spam filters, where >>> you can confirm that a message is a spam or not, and thus, train the >>> classifier. >>> >>> regards, >>> Boris >>> -- >>> 42 >>> > > > > -- > 42
