Hi, thank you all for your help. I think I'll recompute the entire model after some files have been added to a category (threshold to be determined), because I may want to also add a new category in some situations. Computing the model doesn't take that long anyway.
cheers, Boris On Wed, Mar 7, 2012 at 03:39, Paritosh Ranjan <[email protected]> wrote: > You can look into ClusterIterator. It requires prior information but is able > to train on the fly. > > > On 06-03-2012 22:14, Temese Szalai wrote: >> >> One other thing to consider about (and I don't know if Mahout supports >> this >> because I am very new to Mahout although very experienced with text >> classification specifically) >> is that I have seen unsupervised learning or semi-supervised learning >> approaches work for an "on the fly" re-computation of a model. This can be >> particularly helpful for >> data bootstrapping, i.e., cases where you have a small initial set of data >> and want to put some kind of filter and feedback loop in place to build a >> curated data set. >> >> This is different than classification though where you have a labeled data >> set and train the classifier to identify things that look like that data >> set. >> >> In the once or twice I've seen unsupervised or semi-supervised learning >> applied to create a "model", I've seen it work ok when there is only one >> category. So, if you are building a classic binary classifier >> and only care about one category and your system will work just fine with >> that (i.e., "is this spam? y/n?"), this might be worth looking into if >> your >> use cases and business needs >> really demand something on the fly and possibly can handle lower precision >> and recall while the system learns. >> >> I don't know if this is useful to you at all. >> >> Temese >> >> On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<[email protected]> wrote: >> >>> Thanks Charles, I'll have a look at it. >>> >>> cheers, >>> Boris >>> >>> On Tue, Mar 6, 2012 at 11:25, Charles Earl<[email protected]> wrote: >>>> >>>> Boris, >>>> Have you looked at online decision trees and the ilke >>>> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf >>>> I think ultimately the concept boils down to Temese's observation of >>> >>> their being some measure (in the paper's case, concept drift) >>>> >>>> that triggers re-training of the entire set. >>>> C >>>> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote: >>>> >>>>> Hi Temese, >>>>> >>>>> thank you very much for this information. >>>>> >>>>> Boris >>>>> >>>>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai<[email protected]> >>> >>> wrote: >>>>>> >>>>>> Hi Boris - >>>>>> >>>>>> Unless Mahout has super-powers that I am not aware of, years of >>> >>> experience >>>>>> >>>>>> in text classification tell me that - yes, you will have to rebuild >>>>>> the >>>>>> classifier model regularly as new labeled data becomes available. >>>>>> >>>>>> If you are building a system that incorporates a user feedback loop as >>> >>> it >>>>>> >>>>>> sounds like you are (i.e., "yes, this message is spam"), one thing >>>>>> that >>>>>> might reduce the amount of classifier re-training would be to verify >>> >>> that >>>>>> >>>>>> the >>>>>> new incoming labeled document is not already in your data set, i.e., >>> >>> not a >>>>>> >>>>>> dupe. Additionally, you probably want to wait to retrain until you >>>>>> have >>>>>> some critical mass of newly labeled documents or else you have a >>> >>> critical >>>>>> >>>>>> data point to include. >>>>>> >>>>>> If someone has the ability to say "no this is not spam", keeping that >>> >>> data >>>>>> >>>>>> as labeled data to add to your anti-content/negative content set would >>> >>> be >>>>>> >>>>>> valuable. >>>>>> Best, >>>>>> Temese >>>>>> >>>>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<[email protected]> >>> >>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> is there a way to update a classifier model on the fly? Or do I need >>>>>>> to recompute everything each time I add a document to a category in >>>>>>> the training set? >>>>>>> >>>>>>> I would like to build something similar to some spam filters, where >>>>>>> you can confirm that a message is a spam or not, and thus, train the >>>>>>> classifier. >>>>>>> >>>>>>> regards, >>>>>>> Boris >>>>>>> -- >>>>>>> 42 >>>>>>> >>>>> >>>>> >>>>> -- >>>>> 42 >>> >>> >>> >>> -- >>> 42 >>> > -- 42
