Hi,

thank you all for your help. I think I'll recompute the entire model
after some files have been added to a category (threshold to be
determined), because I may want to also add a new category in some
situations. Computing the model doesn't take that long anyway.

cheers,
Boris

On Wed, Mar 7, 2012 at 03:39, Paritosh Ranjan <[email protected]> wrote:
> You can look into ClusterIterator. It requires prior information but is able
> to train on the fly.
>
>
> On 06-03-2012 22:14, Temese Szalai wrote:
>>
>> One other thing to consider about (and I don't know if Mahout supports
>> this
>> because I am very new to Mahout although very experienced with text
>> classification specifically)
>> is that I have seen unsupervised learning or semi-supervised learning
>> approaches work for an "on the fly" re-computation of a model. This can be
>> particularly helpful for
>> data bootstrapping, i.e., cases where you have a small initial set of data
>> and want to put some kind of filter and feedback loop in place to build a
>> curated data set.
>>
>> This is different than classification though where you have a labeled data
>> set and train the classifier to identify things that look like that data
>> set.
>>
>> In the once or twice I've seen unsupervised or semi-supervised learning
>> applied to create a "model", I've seen it work ok when there is only one
>> category. So, if you are building a classic binary classifier
>> and only care about one category and your system will work just fine with
>> that (i.e., "is this spam? y/n?"), this might be worth looking into if
>> your
>> use cases and business needs
>> really demand something on the fly and possibly can handle lower precision
>> and recall while the system learns.
>>
>> I don't know if this is useful to you at all.
>>
>> Temese
>>
>> On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<[email protected]>  wrote:
>>
>>> Thanks Charles, I'll have a look at it.
>>>
>>> cheers,
>>> Boris
>>>
>>> On Tue, Mar 6, 2012 at 11:25, Charles Earl<[email protected]>  wrote:
>>>>
>>>> Boris,
>>>> Have you looked at online decision trees and the ilke
>>>> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
>>>> I think ultimately the concept boils down to Temese's observation of
>>>
>>> their being some measure (in the paper's case, concept drift)
>>>>
>>>> that triggers re-training of the entire set.
>>>> C
>>>> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
>>>>
>>>>> Hi Temese,
>>>>>
>>>>> thank you very much for this information.
>>>>>
>>>>> Boris
>>>>>
>>>>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai<[email protected]>
>>>
>>> wrote:
>>>>>>
>>>>>> Hi Boris -
>>>>>>
>>>>>> Unless Mahout has super-powers that I am not aware of, years of
>>>
>>> experience
>>>>>>
>>>>>> in text classification tell me that - yes, you will have to rebuild
>>>>>> the
>>>>>> classifier model regularly as new labeled data becomes available.
>>>>>>
>>>>>> If you are building a system that incorporates a user feedback loop as
>>>
>>> it
>>>>>>
>>>>>> sounds like you are (i.e., "yes, this message is spam"), one thing
>>>>>> that
>>>>>> might reduce the amount of classifier re-training would be to verify
>>>
>>> that
>>>>>>
>>>>>> the
>>>>>> new incoming labeled document is not already in your data set, i.e.,
>>>
>>> not a
>>>>>>
>>>>>> dupe. Additionally, you probably want to wait to retrain until you
>>>>>> have
>>>>>> some critical mass of newly labeled documents or else you have a
>>>
>>> critical
>>>>>>
>>>>>> data point to include.
>>>>>>
>>>>>> If someone has the ability to say "no this is not spam", keeping that
>>>
>>> data
>>>>>>
>>>>>> as labeled data to add to your anti-content/negative content set would
>>>
>>> be
>>>>>>
>>>>>> valuable.
>>>>>> Best,
>>>>>> Temese
>>>>>>
>>>>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<[email protected]>
>>>
>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> is there a way to update a classifier model on the fly? Or do I need
>>>>>>> to recompute everything each time I add a document to a category in
>>>>>>> the training set?
>>>>>>>
>>>>>>> I would like to build something similar to some spam filters, where
>>>>>>> you can confirm that a message is a spam or not, and thus, train the
>>>>>>> classifier.
>>>>>>>
>>>>>>> regards,
>>>>>>> Boris
>>>>>>> --
>>>>>>> 42
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 42
>>>
>>>
>>>
>>> --
>>> 42
>>>
>



-- 
42

Reply via email to