Boris,
Have you looked at online decision trees and the ilke 
http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
I think ultimately the concept boils down to Temese's observation of their 
being some measure (in the paper's case, concept drift)
that triggers re-training of the entire set. 
C
On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:

> Hi Temese,
> 
> thank you very much for this information.
> 
> Boris
> 
> On Tue, Mar 6, 2012 at 11:14, Temese Szalai <[email protected]> wrote:
>> Hi Boris -
>> 
>> Unless Mahout has super-powers that I am not aware of, years of experience
>> in text classification tell me that - yes, you will have to rebuild the
>> classifier model regularly as new labeled data becomes available.
>> 
>> If you are building a system that incorporates a user feedback loop as it
>> sounds like you are (i.e., "yes, this message is spam"), one thing that
>> might reduce the amount of classifier re-training would be to verify that
>> the
>> new incoming labeled document is not already in your data set, i.e., not a
>> dupe. Additionally, you probably want to wait to retrain until you have
>> some critical mass of newly labeled documents or else you have a critical
>> data point to include.
>> 
>> If someone has the ability to say "no this is not spam", keeping that data
>> as labeled data to add to your anti-content/negative content set would be
>> valuable.
>> Best,
>> Temese
>> 
>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> is there a way to update a classifier model on the fly? Or do I need
>>> to recompute everything each time I add a document to a category in
>>> the training set?
>>> 
>>> I would like to build something similar to some spam filters, where
>>> you can confirm that a message is a spam or not, and thus, train the
>>> classifier.
>>> 
>>> regards,
>>> Boris
>>> --
>>> 42
>>> 
> 
> 
> 
> -- 
> 42

Reply via email to