Re: Updating a classifier model on the fly

Paritosh Ranjan Wed, 07 Mar 2012 00:40:17 -0800

You can look into ClusterIterator. It requires prior information but isable to train on the fly.


On 06-03-2012 22:14, Temese Szalai wrote:

One other thing to consider about (and I don't know if Mahout supports this
because I am very new to Mahout although very experienced with text
classification specifically)
is that I have seen unsupervised learning or semi-supervised learning
approaches work for an "on the fly" re-computation of a model. This can be
particularly helpful for
data bootstrapping, i.e., cases where you have a small initial set of data
and want to put some kind of filter and feedback loop in place to build a
curated data set.


This is different than classification though where you have a labeled data
set and train the classifier to identify things that look like that data
set.

In the once or twice I've seen unsupervised or semi-supervised learning
applied to create a "model", I've seen it work ok when there is only one
category. So, if you are building a classic binary classifier
and only care about one category and your system will work just fine with
that (i.e., "is this spam? y/n?"), this might be worth looking into if your
use cases and business needs
really demand something on the fly and possibly can handle lower precision
and recall while the system learns.

I don't know if this is useful to you at all.

Temese

On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<[email protected]>  wrote:

Thanks Charles, I'll have a look at it.

cheers,
Boris

On Tue, Mar 6, 2012 at 11:25, Charles Earl<[email protected]>  wrote:

Boris,
Have you looked at online decision trees and the ilke
http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
I think ultimately the concept boils down to Temese's observation of

their being some measure (in the paper's case, concept drift)

that triggers re-training of the entire set.
C
On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:

Hi Temese,

thank you very much for this information.

Boris

On Tue, Mar 6, 2012 at 11:14, Temese Szalai<[email protected]>

wrote:

Hi Boris -

Unless Mahout has super-powers that I am not aware of, years of

experience

in text classification tell me that - yes, you will have to rebuild the
classifier model regularly as new labeled data becomes available.

If you are building a system that incorporates a user feedback loop as

it

sounds like you are (i.e., "yes, this message is spam"), one thing that
might reduce the amount of classifier re-training would be to verify

that

the
new incoming labeled document is not already in your data set, i.e.,

not a

dupe. Additionally, you probably want to wait to retrain until you have
some critical mass of newly labeled documents or else you have a

critical

data point to include.

If someone has the ability to say "no this is not spam", keeping that

data

as labeled data to add to your anti-content/negative content set would

be

valuable.
Best,
Temese

On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<[email protected]>

wrote:

Hi all,

is there a way to update a classifier model on the fly? Or do I need
to recompute everything each time I add a document to a category in
the training set?

I would like to build something similar to some spam filters, where
you can confirm that a message is a spam or not, and thus, train the
classifier.

regards,
Boris
--
42



--
42



--
42

Re: Updating a classifier model on the fly

Reply via email to