You can look into ClusterIterator. It requires prior information but is able to train on the fly.

On 06-03-2012 22:14, Temese Szalai wrote:
One other thing to consider about (and I don't know if Mahout supports this
because I am very new to Mahout although very experienced with text
classification specifically)
is that I have seen unsupervised learning or semi-supervised learning
approaches work for an "on the fly" re-computation of a model. This can be
particularly helpful for
data bootstrapping, i.e., cases where you have a small initial set of data
and want to put some kind of filter and feedback loop in place to build a
curated data set.

This is different than classification though where you have a labeled data
set and train the classifier to identify things that look like that data
set.

In the once or twice I've seen unsupervised or semi-supervised learning
applied to create a "model", I've seen it work ok when there is only one
category. So, if you are building a classic binary classifier
and only care about one category and your system will work just fine with
that (i.e., "is this spam? y/n?"), this might be worth looking into if your
use cases and business needs
really demand something on the fly and possibly can handle lower precision
and recall while the system learns.

I don't know if this is useful to you at all.

Temese

On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<[email protected]>  wrote:

Thanks Charles, I'll have a look at it.

cheers,
Boris

On Tue, Mar 6, 2012 at 11:25, Charles Earl<[email protected]>  wrote:
Boris,
Have you looked at online decision trees and the ilke
http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
I think ultimately the concept boils down to Temese's observation of
their being some measure (in the paper's case, concept drift)
that triggers re-training of the entire set.
C
On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:

Hi Temese,

thank you very much for this information.

Boris

On Tue, Mar 6, 2012 at 11:14, Temese Szalai<[email protected]>
wrote:
Hi Boris -

Unless Mahout has super-powers that I am not aware of, years of
experience
in text classification tell me that - yes, you will have to rebuild the
classifier model regularly as new labeled data becomes available.

If you are building a system that incorporates a user feedback loop as
it
sounds like you are (i.e., "yes, this message is spam"), one thing that
might reduce the amount of classifier re-training would be to verify
that
the
new incoming labeled document is not already in your data set, i.e.,
not a
dupe. Additionally, you probably want to wait to retrain until you have
some critical mass of newly labeled documents or else you have a
critical
data point to include.

If someone has the ability to say "no this is not spam", keeping that
data
as labeled data to add to your anti-content/negative content set would
be
valuable.
Best,
Temese

On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<[email protected]>
wrote:
Hi all,

is there a way to update a classifier model on the fly? Or do I need
to recompute everything each time I add a document to a category in
the training set?

I would like to build something similar to some spam filters, where
you can confirm that a message is a spam or not, and thus, train the
classifier.

regards,
Boris
--
42



--
42


--
42


Reply via email to