One other thing to consider about (and I don't know if Mahout supports this
because I am very new to Mahout although very experienced with text
classification specifically)
is that I have seen unsupervised learning or semi-supervised learning
approaches work for an "on the fly" re-computation of a model. This can be
particularly helpful for
data bootstrapping, i.e., cases where you have a small initial set of data
and want to put some kind of filter and feedback loop in place to build a
curated data set.

This is different than classification though where you have a labeled data
set and train the classifier to identify things that look like that data
set.

In the once or twice I've seen unsupervised or semi-supervised learning
applied to create a "model", I've seen it work ok when there is only one
category. So, if you are building a classic binary classifier
and only care about one category and your system will work just fine with
that (i.e., "is this spam? y/n?"), this might be worth looking into if your
use cases and business needs
really demand something on the fly and possibly can handle lower precision
and recall while the system learns.

I don't know if this is useful to you at all.

Temese

On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing <[email protected]> wrote:

> Thanks Charles, I'll have a look at it.
>
> cheers,
> Boris
>
> On Tue, Mar 6, 2012 at 11:25, Charles Earl <[email protected]> wrote:
> > Boris,
> > Have you looked at online decision trees and the ilke
> > http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
> > I think ultimately the concept boils down to Temese's observation of
> their being some measure (in the paper's case, concept drift)
> > that triggers re-training of the entire set.
> > C
> > On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
> >
> >> Hi Temese,
> >>
> >> thank you very much for this information.
> >>
> >> Boris
> >>
> >> On Tue, Mar 6, 2012 at 11:14, Temese Szalai <[email protected]>
> wrote:
> >>> Hi Boris -
> >>>
> >>> Unless Mahout has super-powers that I am not aware of, years of
> experience
> >>> in text classification tell me that - yes, you will have to rebuild the
> >>> classifier model regularly as new labeled data becomes available.
> >>>
> >>> If you are building a system that incorporates a user feedback loop as
> it
> >>> sounds like you are (i.e., "yes, this message is spam"), one thing that
> >>> might reduce the amount of classifier re-training would be to verify
> that
> >>> the
> >>> new incoming labeled document is not already in your data set, i.e.,
> not a
> >>> dupe. Additionally, you probably want to wait to retrain until you have
> >>> some critical mass of newly labeled documents or else you have a
> critical
> >>> data point to include.
> >>>
> >>> If someone has the ability to say "no this is not spam", keeping that
> data
> >>> as labeled data to add to your anti-content/negative content set would
> be
> >>> valuable.
> >>> Best,
> >>> Temese
> >>>
> >>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing <[email protected]>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> is there a way to update a classifier model on the fly? Or do I need
> >>>> to recompute everything each time I add a document to a category in
> >>>> the training set?
> >>>>
> >>>> I would like to build something similar to some spam filters, where
> >>>> you can confirm that a message is a spam or not, and thus, train the
> >>>> classifier.
> >>>>
> >>>> regards,
> >>>> Boris
> >>>> --
> >>>> 42
> >>>>
> >>
> >>
> >>
> >> --
> >> 42
> >
>
>
>
> --
> 42
>

Reply via email to