Re: Mahout: NB Model for Text Classification - In Sample Error

Ted Dunning Mon, 21 Nov 2011 19:02:47 -0800

Yes for active learning, no for transduction.

You can do semi-supervised clustering as well.  This is a special case of
transduction generally in many ways.


If you mean classify instead of cluster in your last sentence, then that is
definitely a way to do transduction.

On Mon, Nov 21, 2011 at 6:12 PM, Lance Norskog <[email protected]> wrote:

> The active learning and transduction methods create candidates which you
> then check by hand, right? Would it work to cluster untagged items using
> the tagged items as seeds?
>
> Lance
>
> On Fri, Nov 18, 2011 at 11:46 PM, Ted Dunning <[email protected]>
> wrote:
>
> > This test plan is pretty reasonable.  There is inherently going to be
> some
> > form of bias due to the time shift, but the bias is real and will affect
> > your test results the same way it will affect your operational accuracy.
> >  It might be somewhat interesting to estimate the effect over time by
> also
> > testing on a sample from within the same time period, but that is really
> > mostly of academic interest.
> >
> > What I would recommend is that you use some additional techniques to
> > increase your training set size.  Active learning is a classic technique
> > which can help you build a relatively small training set that gives
> > performance comparable to the performance you would get without
> > down-sampling.  Transduction would let you use the untagged data to
> improve
> > your model without increasing the number of tagged samples.
> >
> > One simple approach for active learning is to repeatedly take new
> training
> > samples of untagged messages that is stratified on your first model's
> > score.  In addition, it makes sense to also sample messages that have
> > significant numbers of terms that do not appear in your positive training
> > examples.  These methods are much simpler than doing active learning by
> the
> > book, but give similar results.
> >
> > For transduction, a very simple method is to simply tag the rest of your
> > training data and then train a model using this larger training set.
>  This
> > has benefit because you extend your model effectively using cooccurrences
> >  with known terms.  Again, this is less effective than more formally
> > defined transduction methods, but it can be surprisingly effective.
> >
> > Finally, I would recommend that you consider alternative algorithms than
> > Naive Bayes for your basic model.  This is based on the fact that you
> only
> > have a small training set and Naive Bayes depends in part on having a
> > relatively large number of training examples in order to get a good
> model.
> >
> >
> > On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <[email protected]>
> > wrote:
> >
> > > Hey all,
> > >
> > > Quick question regarding potential source of in-sample bias for a text
> > > classification project. I'm develop a system which reads text messages
> > > (i.e. SMS) and tries to classify them into a number of categories. We
> > have
> > > a few million messages. We built our training set of a spare window (~2
> > > months) of messages randomly sampled from within this period.
> > >
> > > We would like to classify messages within this two month period and
> also
> > > messages written beyond. Does this raise some problems with sample
> bias.
> > > From a set of over 600,000 SMSs we only sampled around 5000 and
> manually
> > > tagged them for sentiments. I cant see how this would skew our accuracy
> > > results when we are using a training set which is taken from an unseen
> > > period after this two months.
> > >
> > > But if anyone could add their 2c on if its as mistake to a) classify
> SMS
> > > unseen messages in this two month sample time period b) verify accuracy
> > > based on a test set outside this two month sample period.
> > >
> > > Cheers,
> > > /N
> > >
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: Mahout: NB Model for Text Classification - In Sample Error

Reply via email to