Re: Mahout: NB Model for Text Classification - In Sample Error

Ted Dunning Fri, 18 Nov 2011 23:47:16 -0800

This test plan is pretty reasonable.  There is inherently going to be some
form of bias due to the time shift, but the bias is real and will affect
your test results the same way it will affect your operational accuracy.
 It might be somewhat interesting to estimate the effect over time by also
testing on a sample from within the same time period, but that is really
mostly of academic interest.

What I would recommend is that you use some additional techniques to
increase your training set size.  Active learning is a classic technique
which can help you build a relatively small training set that gives
performance comparable to the performance you would get without
down-sampling.  Transduction would let you use the untagged data to improve
your model without increasing the number of tagged samples.

One simple approach for active learning is to repeatedly take new training
samples of untagged messages that is stratified on your first model's
score.  In addition, it makes sense to also sample messages that have
significant numbers of terms that do not appear in your positive training
examples.  These methods are much simpler than doing active learning by the
book, but give similar results.

For transduction, a very simple method is to simply tag the rest of your
training data and then train a model using this larger training set.  This
has benefit because you extend your model effectively using cooccurrences
 with known terms.  Again, this is less effective than more formally
defined transduction methods, but it can be surprisingly effective.

Finally, I would recommend that you consider alternative algorithms than
Naive Bayes for your basic model.  This is based on the fact that you only
have a small training set and Naive Bayes depends in part on having a
relatively large number of training examples in order to get a good model.

On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <[email protected]> wrote:

> Hey all,
>
> Quick question regarding potential source of in-sample bias for a text
> classification project. I'm develop a system which reads text messages
> (i.e. SMS) and tries to classify them into a number of categories. We have
> a few million messages. We built our training set of a spare window (~2
> months) of messages randomly sampled from within this period.
>
> We would like to classify messages within this two month period and also
> messages written beyond. Does this raise some problems with sample bias.
> From a set of over 600,000 SMSs we only sampled around 5000 and manually
> tagged them for sentiments. I cant see how this would skew our accuracy
> results when we are using a training set which is taken from an unseen
> period after this two months.
>
> But if anyone could add their 2c on if its as mistake to a) classify SMS
> unseen messages in this two month sample time period b) verify accuracy
> based on a test set outside this two month sample period.
>
> Cheers,
> /N
>

Re: Mahout: NB Model for Text Classification - In Sample Error

Reply via email to