This test plan is pretty reasonable. There is inherently going to be some form of bias due to the time shift, but the bias is real and will affect your test results the same way it will affect your operational accuracy. It might be somewhat interesting to estimate the effect over time by also testing on a sample from within the same time period, but that is really mostly of academic interest.
What I would recommend is that you use some additional techniques to increase your training set size. Active learning is a classic technique which can help you build a relatively small training set that gives performance comparable to the performance you would get without down-sampling. Transduction would let you use the untagged data to improve your model without increasing the number of tagged samples. One simple approach for active learning is to repeatedly take new training samples of untagged messages that is stratified on your first model's score. In addition, it makes sense to also sample messages that have significant numbers of terms that do not appear in your positive training examples. These methods are much simpler than doing active learning by the book, but give similar results. For transduction, a very simple method is to simply tag the rest of your training data and then train a model using this larger training set. This has benefit because you extend your model effectively using cooccurrences with known terms. Again, this is less effective than more formally defined transduction methods, but it can be surprisingly effective. Finally, I would recommend that you consider alternative algorithms than Naive Bayes for your basic model. This is based on the fact that you only have a small training set and Naive Bayes depends in part on having a relatively large number of training examples in order to get a good model. On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <[email protected]> wrote: > Hey all, > > Quick question regarding potential source of in-sample bias for a text > classification project. I'm develop a system which reads text messages > (i.e. SMS) and tries to classify them into a number of categories. We have > a few million messages. We built our training set of a spare window (~2 > months) of messages randomly sampled from within this period. > > We would like to classify messages within this two month period and also > messages written beyond. Does this raise some problems with sample bias. > From a set of over 600,000 SMSs we only sampled around 5000 and manually > tagged them for sentiments. I cant see how this would skew our accuracy > results when we are using a training set which is taken from an unseen > period after this two months. > > But if anyone could add their 2c on if its as mistake to a) classify SMS > unseen messages in this two month sample time period b) verify accuracy > based on a test set outside this two month sample period. > > Cheers, > /N >
