Hey all,

Quick question regarding potential source of in-sample bias for a text
classification project. I'm develop a system which reads text messages
(i.e. SMS) and tries to classify them into a number of categories. We have
a few million messages. We built our training set of a spare window (~2
months) of messages randomly sampled from within this period.

We would like to classify messages within this two month period and also
messages written beyond. Does this raise some problems with sample bias.
>From a set of over 600,000 SMSs we only sampled around 5000 and manually
tagged them for sentiments. I cant see how this would skew our accuracy
results when we are using a training set which is taken from an unseen
period after this two months.

But if anyone could add their 2c on if its as mistake to a) classify SMS
unseen messages in this two month sample time period b) verify accuracy
based on a test set outside this two month sample period.

Cheers,
/N

Reply via email to