Yes for active learning, no for transduction. You can do semi-supervised clustering as well. This is a special case of transduction generally in many ways.
If you mean classify instead of cluster in your last sentence, then that is definitely a way to do transduction. On Mon, Nov 21, 2011 at 6:12 PM, Lance Norskog <[email protected]> wrote: > The active learning and transduction methods create candidates which you > then check by hand, right? Would it work to cluster untagged items using > the tagged items as seeds? > > Lance > > On Fri, Nov 18, 2011 at 11:46 PM, Ted Dunning <[email protected]> > wrote: > > > This test plan is pretty reasonable. There is inherently going to be > some > > form of bias due to the time shift, but the bias is real and will affect > > your test results the same way it will affect your operational accuracy. > > It might be somewhat interesting to estimate the effect over time by > also > > testing on a sample from within the same time period, but that is really > > mostly of academic interest. > > > > What I would recommend is that you use some additional techniques to > > increase your training set size. Active learning is a classic technique > > which can help you build a relatively small training set that gives > > performance comparable to the performance you would get without > > down-sampling. Transduction would let you use the untagged data to > improve > > your model without increasing the number of tagged samples. > > > > One simple approach for active learning is to repeatedly take new > training > > samples of untagged messages that is stratified on your first model's > > score. In addition, it makes sense to also sample messages that have > > significant numbers of terms that do not appear in your positive training > > examples. These methods are much simpler than doing active learning by > the > > book, but give similar results. > > > > For transduction, a very simple method is to simply tag the rest of your > > training data and then train a model using this larger training set. > This > > has benefit because you extend your model effectively using cooccurrences > > with known terms. Again, this is less effective than more formally > > defined transduction methods, but it can be surprisingly effective. > > > > Finally, I would recommend that you consider alternative algorithms than > > Naive Bayes for your basic model. This is based on the fact that you > only > > have a small training set and Naive Bayes depends in part on having a > > relatively large number of training examples in order to get a good > model. > > > > > > On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <[email protected]> > > wrote: > > > > > Hey all, > > > > > > Quick question regarding potential source of in-sample bias for a text > > > classification project. I'm develop a system which reads text messages > > > (i.e. SMS) and tries to classify them into a number of categories. We > > have > > > a few million messages. We built our training set of a spare window (~2 > > > months) of messages randomly sampled from within this period. > > > > > > We would like to classify messages within this two month period and > also > > > messages written beyond. Does this raise some problems with sample > bias. > > > From a set of over 600,000 SMSs we only sampled around 5000 and > manually > > > tagged them for sentiments. I cant see how this would skew our accuracy > > > results when we are using a training set which is taken from an unseen > > > period after this two months. > > > > > > But if anyone could add their 2c on if its as mistake to a) classify > SMS > > > unseen messages in this two month sample time period b) verify accuracy > > > based on a test set outside this two month sample period. > > > > > > Cheers, > > > /N > > > > > > > > > -- > Lance Norskog > [email protected] >
