The official solution is to assign outliers in the training set to other. These are defined as high mean distance to other points. A hack to get this to work would be to perform a knn-like distance comparison with all trained sets and classify as other anything that exceeds the threshold distance - a variation of the same technique and already mentioned.
Daniel. On Wed, Apr 13, 2011 at 6:41 PM, Dmitriy Lyubimov <[email protected]> wrote: > I suspect but of the problem might be creating the training set for > the 'other' since the documents are distinctly 'different' from > anything else, including from each other. > I guess the definition for the 'other' category is a 'low relevance > for everything yet trained' but not 'high relevance to some category > 'other' . > > As such, i think it is implied by definition that training for that > stuff is not possible, but perhaps some cut-off threshold on the > regressed posterior for all categories would help. But that's a > surgery on the learner itself, i can't recollect if it is exposed by > learner api? > > > On Wed, Apr 13, 2011 at 8:34 AM, Ted Dunning <[email protected]> wrote: >> I think that what you are doing is inventing an "other" category and >> building a classifier for that category. >> >> Why not just train with those documents and put a category tag of "other" on >> them and run normal categorization? If you can distinguish these documents >> by word frequencies, then this should do the trick. >
