I suspect but of the problem might be creating the training set for the 'other' since the documents are distinctly 'different' from anything else, including from each other. I guess the definition for the 'other' category is a 'low relevance for everything yet trained' but not 'high relevance to some category 'other' .
As such, i think it is implied by definition that training for that stuff is not possible, but perhaps some cut-off threshold on the regressed posterior for all categories would help. But that's a surgery on the learner itself, i can't recollect if it is exposed by learner api? On Wed, Apr 13, 2011 at 8:34 AM, Ted Dunning <[email protected]> wrote: > I think that what you are doing is inventing an "other" category and > building a classifier for that category. > > Why not just train with those documents and put a category tag of "other" on > them and run normal categorization? If you can distinguish these documents > by word frequencies, then this should do the trick.
