Hand classify all the documents that you can into the categories that you know.
Classify the ones that don't fit into "other". On Thu, Apr 14, 2011 at 12:51 AM, Claudia Grieco <[email protected]>wrote: > Thanks to everyone :) > So I should train the category "other" with some documents...but what > documents? > I should identify them first...that's a bit of a "chicken and egg" problem > Maybe I should do this way: > -each day X new documents arrive to be classified > -I find 10-11 docs with a low word freq in respect to the training set(but > what is a "low" value?) and train them as other > -classify everything with the updated classifier > > -----Messaggio originale----- > Da: Ted Dunning [mailto:[email protected]] > Inviato: mercoledì 13 aprile 2011 19.29 > A: [email protected] > Cc: Claudia Grieco > Oggetto: Re: Identify "less similar" documents > > On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <[email protected] > >wrote: > > > Thanks for the help :) > > > Why not just train with those documents and put a category tag of > "other" > > on > > >them and run normal categorization? If you can distinguish these > > documents > > >by word frequencies, then this should do the trick. > > I don't know if this will help > > > > Only an experiment will tell you. > > > > 1)I'm still not sure where to put the threshold (if a document has word > > frequency less than X...how to choose X?) > > > > The classifier should handle that for you for the most part. Again, > experimentation is the way to go here. My first cut would be to assign to > the category with the highest score, possibly including the other category. > > > > 2)The classifier is built incrementally: a document who would be > classified > > as "other" today may be classified as "new category the user has just > added" > > tomorrow. New docs in the training set and new categories are added from > > time to time. > > > > That is pretty easy. Just retrain with the new category assignments. > >
