Thinking of this in terms of clustering, outliers/misfits are one-vector clusters, items that are far away from all others. Clustering would be a slow system for finding these outliers, but an interesting way to check them:
Cluster a sampled set of your items. Save the centroid and radius of each cluster. To verify an outlier, look at its distance to all centroids, and whether it is in the radius of the closest (few). Given a clustering algorithm you like, and a different distance method than the categorization measure, this gives a good cross-check of the outliers. On 4/14/11, Ted Dunning <[email protected]> wrote: > Hand classify all the documents that you can into the categories that you > know. > > Classify the ones that don't fit into "other". > > On Thu, Apr 14, 2011 at 12:51 AM, Claudia Grieco > <[email protected]>wrote: > >> Thanks to everyone :) >> So I should train the category "other" with some documents...but what >> documents? >> I should identify them first...that's a bit of a "chicken and egg" problem >> Maybe I should do this way: >> -each day X new documents arrive to be classified >> -I find 10-11 docs with a low word freq in respect to the training set(but >> what is a "low" value?) and train them as other >> -classify everything with the updated classifier >> >> -----Messaggio originale----- >> Da: Ted Dunning [mailto:[email protected]] >> Inviato: mercoledì 13 aprile 2011 19.29 >> A: [email protected] >> Cc: Claudia Grieco >> Oggetto: Re: Identify "less similar" documents >> >> On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <[email protected] >> >wrote: >> >> > Thanks for the help :) >> > > Why not just train with those documents and put a category tag of >> "other" >> > on >> > >them and run normal categorization? If you can distinguish these >> > documents >> > >by word frequencies, then this should do the trick. >> > I don't know if this will help >> > >> >> Only an experiment will tell you. >> >> >> > 1)I'm still not sure where to put the threshold (if a document has word >> > frequency less than X...how to choose X?) >> > >> >> The classifier should handle that for you for the most part. Again, >> experimentation is the way to go here. My first cut would be to assign to >> the category with the highest score, possibly including the other >> category. >> >> >> > 2)The classifier is built incrementally: a document who would be >> classified >> > as "other" today may be classified as "new category the user has just >> added" >> > tomorrow. New docs in the training set and new categories are added from >> > time to time. >> > >> >> That is pretty easy. Just retrain with the new category assignments. >> >> > -- Lance Norskog [email protected]
