Thanks for the suggestion, I'm currently trying this hack: I take the documents of the training set and put in each cluster all the docs of a certain category. I compute the centroid for each category cluster I compute the distance of each new document to all centroids (I'm using CosineDistanceMeasure) and I identify as "outlier" the ones who have distance more than X
Do you think this makes sense? Thanks Claudia -----Messaggio originale----- Da: Lance Norskog [mailto:[email protected]] Inviato: venerdì 15 aprile 2011 4.27 A: [email protected] Oggetto: Re: Identify "less similar" documents Thinking of this in terms of clustering, outliers/misfits are one-vector clusters, items that are far away from all others. Clustering would be a slow system for finding these outliers, but an interesting way to check them: Cluster a sampled set of your items. Save the centroid and radius of each cluster. To verify an outlier, look at its distance to all centroids, and whether it is in the radius of the closest (few). Given a clustering algorithm you like, and a different distance method than the categorization measure, this gives a good cross-check of the outliers. On 4/14/11, Ted Dunning <[email protected]> wrote: > Hand classify all the documents that you can into the categories that you > know. > > Classify the ones that don't fit into "other". > > On Thu, Apr 14, 2011 at 12:51 AM, Claudia Grieco > <[email protected]>wrote: > >> Thanks to everyone :) >> So I should train the category "other" with some documents...but what >> documents? >> I should identify them first...that's a bit of a "chicken and egg" problem >> Maybe I should do this way: >> -each day X new documents arrive to be classified >> -I find 10-11 docs with a low word freq in respect to the training set(but >> what is a "low" value?) and train them as other >> -classify everything with the updated classifier >> >> -----Messaggio originale----- >> Da: Ted Dunning [mailto:[email protected]] >> Inviato: mercoledì 13 aprile 2011 19.29 >> A: [email protected] >> Cc: Claudia Grieco >> Oggetto: Re: Identify "less similar" documents >> >> On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <[email protected] >> >wrote: >> >> > Thanks for the help :) >> > > Why not just train with those documents and put a category tag of >> "other" >> > on >> > >them and run normal categorization? If you can distinguish these >> > documents >> > >by word frequencies, then this should do the trick. >> > I don't know if this will help >> > >> >> Only an experiment will tell you. >> >> >> > 1)I'm still not sure where to put the threshold (if a document has word >> > frequency less than X...how to choose X?) >> > >> >> The classifier should handle that for you for the most part. Again, >> experimentation is the way to go here. My first cut would be to assign to >> the category with the highest score, possibly including the other >> category. >> >> >> > 2)The classifier is built incrementally: a document who would be >> classified >> > as "other" today may be classified as "new category the user has just >> added" >> > tomorrow. New docs in the training set and new categories are added from >> > time to time. >> > >> >> That is pretty easy. Just retrain with the new category assignments. >> >> > -- Lance Norskog [email protected]
