Claudia, The term to look up is 'one class classifier'. Its built on this problem with a set of solutions pre-made. I don't know if anyone has put it in a general classifier before, but the theory is there.
Daniel. On Wed, Apr 13, 2011 at 11:56 AM, Claudia Grieco <[email protected]> wrote: > Thanks for the help :) >> Why not just train with those documents and put a category tag of "other" on >>them and run normal categorization? If you can distinguish these documents >>by word frequencies, then this should do the trick. > I don't know if this will help > 1)I'm still not sure where to put the threshold (if a document has word > frequency less than X...how to choose X?) > 2)The classifier is built incrementally: a document who would be classified > as "other" today may be classified as "new category the user has just added" > tomorrow. New docs in the training set and new categories are added from time > to time. > > -----Messaggio originale----- > Da: Ted Dunning [mailto:[email protected]] > Inviato: mercoledì 13 aprile 2011 17.34 > A: [email protected] > Cc: Claudia Grieco > Oggetto: Re: Identify "less similar" documents > > I think that what you are doing is inventing an "other" category and > building a classifier for that category. > > Why not just train with those documents and put a category tag of "other" on > them and run normal categorization? If you can distinguish these documents > by word frequencies, then this should do the trick. > > On Wed, Apr 13, 2011 at 7:49 AM, Claudia Grieco <[email protected]>wrote: > >> Let's see if this approach makes sense: >> I have the documents to classify on a Lucene index (Index A) and the >> training set in another Lucene index (Index B). >> With a VectorMapper I map Term-Frequency Vectors of Index A to >> Term-Frequency Vectors of Index B. In this way the transformed vectors have >> only the frequency of the terms of the training set. >> By computing vector.zSum() I should get the frequency of the terms in the >> training set for the document, right? >> I compute vector.zSum() for all the docs to classify and exclude from the >> classification the ones who have a sum value of less than 10% the max >> vector.zSum()=>they mostly contain words never seen before and could be >> classified wrongly. >> >> What do you think? >> >> -----Messaggio originale----- >> Da: Claudia Grieco [mailto:[email protected]] >> Inviato: mercoledì 13 aprile 2011 11.12 >> A: [email protected] >> Oggetto: Identify "less similar" documents >> >> Hi guys, >> >> I'm using SGD to classify a set of documents but I have a problem: there >> are >> some documents that are not related to any of the categories and I want to >> be able to identify them and exclude them from the classification. My idea >> is to read the documents of the training set (that are currently in a >> Lucene >> index) and identify the docs that have less terms in common with them. Any >> idea on how to do it? >> >> Thanks a lot >> >> Claudia >> >> >> > >
