Re: Identify "less similar" documents

Daniel McEnnis Wed, 13 Apr 2011 09:20:13 -0700

Claudia,

The term to look up is 'one class classifier'.  Its built on this
problem with a set of solutions pre-made.  I don't know if anyone has
put it in a general classifier before, but the theory is there.


Daniel.

On Wed, Apr 13, 2011 at 11:56 AM, Claudia Grieco <[email protected]> wrote:
> Thanks for the help :)
>> Why not just train with those documents and put a category tag of "other" on
>>them and run normal categorization?  If you can distinguish these documents
>>by word frequencies, then this should do the trick.
> I don't know if this will help
> 1)I'm still not sure where to put the threshold (if a document has word 
> frequency less than X...how to choose X?)
> 2)The classifier is built incrementally: a document who would be classified 
> as "other" today may be classified as "new category the user has just added" 
> tomorrow. New docs in the training set and new categories are added from time 
> to time.
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:[email protected]]
> Inviato: mercoledì 13 aprile 2011 17.34
> A: [email protected]
> Cc: Claudia Grieco
> Oggetto: Re: Identify "less similar" documents
>
> I think that what you are doing is inventing an "other" category and
> building a classifier for that category.
>
> Why not just train with those documents and put a category tag of "other" on
> them and run normal categorization?  If you can distinguish these documents
> by word frequencies, then this should do the trick.
>
> On Wed, Apr 13, 2011 at 7:49 AM, Claudia Grieco <[email protected]>wrote:
>
>> Let's see if this approach makes sense:
>> I have the documents to classify on a Lucene index (Index A) and the
>> training set in another Lucene index (Index B).
>> With a VectorMapper I map Term-Frequency Vectors of Index A to
>> Term-Frequency Vectors of Index B. In this way the transformed vectors have
>> only the frequency of the terms of the training set.
>> By computing vector.zSum() I should get the frequency of the terms in the
>> training set for the document, right?
>> I compute vector.zSum() for all the docs to classify and exclude from the
>> classification the ones who have a sum value of less than 10% the max
>> vector.zSum()=>they mostly contain words never seen before and could be
>> classified wrongly.
>>
>> What do you think?
>>
>> -----Messaggio originale-----
>> Da: Claudia Grieco [mailto:[email protected]]
>> Inviato: mercoledì 13 aprile 2011 11.12
>> A: [email protected]
>> Oggetto: Identify "less similar" documents
>>
>> Hi guys,
>>
>> I'm using SGD to classify a set of documents but I have a problem: there
>> are
>> some documents that are not related to any of the categories and I want to
>> be able to identify them and exclude them from the classification. My idea
>> is to read the documents of the training set (that are currently in a
>> Lucene
>> index) and identify the docs that have less terms in common with them. Any
>> idea on how to do it?
>>
>> Thanks a lot
>>
>> Claudia
>>
>>
>>
>
>

Re: Identify "less similar" documents

Reply via email to