Let's see if this approach makes sense:
I have the documents to classify on a Lucene index (Index A) and the
training set in another Lucene index (Index B).
With a VectorMapper I map Term-Frequency Vectors of Index A to
Term-Frequency Vectors of Index B. In this way the transformed vectors have
only the frequency of the terms of the training set.
By computing vector.zSum() I should get the frequency of the terms in the
training set for the document, right?
I compute vector.zSum() for all the docs to classify and exclude from the
classification the ones who have a sum value of less than 10% the max
vector.zSum()=>they mostly contain words never seen before and could be
classified wrongly.

What do you think?

-----Messaggio originale-----
Da: Claudia Grieco [mailto:[email protected]] 
Inviato: mercoledì 13 aprile 2011 11.12
A: [email protected]
Oggetto: Identify "less similar" documents

Hi guys,

I'm using SGD to classify a set of documents but I have a problem: there are
some documents that are not related to any of the categories and I want to
be able to identify them and exclude them from the classification. My idea
is to read the documents of the training set (that are currently in a Lucene
index) and identify the docs that have less terms in common with them. Any
idea on how to do it?

Thanks a lot

Claudia 


Reply via email to