Let's see if this approach makes sense: I have the documents to classify on a Lucene index (Index A) and the training set in another Lucene index (Index B). With a VectorMapper I map Term-Frequency Vectors of Index A to Term-Frequency Vectors of Index B. In this way the transformed vectors have only the frequency of the terms of the training set. By computing vector.zSum() I should get the frequency of the terms in the training set for the document, right? I compute vector.zSum() for all the docs to classify and exclude from the classification the ones who have a sum value of less than 10% the max vector.zSum()=>they mostly contain words never seen before and could be classified wrongly.
What do you think? -----Messaggio originale----- Da: Claudia Grieco [mailto:[email protected]] Inviato: mercoledì 13 aprile 2011 11.12 A: [email protected] Oggetto: Identify "less similar" documents Hi guys, I'm using SGD to classify a set of documents but I have a problem: there are some documents that are not related to any of the categories and I want to be able to identify them and exclude them from the classification. My idea is to read the documents of the training set (that are currently in a Lucene index) and identify the docs that have less terms in common with them. Any idea on how to do it? Thanks a lot Claudia
