Identify "less similar" documents

Claudia Grieco Wed, 13 Apr 2011 02:13:31 -0700

Hi guys,

I'm using SGD to classify a set of documents but I have a problem: there are
some documents that are not related to any of the categories and I want to
be able to identify them and exclude them from the classification. My idea
is to read the documents of the training set (that are currently in a Lucene
index) and identify the docs that have less terms in common with them. Any
idea on how to do it?


Thanks a lot

Claudia

Identify "less similar" documents

Reply via email to