I think that what you are doing is inventing an "other" category and
building a classifier for that category.

Why not just train with those documents and put a category tag of "other" on
them and run normal categorization?  If you can distinguish these documents
by word frequencies, then this should do the trick.

On Wed, Apr 13, 2011 at 7:49 AM, Claudia Grieco <[email protected]>wrote:

> Let's see if this approach makes sense:
> I have the documents to classify on a Lucene index (Index A) and the
> training set in another Lucene index (Index B).
> With a VectorMapper I map Term-Frequency Vectors of Index A to
> Term-Frequency Vectors of Index B. In this way the transformed vectors have
> only the frequency of the terms of the training set.
> By computing vector.zSum() I should get the frequency of the terms in the
> training set for the document, right?
> I compute vector.zSum() for all the docs to classify and exclude from the
> classification the ones who have a sum value of less than 10% the max
> vector.zSum()=>they mostly contain words never seen before and could be
> classified wrongly.
>
> What do you think?
>
> -----Messaggio originale-----
> Da: Claudia Grieco [mailto:[email protected]]
> Inviato: mercoledì 13 aprile 2011 11.12
> A: [email protected]
> Oggetto: Identify "less similar" documents
>
> Hi guys,
>
> I'm using SGD to classify a set of documents but I have a problem: there
> are
> some documents that are not related to any of the categories and I want to
> be able to identify them and exclude them from the classification. My idea
> is to read the documents of the training set (that are currently in a
> Lucene
> index) and identify the docs that have less terms in common with them. Any
> idea on how to do it?
>
> Thanks a lot
>
> Claudia
>
>
>

Reply via email to