Re: Identify "less similar" documents

Daniel McEnnis Wed, 13 Apr 2011 16:27:32 -0700

The official solution is to assign outliers in the training set to
other.  These are defined as high mean distance to other points.  A
hack to get this to work would be to perform a knn-like distance
comparison with all trained sets and classify as other anything that
exceeds the threshold distance - a variation of the same technique and
already mentioned.


Daniel.

On Wed, Apr 13, 2011 at 6:41 PM, Dmitriy Lyubimov <[email protected]> wrote:
> I suspect but of the problem might be creating the training set for
> the 'other' since the documents are distinctly 'different' from
> anything else, including from each other.
> I guess the definition for the 'other' category is a 'low relevance
> for everything yet trained' but not 'high relevance to some category
> 'other' .
>
> As such, i think it is implied by definition that training for that
> stuff is not possible, but perhaps some cut-off threshold on the
> regressed posterior for all categories would help. But that's a
> surgery on the learner itself, i can't recollect if it is exposed by
> learner api?
>
>
> On Wed, Apr 13, 2011 at 8:34 AM, Ted Dunning <[email protected]> wrote:
>> I think that what you are doing is inventing an "other" category and
>> building a classifier for that category.
>>
>> Why not just train with those documents and put a category tag of "other" on
>> them and run normal categorization?  If you can distinguish these documents
>> by word frequencies, then this should do the trick.
>

Re: Identify "less similar" documents

Reply via email to