Re: Identify "less similar" documents

Ted Dunning Thu, 14 Apr 2011 01:49:06 -0700

Hand classify all the documents that you can into the categories that you
know.


Classify the ones that don't fit into "other".

On Thu, Apr 14, 2011 at 12:51 AM, Claudia Grieco <[email protected]>wrote:

> Thanks to everyone :)
> So I should train the category "other" with some documents...but what
> documents?
> I should identify them first...that's a bit of a "chicken and egg" problem
> Maybe I should do this way:
> -each day X new documents arrive to be classified
> -I find 10-11 docs with a low word freq in respect to the training set(but
> what is a "low" value?)  and train them as other
> -classify everything with the updated classifier
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:[email protected]]
> Inviato: mercoledì 13 aprile 2011 19.29
> A: [email protected]
> Cc: Claudia Grieco
> Oggetto: Re: Identify "less similar" documents
>
> On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <[email protected]
> >wrote:
>
> > Thanks for the help :)
> > > Why not just train with those documents and put a category tag of
> "other"
> > on
> > >them and run normal categorization?  If you can distinguish these
> > documents
> > >by word frequencies, then this should do the trick.
> > I don't know if this will help
> >
>
> Only an experiment will tell you.
>
>
> > 1)I'm still not sure where to put the threshold (if a document has word
> > frequency less than X...how to choose X?)
> >
>
> The classifier should handle that for you for the most part.  Again,
> experimentation is the way to go here.  My first cut would be to assign to
> the category with the highest score, possibly including the other category.
>
>
> > 2)The classifier is built incrementally: a document who would be
> classified
> > as "other" today may be classified as "new category the user has just
> added"
> > tomorrow. New docs in the training set and new categories are added from
> > time to time.
> >
>
> That is pretty easy.  Just retrain with the new category assignments.
>
>

Re: Identify "less similar" documents

Reply via email to