Thanks to everyone :)
So I should train the category "other" with some documents...but what documents?
I should identify them first...that's a bit of a "chicken and egg" problem
Maybe I should do this way:
-each day X new documents arrive to be classified
-I find 10-11 docs with a low word freq in respect to the training set(but what 
is a "low" value?)  and train them as other
-classify everything with the updated classifier

-----Messaggio originale-----
Da: Ted Dunning [mailto:[email protected]] 
Inviato: mercoledì 13 aprile 2011 19.29
A: [email protected]
Cc: Claudia Grieco
Oggetto: Re: Identify "less similar" documents

On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <[email protected]>wrote:

> Thanks for the help :)
> > Why not just train with those documents and put a category tag of "other"
> on
> >them and run normal categorization?  If you can distinguish these
> documents
> >by word frequencies, then this should do the trick.
> I don't know if this will help
>

Only an experiment will tell you.


> 1)I'm still not sure where to put the threshold (if a document has word
> frequency less than X...how to choose X?)
>

The classifier should handle that for you for the most part.  Again,
experimentation is the way to go here.  My first cut would be to assign to
the category with the highest score, possibly including the other category.


> 2)The classifier is built incrementally: a document who would be classified
> as "other" today may be classified as "new category the user has just added"
> tomorrow. New docs in the training set and new categories are added from
> time to time.
>

That is pretty easy.  Just retrain with the new category assignments.

Reply via email to