> If you don't have truly massive volumes, then SGD is almost certainly a
better choice because it is simpler.

By "simpler" you mean "faster" or "easier to code"?

As for the multiple categories problem...I was thinking of returning the top N 
categories to the user, or the ones whose score is more than a certain 
threshold...do you think it's fine?
Thanks
Claudia

-----Messaggio originale-----
Da: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Inviato: venerdì 14 gennaio 2011 17.32
A: user@mahout.apache.org
Oggetto: Re: Help with Mahout Classification

If you don't have truly massive volumes, then SGD is almost certainly a
better choice because it is simpler.

If you have more than 10 million training examples *per*model* and
*after*downsampling* then you should consider alternatives but even up to
about 50 million training examples, SGD will do very well.  SGD is currently
also mostly appropriate for sparse feature vectors.

Having multiple categories isn't a big deal.  The simplest solution is to
train a classifier per category.  There are more advanced arrangements,
though.  For instance, you can train one classifier per category (the first
level models), then train another classifier per category where the inputs
are the outputs of the first level models.  Which techniques will help is
highly dependent on your particular problem.

On Fri, Jan 14, 2011 at 7:10 AM, Claudia Grieco <gri...@crmpa.unisa.it>wrote:

> Do you think SGD will be a better choice? New documents are added to the
> training set very often and documents can belong to more than one category
> (ex. "sport", "italy")

Reply via email to