Re: Document Classification - Recommended Algorithms?

Ted Dunning Wed, 26 Dec 2012 11:21:37 -0800

Do you have thousands of labeled documents for each category?

Are the categories groupable into very similar clusters?

Do categories come and go?

What is high accuracy to you?

My first recommendation for text classification always is L_1 regularized
logistic regression.  Since your training data is small, I would recommend
that you start with glmnet on R with word level features.  If you have
additional meta-data such as source of the text or time of day or whatnot,
label that specially and see if including it helps.

Whether you want a multinomial model or lots of binomial models is an open
question.  Try each design if you can (glmnet will only do the binomial
option).

As an interesting tree-based alternative, I think that your data is small
enough to use the standard random forest implementation.

If you have usable category nesting, you might try training a top-level
model, then taking the top few super-categories and trying a category
specific model at that level.

R should suffice as long as your data are less than hundreds of thousands.
 Some algorithms in R work with larger data, most will not.

On Wed, Dec 26, 2012 at 8:01 AM, Magesh Sarma <[email protected]>wrote:

> Hi:
>
> Coming from the Weka world, I have Newb question.
>
> My problem is straight-forward: I have to label a given document.  Each
> document will have only one label.  I have hundreds of labels.  I have a
> big training set (thousands of labeled documents).  Accuracy is important.
> So is the ability to incrementally train, or alternatively rebuild the
> model from scratch fast.
>
> I have used the J48 (based on C4.5) algorithm in Weka with a good degree of
> success.  Accuracy is high, but training speed is very slow.  Plus, it does
> not support incremental training.
>
> Any recommendation on what algorithm(s) would be a good fit if I switch to
> Mahout?
>
> Cheers,
> Magesh
>

Re: Document Classification - Recommended Algorithms?

Reply via email to