Do you have thousands of labeled documents for each category? Are the categories groupable into very similar clusters?
Do categories come and go? What is high accuracy to you? My first recommendation for text classification always is L_1 regularized logistic regression. Since your training data is small, I would recommend that you start with glmnet on R with word level features. If you have additional meta-data such as source of the text or time of day or whatnot, label that specially and see if including it helps. Whether you want a multinomial model or lots of binomial models is an open question. Try each design if you can (glmnet will only do the binomial option). As an interesting tree-based alternative, I think that your data is small enough to use the standard random forest implementation. If you have usable category nesting, you might try training a top-level model, then taking the top few super-categories and trying a category specific model at that level. R should suffice as long as your data are less than hundreds of thousands. Some algorithms in R work with larger data, most will not. On Wed, Dec 26, 2012 at 8:01 AM, Magesh Sarma <[email protected]>wrote: > Hi: > > Coming from the Weka world, I have Newb question. > > My problem is straight-forward: I have to label a given document. Each > document will have only one label. I have hundreds of labels. I have a > big training set (thousands of labeled documents). Accuracy is important. > So is the ability to incrementally train, or alternatively rebuild the > model from scratch fast. > > I have used the J48 (based on C4.5) algorithm in Weka with a good degree of > success. Accuracy is high, but training speed is very slow. Plus, it does > not support incremental training. > > Any recommendation on what algorithm(s) would be a good fit if I switch to > Mahout? > > Cheers, > Magesh >
