I have talked to one user who had ~60,000 classes and they were able to use OLR with success.
The way that they did this was to arrange the output classes into a multi-level tree. Then the trained classifiers at each level of the tree. At any level, if there was a dominating result, then only that sub-tree would be searched. Otherwise, all of the top few trees would be searched. Thus, execution would proceed by evaluating the classifier at the root of the tree. One or more sub-trees would be selected. Each of the classifiers at the roots of these sub-trees would be evaluated. This would give a set of sub-sub-trees that eventually bottomed out with possible answers. These possible answers are combined to get a final set of categories. The detailed meanings of "dominating" and "top few" and "answers are combined" are left as an exercise, but I think you can see the general outline. The detailed definitions are very likely application specific in any case. On Thu, Aug 1, 2013 at 11:25 AM, yikes aroni <[email protected]> wrote: > Say that I am trying to determine which customers buy particular candy > bars. So I want to classify training data consisting of candy bar > attributes (an N dimensional vector of variables) into customer attributes > (an M dimensional vector of customer attributes). > > Is there a preferred method when N and M are large? That is say 100 or > more? > > I have done binary classification using AdaptiveLogisticRegression and > OnlineLogisticRegression and small numbers of input features with relative > success. As I'm trying to implement this for large N and M, I feel like i'm > veering into the woods. Is there a code example anyone can point me to that > uses mahout libraries to do multi-class classification when the number of > classes is large? >
