Dan / Ted, thanks for the responses. I think that my confusion with the SGD implementation was from the combination of randomization of the seeds and the batch versus online nature of the classification training. I had previously never used an online classifier before.
It turns out that the library that I was comparing the results against (scikit-learn, which is a wrapper around LIBLINEAR) was doing batch L2 Regression processing, which I believe does not use a gradient descent calculation. So apples and oranges to some extent. I'll take a look at the RandomUtils.useTestSeed() to see if I can replicate some of my test data using this online training. Thanks again for the response. On Mon, May 13, 2013 at 10:48 PM, Dan Filimon <[email protected]>wrote: > Hi, > > On Tue, May 14, 2013 at 7:24 AM, Tom Marthaler <[email protected]> > wrote: > > > Hi, looking at the org.apache.mahout.classifier.sgd.TrainNewsGroups > > examples class, it seems that the online nature of the SGD logistic > > regression will always be dependent on the order in which the classifier > is > > trained. > > > > Yes, gradient descent is dependent on the starting point and SGD, given > that it stochastically choses a single point to compute the gradient with > respect to, is also is dependent on the order of the points. > SGD and batch gradient descent have the same expected errors however. > > > > There is a call to randomize the order in which the newsgroup files are > > read in on line 112 of TrainNewsGroups (the Collections.shuffle(files); > > call). This means that the output of the TrainNewsGroups main method will > > be non-deterministic. > > > > Yes and I can tell you from experience (sadly... :) that not shuffling the > points will break everything. > > I am specifically looking at the weights put into the > > org.mahout.classifier.sgd.ModelDissector core class. > > > > Is there a way to make the feature weights deterministic, no matter the > > order of the input training vectors? > > > > For it to work in general, the stochasticity is what makes it have the same > expected errors as normal GD. It's just by nature a random algorithm. > For testing however, you can always set a fixed seed for the random number > generator. This will always give you the same "random" points. There's a > method in RandomUtils.useTestSeed() that does just that. >
