Hi, On Tue, May 14, 2013 at 7:24 AM, Tom Marthaler <[email protected]> wrote:
> Hi, looking at the org.apache.mahout.classifier.sgd.TrainNewsGroups > examples class, it seems that the online nature of the SGD logistic > regression will always be dependent on the order in which the classifier is > trained. > Yes, gradient descent is dependent on the starting point and SGD, given that it stochastically choses a single point to compute the gradient with respect to, is also is dependent on the order of the points. SGD and batch gradient descent have the same expected errors however. > There is a call to randomize the order in which the newsgroup files are > read in on line 112 of TrainNewsGroups (the Collections.shuffle(files); > call). This means that the output of the TrainNewsGroups main method will > be non-deterministic. > Yes and I can tell you from experience (sadly... :) that not shuffling the points will break everything. I am specifically looking at the weights put into the > org.mahout.classifier.sgd.ModelDissector core class. > > Is there a way to make the feature weights deterministic, no matter the > order of the input training vectors? > For it to work in general, the stochasticity is what makes it have the same expected errors as normal GD. It's just by nature a random algorithm. For testing however, you can always set a fixed seed for the random number generator. This will always give you the same "random" points. There's a method in RandomUtils.useTestSeed() that does just that.
