Hi,

On Tue, May 14, 2013 at 7:24 AM, Tom Marthaler <[email protected]> wrote:

> Hi, looking at the org.apache.mahout.classifier.sgd.TrainNewsGroups
> examples class, it seems that the online nature of the SGD logistic
> regression will always be dependent on the order in which the classifier is
> trained.
>

Yes, gradient descent is dependent on the starting point and SGD, given
that it stochastically choses a single point to compute the gradient with
respect to, is also is dependent on the order of the points.
SGD and batch gradient descent have the same expected errors however.


> There is a call to randomize the order in which the newsgroup files are
> read in on line 112 of TrainNewsGroups (the Collections.shuffle(files);
> call). This means that the output of the TrainNewsGroups main method will
> be non-deterministic.
>

Yes and I can tell you from experience (sadly... :) that not shuffling the
points will break everything.

I am specifically looking at the weights put into the
> org.mahout.classifier.sgd.ModelDissector core class.
>
> Is there a way to make the feature weights deterministic, no matter the
> order of the input training vectors?
>

For it to work in general, the stochasticity is what makes it have the same
expected errors as normal GD. It's just by nature a random algorithm.
For testing however, you can always set a fixed seed for the random number
generator. This will always give you the same "random" points. There's a
method in RandomUtils.useTestSeed() that does just that.

Reply via email to