Dan / Ted, thanks for the responses.

I think that my confusion with the SGD implementation was from the
combination of randomization of the seeds and the batch versus online
nature of the classification training. I had previously never used an
online classifier before.

It turns out that the library that I was comparing the results against
(scikit-learn, which is a wrapper around LIBLINEAR) was doing batch L2
Regression processing, which I believe does not use a gradient descent
calculation. So apples and oranges to some extent.

I'll take a look at the RandomUtils.useTestSeed() to see if I can replicate
some of my test data using this online training.

Thanks again for the response.



On Mon, May 13, 2013 at 10:48 PM, Dan Filimon
<[email protected]>wrote:

> Hi,
>
> On Tue, May 14, 2013 at 7:24 AM, Tom Marthaler <[email protected]>
> wrote:
>
> > Hi, looking at the org.apache.mahout.classifier.sgd.TrainNewsGroups
> > examples class, it seems that the online nature of the SGD logistic
> > regression will always be dependent on the order in which the classifier
> is
> > trained.
> >
>
> Yes, gradient descent is dependent on the starting point and SGD, given
> that it stochastically choses a single point to compute the gradient with
> respect to, is also is dependent on the order of the points.
> SGD and batch gradient descent have the same expected errors however.
>
>
> > There is a call to randomize the order in which the newsgroup files are
> > read in on line 112 of TrainNewsGroups (the Collections.shuffle(files);
> > call). This means that the output of the TrainNewsGroups main method will
> > be non-deterministic.
> >
>
> Yes and I can tell you from experience (sadly... :) that not shuffling the
> points will break everything.
>
> I am specifically looking at the weights put into the
> > org.mahout.classifier.sgd.ModelDissector core class.
> >
> > Is there a way to make the feature weights deterministic, no matter the
> > order of the input training vectors?
> >
>
> For it to work in general, the stochasticity is what makes it have the same
> expected errors as normal GD. It's just by nature a random algorithm.
> For testing however, you can always set a fixed seed for the random number
> generator. This will always give you the same "random" points. There's a
> method in RandomUtils.useTestSeed() that does just that.
>

Reply via email to