Re: SGD didn't work well with high dimension by a random generated data test.

Ted Dunning Sun, 29 May 2011 16:05:37 -0700

I have done some unrelated tests and I think that SGD has suffered from some
unknown decrease in accuracy.  20 newsgroups used to get to 86% accuracy and
now only gets to near 80.  When I find time I will try to figure out what
has happened.


Your test results may be related.

On Mon, May 23, 2011 at 4:22 AM, Stanley Xu <[email protected]> wrote:

> Looks if I set decay to 1(no learning rate decay),remove the
> regularization,
> use the raw OnlineLogisticRegression and adjust the learning rate, the
> performance would be much better.
>
> Best wishes,
> Stanley Xu
>
>
>
> On Mon, May 23, 2011 at 4:18 PM, Stanley Xu <[email protected]> wrote:
>
> > Dear All,
> >
> > I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
> > use a program to generate random weights, training data and test data and
> > use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
> > classify the result. But it looks that the SGD didn't works well. I am
> > wondering if I missed anything in using the SGD algorithm?
> >
> > I did the test with the following data set:
> >
> > 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 10k records or 100 records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data looks fine to
> me.
> > Both the false positive and false negative is less than 100, which would
> be
> > less than 1%.
> >
> > 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 100k records to 1000k records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data is not very
> well.
> > The false positive and false negative are all close to 10%. But the AUC
> is
> > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
> > OnlineLogisticRegression.
> >
> > 3. 100 feature dimension, but change the negative and positive target to
> > 10:1 to match the real training set we will get.
> > With the raw OnlineLogisticRegression, most of positive target will be
> > predicted as negative(more than 90%). And the AUC decrease to 60％. Even
> > worse, with the AdaptiveLogisticRegression, all the positive target will
> be
> > predicted as negative, and AUC decreased to 58%.
> >
> > The code to generate the data could be found here.
> > http://pastebin.com/GAA1di5z
> >
> > The code to train and classify the data could be found here.
> > http://pastebin.com/EjMpGQ1h
> >
> > The parameters there could be changed to generate different set of data.
> >
> > I thought the incorrectness is unacceptable hight, especially with a data
> > which has a perfect line which could separate the data. And, the
> > incorrectness is unusually high in the training data set.
> >
> > I knew SGD is an approximate solution rather than an accurate one, but
> > isn't 20% error in classification is too high?
> >
> > I understood for the unbalance positive and negative for the training
> set,
> > we could add a weight in the training example. I have tried but it is
> also
> > hard to decide the weight we should choose, and per my understand, we
> should
> > also get the weight changed dynamically with the current learning rate.
> > Since the high learning rate with a high weight will mis-lead the model
> to
> > an incorrect direction. We have tried some strategy, but the efforts is
> not
> > well, any tips on how to set the weight for SGD? Since it is not a global
> > convex optimization solution comparing to other algorithm of Logistic
> > Regression.
> >
> > Thanks.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
> >
>

Re: SGD didn't work well with high dimension by a random generated data test.

Reply via email to