I have done some unrelated tests and I think that SGD has suffered from some unknown decrease in accuracy. 20 newsgroups used to get to 86% accuracy and now only gets to near 80. When I find time I will try to figure out what has happened.
Your test results may be related. On Mon, May 23, 2011 at 4:22 AM, Stanley Xu <[email protected]> wrote: > Looks if I set decay to 1(no learning rate decay),remove the > regularization, > use the raw OnlineLogisticRegression and adjust the learning rate, the > performance would be much better. > > Best wishes, > Stanley Xu > > > > On Mon, May 23, 2011 at 4:18 PM, Stanley Xu <[email protected]> wrote: > > > Dear All, > > > > I am trying to evaluate the correctness of the SGD algorithm in Mahout. I > > use a program to generate random weights, training data and test data and > > use OnlineLogisticRegression and AdaptiveLogisticRegression to train and > > classify the result. But it looks that the SGD didn't works well. I am > > wondering if I missed anything in using the SGD algorithm? > > > > I did the test with the following data set: > > > > 1. 10 feature dimension, value would be 0 or 1. Weight is generated > > randomly and the weight value scope would be from -5 to 5. The training > data > > set is 10k records or 100 records. The data of negative and positive > > target would be 1:1. > > The classification on both the training data or test data looks fine to > me. > > Both the false positive and false negative is less than 100, which would > be > > less than 1%. > > > > 2. 100 feature dimension, value would be 0 or 1. Weight is generated > > randomly and the weight value scope would be from -5 to 5. The training > data > > set is 100k records to 1000k records. The data of negative and positive > > target would be 1:1. > > The classification on both the training data or test data is not very > well. > > The false positive and false negative are all close to 10%. But the AUC > is > > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw > > OnlineLogisticRegression. > > > > 3. 100 feature dimension, but change the negative and positive target to > > 10:1 to match the real training set we will get. > > With the raw OnlineLogisticRegression, most of positive target will be > > predicted as negative(more than 90%). And the AUC decrease to 60%. Even > > worse, with the AdaptiveLogisticRegression, all the positive target will > be > > predicted as negative, and AUC decreased to 58%. > > > > The code to generate the data could be found here. > > http://pastebin.com/GAA1di5z > > > > The code to train and classify the data could be found here. > > http://pastebin.com/EjMpGQ1h > > > > The parameters there could be changed to generate different set of data. > > > > I thought the incorrectness is unacceptable hight, especially with a data > > which has a perfect line which could separate the data. And, the > > incorrectness is unusually high in the training data set. > > > > I knew SGD is an approximate solution rather than an accurate one, but > > isn't 20% error in classification is too high? > > > > I understood for the unbalance positive and negative for the training > set, > > we could add a weight in the training example. I have tried but it is > also > > hard to decide the weight we should choose, and per my understand, we > should > > also get the weight changed dynamically with the current learning rate. > > Since the high learning rate with a high weight will mis-lead the model > to > > an incorrect direction. We have tried some strategy, but the efforts is > not > > well, any tips on how to set the weight for SGD? Since it is not a global > > convex optimization solution comparing to other algorithm of Logistic > > Regression. > > > > Thanks. > > > > > > Best wishes, > > Stanley Xu > > > > >
