Re: How to get the best performance with LogisticRegressionWithSGD?
This is really getting into an understanding of how optimization and GLMs work. I'd recommend reading some intro ML or stats literature on how Generalized Linear Models are estimated, as well as how convex optimization is used in ML. There are some free online texts as well as MOOCs which have good intros. (There is also the upcoming ML with Spark MOOC!) On Fri, May 29, 2015 at 3:11 AM, SparknewUser melanie.galloi...@gmail.com wrote: I've tried several different couple of parameters for my LogisticRegressionWithSGD and here are my results. My numIterations varies from 100 to 500 by 50 and my stepSize varies from 0.1 to 1 by 0.1. My last line represents the maximum of each column and my last column the maximum of each line and we see a growth and diminution. What is the logic? My maximum is for the couple (numIter,StepSize)=(0.4,200) numIter/stepSize0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 line max 1000,670,690,500,480,500,69 0,700,500,660,55 0,70 1500,500,510,500,500,500,50 0,530,500,530,68 0,68 2000,670,710,640,740,500,70 0,710,710,500,50 0,74 2500,500,500,550,500,500,50 0,730,550,500,50 0,73 3000,670,500,500,670,500,67 0,720,480,660,67 0,72 3500,710,600,660,500,510,50 0,660,620,660,71 0,71 4000,510,540,710,670,620,50 0,500,500,510,50 0,71 4500,510,500,500,510,500,50 0,660,510,500,50 0,66 5000,510,640,500,500,510,49 0,660,670,540,51 0,67 column max 0,71 0,710,710,740,620,700,73 0,710,660,71 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053p23082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to get the best performance with LogisticRegressionWithSGD?
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDD https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.htmlLabeledPoint https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/regression/LabeledPoint.html input, int numIterations, double stepSize, double miniBatchFraction, Vector https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/linalg/Vector.html initialWeights) How to choose numIterations and stepSize? What does miniBatchFraction mean? Is initialWeights necessary to have a good model? Then, how to choose them? Regards, Mélanie Gallois
How to get the best performance with LogisticRegressionWithSGD?
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDDLabeledPoint input, int numIterations, double stepSize, double miniBatchFraction, Vector initialWeights) How to choose numIterations and stepSize? What does miniBatchFraction mean? Is initialWeights necessary to have a good model? Then, how to choose them? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to get the best performance with LogisticRegressionWithSGD?
The model is learned using an iterative convex optimization algorithm. numIterations, stepSize and miniBatchFraction are for those; you can see details here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer http://spark.apache.org/docs/latest/mllib-optimization.html I would set miniBatchFraction at 1.0 and not mess with it. For LogisticRegressionWithSGD, to know whether you have the other 2 parameters set correctly, you should try running with more iterations. If running with more iterations changes your result significantly, then: - If the result is blowing up (really big model weights), then you need to decrease stepSize. - If the result is not blowing up but keeps changing, then you need to increase numIterations. You should not need to set initialWeights, but it can help if you have some estimate already calculated. If you have access to a build of the current Spark master (or can wait for 1.4), then the org.apache.spark.ml.classification.LogisticRegression implementation has been compared with R and should get very similar results. Good luck! Joseph On Wed, May 27, 2015 at 8:22 AM, SparknewUser melanie.galloi...@gmail.com wrote: I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDDLabeledPoint input, int numIterations, double stepSize, double miniBatchFraction, Vector initialWeights) How to choose numIterations and stepSize? What does miniBatchFraction mean? Is initialWeights necessary to have a good model? Then, how to choose them? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org