Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-30 Thread Joseph Bradley
This is really getting into an understanding of how optimization and GLMs
work.  I'd recommend reading some intro ML or stats literature on how
Generalized Linear Models are estimated, as well as how convex optimization
is used in ML.  There are some free online texts as well as MOOCs which
have good intros.  (There is also the upcoming ML with Spark MOOC!)

On Fri, May 29, 2015 at 3:11 AM, SparknewUser melanie.galloi...@gmail.com
wrote:

 I've tried several different couple of parameters for my
 LogisticRegressionWithSGD and here are my results.
 My numIterations varies from 100 to 500 by 50 and my stepSize varies from
 0.1 to 1 by 0.1.
 My last line represents the maximum of each column and my last column the
 maximum of each line and we see a growth and diminution. What is the logic?

 My maximum is for the couple (numIter,StepSize)=(0.4,200)

 numIter/stepSize0,1 0,2 0,3 0,4 0,5 0,6
  0,7 0,8 0,9 1   line max
  1000,670,690,500,480,500,69
 0,700,500,660,55
 0,70
  1500,500,510,500,500,500,50
 0,530,500,530,68
 0,68
  2000,670,710,640,740,500,70
 0,710,710,500,50
 0,74
  2500,500,500,550,500,500,50
 0,730,550,500,50
 0,73
  3000,670,500,500,670,500,67
 0,720,480,660,67
 0,72
  3500,710,600,660,500,510,50
 0,660,620,660,71
 0,71
  4000,510,540,710,670,620,50
 0,500,500,510,50
 0,71
  4500,510,500,500,510,500,50
 0,660,510,500,50
 0,66
  5000,510,640,500,500,510,49
 0,660,670,540,51
 0,67

 column max   0,71   0,710,710,740,620,700,73
 0,710,660,71



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053p23082.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread mélanie gallois
I'm new to Spark and I'm getting bad performance with classification
methods on Spark MLlib (worse than R in terms of AUC).
I am trying to put my own parameters rather than the default parameters.
Here is the method I want to use :

train(RDD 
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.htmlLabeledPoint
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/regression/LabeledPoint.html
input,
int numIterations,
  double stepSize,
 double miniBatchFraction,
Vector 
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/linalg/Vector.html
initialWeights)

How to choose numIterations and stepSize?
What does miniBatchFraction mean?
Is initialWeights necessary to have a good model? Then, how to choose them?


Regards,

Mélanie Gallois


How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread SparknewUser
I'm new to Spark and I'm getting bad performance with classification methods
on Spark MLlib (worse than R in terms of AUC).
I am trying to put my own parameters rather than the default parameters.
Here is the method I want to use : 
train(RDDLabeledPoint input,
int numIterations,
  double stepSize,
 double miniBatchFraction,
Vector initialWeights)
How to choose numIterations and stepSize? 
What does miniBatchFraction mean?
Is initialWeights necessary to have a good model? Then, how to choose them?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
The model is learned using an iterative convex optimization algorithm.
 numIterations, stepSize and miniBatchFraction are for those; you can
see details here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
http://spark.apache.org/docs/latest/mllib-optimization.html

I would set miniBatchFraction at 1.0 and not mess with it.
For LogisticRegressionWithSGD, to know whether you have the other 2
parameters set correctly, you should try running with more iterations.
If running with more iterations changes your result significantly, then:
 - If the result is blowing up (really big model weights), then you need to
decrease stepSize.
 - If the result is not blowing up but keeps changing, then you need to
increase numIterations.

You should not need to set initialWeights, but it can help if you have some
estimate already calculated.

If you have access to a build of the current Spark master (or can wait for
1.4), then the org.apache.spark.ml.classification.LogisticRegression
implementation has been compared with R and should get very similar results.

Good luck!
Joseph

On Wed, May 27, 2015 at 8:22 AM, SparknewUser melanie.galloi...@gmail.com
wrote:

 I'm new to Spark and I'm getting bad performance with classification
 methods
 on Spark MLlib (worse than R in terms of AUC).
 I am trying to put my own parameters rather than the default parameters.
 Here is the method I want to use :
 train(RDDLabeledPoint input,
 int numIterations,
   double stepSize,
  double miniBatchFraction,
 Vector initialWeights)
 How to choose numIterations and stepSize?
 What does miniBatchFraction mean?
 Is initialWeights necessary to have a good model? Then, how to choose them?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org