Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
Hi

There are many different variants of gradient descent mostly dealing with what 
the step size is and how it might be adjusted as the algorithm proceeds. Also 
if it uses a stochastic variant (as opposed to batch descent) then there are 
variations there too. I don’t know off-hand what MLlib’s detailed 
implementation is but no doubt there are differences between the two - perhaps 
someone with more knowledge of the internals could comment.

As you can tell from playing around with the parameters, step size is vitally 
important to the performance of the algorithm.


On 22 Jan 2015, at 06:44, Jacques Heunis jaaksem...@gmail.com wrote:

 Ah I see, thanks!
 I was just confused because given the same configuration, I would have 
 thought that Spark and Scikit would give more similar results, but I guess 
 this is simply not the case (as in your example, in order to get spark to 
 give an mse sufficiently close to scikit's you have to give it a 
 significantly larger step and iteration count).
 
 Would that then be a result of MLLib and Scikit differing slightly in their 
 exact implementation of the optimizer? Or rather a case of (as you say) 
 Scikit being a far more mature system (and therefore that MLLib would 'get 
 better' over time)? Surely it is far from ideal that to get the same results 
 you need more iterations (IE more computation), or do you think that that is 
 simply coincidence and that given a different model/dataset it may be the 
 other way around?
 
 I ask because I encountered this situation on other, larger datasets, so this 
 is not an isolated case (though being the simplest example I could think of I 
 would imagine that it's somewhat indicative of general behaviour)
 
 On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:
 I don’t get those results. I get:
 
 spark   0.14
 scikit-learn0.85
 
 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 
 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients 
 are both ~1 and the intercept ~0. Similarly if you change the mllib step size 
 to 0.5 and number of iterations to 1200 you again get a very low mse. One of 
 the issues with SGD is you have to tweak these parameters to tune the 
 algorithm.
 
 FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
 nowhere as mature as scikit learn. However if you have large datasets that 
 won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
 choices. Similarly if you are running proof of concepts that you are 
 eventually going to scale up to production environments then there is a 
 definite argument for using MLlib at both the PoC and production stages.
 
 
 On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:
 
  I've recently been trying to get to know Apache Spark as a replacement for
  Scikit Learn, however it seems to me that even in simple cases, Scikit
  converges to an accurate model far faster than Spark does.
  For example I generated 1000 data points for a very simple linear function
  (z=x+y) with the following script:
 
  http://pastebin.com/ceRkh3nb
 
  I then ran the following Scikit script:
 
  http://pastebin.com/1aECPfvq
 
  And then this Spark script: (with spark-submit filename, no other
  arguments)
 
  http://pastebin.com/s281cuTL
 
  Strangely though, the error given by spark is an order of magnitude larger
  than that given by Scikit (0.185 and 0.045 respectively) despite the two
  models having a nearly identical setup (as far as I can tell)
  I understand that this is using SGD with very few iterations and so the
  results may differ but I wouldn't have thought that it would be anywhere
  near such a large difference or such a large error, especially given the
  exceptionally simple data.
 
  Is there something I'm misunderstanding in Spark? Is it not correctly
  configured? Surely I should be getting a smaller error than that?
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 



Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East
I don’t get those results. I get:

spark   0.14
scikit-learn0.85

The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and 
push iterations to 400 and you get a mse ~= 0. Of course the coefficients are 
both ~1 and the intercept ~0. Similarly if you change the mllib step size to 
0.5 and number of iterations to 1200 you again get a very low mse. One of the 
issues with SGD is you have to tweak these parameters to tune the algorithm.

FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
nowhere as mature as scikit learn. However if you have large datasets that 
won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
choices. Similarly if you are running proof of concepts that you are eventually 
going to scale up to production environments then there is a definite argument 
for using MLlib at both the PoC and production stages.


On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

 I've recently been trying to get to know Apache Spark as a replacement for
 Scikit Learn, however it seems to me that even in simple cases, Scikit
 converges to an accurate model far faster than Spark does.
 For example I generated 1000 data points for a very simple linear function
 (z=x+y) with the following script:
 
 http://pastebin.com/ceRkh3nb
 
 I then ran the following Scikit script:
 
 http://pastebin.com/1aECPfvq
 
 And then this Spark script: (with spark-submit filename, no other
 arguments)
 
 http://pastebin.com/s281cuTL
 
 Strangely though, the error given by spark is an order of magnitude larger
 than that given by Scikit (0.185 and 0.045 respectively) despite the two
 models having a nearly identical setup (as far as I can tell)
 I understand that this is using SGD with very few iterations and so the
 results may differ but I wouldn't have thought that it would be anywhere
 near such a large difference or such a large error, especially given the
 exceptionally simple data.
 
 Is there something I'm misunderstanding in Spark? Is it not correctly
 configured? Surely I should be getting a smaller error than that?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Jacques Heunis
Ah I see, thanks!
I was just confused because given the same configuration, I would have
thought that Spark and Scikit would give more similar results, but I guess
this is simply not the case (as in your example, in order to get spark to
give an mse sufficiently close to scikit's you have to give it a
significantly larger step and iteration count).

Would that then be a result of MLLib and Scikit differing slightly in their
exact implementation of the optimizer? Or rather a case of (as you say)
Scikit being a far more mature system (and therefore that MLLib would 'get
better' over time)? Surely it is far from ideal that to get the same
results you need more iterations (IE more computation), or do you think
that that is simply coincidence and that given a different model/dataset it
may be the other way around?

I ask because I encountered this situation on other, larger datasets, so
this is not an isolated case (though being the simplest example I could
think of I would imagine that it's somewhat indicative of general behaviour)

On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:

 I don’t get those results. I get:

 spark   0.14
 scikit-learn0.85

 The scikit-learn mse is due to the very low eta0 setting. Tweak that to
 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the
 coefficients are both ~1 and the intercept ~0. Similarly if you change the
 mllib step size to 0.5 and number of iterations to 1200 you again get a
 very low mse. One of the issues with SGD is you have to tweak these
 parameters to tune the algorithm.

 FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib
 is nowhere as mature as scikit learn. However if you have large datasets
 that won’t sensibly fit the scikit-learn in-core model MLLib is one of the
 top choices. Similarly if you are running proof of concepts that you are
 eventually going to scale up to production environments then there is a
 definite argument for using MLlib at both the PoC and production stages.


 On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

  I've recently been trying to get to know Apache Spark as a replacement
 for
  Scikit Learn, however it seems to me that even in simple cases, Scikit
  converges to an accurate model far faster than Spark does.
  For example I generated 1000 data points for a very simple linear
 function
  (z=x+y) with the following script:
 
  http://pastebin.com/ceRkh3nb
 
  I then ran the following Scikit script:
 
  http://pastebin.com/1aECPfvq
 
  And then this Spark script: (with spark-submit filename, no other
  arguments)
 
  http://pastebin.com/s281cuTL
 
  Strangely though, the error given by spark is an order of magnitude
 larger
  than that given by Scikit (0.185 and 0.045 respectively) despite the two
  models having a nearly identical setup (as far as I can tell)
  I understand that this is using SGD with very few iterations and so the
  results may differ but I wouldn't have thought that it would be anywhere
  near such a large difference or such a large error, especially given the
  exceptionally simple data.
 
  Is there something I'm misunderstanding in Spark? Is it not correctly
  configured? Surely I should be getting a smaller error than that?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org