subject:"Re\: Is Apache Spark less accurate than Scikit Learn\?"

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East

There are many different variants of gradient descent mostly dealing with what
the step size is and how it might be adjusted as the algorithm proceeds. Also
if it uses a stochastic variant (as opposed to batch descent) then there are
variations there too. I don’t know off-hand what MLlib’s detailed
implementation is but no doubt there are differences between the two - perhaps
someone with more knowledge of the internals could comment.

As you can tell from playing around with the parameters, step size is vitally
important to the performance of the algorithm.

On 22 Jan 2015, at 06:44, Jacques Heunis jaaksem...@gmail.com wrote:

Ah I see, thanks!
I was just confused because given the same configuration, I would have
thought that Spark and Scikit would give more similar results, but I guess
this is simply not the case (as in your example, in order to get spark to
give an mse sufficiently close to scikit's you have to give it a
significantly larger step and iteration count).

Would that then be a result of MLLib and Scikit differing slightly in their
exact implementation of the optimizer? Or rather a case of (as you say)
Scikit being a far more mature system (and therefore that MLLib would 'get
better' over time)? Surely it is far from ideal that to get the same results
you need more iterations (IE more computation), or do you think that that is
simply coincidence and that given a different model/dataset it may be the
other way around?

I ask because I encountered this situation on other, larger datasets, so this
is not an isolated case (though being the simplest example I could think of I
would imagine that it's somewhat indicative of general behaviour)

On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:
I don’t get those results. I get:

spark 0.14
scikit-learn0.85

The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1
and push iterations to 400 and you get a mse ~= 0. Of course the coefficients
are both ~1 and the intercept ~0. Similarly if you change the mllib step size
to 0.5 and number of iterations to 1200 you again get a very low mse. One of
the issues with SGD is you have to tweak these parameters to tune the
algorithm.

FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is
nowhere as mature as scikit learn. However if you have large datasets that
won’t sensibly fit the scikit-learn in-core model MLLib is one of the top
choices. Similarly if you are running proof of concepts that you are
eventually going to scale up to production environments then there is a
definite argument for using MLlib at both the PoC and production stages.

On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

I've recently been trying to get to know Apache Spark as a replacement for
Scikit Learn, however it seems to me that even in simple cases, Scikit
converges to an accurate model far faster than Spark does.
For example I generated 1000 data points for a very simple linear function
(z=x+y) with the following script:

http://pastebin.com/ceRkh3nb

I then ran the following Scikit script:

http://pastebin.com/1aECPfvq

And then this Spark script: (with spark-submit filename, no other
arguments)

http://pastebin.com/s281cuTL

Strangely though, the error given by spark is an order of magnitude larger
than that given by Scikit (0.185 and 0.045 respectively) despite the two
models having a nearly identical setup (as far as I can tell)
I understand that this is using SGD with very few iterations and so the
results may differ but I wouldn't have thought that it would be anywhere
near such a large difference or such a large error, especially given the
exceptionally simple data.

Is there something I'm misunderstanding in Spark? Is it not correctly
configured? Surely I should be getting a smaller error than that?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East

I don’t get those results. I get:

spark 0.14
scikit-learn0.85

The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and
push iterations to 400 and you get a mse ~= 0. Of course the coefficients are
both ~1 and the intercept ~0. Similarly if you change the mllib step size to
0.5 and number of iterations to 1200 you again get a very low mse. One of the
issues with SGD is you have to tweak these parameters to tune the algorithm.

FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is
nowhere as mature as scikit learn. However if you have large datasets that
won’t sensibly fit the scikit-learn in-core model MLLib is one of the top
choices. Similarly if you are running proof of concepts that you are eventually
going to scale up to production environments then there is a definite argument
for using MLlib at both the PoC and production stages.

On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

http://pastebin.com/ceRkh3nb

I then ran the following Scikit script:

http://pastebin.com/1aECPfvq

And then this Spark script: (with spark-submit filename, no other
arguments)

http://pastebin.com/s281cuTL

Is there something I'm misunderstanding in Spark? Is it not correctly
configured? Surely I should be getting a smaller error than that?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Jacques Heunis

Ah I see, thanks!
I was just confused because given the same configuration, I would have
thought that Spark and Scikit would give more similar results, but I guess
this is simply not the case (as in your example, in order to get spark to
give an mse sufficiently close to scikit's you have to give it a
significantly larger step and iteration count).

Would that then be a result of MLLib and Scikit differing slightly in their
exact implementation of the optimizer? Or rather a case of (as you say)
Scikit being a far more mature system (and therefore that MLLib would 'get
better' over time)? Surely it is far from ideal that to get the same
results you need more iterations (IE more computation), or do you think
that that is simply coincidence and that given a different model/dataset it
may be the other way around?

I ask because I encountered this situation on other, larger datasets, so
this is not an isolated case (though being the simplest example I could
think of I would imagine that it's somewhat indicative of general behaviour)

On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:

 I don’t get those results. I get:

 spark   0.14
 scikit-learn0.85

 The scikit-learn mse is due to the very low eta0 setting. Tweak that to
 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the
 coefficients are both ~1 and the intercept ~0. Similarly if you change the
 mllib step size to 0.5 and number of iterations to 1200 you again get a
 very low mse. One of the issues with SGD is you have to tweak these
 parameters to tune the algorithm.

 FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib
 is nowhere as mature as scikit learn. However if you have large datasets
 that won’t sensibly fit the scikit-learn in-core model MLLib is one of the
 top choices. Similarly if you are running proof of concepts that you are
 eventually going to scale up to production environments then there is a
 definite argument for using MLlib at both the PoC and production stages.


 On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

  I've recently been trying to get to know Apache Spark as a replacement
 for
  Scikit Learn, however it seems to me that even in simple cases, Scikit
  converges to an accurate model far faster than Spark does.
  For example I generated 1000 data points for a very simple linear
 function
  (z=x+y) with the following script:
 
  http://pastebin.com/ceRkh3nb
 
  I then ran the following Scikit script:
 
  http://pastebin.com/1aECPfvq
 
  And then this Spark script: (with spark-submit filename, no other
  arguments)
 
  http://pastebin.com/s281cuTL
 
  Strangely though, the error given by spark is an order of magnitude
 larger
  than that given by Scikit (0.185 and 0.045 respectively) despite the two
  models having a nearly identical setup (as far as I can tell)
  I understand that this is using SGD with very few iterations and so the
  results may differ but I wouldn't have thought that it would be anywhere
  near such a large difference or such a large error, especially given the
  exceptionally simple data.
 
  Is there something I'm misunderstanding in Spark? Is it not correctly
  configured? Surely I should be getting a smaller error than that?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: Is Apache Spark less accurate than Scikit Learn?

Re: Is Apache Spark less accurate than Scikit Learn?

Re: Is Apache Spark less accurate than Scikit Learn?

3 matches

Site Navigation

Mail list logo

Footer information