I've recently been trying to get to know Apache Spark as a replacement for
Scikit Learn, however it seems to me that even in simple cases, Scikit
converges to an accurate model far faster than Spark does.
For example I generated 1000 data points for a very simple linear function
(z=x+y) with the following script:

http://pastebin.com/ceRkh3nb

I then ran the following Scikit script:

http://pastebin.com/1aECPfvq

And then this Spark script: (with spark-submit <filename>, no other
arguments)

http://pastebin.com/s281cuTL

Strangely though, the error given by spark is an order of magnitude larger
than that given by Scikit (0.185 and 0.045 respectively) despite the two
models having a nearly identical setup (as far as I can tell)
I understand that this is using SGD with very few iterations and so the
results may differ but I wouldn't have thought that it would be anywhere
near such a large difference or such a large error, especially given the
exceptionally simple data.

Is there something I'm misunderstanding in Spark? Is it not correctly
configured? Surely I should be getting a smaller error than that?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to