I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the following script:
http://pastebin.com/ceRkh3nb I then ran the following Scikit script: http://pastebin.com/1aECPfvq And then this Spark script: (with spark-submit <filename>, no other arguments) http://pastebin.com/s281cuTL Strangely though, the error given by spark is an order of magnitude larger than that given by Scikit (0.185 and 0.045 respectively) despite the two models having a nearly identical setup (as far as I can tell) I understand that this is using SGD with very few iterations and so the results may differ but I wouldn't have thought that it would be anywhere near such a large difference or such a large error, especially given the exceptionally simple data. Is there something I'm misunderstanding in Spark? Is it not correctly configured? Surely I should be getting a smaller error than that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org