I’m out of the authors of this paper, and I just came across this thread. I’m glad that Ignacio Zendejas noticed our paper!
First off, let me post link to the published version of the paper, which is likely slightly different than the version linked above: http://cmj4.web.rice.edu/performance.pdf <http://cmj4.web.rice.edu/performance.pdf> Next, I just want to quickly address a couple of comments made here. rxin says: > They only compared their own implementations of couple algorithms on > different platforms rather than comparing the different platforms > themselves (in the case of Spark -- PySpark). I can write two variants of > an algorithm on Spark and make them perform drastically differently. It’s a bit misleading to say that we just tried a “couple” of algorithms; the paper describes five different algorithms, along with multiple implementations on each; we tried a LOT of variants of each algorithm, as we detail in the paper. Also, it is true that we did use our own implementations; the point was to compare each platform as a programming and execution platform. The paper is clear that the benchmark was directed towards “a user who wants to run a specific ML inference algorithm over a large data set, but cannot find an existing implementation and thus must 'roll her own' ML code.” We specifically state that we are not interested in comparing canned libraries, which is a very different task. Both ease-of-use and speed were considered as being equally important. If you read the paper, you’ll see that we generally gave PySpark high marks as a programming platform Several of the posts here imply that all Spark experiments were using PySpark. This is not true. Matei Zaharia says: > Just as a note on this paper, apart from implementing the algorithms in > naive Python… And Ignacio Zendajas says: > Interesting that they chose to use Python alone… In reality, the paper also describes pure Java implementations that ran on Spark, with no Python, for two models: a Gaussian mixture model and LDA. For the GMM, Java ran in 40-50% of the time compared to Python (though to be fair, a lot of that is due to the fact that GMM inference is linear-algebra-intensive; it’s not easy to do linear algebra in the JVM for reasons I’ll not get into here… it’s possible someone else could do a lot better). On LDA, Java was less than 10% of the Python time (that is, much faster). Matai Zaharia also says: > they also run it in a fairly inefficient way. In particular > their implementations send the model out with every task closure, which is > really expensive for a large model, and bring it back with collectAsMap(). It’s certainly possible our implementations were sub-optimal. All I can say is that again, the goal of the paper was to chronicle our experiences using each platform as both a programming and execution platform. While doubtless Spark’s developers could have done a better job of writing code for Spark, I hope we didn’t do too badly! And again, evaluating ease-of-programming was at least half of our goal. That said, I doubt that sending out the model should be much of a bottleneck as Matai Zaharia implies—at least in the cases we tested. And even if it was a bottleneck, one could argue that it probably shouldn't be. In the very worst case (LDA) the model is 100 components X 10^4 dictionary size X 8 bytes per FP number, or 8 MB in all. Not too large. The smallest model is the GMM @10 dimension. In this case, the model has 10 components X (10 X 10 covariance matrix + 10 dim mean vector) X 8 bytes, or roughly 10KB. Tiny even! Matai Zaharia also says: > Implementing ML algorithms well by hand is unfortunately difficult, and > this > is why we have MLlib. The hope is that you either get your desired > algorithm > out of the box or get a higher-level primitive (e.g. stochastic gradient > descent) that you can plug some functions into, without worrying about > the communication. I couldn’t agree more. But again, the idea of our benchmark was specifically to consider the case of an expert who is facing just such an implementation challenge: he/she needs a model for which a canned implementation does not exist. Finally, let me say that we’d absolutely LOVE it if someone who is an active Spark developer would take the time to implement one or more of these algorithms and replicate our experiments (I’d be happy to help anyone out who wants to do this—send me a message). It’s already been a year since we did all of this, and for that reason alone the results might be quite different. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8326.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org