I’m out of the authors of this paper, and I just came across this thread.
I’m glad that Ignacio Zendejas noticed our paper!

First off, let me post link to the published version of the paper, which is
likely slightly different than the version linked above:

http://cmj4.web.rice.edu/performance.pdf
<http://cmj4.web.rice.edu/performance.pdf>  

Next, I just want to quickly address a couple of comments made here.

rxin says:

> They only compared their own implementations of couple algorithms on 
> different platforms rather than comparing the different platforms 
> themselves (in the case of Spark -- PySpark). I can write two variants of 
> an algorithm on Spark and make them perform drastically differently. 

It’s a bit misleading to say that we just tried a “couple” of algorithms;
the paper describes five different algorithms, along with multiple
implementations on each; we tried a LOT of variants of each algorithm, as we
detail in the paper.

Also, it is true that we did use our own implementations; the point was to
compare each platform as a programming and execution platform. The paper is
clear that the benchmark was directed towards “a user who wants to run a
specific ML inference algorithm over a large data set, but cannot find an
existing implementation and thus must 'roll her own' ML code.” We
specifically state that we are not interested in comparing canned libraries,
which is a very different task. Both ease-of-use and speed were considered
as being equally important. If you read the paper, you’ll see that we
generally gave PySpark high marks as a programming platform

Several of the posts here imply that all Spark experiments were using
PySpark. This is not true. Matei Zaharia says:

> Just as a note on this paper, apart from implementing the algorithms in
> naive Python…

And Ignacio Zendajas says:

> Interesting that they chose to use Python alone…

In reality, the paper also describes pure Java implementations that ran on
Spark, with no Python, for two models: a Gaussian mixture model and LDA. 

For the GMM, Java ran in 40-50% of the time compared to Python (though to be
fair, a lot of that is due to the fact that GMM inference is
linear-algebra-intensive; it’s not easy to do linear algebra in the JVM for
reasons I’ll not get into here… it’s possible someone else could do a lot
better). On LDA, Java was less than 10% of the Python time (that is, much
faster).

Matai Zaharia also says:

> they also run it in a fairly inefficient way. In particular 
> their implementations send the model out with every task closure, which is 
> really expensive for a large model, and bring it back with collectAsMap(). 

It’s certainly possible our implementations were sub-optimal. All I can say
is that again, the goal of the paper was to chronicle our experiences using
each platform as both a programming and execution platform. While doubtless
Spark’s developers could have done a better job of writing code for Spark, I
hope we didn’t do too badly! And again, evaluating ease-of-programming was
at least half of our goal.

That said, I doubt that sending out the model should be much of a bottleneck
as Matai Zaharia implies—at least in the cases we tested.  And even if it
was a bottleneck, one could argue that it probably shouldn't be. In the very
worst case (LDA) the model is 100 components X 10^4 dictionary size X 8
bytes per FP number, or 8 MB in all. Not too large. The smallest model is
the GMM @10 dimension. In this case, the model has 10 components X (10 X 10
covariance matrix + 10 dim mean vector) X 8 bytes, or roughly 10KB.  Tiny
even!

Matai Zaharia also says:

> Implementing ML algorithms well by hand is unfortunately difficult, and
> this 
> is why we have MLlib. The hope is that you either get your desired
> algorithm 
> out of the box or get a higher-level primitive (e.g. stochastic gradient 
> descent) that you can plug some functions into, without worrying about 
> the communication. 

I couldn’t agree more. But again, the idea of our benchmark was specifically
to consider the case of an expert who is facing just such an implementation
challenge: he/she needs a model for which a canned implementation does not
exist. 

Finally, let me say that we’d absolutely LOVE it if someone who is an active
Spark developer would take the time to implement one or more of these
algorithms and replicate our experiments (I’d be happy to help anyone out
who wants to do this—send me a message). It’s already been a year since we
did all of this, and for that reason alone the results might be quite
different.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8326.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to