Hi Jeremy,

Thanks for the reply.


Jeremy Freeman wrote
> That said, there's a performance hit. In my testing (v0.8.1) a simple
> algorithm, KMeans (the versions included with Spark), is ~2x faster per
> iteration in Scala than Python in our set up (private HPC, ~30 nodes, each
> with 128GB and 16 cores, roughly comparable to the higher-end EC2
> instances). I'm preparing more extensive benchmarks, esp. re: matrix
> calculations, where the difference may shrink (will post them to this
> forum when ready). For our purposes (purely research), things are fast
> enough already that the benefits of PySpark outweigh the costs, but will
> depend on your use case.

So you measured with a Scala/Java library on Spark vs numpy/scipy on
PySpark, right? Can you tell me which library you used?

A benchmark (or just an initial ballpark figure about the performance
difference) on matrix calculations would be awesome - that's the thing that
I'm wondering about, whether the difference will even out. I'm still working
on something else, and will arrive at Spark/PySpark in a couple of weeks. If
you guys can share the results before, it'll save me a great deal of
time/toil.

Best,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to