Hi Jeremy, Thanks for the reply.
Jeremy Freeman wrote > That said, there's a performance hit. In my testing (v0.8.1) a simple > algorithm, KMeans (the versions included with Spark), is ~2x faster per > iteration in Scala than Python in our set up (private HPC, ~30 nodes, each > with 128GB and 16 cores, roughly comparable to the higher-end EC2 > instances). I'm preparing more extensive benchmarks, esp. re: matrix > calculations, where the difference may shrink (will post them to this > forum when ready). For our purposes (purely research), things are fast > enough already that the benefits of PySpark outweigh the costs, but will > depend on your use case. So you measured with a Scala/Java library on Spark vs numpy/scipy on PySpark, right? Can you tell me which library you used? A benchmark (or just an initial ballpark figure about the performance difference) on matrix calculations would be awesome - that's the thing that I'm wondering about, whether the difference will even out. I'm still working on something else, and will arrive at Spark/PySpark in a couple of weeks. If you guys can share the results before, it'll save me a great deal of time/toil. Best, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
