Hi Nilesh, We're building a data analysis library purely in PySpark that uses a fair bit of numerical computing (https://github.com/freeman-lab/thunder), and faced the same decision as you when starting out.
We went with PySpark because of NumPy and SciPy. So many functions are included, with robust implementations: signal processing, optimization, matrix math, etc., and it's trivial to setup. In Scala, we needed different libraries for specific problems, and many are still in their early days (bugs, missing features, etc.). The PySpark API is relatively complete, though a few bits of functionality aren't there (zipping is probably the only one we're sometimes missing, useful for certain matrix operations). It was definitely feasible to build a functional library entirely in PySpark. That said, there's a performance hit. In my testing (v0.8.1) a simple algorithm, KMeans (the versions included with Spark), is ~2x faster per iteration in Scala than Python in our set up (private HPC, ~30 nodes, each with 128GB and 16 cores, roughly comparable to the higher-end EC2 instances). I'm preparing more extensive benchmarks, esp. re: matrix calculations, where the difference may shrink (will post them to this forum when ready). For our purposes (purely research), things are fast enough already that the benefits of PySpark outweigh the costs, but will depend on your use case. I can't speak much to the current roadblocks and future plans for speed-ups, though I know Josh has mentioned he's working on new custom serializers. -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1049.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
