Hi Nilesh,

We're building a data analysis library purely in PySpark that uses a fair
bit of numerical computing (https://github.com/freeman-lab/thunder), and
faced the same decision as you when starting out.

We went with PySpark because of NumPy and SciPy. So many functions are
included, with robust implementations: signal processing, optimization,
matrix math, etc., and it's trivial to setup. In Scala, we needed different
libraries for specific problems, and many are still in their early days
(bugs, missing features, etc.). The PySpark API is relatively complete,
though a few bits of functionality aren't there (zipping is probably the
only one we're sometimes missing, useful for certain matrix operations). It
was definitely feasible to build a functional library entirely in PySpark.

That said, there's a performance hit. In my testing (v0.8.1) a simple
algorithm, KMeans (the versions included with Spark), is ~2x faster per
iteration in Scala than Python in our set up (private HPC, ~30 nodes, each
with 128GB and 16 cores, roughly comparable to the higher-end EC2
instances). I'm preparing more extensive benchmarks, esp. re: matrix
calculations, where the difference may shrink (will post them to this forum
when ready). For our purposes (purely research), things are fast enough
already that the benefits of PySpark outweigh the costs, but will depend on
your use case.

I can't speak much to the current roadblocks and future plans for speed-ups,
though I know Josh has mentioned he's working on new custom serializers.

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048p1049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to