Hi there, *Background:* I need to do some matrix multiplication stuff inside the mappers, and trying to choose between Python and Scala for writing the Spark MR jobs. I'm equally fluent with Python and Java, and find Scala pretty easy too for what it's worth. Going with Python would let me use numpy + scipy, which is blazing fast when compared to Java libraries like Colt etc. Configuring Java with BLAS seems to be a pain when compared to scipy (direct apt-get installs, or pip).
*Question:* I posted a couple of comments on this answer at StackOverflow: http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python. Basically it states that as of Spark 0.7.2, the Python API would be slower than Scala. What's the performance scenario now? The fork issue seems to be fixed. How about serialization? Can it match Java/Scala Writable-like serialization (having knowledge of object type beforehand, reducing I/O) performance? Also, a probably silly question - loops seem to be slow in Python in general, do you think this can turn out to be an issue? Bottomline, should I choose Python for computation-intensive algorithms like PageRank? Scipy gives me an edge, but does the framework kill it? Any help, insights, benchmarks will be much appreciated. :) Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
