If you just need basic matrix operations - Spark is dependent on JBlas (
http://mikiobraun.github.io/jblas/) to have access to quick linear algebra
routines inside of MLlib and graphx. Jblas does a nice job of avoiding
boxing/unboxing issues when calling out to blas, so it might be what you're
looking for. The programming patterns you'll be able to support with jblas
(matrix ops on local partitions) are very similar to what you'd get with
numpy, etc.

I agree that the python libraries are more complete/feature rich, but if
you really crave high performance then I'd recommend staying pure scala and
giving jblas a try.


On Thu, Jan 30, 2014 at 8:30 AM, nileshc <[email protected]> wrote:

> Hi there,
>
> *Background:*
> I need to do some matrix multiplication stuff inside the mappers, and
> trying
> to choose between Python and Scala for writing the Spark MR jobs. I'm
> equally fluent with Python and Java, and find Scala pretty easy too for
> what
> it's worth. Going with Python would let me use numpy + scipy, which is
> blazing fast when compared to Java libraries like Colt etc. Configuring
> Java
> with BLAS seems to be a pain when compared to scipy (direct apt-get
> installs, or pip).
>
> *Question:*
> I posted a couple of comments on this answer at StackOverflow:
>
> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
> .
> Basically it states that as of Spark 0.7.2, the Python API would be slower
> than Scala. What's the performance scenario now? The fork issue seems to be
> fixed. How about serialization? Can it match Java/Scala Writable-like
> serialization (having knowledge of object type beforehand, reducing I/O)
> performance? Also, a probably silly question - loops seem to be slow in
> Python in general, do you think this can turn out to be an issue?
>
> Bottomline, should I choose Python for computation-intensive algorithms
> like
> PageRank? Scipy gives me an edge, but does the framework kill it?
>
> Any help, insights, benchmarks will be much appreciated. :)
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to