A much (much) better solution than python, (and also scala, if that doesn't
make you upset) is julia <http://julialang.org/>.

Libraries like numpy and scipy are bloated when compared with julia c-like
performance. Julia comes with eveything that numpy+scipy come with + more -
performance hit.

I hope we can see an official support of julia on spark very soon.


On Thu, Jan 30, 2014 at 4:30 PM, nileshc <[email protected]> wrote:

> Hi there,
>
> *Background:*
> I need to do some matrix multiplication stuff inside the mappers, and
> trying
> to choose between Python and Scala for writing the Spark MR jobs. I'm
> equally fluent with Python and Java, and find Scala pretty easy too for
> what
> it's worth. Going with Python would let me use numpy + scipy, which is
> blazing fast when compared to Java libraries like Colt etc. Configuring
> Java
> with BLAS seems to be a pain when compared to scipy (direct apt-get
> installs, or pip).
>
> *Question:*
> I posted a couple of comments on this answer at StackOverflow:
>
> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python
> .
> Basically it states that as of Spark 0.7.2, the Python API would be slower
> than Scala. What's the performance scenario now? The fork issue seems to be
> fixed. How about serialization? Can it match Java/Scala Writable-like
> serialization (having knowledge of object type beforehand, reducing I/O)
> performance? Also, a probably silly question - loops seem to be slow in
> Python in general, do you think this can turn out to be an issue?
>
> Bottomline, should I choose Python for computation-intensive algorithms
> like
> PageRank? Scipy gives me an edge, but does the framework kill it?
>
> Any help, insights, benchmarks will be much appreciated. :)
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to