Re: Python API Performance

Evan Sparks Sat, 01 Feb 2014 18:00:49 -0800

We used breeze in some early MLlib prototypes last year. It feels very "scala" 
which is a huge plus, but unfortunately we found that the object overhead and 
difficulty of tracking down performance problems due to heavy use of implicit 
conversions inside breeze made writing high performance matrix code with it 
difficult. Further - at least for the early algorithms, we didn't need all the 
extra flexibility that breeze provides, since our use cases were pretty 
straightforward.


> On Feb 1, 2014, at 5:51 PM, 尹绪森 <[email protected]> wrote:
> 
> How about breeze (http://www.scalanlp.org/) ? It is written in scala, and use 
> netlib-java as the backend. 
> (https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra#wiki-performance)
> 
> I think breeze is more like matlab and numpy/scipy on the subject of ease of 
> use. This is also a good aspect to have a test.
> 
> 
> 2014-02-02 Ankur Chauhan <[email protected]>:
>> How does Julia interact with spark. I would be interested, mainly because I 
>> seem to find scala syntax a little obscure and it would be great to see 
>> actual numbers comparing scala, Python, Julia workloads. 
>> 
>>> On Feb 1, 2014, at 16:08, Aureliano Buendia <[email protected]> wrote:
>>> 
>>> A much (much) better solution than python, (and also scala, if that doesn't 
>>> make you upset) is julia.
>>> 
>>> Libraries like numpy and scipy are bloated when compared with julia c-like 
>>> performance. Julia comes with eveything that numpy+scipy come with + more - 
>>> performance hit.
>>> 
>>> I hope we can see an official support of julia on spark very soon.
>>> 
>>> 
>>>> On Thu, Jan 30, 2014 at 4:30 PM, nileshc <[email protected]> wrote:
>>>> Hi there,
>>>> 
>>>> *Background:*
>>>> I need to do some matrix multiplication stuff inside the mappers, and 
>>>> trying
>>>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>>>> equally fluent with Python and Java, and find Scala pretty easy too for 
>>>> what
>>>> it's worth. Going with Python would let me use numpy + scipy, which is
>>>> blazing fast when compared to Java libraries like Colt etc. Configuring 
>>>> Java
>>>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>>>> installs, or pip).
>>>> 
>>>> *Question:*
>>>> I posted a couple of comments on this answer at StackOverflow:
>>>> http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python.
>>>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>>>> than Scala. What's the performance scenario now? The fork issue seems to be
>>>> fixed. How about serialization? Can it match Java/Scala Writable-like
>>>> serialization (having knowledge of object type beforehand, reducing I/O)
>>>> performance? Also, a probably silly question - loops seem to be slow in
>>>> Python in general, do you think this can turn out to be an issue?
>>>> 
>>>> Bottomline, should I choose Python for computation-intensive algorithms 
>>>> like
>>>> PageRank? Scipy gives me an edge, but does the framework kill it?
>>>> 
>>>> Any help, insights, benchmarks will be much appreciated. :)
>>>> 
>>>> Cheers,
>>>> Nilesh
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: 
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and 
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: http://yinxusen.github.io/

Re: Python API Performance

Reply via email to