Hi This one is quite interesting. Is it possible to share few toy examples?
On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <assaf.mendel...@rsa.com> wrote: > I am not aware of any official testing but you can easily create your own. > > In testing I made I saw that python UDF were more than 10 times slower > than scala UDF (and in some cases it was closer to 50 times slower). > > That said, it would depend on how you use your UDF. > > For example, lets say you have a 1 billion row table which you do some > aggregation on and left with a 10K rows table. If you do the python UDF in > the beginning then it might have a hard hit but if you do it on the 10K > rows table then the overhead might be negligible. > > Furthermore, you can always write the UDF in scala and wrap it. > > This is something my team did. We have data scientists working on spark in > python. Normally, they can use the existing functions to do what they need > (Spark already has a pretty nice spread of functions which answer most of > the common use cases). When they need a new UDF or UDAF they simply ask my > team (which does the engineering) and we write them a scala one and then > wrap it to be accessible from python. > > > > > > *From:* ayan guha [mailto:[hidden email] > <http:///user/SendEmail.jtp?type=node&node=27650&i=0>] > *Sent:* Friday, September 02, 2016 12:21 AM > *To:* kant kodali > *Cc:* Mendelson, Assaf; user > *Subject:* Re: Scala Vs Python > > > > Thanks All for your replies. > > > > Feature Parity: > > > > MLLib, RDD and dataframes features are totally comparable. Streaming is > now at par in functionality too, I believe. However, what really worries me > is not having Dataset APIs at all in Python. I think thats a deal breaker. > > > > Performance: > > I do get this bit when RDDs are involved, but not when Data frame is the > only construct I am operating on. Dataframe supposed to be > language-agnostic in terms of performance. So why people think python is > slower? is it because of using UDF? Any other reason? > > > > *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF > comparison? like the one out there b/w RDDs.* > > > > @Kant: I am not comparing ANY applications. I am comparing SPARK > applications only. I would be glad to hear your opinion on why pyspark > applications will not work, if you have any benchmarks please share if > possible. > > > > > > > > > > > > On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=27650&i=1>> wrote: > > c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases > or Large Scale Distributed Systems makes absolutely no sense. I can write a > 10 page essay on why that wouldn't work so great. you might be wondering > why would spark have it then? well probably because its ease of use for ML > (that would be my best guess). > > > > > > On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email] > <http:///user/SendEmail.jtp?type=node&node=27650&i=2> wrote: > > I believe this would greatly depend on your use case and your familiarity > with the languages. > > > > In general, scala would have a much better performance than python and not > all interfaces are available in python. > > That said, if you are planning to use dataframes without any UDF then the > performance hit is practically nonexistent. > > Even if you need UDF, it is possible to write those in scala and wrap them > for python and still get away without the performance hit. > > Python does not have interfaces for UDAFs. > > > > I believe that if you have large structured data and do not generally need > UDF/UDAF you can certainly work in python without losing too much. > > > > > > *From:* ayan guha [mailto:[hidden email] > <http://user/SendEmail.jtp?type=node&node=27637&i=0>] > *Sent:* Thursday, September 01, 2016 5:03 AM > *To:* user > *Subject:* Scala Vs Python > > > > Hi Users > > > > Thought to ask (again and again) the question: While I am building any > production application, should I use Scala or Python? > > > > I have read many if not most articles but all seems pre-Spark 2. Anything > changed with Spark 2? Either pro-scala way or pro-python way? > > > > I am thinking performance, feature parity and future direction, not so > much in terms of skillset or ease of use. > > > > Or, if you think it is a moot point, please say so as well. > > > > Any real life example, production experience, anecdotes, personal taste, > profanity all are welcome :) > > > > -- > > Best Regards, > Ayan Guha > > > ------------------------------ > > View this message in context: RE: Scala Vs Python > <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > > > > > > -- > > Best Regards, > Ayan Guha > > ------------------------------ > View this message in context: RE: Scala Vs Python > <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Best Regards, Ayan Guha