Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting on a flight to London for OSCON but I want to continueo encourage users to chime in with their experiences (to that end I'm trying to re include user@ since it doesn't seem to have been posted there despite my initial attempt to do so.)
On Thursday, October 13, 2016, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > Hi, > > We are actually using pyspark heavily. > > I agree with all of your points, for me I see the following as the main > hurdles: > > 1. Pyspark does not have support for UDAF. We have had multiple > needs for UDAF and needed to go to java/scala to support these. Having > python UDAF would have made life much easier (especially at earlier stages > when we prototype). > > 2. Performance. I cannot stress this enough. Currently we have > engineers who take python UDFs and convert them to scala UDFs for > performance. We are currently even looking at writing UDFs and UDAFs in a > more native way (e.g. using expressions) to improve performance but working > with pyspark can be really problematic. > > > > BTW, other than using jython or arrow, I believe there are a couple of > other ways to get improve performance: > > 1. Python provides tool to generate AST for python code ( > https://docs.python.org/2/library/ast.html). This means we can use the > AST to construct scala code very similar to how expressions are build for > native spark functions in scala. Of course doing full conversion is very > hard but at least handling simple cases should be simple. > > 2. The above would of course be limited if we use python packages > but over time it is possible to add some “translation” tools (i.e. take > python packages and find the appropriate scala equivalent. We can even > provide this to the user to supply their own conversions thereby looking as > a regular python code but being converted to scala code behind the scenes). > > 3. In scala, it is possible to use codegen to actually generate > code from a string. There is no reason why we can’t write the expression in > python and provide a scala string. This would mean learning some scala but > would mean we do not have to create a separate code tree. > > > > BTW, the fact that all of the tools to access java are marked as private > has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are > written in scala for performance. The wrapping to provide them in python > uses way too many private elements for my taste. > > > > > > *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+ > <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19431&i=0>] > *Sent:* Thursday, October 13, 2016 3:51 AM > *To:* Mendelson, Assaf > *Subject:* Re: Python Spark Improvements (forked from Spark Improvement > Proposals) > > > > As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all > of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon > Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/. > > > Being a Python shop, we were extremely pleased to learn about PySpark a > few years ago as our main ETL pipeline used Apache Pig at the time. I was > one of the only folks who understood Pig and Java so collaborating on this > as a team was difficult. > > Spark provided a means for the entire team to collaborate, but we've hit > our fair share of issues all of which are enumerated in this thread. > > Besides giving a +1 here, I think if I were to force rank these items for > us, it'd be: > > 1. Configuration difficulties: we've lost literally weeks to > troubleshooting memory issues for larger jobs. It took a long time to even > understand *why* certain jobs were failing since Spark would just report > executors being lost. Finally we tracked things down to understanding that > spark.yarn.executor.memoryOverhead controls the portion of memory > reserved for Python processes, but none of this is documented anywhere as > far as I can tell. We discovered this via trial and error. Both > documentation and better defaults for this setting when running a PySpark > application are probably sufficient. We've also had a number of troubles > with saving Parquet output as part of an ETL flow, but perhaps we'll save > that for a blog post of its own. > > 2. Dependency management: I've tried to help move the conversation on > https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a > bit stalled. Installing the required dependencies for a PySpark application > is a really messy ordeal right now. > > 3. Development workflow: I'd combine both "incomprehensible error > messages" and " > difficulty using PySpark from outside of spark-submit / pyspark shell" > here. When teaching PySpark to new users, I'm reminded of how much inside > knowledge is needed to overcome esoteric errors. As one example is hitting > "PicklingError: Could not pickle object as excessively deep recursion > required." errors. New users often do something innocent like try to pickle > a global logging object and hit this and begin the Google -> stackoverflow > search to try to comprehend what's going on. You can lose days to errors > like these and they completely kill the productivity flow and send you > hunting for alternatives. > > 4. Speed/performance: we are trying to use DataFrame/DataSets where we can > and do as much in Java as possible but when we do move to Python, we're > well aware that we're about to take a hit on performance. We're very keen > to see what Apache Arrow does for things here. > > 5. API difficulties: I agree that when coming from Python, you'd expect > that you can do the same kinds of operations on DataFrames in Spark that > you can with Pandas, but I personally haven't been too bothered by this. > Maybe I'm more used to this situation from using other frameworks that have > similar concepts but incompatible implementations. > > We're big fans of PySpark and are happy to provide feedback and contribute > wherever we can. > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark- > Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=19431&i=1> > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: RE: Python Spark Improvements (forked from > Spark Improvement Proposals) > <http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19431.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau