Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-11-01 Thread Holden Karau
On that note there is some discussion on the Jira -
https://issues.apache.org/jira/browse/SPARK-13534 :)

On Mon, Oct 31, 2016 at 8:32 PM, Holden Karau  wrote:

> I believe Bryan is also working on this a little - and I'm a little busy
> with the other stuff but would love to stay in the loop on Arrow progress :)
>
>
> On Monday, October 31, 2016, mariusvniekerk 
> wrote:
>
>> So i've been working on some very very early stage apache arrow
>> integration.
>> My current plan it to emulate some of how the R function execution works.
>> If there are any other people working on similar efforts it would be good
>> idea to combine efforts.
>>
>> I can see how much effort is involved in converting that PR to a spark
>> package so that people can try to use it.  I think this is something that
>> we
>> want some more community iteration on maybe?
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Python-Spark-Improvements-
>> forked-from-Spark-Improvement-Proposals-tp19422p19670.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread Holden Karau
I believe Bryan is also working on this a little - and I'm a little busy
with the other stuff but would love to stay in the loop on Arrow progress :)

On Monday, October 31, 2016, mariusvniekerk 
wrote:

> So i've been working on some very very early stage apache arrow
> integration.
> My current plan it to emulate some of how the R function execution works.
> If there are any other people working on similar efforts it would be good
> idea to combine efforts.
>
> I can see how much effort is involved in converting that PR to a spark
> package so that people can try to use it.  I think this is something that
> we
> want some more community iteration on maybe?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Python-Spark-
> Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread mariusvniekerk
So i've been working on some very very early stage apache arrow integration. 
My current plan it to emulate some of how the R function execution works.  
If there are any other people working on similar efforts it would be good
idea to combine efforts.

I can see how much effort is involved in converting that PR to a spark
package so that people can try to use it.  I think this is something that we
want some more community iteration on maybe?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-14 Thread mariusvniekerk
So for the jupyter integration pieces.

I've made a simple library ( https://github.com/MaxPoint/spylon
  ) which allows a simpler way of
creating a SparkContext (with all the parameters available to spark-submit)
as well as some usability enhancements, progress bars, tab completion for
spark configuration properties, easier loading of scala objects via py4j.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19449.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
on a flight to London for OSCON but I want to continueo encourage users to
chime in with their experiences (to that end I'm trying to re include user@
since it doesn't seem to have been posted there despite my initial attempt
to do so.)

On Thursday, October 13, 2016, assaf.mendelson 
wrote:

> Hi,
>
> We are actually using pyspark heavily.
>
> I agree with all of your points,  for me I see the following as the main
> hurdles:
>
> 1.   Pyspark does not have support for UDAF. We have had multiple
> needs for UDAF and needed to go to java/scala to support these. Having
> python UDAF would have made life much easier (especially at earlier stages
> when we prototype).
>
> 2.   Performance. I cannot stress this enough. Currently we have
> engineers who take python UDFs and convert them to scala UDFs for
> performance. We are currently even looking at writing UDFs and UDAFs in a
> more native way (e.g. using expressions) to improve performance but working
> with pyspark can be really problematic.
>
>
>
> BTW, other than using jython or arrow, I believe there are a couple of
> other ways to get improve performance:
>
> 1.   Python provides tool to generate AST for python code (
> https://docs.python.org/2/library/ast.html). This means we can use the
> AST to construct scala code very similar to how expressions are build for
> native spark functions in scala. Of course doing full conversion is very
> hard but at least handling simple cases should be simple.
>
> 2.   The above would of course be limited if we use python packages
> but over time it is possible to add some “translation” tools (i.e. take
> python packages and find the appropriate scala equivalent. We can even
> provide this to the user to supply their own conversions thereby looking as
> a regular python code but being converted to scala code behind the scenes).
>
> 3.   In scala, it is possible to use codegen to actually generate
> code from a string. There is no reason why we can’t write the expression in
> python and provide a scala string. This would mean learning some scala but
> would mean we do not have to create a separate code tree.
>
>
>
> BTW, the fact that all of the tools to access java are marked as private
> has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are
> written in scala for performance. The wrapping to provide them in python
> uses way too many private elements for my taste.
>
>
>
>
>
> *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+
> [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19431&i=0>]
> *Sent:* Thursday, October 13, 2016 3:51 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Python Spark Improvements (forked from Spark Improvement
> Proposals)
>
>
>
> As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all
> of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
> Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.
>
>
> Being a Python shop, we were extremely pleased to learn about PySpark a
> few years ago as our main ETL pipeline used Apache Pig at the time. I was
> one of the only folks who understood Pig and Java so collaborating on this
> as a team was difficult.
>
> Spark provided a means for the entire team to collaborate, but we've hit
> our fair share of issues all of which are enumerated in this thread.
>
> Besides giving a +1 here, I think if I were to force rank these items for
> us, it'd be:
>
> 1. Configuration difficulties: we've lost literally weeks to
> troubleshooting memory issues for larger jobs. It took a long time to even
> understand *why* certain jobs were failing since Spark would just report
> executors being lost. Finally we tracked things down to understanding that
> spark.yarn.executor.memoryOverhead controls the portion of memory
> reserved for Python processes, but none of this is documented anywhere as
> far as I can tell. We discovered this via trial and error. Both
> documentation and better defaults for this setting when running a PySpark
> application are probably sufficient. We've also had a number of troubles
> with saving Parquet output as part of an ETL flow, but perhaps we'll save
> that for a blog post of its own.
>
> 2. Dependency management: I've tried to help move the conversation on
> https://issues.apa

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread assaf.mendelson
Hi,
We are actually using pyspark heavily.
I agree with all of your points,  for me I see the following as the main 
hurdles:

1.   Pyspark does not have support for UDAF. We have had multiple needs for 
UDAF and needed to go to java/scala to support these. Having python UDAF would 
have made life much easier (especially at earlier stages when we prototype).

2.   Performance. I cannot stress this enough. Currently we have engineers 
who take python UDFs and convert them to scala UDFs for performance. We are 
currently even looking at writing UDFs and UDAFs in a more native way (e.g. 
using expressions) to improve performance but working with pyspark can be 
really problematic.

BTW, other than using jython or arrow, I believe there are a couple of other 
ways to get improve performance:

1.   Python provides tool to generate AST for python code 
(https://docs.python.org/2/library/ast.html). This means we can use the AST to 
construct scala code very similar to how expressions are build for native spark 
functions in scala. Of course doing full conversion is very hard but at least 
handling simple cases should be simple.

2.   The above would of course be limited if we use python packages but 
over time it is possible to add some "translation" tools (i.e. take python 
packages and find the appropriate scala equivalent. We can even provide this to 
the user to supply their own conversions thereby looking as a regular python 
code but being converted to scala code behind the scenes).

3.   In scala, it is possible to use codegen to actually generate code from 
a string. There is no reason why we can't write the expression in python and 
provide a scala string. This would mean learning some scala but would mean we 
do not have to create a separate code tree.

BTW, the fact that all of the tools to access java are marked as private has me 
a little worried. Nearly all of our UDFs (and all of our UDAFs) are written in 
scala for performance. The wrapping to provide them in python uses way too many 
private elements for my taste.


From: msukmanowsky [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19426...@n3.nabble.com]
Sent: Thursday, October 13, 2016 3:51 AM
To: Mendelson, Assaf
Subject: Re: Python Spark Improvements (forked from Spark Improvement Proposals)

As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of the 
issues raised by Holden and Ricardo. I'm also giving a talk at PyCon Canada on 
PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few 
years ago as our main ETL pipeline used Apache Pig at the time. I was one of 
the only folks who understood Pig and Java so collaborating on this as a team 
was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our 
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for us, 
it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting 
memory issues for larger jobs. It took a long time to even understand *why* 
certain jobs were failing since Spark would just report executors being lost. 
Finally we tracked things down to understanding that 
spark.yarn.executor.memoryOverhead controls the portion of memory reserved for 
Python processes, but none of this is documented anywhere as far as I can tell. 
We discovered this via trial and error. Both documentation and better defaults 
for this setting when running a PySpark application are probably sufficient. 
We've also had a number of troubles with saving Parquet output as part of an 
ETL flow, but perhaps we'll save that for a blog post of its own.

2. Dependency management: I've tried to help move the conversation on 
https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit 
stalled. Installing the required dependencies for a PySpark application is a 
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages" and 
"
difficulty using PySpark from outside of spark-submit / pyspark shell" here. 
When teaching PySpark to new users, I'm reminded of how much inside knowledge 
is needed to overcome esoteric errors. As one example is hitting 
"PicklingError: Could not pickle object as excessively deep recursion 
required." errors. New users often do something innocent like try to pickle a 
global logging object and hit this and begin the Google -> stackoverflow search 
to try to comprehend what's going on. You can lose days to errors like these 
and they completely kill the productivity flow and send you hunting for 
alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can and 
do as much in Java as possible but when we do move to Pytho

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-12 Thread msukmanowsky
As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of
the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few
years ago as our main ETL pipeline used Apache Pig at the time. I was one of
the only folks who understood Pig and Java so collaborating on this as a
team was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for
us, it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting
memory issues for larger jobs. It took a long time to even understand *why*
certain jobs were failing since Spark would just report executors being
lost. Finally we tracked things down to understanding that
spark.yarn.executor.memoryOverhead controls the portion of memory reserved
for Python processes, but none of this is documented anywhere as far as I
can tell. We discovered this via trial and error. Both documentation and
better defaults for this setting when running a PySpark application are
probably sufficient. We've also had a number of troubles with saving Parquet
output as part of an ETL flow, but perhaps we'll save that for a blog post
of its own.

2. Dependency management: I've tried to help move the conversation on
https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit
stalled. Installing the required dependencies for a PySpark application is a
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages"
and "
difficulty using PySpark from outside of spark-submit / pyspark shell" here.
When teaching PySpark to new users, I'm reminded of how much inside
knowledge is needed to overcome esoteric errors. As one example is hitting
"PicklingError: Could not pickle object as excessively deep recursion
required." errors. New users often do something innocent like try to pickle
a global logging object and hit this and begin the Google -> stackoverflow
search to try to comprehend what's going on. You can lose days to errors
like these and they completely kill the productivity flow and send you
hunting for alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can
and do as much in Java as possible but when we do move to Python, we're well
aware that we're about to take a hit on performance. We're very keen to see
what Apache Arrow does for things here.

5. API difficulties: I agree that when coming from Python, you'd expect that
you can do the same kinds of operations on DataFrames in Spark that you can
with Pandas, but I personally haven't been too bothered by this. Maybe I'm
more used to this situation from using other frameworks that have similar
concepts but incompatible implementations.

We're big fans of PySpark and are happy to provide feedback and contribute
wherever we can.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org