Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Evgeniy Ignatiev
Note: forwarding to list, incorrectly hit "Repliy" first, instead of 
"Reply List"


Hello,

Does your code run without enabling fallback mode? Arrow vectorization 
might not just get applied - if you still observe "javaToPython" stages 
on Spark UI. Also data is not skewed (partitions are too large and data 
parallelism can't be fully utilised) or logic is simply too heavy-weight 
- so using Pandas UDF doesn't improve performance much?


Best regards,
Evgenii Ignatev.

On 06.10.2020 19:44, Lian Jiang wrote:

Hi,

I used these settings but did not see obvious improvement (190 minutes 
reduced to 170 minutes):


spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea 
why the perf improvement is small after enabling arrow? Anything else could be 
missing? Thanks.

On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang > wrote:


Please ignore this question.

https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow


shows pandas udf should have avoided jvm<->Python SerDe by
maintaining one data copy in memory.
spark.sql.execution.arrow.enabled is false by default. I think I
missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.

On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang mailto:jiangok2...@gmail.com>> wrote:

Hi,

I am using pyspark Grouped Map pandas UDF
(https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html

).
Functionality wise it works great. However, serDe causes a lot
of perf hits. To optimize this UDF, can I do either below:

1. use a java UDF to completely replace the python Grouped Map
pandas UDF.
2. The Python Grouped Map pandas UDF calls a java function
internally.

Which way is more promising and how? Thanks for any pointers.

Thanks
Lian





-- 




Create your own email signature






--



Create your own email signature 
 



Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Lian Jiang
Hi,

I used these settings but did not see obvious improvement (190 minutes
reduced to 170 minutes):

spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr.
Any idea why the perf improvement is small after enabling arrow?
Anything else could be missing? Thanks.


On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang  wrote:

> Please ignore this question.
> https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
> shows pandas udf should have avoided jvm<->Python SerDe by maintaining one
> data copy in memory. spark.sql.execution.arrow.enabled is false by default.
> I think I missed enabling spark.sql.execution.arrow.enabled. Thanks.
> Regards.
>
> On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang  wrote:
>
>> Hi,
>>
>> I am using pyspark Grouped Map pandas UDF (
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html).
>> Functionality wise it works great. However, serDe causes a lot of perf
>> hits. To optimize this UDF, can I do either below:
>>
>> 1. use a java UDF to completely replace the python Grouped Map pandas
>> UDF.
>> 2. The Python Grouped Map pandas UDF calls a java function internally.
>>
>> Which way is more promising and how? Thanks for any pointers.
>>
>> Thanks
>> Lian
>>
>>
>>
>>
>
> --
>
> Create your own email signature
> 
>


-- 

Create your own email signature



Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang
Please ignore this question.
https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
shows pandas udf should have avoided jvm<->Python SerDe by maintaining one
data copy in memory. spark.sql.execution.arrow.enabled is false by default.
I think I missed enabling spark.sql.execution.arrow.enabled. Thanks.
Regards.

On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang  wrote:

> Hi,
>
> I am using pyspark Grouped Map pandas UDF (
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html).
> Functionality wise it works great. However, serDe causes a lot of perf
> hits. To optimize this UDF, can I do either below:
>
> 1. use a java UDF to completely replace the python Grouped Map pandas UDF.
> 2. The Python Grouped Map pandas UDF calls a java function internally.
>
> Which way is more promising and how? Thanks for any pointers.
>
> Thanks
> Lian
>
>
>
>

-- 

Create your own email signature