Re: use java in Grouped Map pandas udf to avoid serDe

Evgeniy Ignatiev Tue, 06 Oct 2020 08:54:24 -0700

Note: forwarding to list, incorrectly hit "Repliy" first, instead of"Reply List"


Hello,

Does your code run without enabling fallback mode? Arrow vectorizationmight not just get applied - if you still observe "javaToPython" stageson Spark UI. Also data is not skewed (partitions are too large and dataparallelism can't be fully utilised) or logic is simply too heavy-weight- so using Pandas UDF doesn't improve performance much?


Best regards,
Evgenii Ignatev.

On 06.10.2020 19:44, Lian Jiang wrote:

Hi,

I used these settings but did not see obvious improvement (190 minutesreduced to 170 minutes):


spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea 
why the perf improvement is small after enabling arrow? Anything else could be 
missing? Thanks.

On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang <jiangok2...@gmail.com<mailto:jiangok2...@gmail.com>> wrote:


    Please ignore this question.
    
https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
    
<https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow>
    shows pandas udf should have avoided jvm<->Python SerDe by
    maintaining one data copy in memory.
    spark.sql.execution.arrow.enabled is false by default. I think I
    missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.

    On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <jiangok2...@gmail.com
    <mailto:jiangok2...@gmail.com>> wrote:

        Hi,

        I am using pyspark Grouped Map pandas UDF
        (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html
        
<https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html>).
        Functionality wise it works great. However, serDe causes a lot
        of perf hits. To optimize this UDF, can I do either below:

        1. use a java UDF to completely replace the python Grouped Map
        pandas UDF.
        2. The Python Grouped Map pandas UDF calls a java function
        internally.

        Which way is more promising and how? Thanks for any pointers.

        Thanks
        Lian

--



    Create your own email signature
    
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>




--

Create your own email signature<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: use java in Grouped Map pandas udf to avoid serDe

Reply via email to