Re: Profiling PySpark Pandas UDF

Abdeali Kothari Fri, 26 Aug 2022 07:00:44 -0700

Hi Luca, I see you pushed some code to the PR 3 hrs ago.
That's awesome. If I can help out in any way - do let me know
I think that's an amazing feature and would be great if it can get into
spark


On Fri, 26 Aug 2022, 12:41 Luca Canali, <[email protected]> wrote:

> @Abdeali as for “lightweight profiling”, there is some work in progress on
> instrumenting Python UDFs with Spark metrics, see
> https://issues.apache.org/jira/browse/SPARK-34265
>
> However it is a bit stuck at the moment, and needs to be revived I
> believe.
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Abdeali Kothari <[email protected]>
> *Sent:* Friday, August 26, 2022 06:36
> *To:* Subash Prabanantham <[email protected]>
> *Cc:* Russell Jurney <[email protected]>; Gourav Sengupta <
> [email protected]>; Sean Owen <[email protected]>; Takuya UESHIN <
> [email protected]>; user <[email protected]>
> *Subject:* Re: Profiling PySpark Pandas UDF
>
>
>
> The python profiler is pretty cool !
>
> Ill try it out to see what could be taking time within the UDF with it.
>
>
>
> I'm wondering if there is also some lightweight profiling (which does not
> slow down my processing) for me to get:
>
>
>
>  - how much time the UDF took (like how much time was spent inside the UDF)
>
>  - how many times the UDF was called
>
>
>
> I can see the overall time a stage took in the Spark UI - would be cool if
> I could find the time a UDF takes too
>
>
>
> On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, <[email protected]>
> wrote:
>
> Wow, lots of good suggestions. I didn’t know about the profiler either.
> Great suggestion @Takuya.
>
>
>
>
>
> Thanks,
>
> Subash
>
>
>
> On Thu, 25 Aug 2022 at 19:30, Russell Jurney <[email protected]>
> wrote:
>
> YOU know what you're talking about and aren't hacking a solution. You are
> my new friend :) Thank you, this is incredibly helpful!
>
>
>
>
> Thanks,
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
>
>
>
> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <[email protected]>
> wrote:
>
> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
>
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <[email protected]>
> wrote:
>
> Subash, I’m here to help :)
>
>
>
> I started a test script to demonstrate a solution last night but got a
> cold and haven’t finished it. Give me another day and I’ll get it to you.
> My suggestion is that you run PySpark locally in pytest with a fixture to
> generate and yield your SparckContext and SparkSession and the. Write tests
> that load some test data, perform some count operation and checkpoint to
> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
> checkpoint again or write some output to disk to make sure it finishes and
> then stop the timer and compute how long it takes. I’ll show you some code,
> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
> how much overhead there is using Pandera and Spark together to validate
> data: https://github.com/Graphlet-AI/graphlet
>
>
>
> I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets
> consistent, measurable and valid results! :)
>
>
>
> Russell Jurney
>
>
>
> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <[email protected]> wrote:
>
> It's important to realize that while pandas UDFs and pandas on Spark are
> both related to pandas, they are not themselves directly related. The first
> lets you use pandas within Spark, the second lets you use pandas on Spark.
>
>
>
> Hard to say with this info but you want to look at whether you are doing
> something expensive in each UDF call and consider amortizing it with the
> scalar iterator UDF pattern. Maybe.
>
>
>
> A pandas UDF is not spark code itself so no there is no tool in spark to
> profile it. Conversely any approach to profiling pandas or python would
> work here .
>
>
>
> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <[email protected]>
> wrote:
>
> Hi,
>
>
>
> May be I am jumping to conclusions and making stupid guesses, but have you
> tried koalas now that it is natively integrated with pyspark??
>
>
>
> Regards
>
> Gourav
>
>
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <[email protected]>
> wrote:
>
> Hi All,
>
>
>
> I was wondering if we have any best practices on using pandas UDF ?
> Profiling UDF is not an easy task and our case requires some drilling down
> on the logic of the function.
>
>
>
>
>
> Our use case:
>
> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
> while running locally only the function, it runs faster but when executed
> in Spark environment - the processing time is more than expected. We have
> one column where the value is large (BinaryType -> 600KB), wondering
> whether this could make the Arrow computation slower ?
>
>
>
> Is there any profiling or best way to debug the cost incurred using pandas
> UDF ?
>
>
>
>
>
> Thanks,
>
> Subash
>
>
>
> --
>
>
>
> Thanks,
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
>
>
> --
>
> Takuya UESHIN
>
>

Re: Profiling PySpark Pandas UDF

Reply via email to