Hi Luca, I see you pushed some code to the PR 3 hrs ago. That's awesome. If I can help out in any way - do let me know I think that's an amazing feature and would be great if it can get into spark
On Fri, 26 Aug 2022, 12:41 Luca Canali, <luca.can...@cern.ch> wrote: > @Abdeali as for “lightweight profiling”, there is some work in progress on > instrumenting Python UDFs with Spark metrics, see > https://issues.apache.org/jira/browse/SPARK-34265 > > However it is a bit stuck at the moment, and needs to be revived I > believe. > > > > Best, > > Luca > > > > *From:* Abdeali Kothari <abdealikoth...@gmail.com> > *Sent:* Friday, August 26, 2022 06:36 > *To:* Subash Prabanantham <subashpraba...@gmail.com> > *Cc:* Russell Jurney <russell.jur...@gmail.com>; Gourav Sengupta < > gourav.sengu...@gmail.com>; Sean Owen <sro...@gmail.com>; Takuya UESHIN < > ues...@happy-camper.st>; user <user@spark.apache.org> > *Subject:* Re: Profiling PySpark Pandas UDF > > > > The python profiler is pretty cool ! > > Ill try it out to see what could be taking time within the UDF with it. > > > > I'm wondering if there is also some lightweight profiling (which does not > slow down my processing) for me to get: > > > > - how much time the UDF took (like how much time was spent inside the UDF) > > - how many times the UDF was called > > > > I can see the overall time a stage took in the Spark UI - would be cool if > I could find the time a UDF takes too > > > > On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, <subashpraba...@gmail.com> > wrote: > > Wow, lots of good suggestions. I didn’t know about the profiler either. > Great suggestion @Takuya. > > > > > > Thanks, > > Subash > > > > On Thu, 25 Aug 2022 at 19:30, Russell Jurney <russell.jur...@gmail.com> > wrote: > > YOU know what you're talking about and aren't hacking a solution. You are > my new friend :) Thank you, this is incredibly helpful! > > > > > Thanks, > > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > > > > On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st> > wrote: > > Hi Subash, > > Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? > - > https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf > > Hope it can help you. > > Thanks. > > > > On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com> > wrote: > > Subash, I’m here to help :) > > > > I started a test script to demonstrate a solution last night but got a > cold and haven’t finished it. Give me another day and I’ll get it to you. > My suggestion is that you run PySpark locally in pytest with a fixture to > generate and yield your SparckContext and SparkSession and the. Write tests > that load some test data, perform some count operation and checkpoint to > ensure that data is loaded, start a timer, run your UDF on the DataFrame, > checkpoint again or write some output to disk to make sure it finishes and > then stop the timer and compute how long it takes. I’ll show you some code, > I have to do this for Graphlet AI’s RTL utils and other tools to figure out > how much overhead there is using Pandera and Spark together to validate > data: https://github.com/Graphlet-AI/graphlet > > > > I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets > consistent, measurable and valid results! :) > > > > Russell Jurney > > > > On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote: > > It's important to realize that while pandas UDFs and pandas on Spark are > both related to pandas, they are not themselves directly related. The first > lets you use pandas within Spark, the second lets you use pandas on Spark. > > > > Hard to say with this info but you want to look at whether you are doing > something expensive in each UDF call and consider amortizing it with the > scalar iterator UDF pattern. Maybe. > > > > A pandas UDF is not spark code itself so no there is no tool in spark to > profile it. Conversely any approach to profiling pandas or python would > work here . > > > > On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi, > > > > May be I am jumping to conclusions and making stupid guesses, but have you > tried koalas now that it is natively integrated with pyspark?? > > > > Regards > > Gourav > > > > On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <subashpraba...@gmail.com> > wrote: > > Hi All, > > > > I was wondering if we have any best practices on using pandas UDF ? > Profiling UDF is not an easy task and our case requires some drilling down > on the logic of the function. > > > > > > Our use case: > > We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, > while running locally only the function, it runs faster but when executed > in Spark environment - the processing time is more than expected. We have > one column where the value is large (BinaryType -> 600KB), wondering > whether this could make the Arrow computation slower ? > > > > Is there any profiling or best way to debug the cost incurred using pandas > UDF ? > > > > > > Thanks, > > Subash > > > > -- > > > > Thanks, > > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > > > -- > > Takuya UESHIN > >