Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hence, what I mentioned initially does sound correct ?
I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.m...@gmail.com> wrote: >> >> Hi, >> >> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy >> <pmccar...@dstillery.com.invalid> wrote: >> > >> > Thanks Gourav. >> > >> > Incidentally, since the regular UDF is row-wise, we could optimize that a >> > bit by taking the convert() closure and simply making that the UDF. >> > >> > Since there's that MGRS object that we have to create too, we could >> > probably optimize it further by applying the UDF via rdd.mapPartitions, >> > which would allow the UDF to instantiate objects once per-partition >> > instead of per-row and then iterate element-wise through the rows of the >> > partition. >> > >> > All that said, having done the above on prior projects I find the pandas >> > abstractions to be very elegant and friendly to the end-user so I haven't >> > looked back :) >> > >> > (The common memory model via Arrow is a nice boost too!) >> >> And some tentative SPIPs that want to use columnar representations >> internally in Spark should also add some good performance in the >> future. >> >> Cheers >> Andrew >> >> > >> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta >> > <gourav.sengu...@gmail.com> wrote: >> >> >> >> The proof is in the pudding >> >> >> >> :) >> >> >> >> >> >> >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta >> >> <gourav.sengu...@gmail.com> wrote: >> >>> >> >>> Hi Patrick, >> >>> >> >>> super duper, thanks a ton for sharing the code. Can you please confirm >> >>> that this runs faster than the regular UDF's? >> >>> >> >>> Interestingly I am also running same transformations using another geo >> >>> spatial library in Python, where I am passing two fields and getting >> >>> back an array. >> >>> >> >>> >> >>> Regards, >> >>> Gourav Sengupta >> >>> >> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy >> >>> <pmccar...@dstillery.com> wrote: >> >>>> >> >>>> Human time is considerably more expensive than computer time, so in >> >>>> that regard, yes :) >> >>>> >> >>>> This took me one minute to write and ran fast enough for my needs. If >> >>>> you're willing to provide a comparable scala implementation I'd be >> >>>> happy to compare them. >> >>>> >> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) >> >>>> >> >>>> def generate_mgrs_series(lat_lon_str, level): >> >>>> >> >>>> >> >>>> import mgrs >> >>>> >> >>>> m = mgrs.MGRS() >> >>>> >> >>>> >> >>>> precision_level = 0 >> >>>> >> >>>> levelval = level[0] >> >>>> >> >>>> >> >>>> if levelval == 1000: >> >>>> >> >>>> precision_level = 2 >> >>>> >> >>>> if levelval == 100: >> >>>> >> >>>> precision_level = 3 >> >>>> >> >>>> >> >>>> def convert(ll_str): >> >>>> >> >>>> lat, lon = ll_str.split('_') >> >>>> >> >>>> >> >>>> return m.toMGRS(lat, lon, >> >>>> >> >>>> MGRSPrecision = precision_level) >> >>>> >> >>>> >> >>>> return lat_lon_str.apply(lambda x: convert(x)) >> >>>> >> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta >> >>>> <gourav.sengu...@gmail.com> wrote: >> >>>>> >> >>>>> And you found the PANDAS UDF more performant ? Can you share your code >> >>>>> and prove it? >> >>>>> >> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy >> >>>>> <pmccar...@dstillery.com> wrote: >> >>>>>> >> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala >> >>>>>> performance-wise, but for python-based data scientists or others with >> >>>>>> a lot of python expertise it allows one to do things that would >> >>>>>> otherwise be infeasible at scale. >> >>>>>> >> >>>>>> For instance, I recently had to convert latitude / longitude pairs to >> >>>>>> MGRS strings >> >>>>>> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). >> >>>>>> Writing a pandas UDF (and putting the mgrs python package into a >> >>>>>> conda environment) was _significantly_ easier than any alternative I >> >>>>>> found. >> >>>>>> >> >>>>>> @Rishi - depending on your network is constructed, some lag could >> >>>>>> come from just uploading the conda environment. If you load it from >> >>>>>> hdfs with --archives does it improve? >> >>>>>> >> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta >> >>>>>> <gourav.sengu...@gmail.com> wrote: >> >>>>>>> >> >>>>>>> hi, >> >>>>>>> >> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case >> >>>>>>> of adding 1 to a field using Pandas UDF which is pretty much >> >>>>>>> pointless. So you go beyond the blog and realise that your actual >> >>>>>>> used case is more than adding one :) and the reality hits you >> >>>>>>> >> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply >> >>>>>>> for a custom or pandas function. In fact in certain scenarios I have >> >>>>>>> found general UDF's work much faster and use much less memory. >> >>>>>>> Therefore test out your used case (with at least 30 million records) >> >>>>>>> before trying to use the Pandas UDF option. >> >>>>>>> >> >>>>>>> And when you start using GroupMap then you realise after reading >> >>>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >> >>>>>>> that "Oh!! now I can run into random OOM errors and the maxrecords >> >>>>>>> options does not help at all" >> >>>>>>> >> >>>>>>> Excerpt from the above link: >> >>>>>>> Note that all data for a group will be loaded into memory before the >> >>>>>>> function is applied. This can lead to out of memory exceptions, >> >>>>>>> especially if the group sizes are skewed. The configuration for >> >>>>>>> maxRecordsPerBatch is not applied on groups and it is up to the user >> >>>>>>> to ensure that the grouped data will fit into the available memory. >> >>>>>>> >> >>>>>>> Let me know about your used case in case possible >> >>>>>>> >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> Gourav >> >>>>>>> >> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> Thanks Patrick! I tried to package it according to this >> >>>>>>>> instructions, it got distributed on the cluster however the same >> >>>>>>>> spark program that takes 5 mins without pandas UDF has started to >> >>>>>>>> take 25mins... >> >>>>>>>> >> >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 >> >>>>>>>> supported with Spark 2.3 (according to documentation, it should be >> >>>>>>>> fine)? >> >>>>>>>> >> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy >> >>>>>>>> <pmccar...@dstillery.com> wrote: >> >>>>>>>>> >> >>>>>>>>> Hi Rishi, >> >>>>>>>>> >> >>>>>>>>> I've had success using the approach outlined here: >> >>>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >> >>>>>>>>> >> >>>>>>>>> Does this work for you? >> >>>>>>>>> >> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah >> >>>>>>>>> <rishishah.s...@gmail.com> wrote: >> >>>>>>>>>> >> >>>>>>>>>> modified the subject & would like to clarify that I am looking to >> >>>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so >> >>>>>>>>>> that I can distribute it on the cloudera cluster.. >> >>>>>>>>>> >> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah >> >>>>>>>>>> <rishishah.s...@gmail.com> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Hi All, >> >>>>>>>>>>> >> >>>>>>>>>>> I have been trying to figure out a way to build anaconda parcel >> >>>>>>>>>>> with pyarrow included for my cloudera managed server for >> >>>>>>>>>>> distribution but this doesn't seem to work right. Could someone >> >>>>>>>>>>> please help? >> >>>>>>>>>>> >> >>>>>>>>>>> I have tried to install anaconda on one of the management nodes >> >>>>>>>>>>> on cloudera cluster... tarred the directory, but this directory >> >>>>>>>>>>> doesn't include all the packages to form a proper parcel for >> >>>>>>>>>>> distribution. >> >>>>>>>>>>> >> >>>>>>>>>>> Any help is much appreciated! >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> Regards, >> >>>>>>>>>>> >> >>>>>>>>>>> Rishi Shah >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> -- >> >>>>>>>>>> Regards, >> >>>>>>>>>> >> >>>>>>>>>> Rishi Shah >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> >> >>>>>>>>> Patrick McCarthy >> >>>>>>>>> >> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering >> >>>>>>>>> >> >>>>>>>>> Dstillery >> >>>>>>>>> >> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> -- >> >>>>>>>> Regards, >> >>>>>>>> >> >>>>>>>> Rishi Shah >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> >> >>>>>> Patrick McCarthy >> >>>>>> >> >>>>>> Senior Data Scientist, Machine Learning Engineering >> >>>>>> >> >>>>>> Dstillery >> >>>>>> >> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> >> >>>> Patrick McCarthy >> >>>> >> >>>> Senior Data Scientist, Machine Learning Engineering >> >>>> >> >>>> Dstillery >> >>>> >> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >> > >> > >> > >> > -- >> > >> > Patrick McCarthy >> > >> > Senior Data Scientist, Machine Learning Engineering >> > >> > Dstillery >> > >> > 470 Park Ave South, 17th Floor, NYC 10016 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org