Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Andrew Melo Mon, 06 May 2019 10:00:57 -0700

Hi,

On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta
<gourav.sengu...@gmail.com> wrote:
>
> Hence, what I mentioned initially does sound correct ?


I don't agree at all - we've had a significant boost from moving to
regular UDFs to pandas UDFs. YMMV, of course.

>
> On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.m...@gmail.com> wrote:
>>
>> Hi,
>>
>> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
>> <pmccar...@dstillery.com.invalid> wrote:
>> >
>> > Thanks Gourav.
>> >
>> > Incidentally, since the regular UDF is row-wise, we could optimize that a 
>> > bit by taking the convert() closure and simply making that the UDF.
>> >
>> > Since there's that MGRS object that we have to create too, we could 
>> > probably optimize it further by applying the UDF via rdd.mapPartitions, 
>> > which would allow the UDF to instantiate objects once per-partition 
>> > instead of per-row and then iterate element-wise through the rows of the 
>> > partition.
>> >
>> > All that said, having done the above on prior projects I find the pandas 
>> > abstractions to be very elegant and friendly to the end-user so I haven't 
>> > looked back :)
>> >
>> > (The common memory model via Arrow is a nice boost too!)
>>
>> And some tentative SPIPs that want to use columnar representations
>> internally in Spark should also add some good performance in the
>> future.
>>
>> Cheers
>> Andrew
>>
>> >
>> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta 
>> > <gourav.sengu...@gmail.com> wrote:
>> >>
>> >> The proof is in the pudding
>> >>
>> >> :)
>> >>
>> >>
>> >>
>> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta 
>> >> <gourav.sengu...@gmail.com> wrote:
>> >>>
>> >>> Hi Patrick,
>> >>>
>> >>> super duper, thanks a ton for sharing the code. Can you please confirm 
>> >>> that this runs faster than the regular UDF's?
>> >>>
>> >>> Interestingly I am also running same transformations using another geo 
>> >>> spatial library in Python, where I am passing two fields and getting 
>> >>> back an array.
>> >>>
>> >>>
>> >>> Regards,
>> >>> Gourav Sengupta
>> >>>
>> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy 
>> >>> <pmccar...@dstillery.com> wrote:
>> >>>>
>> >>>> Human time is considerably more expensive than computer time, so in 
>> >>>> that regard, yes :)
>> >>>>
>> >>>> This took me one minute to write and ran fast enough for my needs. If 
>> >>>> you're willing to provide a comparable scala implementation I'd be 
>> >>>> happy to compare them.
>> >>>>
>> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>> >>>>
>> >>>> def generate_mgrs_series(lat_lon_str, level):
>> >>>>
>> >>>>
>> >>>>     import mgrs
>> >>>>
>> >>>>     m = mgrs.MGRS()
>> >>>>
>> >>>>
>> >>>>     precision_level = 0
>> >>>>
>> >>>>     levelval = level[0]
>> >>>>
>> >>>>
>> >>>>     if levelval == 1000:
>> >>>>
>> >>>>        precision_level = 2
>> >>>>
>> >>>>     if levelval == 100:
>> >>>>
>> >>>>        precision_level = 3
>> >>>>
>> >>>>
>> >>>>     def convert(ll_str):
>> >>>>
>> >>>>           lat, lon = ll_str.split('_')
>> >>>>
>> >>>>
>> >>>>           return m.toMGRS(lat, lon,
>> >>>>
>> >>>>               MGRSPrecision = precision_level)
>> >>>>
>> >>>>
>> >>>>     return lat_lon_str.apply(lambda x: convert(x))
>> >>>>
>> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta 
>> >>>> <gourav.sengu...@gmail.com> wrote:
>> >>>>>
>> >>>>> And you found the PANDAS UDF more performant ? Can you share your code 
>> >>>>> and prove it?
>> >>>>>
>> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
>> >>>>> <pmccar...@dstillery.com> wrote:
>> >>>>>>
>> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala 
>> >>>>>> performance-wise, but for python-based data scientists or others with 
>> >>>>>> a lot of python expertise it allows one to do things that would 
>> >>>>>> otherwise be infeasible at scale.
>> >>>>>>
>> >>>>>> For instance, I recently had to convert latitude / longitude pairs to 
>> >>>>>> MGRS strings 
>> >>>>>> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). 
>> >>>>>> Writing a pandas UDF (and putting the mgrs python package into a 
>> >>>>>> conda environment) was _significantly_ easier than any alternative I 
>> >>>>>> found.
>> >>>>>>
>> >>>>>> @Rishi - depending on your network is constructed, some lag could 
>> >>>>>> come from just uploading the conda environment. If you load it from 
>> >>>>>> hdfs with --archives does it improve?
>> >>>>>>
>> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
>> >>>>>> <gourav.sengu...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> hi,
>> >>>>>>>
>> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case 
>> >>>>>>> of adding 1 to a field using Pandas UDF which is pretty much 
>> >>>>>>> pointless. So you go beyond the blog and realise that your actual 
>> >>>>>>> used case is more than adding one :) and the reality hits you
>> >>>>>>>
>> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply 
>> >>>>>>> for a custom or pandas function. In fact in certain scenarios I have 
>> >>>>>>> found general UDF's work much faster and use much less memory. 
>> >>>>>>> Therefore test out your used case (with at least 30 million records) 
>> >>>>>>> before trying to use the Pandas UDF option.
>> >>>>>>>
>> >>>>>>> And when you start using GroupMap then you realise after reading 
>> >>>>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>> >>>>>>>  that "Oh!! now I can run into random OOM errors and the maxrecords 
>> >>>>>>> options does not help at all"
>> >>>>>>>
>> >>>>>>> Excerpt from the above link:
>> >>>>>>> Note that all data for a group will be loaded into memory before the 
>> >>>>>>> function is applied. This can lead to out of memory exceptions, 
>> >>>>>>> especially if the group sizes are skewed. The configuration for 
>> >>>>>>> maxRecordsPerBatch is not applied on groups and it is up to the user 
>> >>>>>>> to ensure that the grouped data will fit into the available memory.
>> >>>>>>>
>> >>>>>>> Let me know about your used case in case possible
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Gourav
>> >>>>>>>
>> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> 
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Thanks Patrick! I tried to package it according to this 
>> >>>>>>>> instructions, it got distributed on the cluster however the same 
>> >>>>>>>> spark program that takes 5 mins without pandas UDF has started to 
>> >>>>>>>> take 25mins...
>> >>>>>>>>
>> >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 
>> >>>>>>>> supported with Spark 2.3 (according to documentation, it should be 
>> >>>>>>>> fine)?
>> >>>>>>>>
>> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy 
>> >>>>>>>> <pmccar...@dstillery.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi Rishi,
>> >>>>>>>>>
>> >>>>>>>>> I've had success using the approach outlined here: 
>> >>>>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>> >>>>>>>>>
>> >>>>>>>>> Does this work for you?
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
>> >>>>>>>>> <rishishah.s...@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> modified the subject & would like to clarify that I am looking to 
>> >>>>>>>>>> create an anaconda parcel with pyarrow and other libraries, so 
>> >>>>>>>>>> that I can distribute it on the cloudera cluster..
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
>> >>>>>>>>>> <rishishah.s...@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi All,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have been trying to figure out a way to build anaconda parcel 
>> >>>>>>>>>>> with pyarrow included for my cloudera managed server for 
>> >>>>>>>>>>> distribution but this doesn't seem to work right. Could someone 
>> >>>>>>>>>>> please help?
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have tried to install anaconda on one of the management nodes 
>> >>>>>>>>>>> on cloudera cluster... tarred the directory, but this directory 
>> >>>>>>>>>>> doesn't include all the packages to form a proper parcel for 
>> >>>>>>>>>>> distribution.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Any help is much appreciated!
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Regards,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Rishi Shah
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>> Regards,
>> >>>>>>>>>>
>> >>>>>>>>>> Rishi Shah
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>>
>> >>>>>>>>> Patrick McCarthy
>> >>>>>>>>>
>> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering
>> >>>>>>>>>
>> >>>>>>>>> Dstillery
>> >>>>>>>>>
>> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Regards,
>> >>>>>>>>
>> >>>>>>>> Rishi Shah
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>>
>> >>>>>> Patrick McCarthy
>> >>>>>>
>> >>>>>> Senior Data Scientist, Machine Learning Engineering
>> >>>>>>
>> >>>>>> Dstillery
>> >>>>>>
>> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>>
>> >>>> Patrick McCarthy
>> >>>>
>> >>>> Senior Data Scientist, Machine Learning Engineering
>> >>>>
>> >>>> Dstillery
>> >>>>
>> >>>> 470 Park Ave South, 17th Floor, NYC 10016
>> >
>> >
>> >
>> > --
>> >
>> > Patrick McCarthy
>> >
>> > Senior Data Scientist, Machine Learning Engineering
>> >
>> > Dstillery
>> >
>> > 470 Park Ave South, 17th Floor, NYC 10016

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to