Thanks Patrick! I tried to package it according to this instructions, it
got distributed on the cluster however the same spark program that takes 5
mins without pandas UDF has started to take 25mins...

Have you experienced anything like this? Also is Pyarrow 0.12 supported
with Spark 2.3 (according to documentation, it should be fine)?

On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> Hi Rishi,
>
> I've had success using the approach outlined here:
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>
> Does this work for you?
>
> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com>
> wrote:
>
>> modified the subject & would like to clarify that I am looking to create
>> an anaconda parcel with pyarrow and other libraries, so that I can
>> distribute it on the cloudera cluster..
>>
>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have been trying to figure out a way to build anaconda parcel with
>>> pyarrow included for my cloudera managed server for distribution but this
>>> doesn't seem to work right. Could someone please help?
>>>
>>> I have tried to install anaconda on one of the management nodes on
>>> cloudera cluster... tarred the directory, but this directory doesn't
>>> include all the packages to form a proper parcel for distribution.
>>>
>>> Any help is much appreciated!
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


-- 
Regards,

Rishi Shah

Reply via email to