Re: Spark process locality

Mayur Rustagi Fri, 21 Feb 2014 08:40:33 -0800

No you cannot force RDD to a particular node.

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi




On Fri, Feb 21, 2014 at 8:30 AM, dachuan <hdc1...@gmail.com> wrote:

> Mayur, is there any way to command each RDD's partition to be some node?
>
> The input data is usually stored in HDFS and has its own preferred
> locations. But I am just curious about it, whether we can force the RDD's
> partitions to be stored in this way regardless of how you are stored now.
>
> thanks.
>
>
> On Fri, Feb 21, 2014 at 11:00 AM, Mayur Rustagi 
> <mayur.rust...@gmail.com>wrote:
>
>> Using the storage tab on Spark Web UI you can find that.
>> Compression will help certainly !!!
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <vbajaj2...@gmail.com>wrote:
>>
>>> Hi Mayur,
>>>
>>> Thanks a lot for very quick reply.
>>>
>>> I have few questions regarding RDD
>>> 1) how do I know RDD placement per machine as in which RDD data is
>>> cached at what location ?
>>> 2) how do I know total space taken by each RDD created by my
>>> program/module ?
>>> 3) does enabling compression on RDD help ?
>>>
>>> Thanks,
>>> Vinay
>>>
>>>
>>>
>>>
>>> On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <mayur.rust...@gmail.com
>>> > wrote:
>>>
>>>> Its highly likely that locality type will not become a bottleneck as
>>>> spark tries to schedule the tasks where the data is cached, 2 thing might
>>>> help
>>>> 1. Make sure you have enough memory to cache the whole data as a RDD,
>>>> keep in mind sometimes the RDD may be higher than just raw text as Java
>>>> objects may have overhead
>>>> 2. you can try and increase the replication factor of data, so that
>>>> data is available on all workers hence is faster to cache in other workers
>>>> if they already dont have it(in non-local cases per say).
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>> Mayur Rustagi
>>>> Ph: +919632149971
>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>> https://twitter.com/mayur_rustagi
>>>>
>>>>
>>>>
>>>> On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vbajaj2...@gmail.com>wrote:
>>>>
>>>>> Hi Mayur
>>>>>
>>>>> I am trying to analyse the Apache logs which contains the traffic
>>>>> details. Basically trying to figure out the statistics on Data points such
>>>>> as total views from each country and unique URLs. And i have one cluster
>>>>> running with 4 workers and one master (total space 240GB and 96 cores). 
>>>>> And
>>>>> i was trying some things to make it faster so was stuck with these 
>>>>> locality
>>>>> type of the process.
>>>>>
>>>>> Regards
>>>>> Vinay Bajaj
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <
>>>>> mayur.rust...@gmail.com> wrote:
>>>>>
>>>>>> Process local implies the data is cached on the same jvm as the task,
>>>>>> node local means its cached on the same system but not in the same jvm(on
>>>>>> some other core perhaps). Wait modification is a tune process depends on
>>>>>> your system configuration (memory vs disk vs network). I frankly never 
>>>>>> had
>>>>>> to modify it..can you share your usecase that is requiring you to do 
>>>>>> that?
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +919632149971
>>>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>>>> https://twitter.com/mayur_rustagi
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vbajaj2...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> It will be very helpful if anyone could elaborate your ideas on
>>>>>>> spark.locality.wait and multiple locality levels (process-local,
>>>>>>> node-local, rack-local and then any) and what is the best configuration 
>>>>>>> i
>>>>>>> can achieve by modifying this wait and what is the difference
>>>>>>> between process local and node local.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Vinay Bajaj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>

Re: Spark process locality

Reply via email to