Its highly likely that locality type will not become a bottleneck as spark
tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep
in mind sometimes the RDD may be higher than just raw text as Java objects
may have overhead
2. you can try and increase the replication factor of data, so that data is
available on all workers hence is faster to cache in other workers if they
already dont have it(in non-local cases per say).

Regards
Mayur

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vbajaj2...@gmail.com> wrote:

> Hi Mayur
>
> I am trying to analyse the Apache logs which contains the traffic details.
> Basically trying to figure out the statistics on Data points such as total
> views from each country and unique URLs. And i have one cluster running
> with 4 workers and one master (total space 240GB and 96 cores). And i was
> trying some things to make it faster so was stuck with these locality type
> of the process.
>
> Regards
> Vinay Bajaj
>
>
> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi 
> <mayur.rust...@gmail.com>wrote:
>
>> Process local implies the data is cached on the same jvm as the task,
>> node local means its cached on the same system but not in the same jvm(on
>> some other core perhaps). Wait modification is a tune process depends on
>> your system configuration (memory vs disk vs network). I frankly never had
>> to modify it..can you share your usecase that is requiring you to do that?
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vbajaj2...@gmail.com>wrote:
>>
>>> Hi
>>>
>>> It will be very helpful if anyone could elaborate your ideas on
>>> spark.locality.wait and multiple locality levels (process-local,
>>> node-local, rack-local and then any) and what is the best configuration i
>>> can achieve by modifying this wait and what is the difference between
>>> process local and node local.
>>>
>>> Thanks
>>> Vinay Bajaj
>>>
>>>
>>>
>>
>

Reply via email to