Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help 1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead 2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say).
Regards Mayur Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vbajaj2...@gmail.com> wrote: > Hi Mayur > > I am trying to analyse the Apache logs which contains the traffic details. > Basically trying to figure out the statistics on Data points such as total > views from each country and unique URLs. And i have one cluster running > with 4 workers and one master (total space 240GB and 96 cores). And i was > trying some things to make it faster so was stuck with these locality type > of the process. > > Regards > Vinay Bajaj > > > On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi > <mayur.rust...@gmail.com>wrote: > >> Process local implies the data is cached on the same jvm as the task, >> node local means its cached on the same system but not in the same jvm(on >> some other core perhaps). Wait modification is a tune process depends on >> your system configuration (memory vs disk vs network). I frankly never had >> to modify it..can you share your usecase that is requiring you to do that? >> >> Mayur Rustagi >> Ph: +919632149971 >> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >> https://twitter.com/mayur_rustagi >> >> >> >> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vbajaj2...@gmail.com>wrote: >> >>> Hi >>> >>> It will be very helpful if anyone could elaborate your ideas on >>> spark.locality.wait and multiple locality levels (process-local, >>> node-local, rack-local and then any) and what is the best configuration i >>> can achieve by modifying this wait and what is the difference between >>> process local and node local. >>> >>> Thanks >>> Vinay Bajaj >>> >>> >>> >> >