Additionally, I've encountered some confusing situation where the locality level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the data. I wonder some implicit caching happens even without the user specifying things.
On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > Thanks Andrew, > > Is there a chance that even with full-caching, that modes other than > PROCESS_LOCAL will be used? E.g., let's say, an executor will try to > perform tasks although the data are cached on a different executor. > > What I'd like to do is to prevent such a scenario entirely. > > I'd like to know if setting 'spark.locality.wait' to a very high value > would guarantee that the mode will always be 'PROCESS_LOCAL'. > > > On Thu, Jun 5, 2014 at 3:36 PM, Andrew Ash <and...@andrewash.com> wrote: > >> The locality is how close the data is to the code that's processing it. >> PROCESS_LOCAL means data is in the same JVM as the code that's running, so >> it's really fast. NODE_LOCAL might mean that the data is in HDFS on the >> same node, or in another executor on the same node, so is a little slower >> because the data has to travel across an IPC connection. RACK_LOCAL is >> even slower -- data is on a different server so needs to be sent over the >> network. >> >> Spark switches to lower locality levels when there's no unprocessed data >> on a node that has idle CPUs. In that situation you have two options: wait >> until the busy CPUs free up so you can start another task that uses data on >> that server, or start a new task on a farther away server that needs to >> bring data from that remote place. What Spark typically does is wait a bit >> in the hopes that a busy CPU frees up. Once that timeout expires, it >> starts moving the data from far away to the free CPU. >> >> The main tunable option is how far long the scheduler waits before >> starting to move data rather than code. Those are the spark.locality.* >> settings here: http://spark.apache.org/docs/latest/configuration.html >> >> If you want to prevent this from happening entirely, you can set the >> values to ridiculously high numbers. The documentation also mentions that >> "0" has special meaning, so you can try that as well. >> >> Good luck! >> Andrew >> >> >> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <coded...@cs.stanford.edu >> > wrote: >> >>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd >>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. >>> >>> When these happen things get extremely slow. >>> >>> Does this mean that the executor got terminated and restarted? >>> >>> Is there a way to prevent this from happening (barring the machine >>> actually going down, I'd rather stick with the same process)? >>> >> >> >