Hi Brad and Nick, Thanks for the comments! I opened a ticket to get a more thorough explanation of data locality into the docs here: https://issues.apache.org/jira/browse/SPARK-3526
If you could put any other unanswered questions you have about data locality on that ticket I'll try to incorporate answers to them in the final addition I send in. Andrew On Sun, Sep 14, 2014 at 6:47 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Hi Andrew, > > I agree with Nicholas. That was a nice, concise summary of the > meaning of the locality customization options, indicators and default > Spark behaviors. I haven't combed through the documentation > end-to-end in a while, but I'm also not sure that information is > presently represented somewhere and it would be great to persist it > somewhere besides the mailing list. > > best, > -Brad > > On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > Andrew, > > > > This email was pretty helpful. I feel like this stuff should be > summarized > > in the docs somewhere, or perhaps in a blog post. > > > > Do you know if it is? > > > > Nick > > > > > > On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash <and...@andrewash.com> wrote: > >> > >> The locality is how close the data is to the code that's processing it. > >> PROCESS_LOCAL means data is in the same JVM as the code that's running, > so > >> it's really fast. NODE_LOCAL might mean that the data is in HDFS on the > >> same node, or in another executor on the same node, so is a little > slower > >> because the data has to travel across an IPC connection. RACK_LOCAL is > even > >> slower -- data is on a different server so needs to be sent over the > >> network. > >> > >> Spark switches to lower locality levels when there's no unprocessed data > >> on a node that has idle CPUs. In that situation you have two options: > wait > >> until the busy CPUs free up so you can start another task that uses > data on > >> that server, or start a new task on a farther away server that needs to > >> bring data from that remote place. What Spark typically does is wait a > bit > >> in the hopes that a busy CPU frees up. Once that timeout expires, it > starts > >> moving the data from far away to the free CPU. > >> > >> The main tunable option is how far long the scheduler waits before > >> starting to move data rather than code. Those are the spark.locality.* > >> settings here: http://spark.apache.org/docs/latest/configuration.html > >> > >> If you want to prevent this from happening entirely, you can set the > >> values to ridiculously high numbers. The documentation also mentions > that > >> "0" has special meaning, so you can try that as well. > >> > >> Good luck! > >> Andrew > >> > >> > >> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung < > coded...@cs.stanford.edu> > >> wrote: > >>> > >>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd > >>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. > >>> > >>> When these happen things get extremely slow. > >>> > >>> Does this mean that the executor got terminated and restarted? > >>> > >>> Is there a way to prevent this from happening (barring the machine > >>> actually going down, I'd rather stick with the same process)? > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >