Hi Brad and Nick,

Thanks for the comments!  I opened a ticket to get a more thorough
explanation of data locality into the docs here:
https://issues.apache.org/jira/browse/SPARK-3526

If you could put any other unanswered questions you have about data
locality on that ticket I'll try to incorporate answers to them in the
final addition I send in.

Andrew

On Sun, Sep 14, 2014 at 6:47 PM, Brad Miller <bmill...@eecs.berkeley.edu>
wrote:

> Hi Andrew,
>
> I agree with Nicholas.  That was a nice, concise summary of the
> meaning of the locality customization options, indicators and default
> Spark behaviors.  I haven't combed through the documentation
> end-to-end in a while, but I'm also not sure that information is
> presently represented somewhere and it would be great to persist it
> somewhere besides the mailing list.
>
> best,
> -Brad
>
> On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > Andrew,
> >
> > This email was pretty helpful. I feel like this stuff should be
> summarized
> > in the docs somewhere, or perhaps in a blog post.
> >
> > Do you know if it is?
> >
> > Nick
> >
> >
> > On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash <and...@andrewash.com> wrote:
> >>
> >> The locality is how close the data is to the code that's processing it.
> >> PROCESS_LOCAL means data is in the same JVM as the code that's running,
> so
> >> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
> >> same node, or in another executor on the same node, so is a little
> slower
> >> because the data has to travel across an IPC connection.  RACK_LOCAL is
> even
> >> slower -- data is on a different server so needs to be sent over the
> >> network.
> >>
> >> Spark switches to lower locality levels when there's no unprocessed data
> >> on a node that has idle CPUs.  In that situation you have two options:
> wait
> >> until the busy CPUs free up so you can start another task that uses
> data on
> >> that server, or start a new task on a farther away server that needs to
> >> bring data from that remote place.  What Spark typically does is wait a
> bit
> >> in the hopes that a busy CPU frees up.  Once that timeout expires, it
> starts
> >> moving the data from far away to the free CPU.
> >>
> >> The main tunable option is how far long the scheduler waits before
> >> starting to move data rather than code.  Those are the spark.locality.*
> >> settings here: http://spark.apache.org/docs/latest/configuration.html
> >>
> >> If you want to prevent this from happening entirely, you can set the
> >> values to ridiculously high numbers.  The documentation also mentions
> that
> >> "0" has special meaning, so you can try that as well.
> >>
> >> Good luck!
> >> Andrew
> >>
> >>
> >> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <
> coded...@cs.stanford.edu>
> >> wrote:
> >>>
> >>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
> >>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
> >>>
> >>> When these happen things get extremely slow.
> >>>
> >>> Does this mean that the executor got terminated and restarted?
> >>>
> >>> Is there a way to prevent this from happening (barring the machine
> >>> actually going down, I'd rather stick with the same process)?
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to