Hi Brad and Nick,
Thanks for the comments! I opened a ticket to get a more thorough
explanation of data locality into the docs here:
https://issues.apache.org/jira/browse/SPARK-3526
If you could put any other unanswered questions you have about data
locality on that ticket I'll try to incorporat
Hi Andrew,
I agree with Nicholas. That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors. I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somew
Another observation I had was reading over local filesystem with “file://“. it
was stated as PROCESS_LOCAL which was confusing.
Regards,
Liming
On 13 Sep, 2014, at 3:12 am, Nicholas Chammas
wrote:
> Andrew,
>
> This email was pretty helpful. I feel like this stuff should be summarized in
>
Andrew,
This email was pretty helpful. I feel like this stuff should be summarized
in the docs somewhere, or perhaps in a blog post.
Do you know if it is?
Nick
On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash wrote:
> The locality is how close the data is to the code that's processing it.
> PROCE
Sent: Friday, June 06, 2014 6:53 AM
To: user@spark.apache.org
Subject: Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or
RACK_LOCAL?
Additionally, I've encountered some confusing situation where the locality
level for a task showed up as 'PROCESS_LOCAL' even though I d
Additionally, I've encountered some confusing situation where the locality
level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache
the data. I wonder some implicit caching happens even without the user
specifying things.
On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung
wrote:
>
Thanks Andrew,
Is there a chance that even with full-caching, that modes other than
PROCESS_LOCAL will be used? E.g., let's say, an executor will try to
perform tasks although the data are cached on a different executor.
What I'd like to do is to prevent such a scenario entirely.
I'd like to kno
The locality is how close the data is to the code that's processing it.
PROCESS_LOCAL means data is in the same JVM as the code that's running, so
it's really fast. NODE_LOCAL might mean that the data is in HDFS on the
same node, or in another executor on the same node, so is a little slower
beca
On a related note, I'd also minimize any kind of executor movement. I.e.,
once an executor is spawned and data cached in the executor, I want that
executor to live all the way till the job is finished, or the machine fails
in a fatal manner.
What would be the best way to ensure that this is the ca