Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-15 Thread Andrew Ash
Hi Brad and Nick, Thanks for the comments! I opened a ticket to get a more thorough explanation of data locality into the docs here: https://issues.apache.org/jira/browse/SPARK-3526 If you could put any other unanswered questions you have about data locality on that ticket I'll try to incorporat

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew, I agree with Nicholas. That was a nice, concise summary of the meaning of the locality customization options, indicators and default Spark behaviors. I haven't combed through the documentation end-to-end in a while, but I'm also not sure that information is presently represented somew

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it was stated as PROCESS_LOCAL which was confusing. Regards, Liming On 13 Sep, 2014, at 3:12 am, Nicholas Chammas wrote: > Andrew, > > This email was pretty helpful. I feel like this stuff should be summarized in >

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Nicholas Chammas
Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash wrote: > The locality is how close the data is to the code that's processing it. > PROCE

RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
Sent: Friday, June 06, 2014 6:53 AM To: user@spark.apache.org Subject: Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL? Additionally, I've encountered some confusing situation where the locality level for a task showed up as 'PROCESS_LOCAL' even though I d

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
Additionally, I've encountered some confusing situation where the locality level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the data. I wonder some implicit caching happens even without the user specifying things. On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung wrote: >

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
Thanks Andrew, Is there a chance that even with full-caching, that modes other than PROCESS_LOCAL will be used? E.g., let's say, an executor will try to perform tasks although the data are cached on a different executor. What I'd like to do is to prevent such a scenario entirely. I'd like to kno

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Andrew Ash
The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower beca

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
On a related note, I'd also minimize any kind of executor movement. I.e., once an executor is spawned and data cached in the executor, I want that executor to live all the way till the job is finished, or the machine fails in a fatal manner. What would be the best way to ensure that this is the ca