[jira] [Comment Edited] (SPARK-3526) Docs section on data locality

Andrew Ash (JIRA) Mon, 15 Sep 2014 01:14:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133662#comment-14133662
 ]


Andrew Ash edited comment on SPARK-3526 at 9/15/14 8:14 AM:
------------------------------------------------------------

Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

Edit: repro'd and filed as SPARK-3528


was (Author: aash):
Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

> Docs section on data locality
> -----------------------------
>
>                 Key: SPARK-3526
>                 URL: https://issues.apache.org/jira/browse/SPARK-3526
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.0.2
>            Reporter: Andrew Ash
>
> Several threads on the mailing list have been about data locality and how to 
> interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
> details in the docs on this concept so we can point future questions there.
> A couple people appreciated the below description of locality so it could be 
> a good starting point:
> {quote}
> The locality is how close the data is to the code that's processing it.  
> PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
> node, or in another executor on the same node, so is a little slower because 
> the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
> -- data is on a different server so needs to be sent over the network.
> Spark switches to lower locality levels when there's no unprocessed data on a 
> node that has idle CPUs.  In that situation you have two options: wait until 
> the busy CPUs free up so you can start another task that uses data on that 
> server, or start a new task on a farther away server that needs to bring data 
> from that remote place.  What Spark typically does is wait a bit in the hopes 
> that a busy CPU frees up.  Once that timeout expires, it starts moving the 
> data from far away to the free CPU.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3526) Docs section on data locality

Reply via email to