I am running Spark 1.1.1 built against CDH4 and have a few questions
regarding Spark performance related to co-location with HDFS nodes. 

I want to know whether (and how efficiently) Spark takes advantage of being
co-located with a HDFS node? 
  
What I mean by this is: if a file is being read by a Spark executor and that
file (or most of its blocks) is located in a HDFS DataNode on the same
machine as a Spark worker, will it read directly off of disk, or does that
data have to travel through the network in some way? Is there a distinct
advantage to putting HDFS and Spark on the same box if it is possible or,
due to the way blocks are distributed about a cluster, are we so likely to
be moving files over the network that co-location doesn’t really make that
much of a difference? 
  
Also, do you know of any papers/books/other resources (other trying to dig
through the spark code) which do a good job of explaining the Spark/HDFS
data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)? 

Thanks! 
Zach




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to