I am running Spark 1.1.1 built against CDH4 and have a few questions regarding Spark performance related to co-location with HDFS nodes.
I want to know whether (and how efficiently) Spark takes advantage of being co-located with a HDFS node? What I mean by this is: if a file is being read by a Spark executor and that file (or most of its blocks) is located in a HDFS DataNode on the same machine as a Spark worker, will it read directly off of disk, or does that data have to travel through the network in some way? Is there a distinct advantage to putting HDFS and Spark on the same box if it is possible or, due to the way blocks are distributed about a cluster, are we so likely to be moving files over the network that co-location doesn’t really make that much of a difference? Also, do you know of any papers/books/other resources (other trying to dig through the spark code) which do a good job of explaining the Spark/HDFS data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)? Thanks! Zach -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org