Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil <sdhe...@gmail.com> wrote: > Hello ..... I am seeing some unexpected issues with achieving HDFS > data > locality. I expect the tasks to be executed only on the node which has the > data but this is not happening (ofcourse, unless the node is busy in which > case, I understand tasks can go to some other node). Could anyone clarify > whats wrong with the way I am trying or what I should rather do? Below is > the cluster configuration and experiments that I have tried. Any help will > be appreciated. If you would like to recreate the below scenario, then you > may use the JavaWordCount.java example given within the spark. > > *Cluster configuration:* > > 1. spark-1.4.0 and hadoop-2.7.1 > 2. Machines --> Master node (master) and 6 worker nodes (node1 to node6) > 3. master acts as --> spark master, HDFS name node & sec name node, Yarn > resource manager > 4. Each of the 6 worker nodes act as --> spark worker node, HDFS data node, > node manager > > *Data on HDFS:* > > 20Mb text file is stored in single block. With the replication factor of 3, > the text file is stored on nodes 2, 3 & 4. > > *Test-1 (Spark stand alone mode):* > > Application being run is the standard Java word count count example with > the > above text file in HDFS, as input. On job submission, I see in the spark > web-UI that, stage-0(i.e mapToPair) is being run on random nodes (i.e. > node1, node 2, node 6, etc.). By random I mean that, stage 0 executes on > the > very first worker node that gets registered to the application (this can be > looked from the event timeline graph). Rather, I am expecting the stage-0 > to > be run only on any one of the three nodes 2, 3, or 4. > > * Test-2 (Yarn cluster mode): * > Same as above. No data locality seen. > > * Additional info: * > No other spark applications are running and I have even tried by setting > the > /spark.locality.wait/ to 10s, but still no difference. > > Thanks and regards, > Sunil > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-with-HDFS-not-being-seen-tp24361.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >