Re: Data locality with HDFS not being seen

Sameer Farooqui Fri, 21 Aug 2015 08:35:55 -0700

Hi Sunil,

Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352


On Thu, Aug 20, 2015 at 4:09 AM, Sunil <[email protected]> wrote:

> Hello .....      I am seeing some unexpected issues with achieving HDFS
> data
> locality. I expect the tasks to be executed only on the node which has the
> data but this is not happening (ofcourse, unless the node is busy in which
> case, I understand tasks can go to some other node). Could anyone clarify
> whats wrong with the way I am trying or what I should rather do? Below is
> the cluster configuration and experiments that I have tried. Any help will
> be appreciated. If you would like to recreate the below scenario, then you
> may use the JavaWordCount.java example given within the spark.
>
> *Cluster configuration:*
>
> 1. spark-1.4.0 and hadoop-2.7.1
> 2. Machines --> Master node (master) and 6 worker nodes (node1 to node6)
> 3. master acts as --> spark master, HDFS name node & sec name node, Yarn
> resource manager
> 4. Each of the 6 worker nodes act as --> spark worker node, HDFS data node,
> node manager
>
> *Data on HDFS:*
>
> 20Mb text file is stored in single block. With the replication factor of 3,
> the text file is stored on nodes 2, 3 & 4.
>
> *Test-1 (Spark stand alone mode):*
>
> Application being run is the standard Java word count count example with
> the
> above text file in HDFS, as input. On job submission, I see in the spark
> web-UI that, stage-0(i.e mapToPair) is being run on random nodes (i.e.
> node1, node 2, node 6, etc.). By random I mean that, stage 0 executes on
> the
> very first worker node that gets registered to the application (this can be
> looked from the event timeline graph). Rather, I am expecting the stage-0
> to
> be run only on any one of the three nodes 2, 3, or 4.
>
> * Test-2 (Yarn cluster mode): *
> Same as above. No data locality seen.
>
> * Additional info: *
> No other spark applications are running and I have even tried by setting
> the
> /spark.locality.wait/ to 10s, but still no difference.
>
> Thanks and regards,
> Sunil
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-with-HDFS-not-being-seen-tp24361.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Data locality with HDFS not being seen

Reply via email to