[ 
https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896854#comment-15896854
 ] 

Huangkaixuan edited comment on YARN-6289 at 3/6/17 7:54 AM:
------------------------------------------------------------

Experiment1:
       7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37        Simple27        Simple28        Simple30        Simple31        
Simple32        Simple33
Master  Node1   Node2   Node3   Node4   Node5   node6
       Configure HDFS with replication factor 2
       File has a single block in HDFS
       Configure Spark to use dynamic allocation
       Configure Yarn for both mapreduce shuffle service and Spark shuffle 
service
       Add a single small file (few bytes) to HDFS
       Run wordcount on the file (using Spark/MapReduce)
       Inspect if the single task for the map stage was scheduled on the node 
with the data
  
The result are shown in the webui as follow:

  
 

 

Result1:
 7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
MapReduce wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node6   No      20s
2       Node5/Node3     Node6   No      17s
3       Node3/Node5     Node1   No      21s
4       Node2/Node3     Node6   No      18s
5       Node1/Node2     Node1   Yes     15s
6       Node4/Node5     Node3   No      19s
7       Node2/Node3     Node2   Yes     14s
8       Node1/Node4     Node5   No      16s
9       Node1/Node6     Node6   yes     15s
10      Node3/Node5     Node4   no      17s


























7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
Spark wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node4   Yes     24s
2       Node2/Node3     Node5   No      30s
3       Node3/Node5     Node4   No      35s
4       Node2/Node3     Node2   Yes     24s
5       Node1/Node2     Node4   No      26s
6       Node4/Node5     Node2   No      25s
7       Node2/Node3     Node4   No      27s
8       Node1/Node4     Node1   Yes     22s
9       Node1/Node6     Node2   No      23s
10      Node1/Node2     Node4   No      33s





























was (Author: huangkx6810):
The experiment details as follow:
Yarn Experiments for Data Locality
I run the experiments with a 7 node cluster with 2x replication(1 master, 6 
data nodes/node managers) 
Hardware configuration
CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
Memory: 128GB Memory (16x8GB) 1600MHz
Disk: 600GBx2 3.5-inch with RAID-1
Network bandwidth: 968Mb/s
Software configuration
Spark-1.6.2   Hadoop-2.7.1 
Experiment1:
       7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37        Simple27        Simple28        Simple30        Simple31        
Simple32        Simple33
Master  Node1   Node2   Node3   Node4   Node5   node6
       Configure HDFS with replication factor 2
       File has a single block in HDFS
       Configure Spark to use dynamic allocation
       Configure Yarn for both mapreduce shuffle service and Spark shuffle 
service
       Add a single small file (few bytes) to HDFS
       Run wordcount on the file (using Spark/MapReduce)
       Inspect if the single task for the map stage was scheduled on the node 
with the data
  
The result are shown in the webui as follow:

  
 

 

Result1:
 7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
MapReduce wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node6   No      20s
2       Node5/Node3     Node6   No      17s
3       Node3/Node5     Node1   No      21s
4       Node2/Node3     Node6   No      18s
5       Node1/Node2     Node1   Yes     15s
6       Node4/Node5     Node3   No      19s
7       Node2/Node3     Node2   Yes     14s
8       Node1/Node4     Node5   No      16s
9       Node1/Node6     Node6   yes     15s
10      Node3/Node5     Node4   no      17s


























7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
Spark wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node4   Yes     24s
2       Node2/Node3     Node5   No      30s
3       Node3/Node5     Node4   No      35s
4       Node2/Node3     Node2   Yes     24s
5       Node1/Node2     Node4   No      26s
6       Node4/Node5     Node2   No      25s
7       Node2/Node3     Node4   No      27s
8       Node1/Node4     Node1   Yes     22s
9       Node1/Node6     Node2   No      23s
10      Node1/Node2     Node4   No      33s




























> yarn got little data locality
> -----------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2   Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>            Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the 
> file, I noticed that the job did not get data locality every time. It was 
> seemingly random in the placement of the tasks, even though there is no other 
> job running on the cluster. I expected the task placement to always be on the 
> single machine which is holding the data block, but that did not happen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to