[jira] [Updated] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS

Huangkaixuan (JIRA) Mon, 06 Mar 2017 22:45:09 -0800

     [ 
https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Huangkaixuan updated YARN-6289:
-------------------------------
    Description: 
When running a simple wordcount experiment on YARN, I noticed that the task 
failed to achieve data locality, even though there is no other job running on 
the cluster at the same time. The experiment was done in a 7-node (1 master, 6 
data nodes/node managers) cluster and the input of the wordcount job (both 
Spark and MapReduce) is a single-block file in HDFS which is two-way replicated 
(replication factor = 2). I ran wordcount on YARN for 10 times. The results 
show that only 30% of tasks can achieve data locality, which seems like the 
result of a random placement of tasks. The experiment details are in the 
attachment, and feel free to reproduce the experiments.


  was:
When I ran experiments with both Spark and MapReduce wordcount on YARN, I 
noticed that the task failed to achieve data locality, even though there is no 
other job running on the cluster. 
I adopted a 7 node (1 master, 6 data nodes/node managers) cluster and set 2x 
replication for HDFS. In the experiments, I run Spark/MapReduce wordcount on 
YARN for 10 times with a single data block. The results show that only 30% of 
tasks can achieve data locality, it seems like random in the placement of 
tasks. the experiment details are in the attachment, you can reproduce the 
experiments.



> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2   Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>         Attachments: Hadoop_Spark_Conf.zip, YARN-DataLocality.docx
>
>
> When running a simple wordcount experiment on YARN, I noticed that the task 
> failed to achieve data locality, even though there is no other job running on 
> the cluster at the same time. The experiment was done in a 7-node (1 master, 
> 6 data nodes/node managers) cluster and the input of the wordcount job (both 
> Spark and MapReduce) is a single-block file in HDFS which is two-way 
> replicated (replication factor = 2). I ran wordcount on YARN for 10 times. 
> The results show that only 30% of tasks can achieve data locality, which 
> seems like the result of a random placement of tasks. The experiment details 
> are in the attachment, and feel free to reproduce the experiments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS

Reply via email to