[jira] [Updated] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS

Huangkaixuan (JIRA) Mon, 06 Mar 2017 19:35:05 -0800

     [ 
https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Huangkaixuan updated YARN-6289:
-------------------------------
    Description: 
When I ran experiments with both Spark and MapReduce wordcount on YARN, I 
noticed that the task failed to achieve data locality, even though there is no 
other job running on the cluster. 
I adopted a 7 node (1 master, 6 data nodes/node managers) cluster and set 2x 
replication for HDFS. In the experiments, I run Spark/MapReduce wordcount on 
YARN for 10 times with a single data block. The results show that only 30% of 
tasks can achieve data locality, it seems like random in the placement of 
tasks. the experiment details are in the attachment, you can reproduce the 
experiments.


  was:
     When I ran experiments with both Spark and MapReduce wordcount with yarn 
on the file, I noticed that the job did not get data locality every time. It 
was seemingly random in the placement of the tasks, even though there is no 
other job running on the cluster. I expected the task placement to always be on 
the single machine which is holding the data block, but that did not happen.
     I run the experiments with a 7 node cluster with 2x replication(1 master, 
6 data nodes/node managers) , the experiment details are in the patch so you 
can recreate the result.
     In the experiments, I run Spark/MapReduce wordcount with yarn for 10 times 
in a single block and the results show that only 30% of tasks can satisfy data 
locality, it seems like random in the placement of tasks.  
     Next,I will run two more experiments(7 node cluster with 2x replication 
with 2 blocks and 4 blocks) to verify the results and plan to do some 
optimization work (optimize the schedule policy) to improve data locality


> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2   Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>            Priority: Minor
>         Attachments: YARN-6289.01.docx
>
>
> When I ran experiments with both Spark and MapReduce wordcount on YARN, I 
> noticed that the task failed to achieve data locality, even though there is 
> no other job running on the cluster. 
> I adopted a 7 node (1 master, 6 data nodes/node managers) cluster and set 2x 
> replication for HDFS. In the experiments, I run Spark/MapReduce wordcount on 
> YARN for 10 times with a single data block. The results show that only 30% of 
> tasks can achieve data locality, it seems like random in the placement of 
> tasks. the experiment details are in the attachment, you can reproduce the 
> experiments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS

Reply via email to