liuxiaolong created HADOOP-16872:
------------------------------------

             Summary: Performance improvement when distcp files in large dir 
with -direct option
                 Key: HADOOP-16872
                 URL: https://issues.apache.org/jira/browse/HADOOP-16872
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: liuxiaolong


We use distcp with -direct option to copy a file between two large directories. 
We found it costed a few minutes. If we launch too much distcp jobs at the same 
time, NameNode  performance degradation is serious.

hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1 
hdfs://cluster1:8020/source/100.log  hdfs://cluster2:8020/target/100.jpg

 
|| ||Dir path||Count||
||Source dir||  hdfs://cluster1:8020/source/ ||100k+ files||
||Target dir||hdfs://cluster2:8020/target/ ||100k+  files||

 

 

Check code in CopyCommitter.java, we find in function

deleteAttemptTempFiles() has a code

targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." + 
jobId.replaceAll("job","attempt") + "*")); 

 it will waste a lot of time when distcp between two large dirs. When we use 
distcp with -direct option,  it will direct write to the target file without 
generate a  '.distcp.tmp'  temp file. So, i think this code need add a judgment 
in function deleteAttemptTempFiles, if distcp with -direct option, do nothing , 
directly return .  

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to