Hello, I have some directories containing a large number of files which have a wide range of file sizes in it.
When I do a distcp, because the smallest unit of transfer is a file, there are some maps that take much longer than the others (or simply fail). I see that there's an open JIRA ( https://issues.apache.org/jira/browse/MAPREDUCE-2257 - distcp.copy.by.chunk ) to allow multiple maps to copy parts the same file in parallel to get around this problem. In the mean time, can anyone suggest a manual technique that I can use on the largest files in the directory to split them prior to carrying out the distcp, and then concatenate them back into their original sizes at the other end? Regards, Nik
