Try to run the command from the namenode, or another node which is not a datanode, the files should distribute. As far as I know, if you copy a file to hdfs from a datanode, the first copy is stored in that datanode.
On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <[email protected]>wrote: > Hello (and thanks for replying!) :) > > On Wed, Jan 29, 2014 at 7:38 AM, java8964 <[email protected]> wrote: > >> Hi, Ognen: >> >> I noticed you were asking this question before under a different subject >> line. I think you need to tell us where you mean unbalance space, is it on >> HDFS or the local disk. >> >> 1) The HDFS is independent as MR. They are not related to each other. >> > > OK good to know. > > >> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all >> HDFS command, API will just work. >> > > Good to know. Does this also mean that when I put or distcp file to > hdfs://namenode:54310/path/file - it will "decide" how to split the file > across all the datanodes so as the nodes are utilized equally in terms of > space? > > >> 3) But when you tried to copy file into HDFS using distcp, you need MR >> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses >> MapReduce to do the massively parallel copying files. >> > > Understood. > > >> 4) Your original problem is that when you run the distcp command, you >> didn't start the MR component in your cluster, so distcp in fact copy your >> files to the LOCAL file system, based on some one else's reply to your >> original question. I didn't test this myself before, but I kind of believe >> that. >> > > Sure. But even if distcp is running in one thread, its destination is > hdfs://namenode:54310/path/file - should this not ensure equal "split" of > files across the whole HDFS cluster? Or am I delusional? :) > > >> 5) If the above is true, then you should see under node your were running >> distcp command there should be having these files in the local file system, >> in the path you specified. You should check and verify that. >> > > OK - so the command is this: > > hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs:// > 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am > running this on 10.10.0.200 which is one of the Data nodes and I am making > no mention of the local data node storage in this command. My expectation > is that the files obtained this way from S3 will end up distributed > somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I > wrong to expect this? > > 6) After you start yarn/resource manager, you see the unbalance after you >> distcp files again. Where is this unbalance? In the HDFS or local file >> system. List the commands and outputs here, so we can understand your >> problem more clearly, instead of misleading sometimes by your words. >> > > The imbalance is as follows: the machine I run the distcp command on (one > of the Data nodes) ends up with 70+% of the space it is contributing to the > HDFS cluster occupied with these files while the rest of the data nodes in > the cluster only get 10% of their contributed space occupied. Since HDFS is > a distributed, parallel file system I would expect that the file space > occupied would be spread evenly or somewhat evenly across all the data > nodes. > > Thanks! > Ognen >
