So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination in addition to HDFS? My impression was what you are saying - that HDFS is independent of the MR component.
Thanks! :) Ognen On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <[email protected]>wrote: > Harsh, > > Thanks for your reply. What happens is this: I have about 70 files, all > about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a > for loop, file by file using the -distcp command from a single node. > > When I look at the distribution of space consumed on the HDFS cluster now, > the node I ran the command on has 70% of its space taken up while the rest > of the nodes are at 10% local space usage. All of the nodes started out > with the same local space of 1.6TB mounted in the same exact partition > /extra (ephemeral space on an Amazon instance put into a RAID0 array). > > Hence, the distribution of space is not balanced. > > However, I did discover the start-balancer.sh script and ran it with > -threshold 5. It has been running since yesterday, maybe the 5% balancing > threshold is too much? > > Ognen > > > > > On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <[email protected]> wrote: > >> I don't believe what you've been told is correct (IIUC). HDFS is an >> independent component and does not require presence of YARN (or MR) to >> function correctly. >> >> What do you exactly mean when you say "files are only stored on the >> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a >> local FS / result list or does it show a true HDFS directory listing? >> Your problem may simply be configuring clients right - depending on >> this. >> >> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski >> <[email protected]> wrote: >> > Hello, >> > >> > I have set up an HDFS cluster by running a name node and a bunch of data >> > nodes. I ran into a problem where the files are only stored on the node >> that >> > uses the hdfs command and was told that this is because I do not have a >> job >> > tracker and task nodes set up. >> > >> > However, the documentation for 2.2.0 does not mention any of these (at >> least >> > not this page: >> > >> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html >> ). >> > I browsed some of the earlier docs and they do mention job tracker nodes >> > etc. >> > >> > So, for 2.2.0 - what is the way to set this up? Do I need a separate >> machine >> > to be the "job tracker"? Did this job tracker node change its name to >> > something else in the current docs? >> > >> > Thanks, >> > Ognen >> >> >> >> -- >> Harsh J >> > >
