Harsh, Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp command from a single node.
When I look at the distribution of space consumed on the HDFS cluster now, the node I ran the command on has 70% of its space taken up while the rest of the nodes are at 10% local space usage. All of the nodes started out with the same local space of 1.6TB mounted in the same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array). Hence, the distribution of space is not balanced. However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has been running since yesterday, maybe the 5% balancing threshold is too much? Ognen On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <[email protected]> wrote: > I don't believe what you've been told is correct (IIUC). HDFS is an > independent component and does not require presence of YARN (or MR) to > function correctly. > > What do you exactly mean when you say "files are only stored on the > node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a > local FS / result list or does it show a true HDFS directory listing? > Your problem may simply be configuring clients right - depending on > this. > > On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski > <[email protected]> wrote: > > Hello, > > > > I have set up an HDFS cluster by running a name node and a bunch of data > > nodes. I ran into a problem where the files are only stored on the node > that > > uses the hdfs command and was told that this is because I do not have a > job > > tracker and task nodes set up. > > > > However, the documentation for 2.2.0 does not mention any of these (at > least > > not this page: > > > http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html > ). > > I browsed some of the earlier docs and they do mention job tracker nodes > > etc. > > > > So, for 2.2.0 - what is the way to set this up? Do I need a separate > machine > > to be the "job tracker"? Did this job tracker node change its name to > > something else in the current docs? > > > > Thanks, > > Ognen > > > > -- > Harsh J >
