RE: Configuring hadoop 2.2.0

java8964 Wed, 29 Jan 2014 05:39:29 -0800

Hi, Ognen:
I noticed you were asking this question before under a different subject line. 
I think you need to tell us where you mean unbalance space, is it on HDFS or 
the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without 
MR1 or MR2 (Yarn), HDFS should work as itself, which means all HDFS command, 
API will just work.3) But when you tried to copy file into HDFS using distcp, 
you need MR component (Doesn't matter it is MR1 or MR2), as distcp indeed uses 
MapReduce to do the massively parallel copying files.4) Your original problem 
is that when you run the distcp command, you didn't start the MR component in 
your cluster, so distcp in fact copy your files to the LOCAL file system, based 
on some one else's reply to your original question. I didn't test this myself 
before, but I kind of believe that. 5) If the above is true, then you should 
see under node your were running distcp command there should be having these 
files in the local file system, in the path you specified. You should check and 
verify that.6) After you start yarn/resource manager, you see the unbalance 
after you distcp files again. Where is this unbalance? In the HDFS or local 
file system. List the commands  and outputs here, so we can understand your 
problem more clearly, instead of misleading sometimes by your words.7) My 
suggest is that after you start the yarn/resource managers, run some examples 
MR jobs coming with hadoop, to make sure your cluster working as normal, then 
try your distcp command.
Thanks
Yong

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: [email protected]
To: [email protected]

So, the question is: do I or don't I need to run the yarn/resource manager/node 
manager combination in addition to HDFS? My impression was what you are saying 
- that HDFS is independent of the MR component.

Thanks! :)
Ognen

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <[email protected]> 
wrote:

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all about 
20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, 
file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the 
node I ran the command on has 70% of its space taken up while the rest of the 
nodes are at 10% local space usage. All of the nodes started out with the same 
local space of 1.6TB mounted in the same exact partition /extra (ephemeral 
space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 
5. It has been running since yesterday, maybe the 5% balancing threshold is too 
much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <[email protected]> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on

this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<[email protected]> wrote:

> Hello,

>

> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.

>

> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.

>

> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?

>

> Thanks,

> Ognen

--

Harsh J

RE: Configuring hadoop 2.2.0

Reply via email to