Re: HDFS Explained as Comics

2011-12-01 Thread Dieter Plaetinck
Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,
 
 very cool comic!
 
 Thanks,
  Alex
 
 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:
 
  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client really need to know Block
  size and replication factor - A lot of times client has no
  control over
  these
(set
  at cluster level)
 
  -Prashant Kommireddi
 
  On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges 
   dejan.men...@gmail.com
  wrote:
 
   Hi 

RE: HDFS Explained as Comics

2011-12-01 Thread Ravi teja ch n v
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :)

Regards,
Ravi Teja

From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be]
Sent: 01 December 2011 15:11:36
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,

 very cool comic!

 Thanks,
  Alex

 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:

  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client 

RE: HDFS Explained as Comics

2011-12-01 Thread Ravi teja ch n v

Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :)

Regards,
Ravi Teja

From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be]
Sent: 01 December 2011 15:11:36
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,

 very cool comic!

 Thanks,
  Alex

 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:

  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a 

Re: HDFS Explained as Comics

2011-12-01 Thread maneesh varshney
Hi Dieter

Very clear.  The comic format works indeed quite well.
 I never considered comics as a serious (professional) way to get
 something explained efficiently,
 but this shows people should think twice before they start writing their
 next documentation.


Thanks! :)


 one question though: if a DN has a corrupted block, why does the NN only
 remove the bad DN from the block's list, and not the block from the DN list?


You are right. This needs to be fixed.


 (also, does it really store the data in 2 separate tables?  This looks to
 me like 2 different views of the same data?)


Actually its more than two tables... I have personally found the data
structures rather contrived.

In the org.apache.hadoop.hdfs.server.namenode package, information is kept
in multiple places:
- InodeFile, which has a list of blocks for a given file
- FSNamesystem, has a map of block - {inode, datanodes}
- BlockInfo, which stores information in rather strange manner:

/**

 * This array contains triplets of references.

 * For each i-th data-node the block belongs to

 * triplets[3*i] is the reference to the DatanodeDescriptor

 * and triplets[3*i+1] and triplets[3*i+2] are references

 * to the previous and the next blocks, respectively, in the

 * list of blocks belonging to this data-node.

 */

private Object[] triplets;





 On Thu, 1 Dec 2011 08:53:31 +0100
 Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

  Hi all,
 
  very cool comic!
 
  Thanks,
   Alex
 
  On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
  manu.i...@gmail.com
   wrote:
 
   Hi,
  
   This is indeed a good way to explain, most of the improvement has
   already been discussed. waiting for sequel of this comic.
  
   Regards,
   Abhishek
  
   On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
   mvarsh...@gmail.com
   wrote:
  
Hi Matthew
   
I agree with both you and Prashant. The strip needs to be
modified to explain that these can be default values that can be
optionally
   overridden
(which I will fix in the next iteration).
   
However, from the 'understanding concepts of HDFS' point of view,
I still think that block size and replication factors are the
real strengths of HDFS, and the learners must be exposed to them
so that they get to see
   how
hdfs is significantly different from conventional file systems.
   
On personal note: thanks for the first part of your message :)
   
-Maneesh
   
   
On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:
   
 Maneesh,

 Firstly, I love the comic :)

 Secondly, I am inclined to agree with Prashant on this latest
 point.
While
 one code path could take us through the user defining command
 line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
 might
   confuse
a
 person new to Hadoop. The most common flow would be using admin
determined
 values from hdfs-site and the only thing that would need to
 change is
that
 conversation happening between client / server and not user /
 client.

 Matt

 -Original Message-
 From: Prashant Kommireddi [mailto:prash1...@gmail.com]
 Sent: Wednesday, November 30, 2011 3:28 PM
 To: common-user@hadoop.apache.org
 Subject: Re: HDFS Explained as Comics

 Sure, its just a case of how readers interpret it.

   1. Client is required to specify block size and replication
 factor
   each
   time
   2. Client does not need to worry about it since an admin has
 set the properties in default configuration files

 A client could not be allowed to override the default configs
 if they
   are
 set final (well there are ways to go around it as well as you
 suggest
   by
 using create() :)

 The information is great and helpful. Just want to make sure a
 beginner
who
 wants to write a WordCount in Mapreduce does not worry about
   specifying
 block size' and replication factor in his code.

 Thanks,
 Prashant

 On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
 mvarsh...@gmail.com
 wrote:

  Hi Prashant
 
  Others may correct me if I am wrong here..
 
  The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
  of
   block
 size
  and replication factor. In the source code, I see the
  following in
   the
  DFSClient constructor:
 
 defaultBlockSize = conf.getLong(dfs.block.size,
DEFAULT_BLOCK_SIZE);
 
 defaultReplication = (short)
  conf.getInt(dfs.replication, 3);
 
  My understanding is that the client considers the following
  chain for
the
  values:
  1. Manual values (the long form constructor; when a user
  provides
   these
  values)
  2. Configuration file values (these are cluster level
 

Re: Issue : Hadoop mapreduce job to process S3 logs gets hung at INFO mapred.JobClient: map 0% reduce 0%

2011-12-01 Thread Nitika Gupta
Regarding the user logs of tasktracker, there is nothing interesting
there. That is the thing, tasktracker did not
pick the task that was assigned to it.

Any idea why the mapper is not picking up the task?

Thanks

Nitika

On Mon, Nov 28, 2011 at 9:53 PM, Prashant Sharma
prashant.ii...@gmail.com wrote:
 Can you check your userlogs/xyz_attempt_xyz.log and also jobtracker and
 datanode logs.

 -P

 On Tue, Nov 29, 2011 at 4:17 AM, Nitika Gupta ngu...@rocketfuelinc.comwrote:

 Hi All,

 I am trying to run a mapreduce job to process the Amazon S3 logs.
 However, the code hangs at INFO mapred.JobClient: map 0% reduce 0% and
 does not even attempt to launch the tasks. The sample code for the job
 setup is given below:

 public int run(CommandLine cl) throws Exception
 {
 Configuration conf = getConf();
 String inputPath = ;
 String outputPath = ;
 try
 {
 Job job = new Job(conf, Dummy);
 job.setNumReduceTasks(0);
 job.setMapperClass(Mapper.class);
 inputPath = cl.getOptionValue(input); //input is an s3n path
 outputPath = cl.getOptionValue(output);
 FileInputFormat.setInputPaths(job, inputPath);
 FileOutputFormat.setOutputPath(job, new Path(outputPath));
 _log.info(Input path set as  + inputPath);
 _log.info(Output path set as  + outputPath);
 job.waitForCompletion(true); return 0;
 }
 catch (Exception ex)
 {
 _log.error(ex); return 1; }
 }
 The above code works on the staging machine. However, it fails on the
 production machine which is same as the staging machine with more
 capacity.

 Job Run:
 11/11/22 16:13:38 INFO Driver: Input path being processed is
 s3n://abc//mm/dd/*
 11/11/22 16:13:38 INFO Driver: Output path being processed is
 s3n://xyz//mm/dd/00/
 11/11/22 16:13:51 INFO mapred.FileInputFormat: Total input paths to
 process : 399
 11/11/22 16:13:53 INFO mapred.JobClient: Running job:
 job_20151645_14535
 11/11/22 16:13:54 INFO mapred.JobClient:  map 0% reduce 0%

 --- At this point, it hangs. The job submission goes fine and I can
 see messages in jobtracker logs
 that the task assignment has happened fine. By that I mean the log says
  Adding task (MAP) 'attempt_20262339_1974_r_40_1' to tip
 task_20262339_1974_r_40, for tracker
 'tracker_xx.xx.xx:localhost/127.0.0.1:47937' 
 But if I go to tasktracker logs (to which task was assigned) I do not
 see any mention of this attempt , which hints the tasktracker did not
 pick this task(?).
 We are using fair scheduler , if that has something to do.

 I tried to validate if it is the issue with the connection to s3. So,
 I tried to distcp from s3 to hdfs and it went fine, which hints
 connectivity issues are not there.

 Does anyone know what could be the possible reason for the error?

 Thanks in advance!

 Nitika



Utilizing multiple hard disks for hadoop HDFS ?

2011-12-01 Thread praveenesh kumar
Hi everyone,

So I have this blade server with 4x500 GB hard disks.
I want to use all these hard disks for hadoop HDFS.
How can I achieve this target ?

If I install hadoop on 1 hard disk and use other hard disk as normal
partitions eg.  -

/dev/sda1, -- HDD 1 -- Primary partition -- Linux + Hadoop installed on it
/dev/sda2, -- HDD 2 -- Mounted partition -- /mnt/dev/sda2
/dev/sda3, -- HDD3  -- Mounted partition -- /mnt/dev/sda3
/dev/sda4, -- HDD4  -- Mounted partition -- /mnt/dev/sda4

And if I create a hadoop.tmp.dir on each partition say --
/tmp/hadoop-datastore/hadoop-hadoop

and on core-site.xml, if I configure like --
property
namehadoop.tmp.dir/name

value/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda2/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda3/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda4/tmp/hadoop-datastore/hadoop-hadoop/value
descriptionA base for other temporary directories./description
/property

Will it work ??

Can I set the above property for dfs.data.dir also ?

Thanks,
Praveenesh


Re: Utilizing multiple hard disks for hadoop HDFS ?

2011-12-01 Thread Harsh J
You need to apply comma-separated lists only to dfs.data.dir (HDFS) and 
mapred.local.dir (MR) directly. Make sure the subdirectories are different for 
each, else you may accidentally wipe away your data when you restart MR 
services.

The hadoop.tmp.dir property does not accept multiple paths and you should avoid 
using it in production - its more of a utility property that acts as a default 
base path for other properties.

On 02-Dec-2011, at 11:05 AM, praveenesh kumar wrote:

 Hi everyone,
 
 So I have this blade server with 4x500 GB hard disks.
 I want to use all these hard disks for hadoop HDFS.
 How can I achieve this target ?
 
 If I install hadoop on 1 hard disk and use other hard disk as normal
 partitions eg.  -
 
 /dev/sda1, -- HDD 1 -- Primary partition -- Linux + Hadoop installed on it
 /dev/sda2, -- HDD 2 -- Mounted partition -- /mnt/dev/sda2
 /dev/sda3, -- HDD3  -- Mounted partition -- /mnt/dev/sda3
 /dev/sda4, -- HDD4  -- Mounted partition -- /mnt/dev/sda4
 
 And if I create a hadoop.tmp.dir on each partition say --
 /tmp/hadoop-datastore/hadoop-hadoop
 
 and on core-site.xml, if I configure like --
 property
namehadoop.tmp.dir/name
 
 value/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda2/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda3/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda4/tmp/hadoop-datastore/hadoop-hadoop/value
descriptionA base for other temporary directories./description
 /property
 
 Will it work ??
 
 Can I set the above property for dfs.data.dir also ?
 
 Thanks,
 Praveenesh