Re: HDFS Explained as Comics
Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi
RE: HDFS Explained as Comics
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :) Regards, Ravi Teja From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be] Sent: 01 December 2011 15:11:36 To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client
RE: HDFS Explained as Comics
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :) Regards, Ravi Teja From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be] Sent: 01 December 2011 15:11:36 To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a
Re: HDFS Explained as Comics
Hi Dieter Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. Thanks! :) one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? You are right. This needs to be fixed. (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Actually its more than two tables... I have personally found the data structures rather contrived. In the org.apache.hadoop.hdfs.server.namenode package, information is kept in multiple places: - InodeFile, which has a list of blocks for a given file - FSNamesystem, has a map of block - {inode, datanodes} - BlockInfo, which stores information in rather strange manner: /** * This array contains triplets of references. * For each i-th data-node the block belongs to * triplets[3*i] is the reference to the DatanodeDescriptor * and triplets[3*i+1] and triplets[3*i+2] are references * to the previous and the next blocks, respectively, in the * list of blocks belonging to this data-node. */ private Object[] triplets; On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level
Re: Issue : Hadoop mapreduce job to process S3 logs gets hung at INFO mapred.JobClient: map 0% reduce 0%
Regarding the user logs of tasktracker, there is nothing interesting there. That is the thing, tasktracker did not pick the task that was assigned to it. Any idea why the mapper is not picking up the task? Thanks Nitika On Mon, Nov 28, 2011 at 9:53 PM, Prashant Sharma prashant.ii...@gmail.com wrote: Can you check your userlogs/xyz_attempt_xyz.log and also jobtracker and datanode logs. -P On Tue, Nov 29, 2011 at 4:17 AM, Nitika Gupta ngu...@rocketfuelinc.comwrote: Hi All, I am trying to run a mapreduce job to process the Amazon S3 logs. However, the code hangs at INFO mapred.JobClient: map 0% reduce 0% and does not even attempt to launch the tasks. The sample code for the job setup is given below: public int run(CommandLine cl) throws Exception { Configuration conf = getConf(); String inputPath = ; String outputPath = ; try { Job job = new Job(conf, Dummy); job.setNumReduceTasks(0); job.setMapperClass(Mapper.class); inputPath = cl.getOptionValue(input); //input is an s3n path outputPath = cl.getOptionValue(output); FileInputFormat.setInputPaths(job, inputPath); FileOutputFormat.setOutputPath(job, new Path(outputPath)); _log.info(Input path set as + inputPath); _log.info(Output path set as + outputPath); job.waitForCompletion(true); return 0; } catch (Exception ex) { _log.error(ex); return 1; } } The above code works on the staging machine. However, it fails on the production machine which is same as the staging machine with more capacity. Job Run: 11/11/22 16:13:38 INFO Driver: Input path being processed is s3n://abc//mm/dd/* 11/11/22 16:13:38 INFO Driver: Output path being processed is s3n://xyz//mm/dd/00/ 11/11/22 16:13:51 INFO mapred.FileInputFormat: Total input paths to process : 399 11/11/22 16:13:53 INFO mapred.JobClient: Running job: job_20151645_14535 11/11/22 16:13:54 INFO mapred.JobClient: map 0% reduce 0% --- At this point, it hangs. The job submission goes fine and I can see messages in jobtracker logs that the task assignment has happened fine. By that I mean the log says Adding task (MAP) 'attempt_20262339_1974_r_40_1' to tip task_20262339_1974_r_40, for tracker 'tracker_xx.xx.xx:localhost/127.0.0.1:47937' But if I go to tasktracker logs (to which task was assigned) I do not see any mention of this attempt , which hints the tasktracker did not pick this task(?). We are using fair scheduler , if that has something to do. I tried to validate if it is the issue with the connection to s3. So, I tried to distcp from s3 to hdfs and it went fine, which hints connectivity issues are not there. Does anyone know what could be the possible reason for the error? Thanks in advance! Nitika
Utilizing multiple hard disks for hadoop HDFS ?
Hi everyone, So I have this blade server with 4x500 GB hard disks. I want to use all these hard disks for hadoop HDFS. How can I achieve this target ? If I install hadoop on 1 hard disk and use other hard disk as normal partitions eg. - /dev/sda1, -- HDD 1 -- Primary partition -- Linux + Hadoop installed on it /dev/sda2, -- HDD 2 -- Mounted partition -- /mnt/dev/sda2 /dev/sda3, -- HDD3 -- Mounted partition -- /mnt/dev/sda3 /dev/sda4, -- HDD4 -- Mounted partition -- /mnt/dev/sda4 And if I create a hadoop.tmp.dir on each partition say -- /tmp/hadoop-datastore/hadoop-hadoop and on core-site.xml, if I configure like -- property namehadoop.tmp.dir/name value/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda2/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda3/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda4/tmp/hadoop-datastore/hadoop-hadoop/value descriptionA base for other temporary directories./description /property Will it work ?? Can I set the above property for dfs.data.dir also ? Thanks, Praveenesh
Re: Utilizing multiple hard disks for hadoop HDFS ?
You need to apply comma-separated lists only to dfs.data.dir (HDFS) and mapred.local.dir (MR) directly. Make sure the subdirectories are different for each, else you may accidentally wipe away your data when you restart MR services. The hadoop.tmp.dir property does not accept multiple paths and you should avoid using it in production - its more of a utility property that acts as a default base path for other properties. On 02-Dec-2011, at 11:05 AM, praveenesh kumar wrote: Hi everyone, So I have this blade server with 4x500 GB hard disks. I want to use all these hard disks for hadoop HDFS. How can I achieve this target ? If I install hadoop on 1 hard disk and use other hard disk as normal partitions eg. - /dev/sda1, -- HDD 1 -- Primary partition -- Linux + Hadoop installed on it /dev/sda2, -- HDD 2 -- Mounted partition -- /mnt/dev/sda2 /dev/sda3, -- HDD3 -- Mounted partition -- /mnt/dev/sda3 /dev/sda4, -- HDD4 -- Mounted partition -- /mnt/dev/sda4 And if I create a hadoop.tmp.dir on each partition say -- /tmp/hadoop-datastore/hadoop-hadoop and on core-site.xml, if I configure like -- property namehadoop.tmp.dir/name value/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda2/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda3/tmp/hadoop-datastore/hadoop-hadoop,/mnt/dev/sda4/tmp/hadoop-datastore/hadoop-hadoop/value descriptionA base for other temporary directories./description /property Will it work ?? Can I set the above property for dfs.data.dir also ? Thanks, Praveenesh