Re: Hadoop Comic
Hi Jaganadh I am the author of this comic strip. Please feel free to re-distribute it as you see fit.. I assign the content under Creative Commons Attribution Share-Alike. Would also like to thank everybody for the nice feedback and encouragement! This was a little experiment on my part to see if visual format can be used to explain protocols and algorithms. I am teaching myself WordPress these days so that I can host other comics on this and other topics in computer science. Thanks -Maneesh On Wed, Dec 7, 2011 at 2:06 AM, JAGANADH G jagana...@gmail.com wrote: Hi Is it re-distributable or sharable in a blog or like ? -- ** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.psgkriya.org
Re: HDFS Explained as Comics
Hi Dieter Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. Thanks! :) one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? You are right. This needs to be fixed. (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Actually its more than two tables... I have personally found the data structures rather contrived. In the org.apache.hadoop.hdfs.server.namenode package, information is kept in multiple places: - InodeFile, which has a list of blocks for a given file - FSNamesystem, has a map of block - {inode, datanodes} - BlockInfo, which stores information in rather strange manner: /** * This array contains triplets of references. * For each i-th data-node the block belongs to * triplets[3*i] is the reference to the DatanodeDescriptor * and triplets[3*i+1] and triplets[3*i+2] are references * to the previous and the next blocks, respectively, in the * list of blocks belonging to this data-node. */ private Object[] triplets; On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level
HDFS Explained as Comics
For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual medium); sequence diagrams are unwieldy for non-trivial protocols, and they do not explain concepts; and finally, animations/videos happen too fast and do not offer self-paced learning experience.] All forms of criticisms, comments (and encouragements) welcome :) Thanks Maneesh
Re: HDFS Explained as Comics
Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.comwrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi Maneesh, Thanks a lot for this! Just distributed it over the team and comments are great :) Best regards, Dejan On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com wrote: For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual medium); sequence diagrams are unwieldy for non-trivial protocols, and they do not explain concepts; and finally, animations/videos happen too fast and do not offer self-paced learning experience.] All forms of criticisms, comments (and encouragements) welcome :) Thanks Maneesh
Hadoop Discrete Event Simulator
Hello, Perhaps somebody can point out if there have been efforts to simulate Hadoop clusters. What I mean is a discrete event simulator that models the hosts and the networks and run hadoop algorithms for some synthetic workload. Something similar to network simulators (for example, ns2). If such as tool is available, I was hoping to use it for: a. Getting a general sense of how the HDFS and MapReduce algorithms work. For example, if I were to store 1TB data over 100 nodes, how would the blocks get distributed. b. Use the simulation to optimize my configuration parameters. For example, the relationship between performance and number of cluster node, or number of replicas, and so on. The need for point b. above is to be able to study/analyze the performance without (or before) actually running the algorithms on an actual cluster. Thanks in advance, Maneesh PS: I apologize if this question has been asked earlier. I could not seem to locate the search feature in the mailing list archive.