Re: namenode restart
Hi Rita, What kind of client you have?? Please be sure that there is no job running and especially no data is copying while you do the restart. Also are you restarting by using stop-dfs.sh and start-dfs.sh? Regards, Abhishek On Thu, Jun 7, 2012 at 3:29 AM, Rita rmorgan...@gmail.com wrote: Running Hadoop 0.22 and I need to restart the namenode so my new rack configuration will be set into place. I am thinking of doing a quick stop and start of the namenode but what will happen to the current clients? Can they tolerate a 30 second hiccup by retrying? -- --- Get your facts first, then you can distort them as you please.--
Re: Ideal file size
Almost all the answers is already provided in this post. My 2 cents... try to have a file size in multiple of block size so while processing number of mappers are less and performance of the job is better.. You can also merge file in HDFS later on for processing. Regards, Abhishek On Thu, Jun 7, 2012 at 7:29 AM, M. C. Srivas mcsri...@gmail.com wrote: On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote: Many factors to consider than just the size of the file. . How long can you wait before you *have to* process the data? 5 minutes? 5 hours? 5 days? If you want good timeliness, you need to roll-over faster. The longer you wait: 1. the lesser the load on the NN. 2. but the poorer the timeliness 3. and the larger chance of lost data (ie, the data is not saved until the file is closed and rolled over, unless you want to sync() after every write) To Begin with I was going to use Flume and specify rollover file size. I understand the above parameters, I just want to ensure that too many small files doesn't cause problem on the NameNode. For instance there would be times when we get GBs of data in an hour and at times only few 100 MB. From what Harsh, Edward and you've described it doesn't cause issues with the NameNode but rather increase in processing times if there are too many small files. Looks like I need to find that balance. It would also be interesting to see how others solve this problem when not using Flume. They use NFS with MapR. Any and all log-rotators (like the one in log4j) simply just work over NFS, and MapR does not have a NN, so the problems with small files or number of files do not exist. On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.
Re: dfs.replication factor for MR jobs
Hi Aishwarya, Temporary output of mapper is used for reducer. And number of Reduce jobs are based on the output keys of Mapper. It has nothing to do with replication factor. It is writing to three nodes because at least three keys has been generated from mapper and assigned reducer to three different nodes. Regards, Abhishek On Thu, May 17, 2012 at 2:06 PM, Aishwarya Venkataraman avenk...@cs.ucsd.edu wrote: Hello, I have a 4-node cluster. One namenode and 3 other datanodes. I want to explicitly set the dfs.replication factor to 1 inorder to run some experiments. I tried setting this via the hdfs-site.xml file and via the command line as well (hadoop dfs -setrep -R -w 1 /). But I have a feeling that the replication factor that hdfs is seeing is 3. It seems to be writing the temporary mapper outputs to all the 3 datanodes. Is this the default configuration for MR jobs ? If no, how can I set this to 1 ? Thanks, Aishwarya
Re: is hadoop suitable for us?
Hi, For your question if HADOOP can be used without HDFS, the answer is Yes. Hadoop can be used with any kind of distributed file system. But I m not able to understand the problem statement clearly to advice my point of view. Are you processing text file and saving in distributed database?? Regards, Abhishek On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: We want to distribute processing of text files.. processing of large machine learning tasks, have a distributed database as we have big amount of data etc. The problem is that each VM can have up to 2TB of data (limitation of VM), and we have 20TB of data. So we have to distribute the processing, the database etc. But all those data will be in a shared huge central file system. We heard about myHadoop, but we are not sure why is that any different from Hadoop. If we run hadoop/mapreduce without using HDFS? is that an option? best, PA 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com Hadoop does not perform well with shared storage and vms. The question should be asked first regarding what you're trying to achieve, not about your infra. On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: Hello, We have about 50 VMs and we want to distribute processing across them. However these VMs share a huge data storage system and thus their virtual HDD are all located in the same computer. Would Hadoop be useful for such configuration? Could we use hadoop without HDFS? so that we can retrieve and store everything in the same storage? Thanks, PA
Re: Resource underutilization / final reduce tasks only uses half of cluster ( tasktracker map/reduce slots )
Hi JD, Number of reduce task will depend upon the key after all the mapper is done. if the key is same than all the data will go to one node, similarly utilization of all nodes of cluster will depend upon the number of different keys for reduce task. Regards, Abhishek On Fri, May 11, 2012 at 4:57 PM, Jeremy Davis jda...@upstreamsoftware.comwrote: I see mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum, but I'm wondering if there isn't another tuning parameter I need to look at. I can tune the task tracker so that when I have many jobs running, with many simultaneous maps and reduces I utilize 95% of cpu and memory. Inevitably though I end up with a huge final reduce task that only uses half of of my cluster because I have reserved the other half for Mapping. Is there a way around this problem? Seems like there should also be a maximum number of reducers conditional on no Map tasks running. -JD
Re: Moving files from JBoss server to HDFS
What about using flume to consume the files and send to HDFS? It depends upon what security implication you are looking for. It will be easier if you can look into answer of similar question below. Is Jboss server process is pushing file to HDFS or some other process is pushing file to HDFS? The intermediate server, is doing some security check on data or user access? What processing is done on intermediate server? if not how it is different from sending the file directly to HDFS? Regards, Abhishek On Sat, May 12, 2012 at 3:56 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi financeturd financet...@yahoo.com, My Point of view second step like bellow is the good approach {Separate server} -- {JBoss server} and then {Separate server} -- HDFS thanks samir On Sat, May 12, 2012 at 6:00 AM, financeturd financeturd financet...@yahoo.com wrote: Hello, We have a large number of custom-generated files (not just web logs) that we need to move from our JBoss servers to HDFS. Our first implementation ran a cron job every 5 minutes to move our files from the output directory to HDFS. Is this recommended? We are being told by our IT team that our JBoss servers should not have access to HDFS for security reasons. The files must be sucked to HDFS by other servers that do not accept traffic from the outside. In essence, they are asking for a layer of indirection. Instead of: {JBoss server} -- {HDFS} it's being requested that it look like: {Separate server} -- {JBoss server} and then {Separate server} -- HDFS While I understand in principle what is being said, the security of having processes on JBoss servers writing files to HDFS doesn't seem any worse than having Tomcat servers access a central database, which they do. Can anyone comment on what a recommended approach would be? Should our JBoss servers push their data to HDFS or should the data be pulled by another server and then placed into HDFS? Thank you! FT
Re: Number of Reduce Tasks
AFAIK Number of reducers depends upon the Key generated after Mappers are done. May be join is resulting in one key. Regards, Abhishek On Sun, May 13, 2012 at 10:48 PM, anwar shaikh anwardsha...@gmail.comwrote: Hi Everybody, I am executing a MapReduce job to execute JOIN operation using org.apache.hadoop.contrib.utils.join Four files are given as Input. I think there are four Map Jobs running (based on the line marked in red ). I have also set number of reducers to be 10 using - * job.setNumReduceTasks(10) * * * But, only one reduce task is performed (line marked in blue). So, Please can you suggest how can I increase the number of reducers ? Below are some of the last lines from the log. - 12/05/14 10:32:46 INFO mapred.Task: Task '*attempt_local_0001_m_03_0*' done. 12/05/14 10:32:46 INFO mapred.LocalJobRunner: 12/05/14 10:32:46 INFO mapred.Merger: Merging 4 sorted segments 12/05/14 10:32:46 INFO mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 8018 bytes 12/05/14 10:32:46 INFO mapred.LocalJobRunner: 12/05/14 10:32:46 INFO datajoin.job: key: 1 this.largestNumOfValues: 48 12/05/14 10:32:46 INFO mapred.Task: Task:attempt_local_0001_r_00_0 is done. And is in the process of commiting 12/05/14 10:32:46 INFO mapred.LocalJobRunner: 12/05/14 10:32:46 INFO mapred.Task: Task attempt_local_0001_r_00_0 is allowed to commit now 12/05/14 10:32:46 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_00_0' to file:/home/anwar/workspace/JoinLZOPfiles/OutLarge 12/05/14 10:32:49 INFO mapred.LocalJobRunner: actuallyCollectedCount 86 collectedCount 86 groupCount 25 reduce 12/05/14 10:32:49 INFO mapred.Task: Task '*attempt_local_0001_r_00_0'*done. 12/05/14 10:32:50 INFO mapred.JobClient: map 100% reduce 100% 12/05/14 10:32:50 INFO mapred.JobClient: Job complete: job_local_0001 12/05/14 10:32:50 INFO mapred.JobClient: Counters: 17 12/05/14 10:32:50 INFO mapred.JobClient: File Input Format Counters 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Read=1666 12/05/14 10:32:50 INFO mapred.JobClient: File Output Format Counters 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Written=2421 12/05/14 10:32:50 INFO mapred.JobClient: FileSystemCounters 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_READ=22890 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=194702 12/05/14 10:32:50 INFO mapred.JobClient: Map-Reduce Framework 12/05/14 10:32:50 INFO mapred.JobClient: Map output materialized bytes=8034 12/05/14 10:32:50 INFO mapred.JobClient: Map input records=106 12/05/14 10:32:50 INFO mapred.JobClient: Reduce shuffle bytes=0 12/05/14 10:32:50 INFO mapred.JobClient: Spilled Records=212 12/05/14 10:32:50 INFO mapred.JobClient: Map output bytes=7798 12/05/14 10:32:50 INFO mapred.JobClient: Map input bytes=1666 12/05/14 10:32:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=472 12/05/14 10:32:50 INFO mapred.JobClient: Combine input records=0 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input records=106 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input groups=25 12/05/14 10:32:50 INFO mapred.JobClient: Combine output records=0 12/05/14 10:32:50 INFO mapred.JobClient: Reduce output records=86 12/05/14 10:32:50 INFO mapred.JobClient: Map output records=106 -- Mr. Anwar Shaikh Delhi Technological University, Delhi +91 92 50 77 12 44
Re: Multiple data centre in Hadoop
Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Multiple data centre in Hadoop
Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek
Zero Byte file in HDFS
Hi All, I was just going through the implementation scenario of avoiding or deleting Zero byte file in HDFS. I m using Hive partition table where the data in partition come from INSERT OVERWRITE command using the SELECT from few other tables. Sometimes 0 Byte files are being generated in those partitions and during the course of time the amount of these files in the HDFS will increase enormously, decreasing the performance of hadoop job on that table / folder. I m looking for best way to avoid generation or deleting the zero byte file. I can think of few ways to implement this 1) Programmatically using the Filesystem object and cleaning the zero byte file. 2) Using Hadoop fs and Linux command combination to identify the zero byte file and delete it. 3) LazyOutputFormat (Applicable in Hadoop based custom jobs). Kindly guide on efficient ways to achieve the same. Regards, Abhishek
Re: Zero Byte file in HDFS
Thanks Bejoy for this input, Does this merging will combine all the small files to least block size for the very first mappers of the hive job? Well i ll explore on this, my interest on deleting the zero byte files from HDFS comes from reducing the cost of Bookkeeping these files in system. The metadata of any files in HDFS occupy approx 150kb space, considering thousand or millions of zero byte file in HDFS will cost a lot. Regards, Abhishek On Mon, Mar 26, 2012 at 2:27 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Abshikek I can propose a better solution. Enable merge in hive. So that the smaller files would be merged to at lest the hdfs block size(your choice) and would benefit subsequent hive jobs on the same. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhishek Pratap Singh manu.i...@gmail.com Date: Mon, 26 Mar 2012 14:20:18 To: common-user@hadoop.apache.org; u...@hive.apache.org Reply-To: common-user@hadoop.apache.org Subject: Zero Byte file in HDFS Hi All, I was just going through the implementation scenario of avoiding or deleting Zero byte file in HDFS. I m using Hive partition table where the data in partition come from INSERT OVERWRITE command using the SELECT from few other tables. Sometimes 0 Byte files are being generated in those partitions and during the course of time the amount of these files in the HDFS will increase enormously, decreasing the performance of hadoop job on that table / folder. I m looking for best way to avoid generation or deleting the zero byte file. I can think of few ways to implement this 1) Programmatically using the Filesystem object and cleaning the zero byte file. 2) Using Hadoop fs and Linux command combination to identify the zero byte file and delete it. 3) LazyOutputFormat (Applicable in Hadoop based custom jobs). Kindly guide on efficient ways to achieve the same. Regards, Abhishek
Re: Zero Byte file in HDFS
This sounds goods as long as the output of select query has at least one row. But in my case it can be zero rows. Thanks, Abhishek On Mon, Mar 26, 2012 at 2:48 PM, Bejoy KS bejoy.had...@gmail.com wrote: ** Hi Abshiek Merging happens as a last stage of hive jobs. Say your hive query is translated to n MR jobs when you enable merge you can set a size that is needed to merge (usually block size). So after n MR jobs there would be a map only job that is automatically triggered from hive which merges the smaller files into larger ones. The intermediate output files are not retained in hive and hence the final large enough files remains in hdfs. Regards Bejoy KS Sent from handheld, please excuse typos. -- *From: * Abhishek Pratap Singh manu.i...@gmail.com *Date: *Mon, 26 Mar 2012 14:41:53 -0700 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com *Cc: *u...@hive.apache.org *Subject: *Re: Zero Byte file in HDFS Thanks Bejoy for this input, Does this merging will combine all the small files to least block size for the very first mappers of the hive job? Well i ll explore on this, my interest on deleting the zero byte files from HDFS comes from reducing the cost of Bookkeeping these files in system. The metadata of any files in HDFS occupy approx 150kb space, considering thousand or millions of zero byte file in HDFS will cost a lot. Regards, Abhishek On Mon, Mar 26, 2012 at 2:27 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Abshikek I can propose a better solution. Enable merge in hive. So that the smaller files would be merged to at lest the hdfs block size(your choice) and would benefit subsequent hive jobs on the same. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhishek Pratap Singh manu.i...@gmail.com Date: Mon, 26 Mar 2012 14:20:18 To: common-user@hadoop.apache.org; u...@hive.apache.org Reply-To: common-user@hadoop.apache.org Subject: Zero Byte file in HDFS Hi All, I was just going through the implementation scenario of avoiding or deleting Zero byte file in HDFS. I m using Hive partition table where the data in partition come from INSERT OVERWRITE command using the SELECT from few other tables. Sometimes 0 Byte files are being generated in those partitions and during the course of time the amount of these files in the HDFS will increase enormously, decreasing the performance of hadoop job on that table / folder. I m looking for best way to avoid generation or deleting the zero byte file. I can think of few ways to implement this 1) Programmatically using the Filesystem object and cleaning the zero byte file. 2) Using Hadoop fs and Linux command combination to identify the zero byte file and delete it. 3) LazyOutputFormat (Applicable in Hadoop based custom jobs). Kindly guide on efficient ways to achieve the same. Regards, Abhishek
Regarding pointers for LZO compression in Hive and Hadoop
Hi, I m looking for some useful docs on enabling LZO on hadoop cluster. I tried few of the blogs, but somehow its not working. Here is my requirement. I have a hadoop 0.20.2 and Hive 0.6. I have some tables with 1.5 TB of data, i want to compress them using LZO and enable LZO in hive as well as in hadoop. Let me know if you have any useful docs or pointers for the same. Regards, Abhishek
Re: Running a job continuously
Hi Burak, The model of hadoop is very different, it is based on Job based model, in more easy words its a kind of Batch model where map reduce job is executed on a batch of data which is already present. As per your requirement, word count example doesn't make sense if the file has been written continuously. However word count per hour or per min make sense in map reduce type of program. I second what Bejoy has mentioned is to use flume, aggregate the data and then run map reduce. Hadoop can give you near real time by using flume with Map reduce where you can run map reduce job on flume dumped data every few mins. Second option is to see if your problem can be solved by Flume Decorator itself for real time experience. Regards, Abhishek On Mon, Dec 5, 2011 at 2:33 PM, burakkk burak.isi...@gmail.com wrote: Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
Re: HDFS Explained as Comics
Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.comwrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi Maneesh, Thanks a lot for this! Just distributed it over the team and comments are great :) Best regards, Dejan On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com wrote: For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual
Capacity Planning using Dfsadmin command
Hi, I have a query about hadoop, this is very important for capacity planning and requesting new hardware. Hadoop command: *hadoop dfsadmin –report* depicts the DFS usage and DFS remaning. I have checked that the DFS usage does not match the size of *hadoop dfs –dus* on root or user directory. The difference is high enough, means dfsadmin shows occupied space as 5TB but dus shows it is 3.5TB. Does anyone knows why there is a more space occupied shown in dfsadmin? Regards, Abhishek