So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps.
Things have been going fine till now. Keeping fingers crossed :) On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee < [email protected]> wrote: > I have a few points to make , these may not be very helpful for the said > problem. > > +All data nodes are bad exception is kind of not pointing to the problem > related to disk space full. > +hadoop.tmp.dir acts as base location of other hadoop related properties , > not sure if any particular directory is created specifically. > +Only one disk getting filled looks strange.The other disk are part while > formatting the NN. > > Would be interesting to know the reason for this. > Please keep posted. > > Thanks, > Rahul > > > On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <[email protected]>wrote: > >> From the snapshot, you got around 3TB for writing data. >> >> Can you check individual datanode's storage health. >> As you said you got 80 servers writing parallely to hdfs, I am not sure >> can that be an issue. >> As suggested in past threads, you can do a rebalance of the blocks but >> that will take some time to finish and will not solve your issue right >> away. >> >> You can wait for others to reply. I am sure there will be far better >> solutions from experts for this. >> >> >> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote: >> >>> No it's not a map-reduce job. We've a java app running on around 80 >>> machines which writes to hdfs. The error that I'd mentioned is being thrown >>> by the application and yes we've replication factor set to 3 and following >>> is status of hdfs: >>> >>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB >>> DFS >>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live >>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> >>> :10 Dead >>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD> >>> : 0 Decommissioning >>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING> >>> : 0 Number of Under-Replicated Blocks : 0 >>> >>> >>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <[email protected]>wrote: >>> >>>> when you say application errors out .. does that mean your mapreduce >>>> job is erroring? In that case apart from hdfs space you will need to look >>>> at mapred tmp directory space as well. >>>> >>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a >>>> replication factor of 3 so at max you will have datasize of 5TB with you. >>>> I am also assuming you are not scheduling your program to run on entire >>>> 5TB with just 10 nodes. >>>> >>>> i suspect your clusters mapred tmp space is getting filled in while the >>>> job is running. >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]> wrote: >>>> >>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each >>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which >>>>> each >>>>> disk having a capacity 414GB. >>>>> >>>>> >>>>> hdfs-site.xml has following property set: >>>>> >>>>> <property> >>>>> <name>dfs.data.dir</name> >>>>> >>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value> >>>>> <description>Data dirs for DFS.</description> >>>>> </property> >>>>> >>>>> Now we are facing a issue where in we find /data1 getting filled up >>>>> quickly and many a times we see it's usage running at 100% with just few >>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at >>>>> present. >>>>> >>>>> We've some java applications which are writing to hdfs and many a >>>>> times we are seeing foloowing errors in our application logs: >>>>> >>>>> >>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. >>>>> Aborting... >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093) >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586) >>>>> at >>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790) >>>>> >>>>> >>>>> I went through some old discussions and looks like manual rebalancing >>>>> is what is required in this case and we should also have >>>>> dfs.datanode.du.reserved set up. >>>>> >>>>> However I'd like to understand if this issue, with one disk getting >>>>> filled up to 100% can result into the issue which we are seeing in our >>>>> application. >>>>> >>>>> Also, are there any other peformance implications due to some of the >>>>> disks running at 100% usage on a datanode. >>>>> -- >>>>> Mayank Joshi >>>>> >>>>> Skype: mail2mayank >>>>> Mb.: +91 8690625808 >>>>> >>>>> Blog: http://www.techynfreesouls.co.nr >>>>> PhotoStream: http://picasaweb.google.com/mail2mayank >>>>> >>>>> Today is tommorrow I was so worried about yesterday ... >>>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >>> >>> -- >>> Mayank Joshi >>> >>> Skype: mail2mayank >>> Mb.: +91 8690625808 >>> >>> Blog: http://www.techynfreesouls.co.nr >>> PhotoStream: http://picasaweb.google.com/mail2mayank >>> >>> Today is tommorrow I was so worried about yesterday ... >>> >> >> >> >> -- >> Nitin Pawar >> > > -- Mayank Joshi Skype: mail2mayank Mb.: +91 8690625808 Blog: http://www.techynfreesouls.co.nr PhotoStream: http://picasaweb.google.com/mail2mayank Today is tommorrow I was so worried about yesterday ...
