Thanks Mayank, Any clue on why was only one disk was getting all writes. Rahul
On Thu, Jun 13, 2013 at 11:47 AM, Mayank <[email protected]> wrote: > So we did a manual rebalance (followed instructions at: > http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) > and also reserved 30 GB of space for non dfs usage via > dfs.datanode.du.reserved and restarted our apps. > > Things have been going fine till now. > > Keeping fingers crossed :) > > > On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee < > [email protected]> wrote: > >> I have a few points to make , these may not be very helpful for the said >> problem. >> >> +All data nodes are bad exception is kind of not pointing to the problem >> related to disk space full. >> +hadoop.tmp.dir acts as base location of other hadoop related properties >> , not sure if any particular directory is created specifically. >> +Only one disk getting filled looks strange.The other disk are part while >> formatting the NN. >> >> Would be interesting to know the reason for this. >> Please keep posted. >> >> Thanks, >> Rahul >> >> >> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <[email protected]>wrote: >> >>> From the snapshot, you got around 3TB for writing data. >>> >>> Can you check individual datanode's storage health. >>> As you said you got 80 servers writing parallely to hdfs, I am not sure >>> can that be an issue. >>> As suggested in past threads, you can do a rebalance of the blocks but >>> that will take some time to finish and will not solve your issue right >>> away. >>> >>> You can wait for others to reply. I am sure there will be far better >>> solutions from experts for this. >>> >>> >>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote: >>> >>>> No it's not a map-reduce job. We've a java app running on around 80 >>>> machines which writes to hdfs. The error that I'd mentioned is being thrown >>>> by the application and yes we've replication factor set to 3 and following >>>> is status of hdfs: >>>> >>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB >>>> DFS >>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live >>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> >>>> :10 Dead >>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD> >>>> : 0 Decommissioning >>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING> >>>> : 0 Number of Under-Replicated Blocks : 0 >>>> >>>> >>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar >>>> <[email protected]>wrote: >>>> >>>>> when you say application errors out .. does that mean your mapreduce >>>>> job is erroring? In that case apart from hdfs space you will need to look >>>>> at mapred tmp directory space as well. >>>>> >>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a >>>>> replication factor of 3 so at max you will have datasize of 5TB with you. >>>>> I am also assuming you are not scheduling your program to run on >>>>> entire 5TB with just 10 nodes. >>>>> >>>>> i suspect your clusters mapred tmp space is getting filled in while >>>>> the job is running. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]> wrote: >>>>> >>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. >>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), >>>>>> which >>>>>> each disk having a capacity 414GB. >>>>>> >>>>>> >>>>>> hdfs-site.xml has following property set: >>>>>> >>>>>> <property> >>>>>> <name>dfs.data.dir</name> >>>>>> >>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value> >>>>>> <description>Data dirs for DFS.</description> >>>>>> </property> >>>>>> >>>>>> Now we are facing a issue where in we find /data1 getting filled up >>>>>> quickly and many a times we see it's usage running at 100% with just few >>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes >>>>>> at >>>>>> present. >>>>>> >>>>>> We've some java applications which are writing to hdfs and many a >>>>>> times we are seeing foloowing errors in our application logs: >>>>>> >>>>>> >>>>>> >>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. >>>>>> Aborting... >>>>>> at >>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093) >>>>>> at >>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586) >>>>>> at >>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790) >>>>>> >>>>>> >>>>>> >>>>>> I went through some old discussions and looks like manual rebalancing >>>>>> is what is required in this case and we should also have >>>>>> dfs.datanode.du.reserved set up. >>>>>> >>>>>> However I'd like to understand if this issue, with one disk getting >>>>>> filled up to 100% can result into the issue which we are seeing in our >>>>>> application. >>>>>> >>>>>> Also, are there any other peformance implications due to some of the >>>>>> disks running at 100% usage on a datanode. >>>>>> -- >>>>>> Mayank Joshi >>>>>> >>>>>> Skype: mail2mayank >>>>>> Mb.: +91 8690625808 >>>>>> >>>>>> Blog: http://www.techynfreesouls.co.nr >>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank >>>>>> >>>>>> Today is tommorrow I was so worried about yesterday ... >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>>> >>>> -- >>>> Mayank Joshi >>>> >>>> Skype: mail2mayank >>>> Mb.: +91 8690625808 >>>> >>>> Blog: http://www.techynfreesouls.co.nr >>>> PhotoStream: http://picasaweb.google.com/mail2mayank >>>> >>>> Today is tommorrow I was so worried about yesterday ... >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> > > > -- > Mayank Joshi > > Skype: mail2mayank > Mb.: +91 8690625808 > > Blog: http://www.techynfreesouls.co.nr > PhotoStream: http://picasaweb.google.com/mail2mayank > > Today is tommorrow I was so worried about yesterday ... >
