No, as of this moment we've no ideas about the reasons for that behavior.
On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee < [email protected]> wrote: > Thanks Mayank, Any clue on why was only one disk was getting all writes. > > Rahul > > > On Thu, Jun 13, 2013 at 11:47 AM, Mayank <[email protected]> wrote: > >> So we did a manual rebalance (followed instructions at: >> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) >> and also reserved 30 GB of space for non dfs usage via >> dfs.datanode.du.reserved and restarted our apps. >> >> Things have been going fine till now. >> >> Keeping fingers crossed :) >> >> >> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee < >> [email protected]> wrote: >> >>> I have a few points to make , these may not be very helpful for the said >>> problem. >>> >>> +All data nodes are bad exception is kind of not pointing to the problem >>> related to disk space full. >>> +hadoop.tmp.dir acts as base location of other hadoop related properties >>> , not sure if any particular directory is created specifically. >>> +Only one disk getting filled looks strange.The other disk are part >>> while formatting the NN. >>> >>> Would be interesting to know the reason for this. >>> Please keep posted. >>> >>> Thanks, >>> Rahul >>> >>> >>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <[email protected]>wrote: >>> >>>> From the snapshot, you got around 3TB for writing data. >>>> >>>> Can you check individual datanode's storage health. >>>> As you said you got 80 servers writing parallely to hdfs, I am not sure >>>> can that be an issue. >>>> As suggested in past threads, you can do a rebalance of the blocks but >>>> that will take some time to finish and will not solve your issue right >>>> away. >>>> >>>> You can wait for others to reply. I am sure there will be far better >>>> solutions from experts for this. >>>> >>>> >>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote: >>>> >>>>> No it's not a map-reduce job. We've a java app running on around 80 >>>>> machines which writes to hdfs. The error that I'd mentioned is being >>>>> thrown >>>>> by the application and yes we've replication factor set to 3 and following >>>>> is status of hdfs: >>>>> >>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 >>>>> GB DFS >>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live >>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> >>>>> :10 Dead >>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD> >>>>> : 0 Decommissioning >>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING> >>>>> : 0 Number of Under-Replicated Blocks : 0 >>>>> >>>>> >>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar >>>>> <[email protected]>wrote: >>>>> >>>>>> when you say application errors out .. does that mean your mapreduce >>>>>> job is erroring? In that case apart from hdfs space you will need to look >>>>>> at mapred tmp directory space as well. >>>>>> >>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a >>>>>> replication factor of 3 so at max you will have datasize of 5TB with you. >>>>>> I am also assuming you are not scheduling your program to run on >>>>>> entire 5TB with just 10 nodes. >>>>>> >>>>>> i suspect your clusters mapred tmp space is getting filled in while >>>>>> the job is running. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]>wrote: >>>>>> >>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. >>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), >>>>>>> which >>>>>>> each disk having a capacity 414GB. >>>>>>> >>>>>>> >>>>>>> hdfs-site.xml has following property set: >>>>>>> >>>>>>> <property> >>>>>>> <name>dfs.data.dir</name> >>>>>>> >>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value> >>>>>>> <description>Data dirs for DFS.</description> >>>>>>> </property> >>>>>>> >>>>>>> Now we are facing a issue where in we find /data1 getting filled up >>>>>>> quickly and many a times we see it's usage running at 100% with just few >>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes >>>>>>> at >>>>>>> present. >>>>>>> >>>>>>> We've some java applications which are writing to hdfs and many a >>>>>>> times we are seeing foloowing errors in our application logs: >>>>>>> >>>>>>> >>>>>>> >>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. >>>>>>> Aborting... >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093) >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586) >>>>>>> at >>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790) >>>>>>> >>>>>>> >>>>>>> >>>>>>> I went through some old discussions and looks like manual >>>>>>> rebalancing is what is required in this case and we should also have >>>>>>> dfs.datanode.du.reserved set up. >>>>>>> >>>>>>> However I'd like to understand if this issue, with one disk getting >>>>>>> filled up to 100% can result into the issue which we are seeing in our >>>>>>> application. >>>>>>> >>>>>>> Also, are there any other peformance implications due to some of the >>>>>>> disks running at 100% usage on a datanode. >>>>>>> -- >>>>>>> Mayank Joshi >>>>>>> >>>>>>> Skype: mail2mayank >>>>>>> Mb.: +91 8690625808 >>>>>>> >>>>>>> Blog: http://www.techynfreesouls.co.nr >>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank >>>>>>> >>>>>>> Today is tommorrow I was so worried about yesterday ... >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nitin Pawar >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Mayank Joshi >>>>> >>>>> Skype: mail2mayank >>>>> Mb.: +91 8690625808 >>>>> >>>>> Blog: http://www.techynfreesouls.co.nr >>>>> PhotoStream: http://picasaweb.google.com/mail2mayank >>>>> >>>>> Today is tommorrow I was so worried about yesterday ... >>>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >> >> -- >> Mayank Joshi >> >> Skype: mail2mayank >> Mb.: +91 8690625808 >> >> Blog: http://www.techynfreesouls.co.nr >> PhotoStream: http://picasaweb.google.com/mail2mayank >> >> Today is tommorrow I was so worried about yesterday ... >> > > -- Mayank Joshi Skype: mail2mayank Mb.: +91 8690625808 Blog: http://www.techynfreesouls.co.nr PhotoStream: http://picasaweb.google.com/mail2mayank Today is tommorrow I was so worried about yesterday ...
