Re: Application errors with one disk on datanode getting filled up to 100%

Mayank Fri, 14 Jun 2013 04:10:05 -0700

No, as of this moment we've no ideas about the reasons for that behavior.


On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
[email protected]> wrote:

> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <[email protected]> wrote:
>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> [email protected]> wrote:
>>
>>> I have a few points to make , these may not be very helpful for the said
>>> problem.
>>>
>>> +All data nodes are bad exception is kind of not pointing to the problem
>>> related to disk space full.
>>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>>> , not sure if any particular directory is created specifically.
>>> +Only one disk getting filled looks strange.The other disk are part
>>> while formatting the NN.
>>>
>>> Would be interesting to know the reason for this.
>>> Please keep posted.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <[email protected]>wrote:
>>>
>>>> From the snapshot, you got around 3TB for writing data.
>>>>
>>>> Can you check individual datanode's storage health.
>>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>>> can that be an issue.
>>>> As suggested in past threads, you can do a rebalance of the blocks but
>>>> that will take some time to finish and will not solve your issue right
>>>> away.
>>>>
>>>> You can wait for others to reply. I am sure there will be far better
>>>> solutions from experts for this.
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote:
>>>>
>>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>>> machines which writes to hdfs. The error that I'd mentioned is being 
>>>>> thrown
>>>>> by the application and yes we've replication factor set to 3 and following
>>>>> is status of hdfs:
>>>>>
>>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 
>>>>> GB DFS
>>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
>>>>>  :10 Dead
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>>> : 0  Decommissioning 
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> when you say application errors out .. does that mean your mapreduce
>>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>>> at mapred tmp directory space as well.
>>>>>>
>>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>>> I am also assuming you are not scheduling your program to run on
>>>>>> entire 5TB with just 10 nodes.
>>>>>>
>>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>>> the job is running.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]>wrote:
>>>>>>
>>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), 
>>>>>>> which
>>>>>>> each disk having a capacity 414GB.
>>>>>>>
>>>>>>>
>>>>>>> hdfs-site.xml has following property set:
>>>>>>>
>>>>>>> <property>
>>>>>>>         <name>dfs.data.dir</name>
>>>>>>>
>>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>>         <description>Data dirs for DFS.</description>
>>>>>>> </property>
>>>>>>>
>>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes 
>>>>>>> at
>>>>>>> present.
>>>>>>>
>>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. 
>>>>>>> Aborting...
>>>>>>>         at 
>>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I went through some old discussions and looks like manual
>>>>>>> rebalancing is what is required in this case and we should also have
>>>>>>> dfs.datanode.du.reserved set up.
>>>>>>>
>>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>>> application.
>>>>>>>
>>>>>>> Also, are there any other peformance implications due to some of the
>>>>>>> disks running at 100% usage on a datanode.
>>>>>>> --
>>>>>>> Mayank Joshi
>>>>>>>
>>>>>>> Skype: mail2mayank
>>>>>>> Mb.:  +91 8690625808
>>>>>>>
>>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>>
>>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Reply via email to