Re: Application errors with one disk on datanode getting filled up to 100%

Nitin Pawar Mon, 10 Jun 2013 03:10:31 -0700

>From the snapshot, you got around 3TB for writing data.

Can you check individual datanode's storage health.
As you said you got 80 servers writing parallely to hdfs, I am not sure can
that be an issue.
As suggested in past threads, you can do a rebalance of the blocks but that
will take some time to finish and will not solve your issue right away.


You can wait for others to reply. I am sure there will be far better
solutions from experts for this.


On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote:

> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
>  :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning 
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <[email protected]>wrote:
>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]> wrote:
>>
>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>> disk having a capacity 414GB.
>>>
>>>
>>> hdfs-site.xml has following property set:
>>>
>>> <property>
>>>         <name>dfs.data.dir</name>
>>>
>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>         <description>Data dirs for DFS.</description>
>>> </property>
>>>
>>> Now we are facing a issue where in we find /data1 getting filled up
>>> quickly and many a times we see it's usage running at 100% with just few
>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>> present.
>>>
>>> We've some java applications which are writing to hdfs and many a times
>>> we are seeing foloowing errors in our application logs:
>>>
>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. 
>>> Aborting...
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>     at 
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>
>>>
>>> I went through some old discussions and looks like manual rebalancing is
>>> what is required in this case and we should also have
>>> dfs.datanode.du.reserved set up.
>>>
>>> However I'd like to understand if this issue, with one disk getting
>>> filled up to 100% can result into the issue which we are seeing in our
>>> application.
>>>
>>> Also, are there any other peformance implications due to some of the
>>> disks running at 100% usage on a datanode.
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Reply via email to