Re: Application errors with one disk on datanode getting filled up to 100%

Rahul Bhattacharjee Fri, 14 Jun 2013 06:14:17 -0700

Thanks Sandeep.
Yes , thats correct , I was more interested to know about the uneven
distribution within the DN.


Thanks,
Rahul


On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L <[email protected]>wrote:

> Rahul,
>
> In general most of the times Hadoop tries to compute data locally that is,
> if run a MapReduce task on particular input,
> Hadoop will try compute data locally and write data locally(Majority of
> times this will happen), replicate in other nodes.
>
> In your scenario majority of your input data may be from a single
> datanode, so Hadoop is trying to write output data to same datanode.
>
> Thanks,
> Sandeep.
> ------------------------------
> From: [email protected]
> Date: Fri, 14 Jun 2013 17:50:46 +0530
>
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> To: [email protected]
>
>
> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <[email protected]>wrote:
>
> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: [email protected]
> To: [email protected]
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> [email protected]> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <[email protected]> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> [email protected]> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <[email protected]>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <[email protected]> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
>  :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning 
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <[email protected]>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Reply via email to