No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:
Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66 GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> : 10 Dead Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD> : 0 Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING> : 0 Number of Under-Replicated Blocks : 0 On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <[email protected]>wrote: > when you say application errors out .. does that mean your mapreduce job > is erroring? In that case apart from hdfs space you will need to look at > mapred tmp directory space as well. > > you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a > replication factor of 3 so at max you will have datasize of 5TB with you. > I am also assuming you are not scheduling your program to run on entire > 5TB with just 10 nodes. > > i suspect your clusters mapred tmp space is getting filled in while the > job is running. > > > > > > On Mon, Jun 10, 2013 at 3:06 PM, Mayank <[email protected]> wrote: > >> We are running a hadoop cluster with 10 datanodes and a namenode. Each >> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each >> disk having a capacity 414GB. >> >> >> hdfs-site.xml has following property set: >> >> <property> >> <name>dfs.data.dir</name> >> >> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value> >> <description>Data dirs for DFS.</description> >> </property> >> >> Now we are facing a issue where in we find /data1 getting filled up >> quickly and many a times we see it's usage running at 100% with just few >> megabytes of free space. This issue is visible on 7 out of 10 datanodes at >> present. >> >> We've some java applications which are writing to hdfs and many a times >> we are seeing foloowing errors in our application logs: >> >> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting... >> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093) >> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586) >> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790) >> >> >> I went through some old discussions and looks like manual rebalancing is >> what is required in this case and we should also have >> dfs.datanode.du.reserved set up. >> >> However I'd like to understand if this issue, with one disk getting >> filled up to 100% can result into the issue which we are seeing in our >> application. >> >> Also, are there any other peformance implications due to some of the >> disks running at 100% usage on a datanode. >> -- >> Mayank Joshi >> >> Skype: mail2mayank >> Mb.: +91 8690625808 >> >> Blog: http://www.techynfreesouls.co.nr >> PhotoStream: http://picasaweb.google.com/mail2mayank >> >> Today is tommorrow I was so worried about yesterday ... >> > > > > -- > Nitin Pawar > -- Mayank Joshi Skype: mail2mayank Mb.: +91 8690625808 Blog: http://www.techynfreesouls.co.nr PhotoStream: http://picasaweb.google.com/mail2mayank Today is tommorrow I was so worried about yesterday ...
