Eran, For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated" to > 0 (default). This will help RS survive transient HLog sync failures (with local DN) by retrying a few times before the RS decides to shut itself down.
Also worth investigating if you had too much IO load/etc. on the box that lead to the DN throwing up an error during sync(). P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222 will also be in CDH3u4. On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <[email protected]> wrote: > Hi Jimmy, > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had > the same problem with 0.90.4 > Hadoop 0.20.2 from Cloudera CDH3u1 > > This failure happens during large M/R jobs, I have 10 servers and usually > no more than 1 would fail like this, sometimes none. > One thing worth mentioning is that the table it is trying to write to has > over 5000 regions. > > -eran > > > > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <[email protected]> wrote: > >> Which version of HDFS and HBase are you using? >> >> When the problem happens, can you access the HDFS, for example, from >> hadoop dfs? >> >> Thanks, >> Jimmy >> >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[email protected]> wrote: >> > Hi, >> > >> > We have region server sporadically stopping under load due supposedly to >> > errors writing to HDFS. Things like: >> > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error >> while >> > syncing >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. >> > >> > It's happening with a different region server and data node every time, >> so >> > it's not a problem with one specific server and there doesn't seem to be >> > anything really wrong with either of them. I've already increased the >> file >> > descriptor limit, datanode xceivers and data node handler count. Any idea >> > what can be causing these errors? >> > >> > >> > A more complete log is here: http://pastebin.com/wC90xU2x >> > >> > Thanks. >> > >> > -eran >> -- Harsh J
