We had a namenode go down due to timeout with the hdfs ha qjm journal:
2015-12-09 04:10:42,723 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits 2015-12-09 04:10:43,708 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.42.28.221:8485, 10.42.28.222:8485, 10.42.28.223:8485], stream=QuorumOutputStream starting at txid 8781293)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:490) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:486) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:581) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1695) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1669) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:409) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) While this is disturbing in it's own right, I'm further annoyed that HBASE shut down 2 region servers. Furthermore, we had to hbck -fixAssignments to repair HBASE, and I'm not sure that the data from the shutdown regions was available, and if our hbase service itself was available afterwards: 2015-12-09 04:10:44,320 ERROR org.apache.hadoop.hbase.master.HMaster: Region server ^@^@hbase008r09.comp.prod.local,60020,1436412712133 reported a fatal error: ABORTING region server hbase008r09.comp.prod.local,60020,1436412712133: IOE in log roller Cause: java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:716) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:663) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:595) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "hbase008r09.comp.prod.local/10.42.28.192"; destination host is: "hbasenn001.comp.prod.local":8020; at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:713) ... 4 more Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "hbase008r09.comp.prod.local/10.42.28.192"; destination host is: "hbasenn001.comp.prod.local":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy14.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:192) at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at com.sun.proxy.$Proxy15.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1298) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1317) at org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1264) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:97) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:53) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:554) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:663) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:660) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333) at org.apache.hadoop.fs.FileContext.create(FileContext.java:660) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:502) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:469) at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87) ... 5 more Caused by: java.io.IOException: Response is null. at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:940) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:835) 2015-12-09 04:10:44,387 ERROR org.apache.hadoop.hbase.master.HMaster: Region server ^@^@hbase007r08.comp.prod.local,60020,1436412674179 reported a fatal error: ABORTING region server hbase007r08.comp.prod.local,60020,1436412674179: IOE in log roller Cause: java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:716) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:663) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:595) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "hbase007r08.comp.prod.local/10.42.28.191"; destination host is: "hbasenn001.comp.prod.local":8020; at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:713) ... 4 more Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "hbase007r08.comp.prod.local/10.42.28.191"; destination host is: "hbasenn001.comp.prod.local":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy14.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:192) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at com.sun.proxy.$Proxy15.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1298) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1317) at org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1264) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:97) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:53) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:554) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:663) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:660) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333) at org.apache.hadoop.fs.FileContext.create(FileContext.java:660) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:502) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:469) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87) ... 5 more Caused by: java.io.IOException: Response is null. at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:940) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:835) 2015-12-09 04:11:01,444 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x44e6c2f20980003, closing socket connection and attempting reconnect 2015-12-09 04:11:34,636 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately. 2015-12-09 04:11:34,687 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB after 1 fail over attempts. Trying to fail over after sleeping for 791ms. 2015-12-09 04:11:35,334 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50237,"call":"reportRSFatalError([B@3c97e50c, ABORTING region server hbase008r09.comp.prod.local,60020,1436412712133: IOE in log roller\nCause:\njava.io.IOException: cannot get log writer\n\tat org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:716)\n\tat org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:663)\n\tat org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:595)\n\tat org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)\n\tat java.lang.Thread.run(Thread.java:722)\nCaused by: java.io.IOException: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: \"hbase008r09.comp.prod.local/10.42.28.192\"; destination host is: \"hbasenn001.comp.prod.local\":8020; \n\tat org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)\n\tat org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:713)\n\t... 4 more\nCaused by: java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: \"hbase008r09.comp.prod.local/10.42.28.192\"; destination host is: \"hbasenn001.comp.prod.local\":8020; \n\tat org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)\n\tat org.apache.hadoop.ipc.Client.call(Client.java:1228)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)\n\tat com.sun.proxy.$Proxy14.create(Unknown Source)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:192)\n\tat sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:601)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)\n\tat com.sun.proxy.$Proxy15.create(Unknown Source)\n\tat org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1298)\n\tat org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1317)\n\tat org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1264)\n\tat org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:97)\n\tat org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:53)\n\tat org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:554)\n\tat org.apache.hadoop.fs.FileContext$3.next(FileContext.java:663)\n\tat org.apache.hadoop.fs.FileContext$3.next(FileContext.java:660)\n\tat org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)\n\tat org.apache.hadoop.fs.FileContext.create(FileContext.java:660)\n\tat org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:502)\n\tat org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:469)\n\tat sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:601)\n\tat org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)\n\t... 5 more\nCaused by: java.io.IOException: Response is null.\n\tat org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:940)\n\tat org.apache.hadoop.ipc.Client$Connection.run(Client.java:835)\n), rpc version=1, client version=29, methodsFingerPrint=-525182806","client":" 10.42.28.192:52162 ","starttimems":1449659444320,"queuetimems":0,"class":"HMaster","responsesize":0,"method":"reportRSFatalError"} 2015-12-09 04:11:35,409 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hbase004r08.comp.prod.local/10.42.28.188:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2015-12-09 04:11:35,411 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hbase004r08.comp.prod.local/10.42.28.188:2181, initiating session 2015-12-09 04:11:35,413 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x44e6c2f20980003 has expired, closing socket connection 2015-12-09 04:11:35,413 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2015-12-09 04:11:35,414 INFO org.apache.hadoop.hbase.master.HMaster: Primary Master trying to recover from ZooKeeper session expiry. 2015-12-09 04:11:35,416 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hbase004r08.comp.prod.local:2181,hbase003r07.comp.prod.local:2181,hbase005r09.comp.prod.local:2181 sessionTimeout=1200000 watcher=master:60000 ... and eventually: 2015-12-09 04:11:46,724 ERROR org.apache.zookeeper.ClientCnxn: Caught unexpected throwable 2015-12-09 04:11:46,724 ERROR org.apache.zookeeper.ClientCnxn: Caught unexpected throwable java.lang.StackOverflowError at java.security.AccessController.doPrivileged(Native Method) at java.io.PrintWriter.<init>(PrintWriter.java:78) at java.io.PrintWriter.<init>(PrintWriter.java:62) at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:58) at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87) at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313) at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:623) at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:477) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:658) at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1286) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:975) at org.apache.hadoop.hbase.master.SplitLogManager.deleteNode(SplitLogManager.java:627) at org.apache.hadoop.hbase.master.SplitLogManager.access$1600(SplitLogManager.java:96) at org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback.processResult(SplitLogManager.java:1106) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:477) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:658) at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1286) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:975) at org.apache.hadoop.hbase.master.SplitLogManager.deleteNode(SplitLogManager.java:627) at org.apache.hadoop.hbase.master.SplitLogManager.access$1600(SplitLogManager.java:96) at org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback.processResult(SplitLogManager.java:1106) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:477) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:658) at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1286) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:975) at org.apache.hadoop.hbase.master.SplitLogManager.deleteNode(SplitLogManager.java:627) at org.apache.hadoop.hbase.master.SplitLogManager.access$1600(SplitLogManager.java:96) at org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback.processResult(SplitLogManager.java:1106) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:477) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:658) at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1286) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:975) at org.apache.hadoop.hbase.master.SplitLogManager.deleteNode(SplitLogManager.java:627) at org.apache.hadoop.hbase.master.SplitLogManager.access$1600(SplitLogManager.java:96) at org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback.processResult(SplitLogManager.java:1106) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:477) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:640) at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:658) at org.apache.zookeeper.ClientCnxn.queuePacket(ClientCnxn.java:1286) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:975) at org.apache.hadoop.hbase.master.SplitLogManager.deleteNode(SplitLogManager.java:627) at org.apache.hadoop.hbase.master.SplitLogManager.access$1600(SplitLogManager.java:96) ... Since the namenode failover made the other nameserver active, then why did my region servers decide to shutdown? The HDFS service seems to have stayed up. Then how can I make the HBASE service more resilient to namenode failovers? Hbase: Version 0.92.1-cdh4.1.3 Hadoop: Hadoop 2.0.0-cdh4.1.3