Hi St.Ack,

Yes, I do have HA NN, I didn't failover it and it was not necessary to
clear the standby mode or run fsck. However, I'm a bit concerned about
those logs.
The logs that I sent to Enis show that only the standby NN was having
issues the active NN was normal and I didn't find errors on that time.

No crashes since then all region servers are operating normally.

Thanks
Cheers

On Wed, Feb 17, 2016 at 6:43 AM, Stack <st...@duboce.net> wrote:

> What @Enis said, your NN seems to be in standby mode. You have HA NN? You
> failed over? You need to clear standby state? Run fsck?
> St.Ack
>
> On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola <pedro.gand...@gmail.com>
> wrote:
>
> > Hi Enis,
> >
> > At the time that the problem happened, the active NN log does not have
> any
> > error or exceptions however the standby node has the following:
> >
> > 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> > txid
> > > 4916546
> > > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 196 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > > 10.5.2.11:54684 Call#22619 Retry#1:
> > > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is
> not
> > > supported in state standby*
> > > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 161 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode
> from
> > > 10.5.2.11:54684 Call#22620 Retry#0:
> > > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > > supported in state standby
> >
> >
> > But the standby node log is full of messages like this:
> >
> > 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> >
> >
> > Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> > that HDFS is totally OK :).
> >
> > Thanks
> > Cheers
> > Pedro
> >
> > On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar <enis....@gmail.com>
> > wrote:
> >
> > > > Operation category WRITE is not supported in state standby
> > > This looks to be coming from a NN that is in Standby state (or safe
> > mode?).
> > > Did you check whether underlying HDFS is healthy. Is this HA
> > configuration
> > > and hbase-site.xml contains the correct NN configuration?
> > >
> > > Enis
> > >
> > >
> > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <
> pedro.gand...@gmail.com>
> > > wrote:
> > >
> > > > Just lost another RegionServer... The cluster was very stable until
> > > > yesterday, 2 region servers in less than 24 hours something might be
> > > wrong
> > > > in my configuration.
> > > >
> > > > Any insights?
> > > >
> > > > Thank you
> > > > Pedro
> > > >
> > > > *(rs log)*
> > > >
> > > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> > block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > > java.io.IOException: Bad response ERROR for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > > datanode
> > > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod
> ]
> > > > > retry.RetryInvocationHandler: Exception while invoking renewLease
> of
> > > > class
> > > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> > after
> > > > > sleeping for 1261ms.
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > > Operation category WRITE is not supported in state standby
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > > > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > > > > at
> > org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > > > > at java.lang.Thread.run(Thread.java:745)
> > > > > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: Error Recovery for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in
> > pipeline
> > > > > DatanodeInfoWithStorage[10.5.2.11:50010
> > > > ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > > > > DatanodeInfoWithStorage[10.5.2.57:50010
> > > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > > > > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > retry.RetryInvocationHandler: Exception while invoking
> > > > > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB
> > over
> > > > > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over
> > > immediately.
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > > Operation category READ is not supported in state standby
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: DataStreamer Exception
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > > No lease on
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > (inode 641732): File is not open for writing. Holder
> > > > > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > > 2016-02-16 14:22:38,134 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Opening socket connection to server zookeeper3.localdomain/
> > > > 10.5.2.166:2181.
> > > > > Will not attempt to authenticate using SASL (unknown error)
> > > > > 2016-02-16 14:22:38,137 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Socket connection established to zookeeper3.localdomain/
> > > 10.5.2.166:2181,
> > > > > initiating session
> > > > > 2016-02-16 14:22:38,144 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d
> > has
> > > > > expired, closing socket connection
> > > > > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > > > > regionserver.HRegionServer: ABORTING region server
> > > > > hbase-rs6.localdomain,16020,1455560986600:
> > > > > regionserver:16020-0x25278d348e90c8d,
> > > > >
> > > >
> > >
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > > > > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received
> > expired
> > > > from
> > > > > ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > > > at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > >
> > > >
> > > > On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <
> > pedro.gand...@gmail.com>
> > > > wrote:
> > > >
> > > > > Just missed one of your questions.
> > > > >
> > > > > Were you trying to restart it and it wouldn't come up? It kept
> doing
> > > the
> > > > >> NPE on each restart?
> > > > >> Or it happened once and killed the regionserver?
> > > > >
> > > > >
> > > > > I received an alarm for the RS and I went there I saw the cluster
> not
> > > > > processing data and I tried to restart the RS. Everything went to
> > > normal
> > > > > but the RS crashed again then I decided to remove it completely
> until
> > > > know
> > > > > more about the problem. Now looking into the restarting code I see
> > that
> > > > the
> > > > > problem might be other.
> > > > >
> > > > > *(log.out) *
> > > > >
> > > > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> > > 20748
> > > > >> Killed                  nice -n $HBASE_NICENESS
> > > "$HBASE_HOME"/bin/hbase
> > > > >> --config "${HBASE_CONF_DIR}" $command "$@" start >>
> ${HBASE_LOGOUT}
> > > 2>&1
> > > > >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file
> > or
> > > > >> directory
> > > > >
> > > > >
> > > > > *(log after restart)*
> > > > >
> > > > >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >>
> > > >
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:java.io.tmpdir=/tmp
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:java.compiler=<NA>
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.name=Linux
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.arch=amd64
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.name=hbase
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.home=/home/hbase
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.dir=/home/hbase
> > > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper:
> Initiating
> > > > >> client connection,
> > > > >>
> > > >
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > > >> sessionTimeout=30000
> > > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper:
> Initiating
> > > > >> client connection,
> > > > >>
> > > >
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > > >> sessionTimeout=30000
> > > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > > >> 2016-02-15 22:21:23,754 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Opening socket connection to server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> > (unknown
> > > > >> error)
> > > > >> 2016-02-15 22:21:23,755 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Opening socket connection to server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> > (unknown
> > > > >> error)
> > > > >> 2016-02-15 22:21:23,767 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Socket connection established to zookeeper2.localdomain/
> > > 10.5.1.105:2181
> > > > ,
> > > > >> initiating session
> > > > >> 2016-02-15 22:21:23,767 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Socket connection established to zookeeper2.localdomain/
> > > 10.5.1.105:2181
> > > > ,
> > > > >> initiating session
> > > > >> 2016-02-15 22:21:23,777 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Session establishment complete on server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated
> timeout
> > =
> > > > >> 30000
> > > > >> 2016-02-15 22:21:23,777 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Session establishment complete on server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated
> timeout
> > =
> > > > >> 30000
> > > > >>
> > > > >
> > > > > I don't see any exception during the restart but for some reason
> the
> > > > > process got killed and I don't see any SIGTERM or OOM on linux
> > journal.
> > > > >
> > > > > Cheers
> > > > > Pedro
> > > > >
> > > > >
> > > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <
> > > pedro.gand...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi St.Ack,
> > > > >>
> > > > >> Thank you for your help. I have attached ~5 min worth of data
> before
> > > the
> > > > >> crash.
> > > > >>
> > > > >> I restarted the cluster to update some configurations after that I
> > > moved
> > > > >> some regions around to balance the cluster and just to ensure the
> > data
> > > > >> locality I ran a major compaction. After that I connected my App
> > and 2
> > > > >> hours later  this happened. So by the time of the crash I was not
> > > > >> performing any operation over the cluster it was running normally.
> > > > >>
> > > > >> Thank you,
> > > > >> Cheers
> > > > >> Pedro
> > > > >>
> > > > >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> > > > >>
> > > > >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> > > > pedro.gand...@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Hi Guys,
> > > > >>> >
> > > > >>> > One of my region servers got into a state where it was unable
> to
> > > > start
> > > > >>> and
> > > > >>> > the cluster was not receiving traffic for some time:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Were you trying to restart it and it wouldn't come up? It kept
> > doing
> > > > the
> > > > >>> NPE on each restart?
> > > > >>>
> > > > >>> Or it happened once and killed the regionserver?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *(master log)*
> > > > >>> >
> > > > >>> > > 2016-02-15 22:04:33,186 ERROR
> > > > >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > > > >>> > master.MasterRpcServices:
> > > > >>> > > Region server hbase-rs9.localdomain,16020,1455560991134
> > reported
> > > a
> > > > >>> fatal
> > > > >>> > > error:
> > > > >>> > > ABORTING region server
> > hbase-rs9.localdomain,16020,1455560991134:
> > > > >>> IOE in
> > > > >>> > > log roller
> > > > >>> > > Cause:
> > > > >>> > > java.io.IOException: java.lang.NullPointerException
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > > Caused by: java.lang.NullPointerException
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > > >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > > >>> > > master.CatalogJanitor: Failed scan of catalog table
> > > > >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException:
> > Failed
> > > > >>> after
> > > > >>> > > attempts=351, exceptions:
> > > > >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> > > > java.net.SocketTimeoutException:
> > > > >>> > > callTimeout=60000, callDuration=68222: row '' on table
> > > 'hbase:meta'
> > > > >>> at
> > > > >>> > > region=hbase:meta,,1.1588230740,
> > > > >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
> at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > > >>> > > at
> > > > >>> >
> > > > >>>
> > > > >>> You running with read replicas enabled?
> > > > >>>
> > > > >>>
> > > > >>> ....
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *(regionserver log)*
> > > > >>>
> > > > >>>
> > > > >>> Anything before this log message about rolling the WAL?
> > > > >>>
> > > > >>>
> > > > >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error
> syncing,
> > > > >>> request
> > > > >>> > close of wal
> > > > >>> > java.io.IOException: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > Caused by: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > ... 2 more
> > > > >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync
> > cost:
> > > > 234
> > > > >>> ms,
> > > > >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > > > >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > > > >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> > > > >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > > > >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> > > > >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: ABORTING region server
> > > > >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > > > >>> > java.io.IOException: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > Caused by: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > ... 2 more
> > > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: RegionServer abort: loaded
> > coprocessors
> > > > >>> are:
> > > > >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > > >>> >
> > > >
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > > > >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > > > >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > > >>> >
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > > > >>> > org.apache.phoenix.hbase.index.Indexer,
> > > > >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > > > >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > > >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > > > >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > > > >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > > >>> > 2016-02-15 22:04:33,136 INFO
> > > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" : "java.lang:type=Memory",
> > > > >>> >     "modelerType" : "sun.management.MemoryImpl",
> > > > >>> >     "Verbose" : true,
> > > > >>> >     "HeapMemoryUsage" : {
> > > > >>> >       "committed" : 15032385536,
> > > > >>> >       "init" : 15032385536,
> > > > >>> >       "max" : 15032385536,
> > > > >>> >       "used" : 3732843792
> > > > >>> >     },
> > > > >>> >     "ObjectPendingFinalizationCount" : 0,
> > > > >>> >     "NonHeapMemoryUsage" : {
> > > > >>> >       "committed" : 104660992,
> > > > >>> >       "init" : 2555904,
> > > > >>> >       "max" : -1,
> > > > >>> >       "used" : 103018984
> > > > >>> >     },
> > > > >>> >     "ObjectName" : "java.lang:type=Memory"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > > > >>> >     "modelerType" : "RegionServer,sub=IPC",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" :
> > > > "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > > > >>> >     "modelerType" : "RegionServer,sub=Replication",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" :
> "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > > > >>> >     "modelerType" : "RegionServer,sub=Server",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ]
> > > > >>> > }
> > > > >>> > 2016-02-15 22:04:33,194 INFO
> > > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > > > >>>
> > > > >>> Looks like we were in log roller, closing the WAL to put up a new
> > one
> > > > but
> > > > >>> the a sync was trying to run at same time but stream undo under
> > > it....
> > > > >>> NPE... which caused server abort.
> > > > >>>
> > > > >>> ...
> > > > >>>
> > > > >>>
> > > > >>> Any idea what happened? Today, I moved some regions around to
> > balance
> > > > the
> > > > >>> cluster and I ran a major compaction after that and I added more
> > > > threads
> > > > >>> to
> > > > >>> run small and large compactions, could this be related?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Yeah, pass more of the regionserver log from before the NPE... to
> > see
> > > > >>> what
> > > > >>> was going on at the time.  I don't think the above changes
> related.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> I see that in the current branch the class
> *ProtobufLogWriter:176*
> > > > >>> already
> > > > >>> contains try..catch
> > > > >>>
> > > > >>> try {
> > > > >>> >   return this.output.getPos();
> > > > >>> > } catch (NullPointerException npe) {
> > > > >>> >   // Concurrent close...
> > > > >>> >   throw new IOException(npe);
> > > > >>> > }
> > > > >>>
> > > > >>>
> > > > >>> Yeah. Looks like the thought is an NPE could happen here because
> > the
> > > > >>> stream
> > > > >>> is being closed out from under us by another thread (the syncers
> > are
> > > > off
> > > > >>> on
> > > > >>> their own just worried about syncing... This is just to get the
> NPE
> > > up
> > > > >>> out
> > > > >>> of the syncer thread and up into the general WAL subsystem.).  It
> > > looks
> > > > >>> like this caused the regionserver abort. I would think it should
> > just
> > > > >>> provoke the WAL roll, not an abort. Put up more RS log please.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > But I would be nice to understand the root cause of this error
> > and
> > > if
> > > > >>> there
> > > > >>> > is any misconfiguration from my side.
> > > > >>> >
> > > > >>> >
> > > > >>> Not your config I'd say (if your config brought it on, its a
> bug).
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > *Version: *HBase 1.1.2
> > > > >>> >
> > > > >>> >
> > > > >>>
> > > > >>> St.Ack
> > > > >>>
> > > > >>> P.S. It looks like a gentleman had similar issue end of last
> year:
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E
> > > > >>> No response then.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > Thank you
> > > > >>> > Cheers
> > > > >>> > Pedro
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to