Re: NPE on ProtobufLogWriter.sync()
Hi St.Ack, Yes, I do have HA NN, I didn't failover it and it was not necessary to clear the standby mode or run fsck. However, I'm a bit concerned about those logs. The logs that I sent to Enis show that only the standby NN was having issues the active NN was normal and I didn't find errors on that time. No crashes since then all region servers are operating normally. Thanks Cheers On Wed, Feb 17, 2016 at 6:43 AM, Stackwrote: > What @Enis said, your NN seems to be in standby mode. You have HA NN? You > failed over? You need to clear standby state? Run fsck? > St.Ack > > On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola > wrote: > > > Hi Enis, > > > > At the time that the problem happened, the active NN log does not have > any > > error or exceptions however the standby node has the following: > > > > 2016-02-16 14:21:51,039 INFO ha.EditLogTailer > > > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from > > txid > > > 4916546 > > > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem > > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > > returned error: Operation category READ is not supported in state > standby > > > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem > > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > > returned error: Operation category READ is not supported in state > standby > > > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC > > > Server handler 196 on 8020, call > > > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from > > > 10.5.2.11:54684 Call#22619 Retry#1: > > > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is > not > > > supported in state standby* > > > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC > > > Server handler 161 on 8020, call > > > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode > from > > > 10.5.2.11:54684 Call#22620 Retry#0: > > > org.apache.hadoop.ipc.StandbyException: Operation category READ is not > > > supported in state standby > > > > > > But the standby node log is full of messages like this: > > > > 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem > > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > > returned error: Operation category READ is not supported in state > standby > > > > > > Is this normal? or shall I run a fsck? I'm trusting in Ambari that says > > that HDFS is totally OK :). > > > > Thanks > > Cheers > > Pedro > > > > On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar > > wrote: > > > > > > Operation category WRITE is not supported in state standby > > > This looks to be coming from a NN that is in Standby state (or safe > > mode?). > > > Did you check whether underlying HDFS is healthy. Is this HA > > configuration > > > and hbase-site.xml contains the correct NN configuration? > > > > > > Enis > > > > > > > > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola < > pedro.gand...@gmail.com> > > > wrote: > > > > > > > Just lost another RegionServer... The cluster was very stable until > > > > yesterday, 2 region servers in less than 24 hours something might be > > > wrong > > > > in my configuration. > > > > > > > > Any insights? > > > > > > > > Thank you > > > > Pedro > > > > > > > > *(rs log)* > > > > > > > > > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > > > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for > > block > > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > > > > > java.io.IOException: Bad response ERROR for block > > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from > > > datanode > > > > > DatanodeInfoWithStorage[10.5.1.117:50010 > > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > > > > > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod > ] > > > > > retry.RetryInvocationHandler: Exception while invoking renewLease > of > > > > class > > > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > > > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over > > after > > > > > sleeping for 1261ms. > > > > > > > > > > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > > > > > Operation category WRITE is not supported in state standby > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > > > > > at > > > > > > > > > > > > > > >
Re: NPE on ProtobufLogWriter.sync()
What @Enis said, your NN seems to be in standby mode. You have HA NN? You failed over? You need to clear standby state? Run fsck? St.Ack On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandolawrote: > Hi Enis, > > At the time that the problem happened, the active NN log does not have any > error or exceptions however the standby node has the following: > > 2016-02-16 14:21:51,039 INFO ha.EditLogTailer > > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from > txid > > 4916546 > > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > returned error: Operation category READ is not supported in state standby > > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > returned error: Operation category READ is not supported in state standby > > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC > > Server handler 196 on 8020, call > > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from > > 10.5.2.11:54684 Call#22619 Retry#1: > > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not > > supported in state standby* > > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC > > Server handler 161 on 8020, call > > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from > > 10.5.2.11:54684 Call#22620 Retry#0: > > org.apache.hadoop.ipc.StandbyException: Operation category READ is not > > supported in state standby > > > But the standby node log is full of messages like this: > > 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > > returned error: Operation category READ is not supported in state standby > > > Is this normal? or shall I run a fsck? I'm trusting in Ambari that says > that HDFS is totally OK :). > > Thanks > Cheers > Pedro > > On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar > wrote: > > > > Operation category WRITE is not supported in state standby > > This looks to be coming from a NN that is in Standby state (or safe > mode?). > > Did you check whether underlying HDFS is healthy. Is this HA > configuration > > and hbase-site.xml contains the correct NN configuration? > > > > Enis > > > > > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola > > wrote: > > > > > Just lost another RegionServer... The cluster was very stable until > > > yesterday, 2 region servers in less than 24 hours something might be > > wrong > > > in my configuration. > > > > > > Any insights? > > > > > > Thank you > > > Pedro > > > > > > *(rs log)* > > > > > > > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for > block > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > > > > java.io.IOException: Bad response ERROR for block > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from > > datanode > > > > DatanodeInfoWithStorage[10.5.1.117:50010 > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > > > > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod] > > > > retry.RetryInvocationHandler: Exception while invoking renewLease of > > > class > > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over > after > > > > sleeping for 1261ms. > > > > > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > > > > Operation category WRITE is not supported in state standby > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > > > > > > > > > >
Re: NPE on ProtobufLogWriter.sync()
Hi Enis, At the time that the problem happened, the active NN log does not have any error or exceptions however the standby node has the following: 2016-02-16 14:21:51,039 INFO ha.EditLogTailer > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from txid > 4916546 > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > returned error: Operation category READ is not supported in state standby > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > returned error: Operation category READ is not supported in state standby > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC > Server handler 196 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from > 10.5.2.11:54684 Call#22619 Retry#1: > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not > supported in state standby* > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC > Server handler 161 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from > 10.5.2.11:54684 Call#22620 Retry#0: > org.apache.hadoop.ipc.StandbyException: Operation category READ is not > supported in state standby But the standby node log is full of messages like this: 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks > returned error: Operation category READ is not supported in state standby Is this normal? or shall I run a fsck? I'm trusting in Ambari that says that HDFS is totally OK :). Thanks Cheers Pedro On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutarwrote: > > Operation category WRITE is not supported in state standby > This looks to be coming from a NN that is in Standby state (or safe mode?). > Did you check whether underlying HDFS is healthy. Is this HA configuration > and hbase-site.xml contains the correct NN configuration? > > Enis > > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola > wrote: > > > Just lost another RegionServer... The cluster was very stable until > > yesterday, 2 region servers in less than 24 hours something might be > wrong > > in my configuration. > > > > Any insights? > > > > Thank you > > Pedro > > > > *(rs log)* > > > > > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > > > java.io.IOException: Bad response ERROR for block > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from > datanode > > > DatanodeInfoWithStorage[10.5.1.117:50010 > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > > > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod] > > > retry.RetryInvocationHandler: Exception while invoking renewLease of > > class > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after > > > sleeping for 1261ms. > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > > > Operation category WRITE is not supported in state standby > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457) > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992) > > > at > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652) > > > at > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > > > at java.security.AccessController.doPrivileged(Native Method) > > > at javax.security.auth.Subject.doAs(Subject.java:422) > > > at > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > > > at
Re: NPE on ProtobufLogWriter.sync()
Hi St.Ack I attached the log in my previous response. Thanks Cheers On Tue, Feb 16, 2016 at 11:51 PM, Stackwrote: > On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola > wrote: > > > Just missed one of your questions. > > > > Were you trying to restart it and it wouldn't come up? It kept doing the > > > NPE on each restart? > > > Or it happened once and killed the regionserver? > > > > > > I received an alarm for the RS and I went there I saw the cluster not > > processing data and I tried to restart the RS. Everything went to normal > > but the RS crashed again then I decided to remove it completely until > know > > more about the problem. Now looking into the restarting code I see that > the > > problem might be other. > > > > *(log.out) * > > > > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: > 20748 > > > Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase > > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} > 2>&1 > > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or > > > directory > > > > > > *(log after restart)* > > > > > 2016-02-15 22:21:23,692 INFO [main] zookeeper.ZooKeeper: Client > > > > > > environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:java.io.tmpdir=/tmp > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:java.compiler= > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.name=Linux > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.arch=amd64 > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.version=3.10.0-327.3.1.el7.x86_64 > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.name=hbase > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.home=/home/hbase > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.dir=/home/hbase > > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > > client connection, > > > > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > > sessionTimeout=3 > > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > > client connection, > > > > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > > sessionTimeout=3 > > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > > 2016-02-15 22:21:23,754 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Opening socket connection to server zookeeper2.localdomain/ > > 10.5.1.105:2181. > > > Will not attempt to authenticate using SASL (unknown error) > > > 2016-02-15 22:21:23,755 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Opening socket connection to server zookeeper2.localdomain/ > > 10.5.1.105:2181. > > > Will not attempt to authenticate using SASL (unknown error) > > > 2016-02-15 22:21:23,767 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Socket connection established to zookeeper2.localdomain/ > 10.5.1.105:2181, > > > initiating session > > > 2016-02-15 22:21:23,767 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Socket connection established to zookeeper2.localdomain/ > 10.5.1.105:2181, > > > initiating session > > > 2016-02-15 22:21:23,777 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Session establishment complete on server zookeeper2.localdomain/ > > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = > > 3 > > > 2016-02-15 22:21:23,777 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Session establishment complete on server zookeeper2.localdomain/ > > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = > > 3 > > > > > > > I don't see any exception during the restart but for some reason the > > process got killed and I don't see any SIGTERM or OOM on linux journal. > > > > > Is that the last of the log? Just like that? > St.Ack > > > > Cheers > > Pedro > > > > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola > > wrote: > > > > > Hi St.Ack, > > > > > > Thank you for your help. I have attached ~5 min worth of data before > the > > > crash. > > > > > > I restarted the cluster to update some configurations after that I > moved > > > some regions around to balance the cluster and just to ensure the data > > >
Re: NPE on ProtobufLogWriter.sync()
On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandolawrote: > Just missed one of your questions. > > Were you trying to restart it and it wouldn't come up? It kept doing the > > NPE on each restart? > > Or it happened once and killed the regionserver? > > > I received an alarm for the RS and I went there I saw the cluster not > processing data and I tried to restart the RS. Everything went to normal > but the RS crashed again then I decided to remove it completely until know > more about the problem. Now looking into the restarting code I see that the > problem might be other. > > *(log.out) * > > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748 > > Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1 > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or > > directory > > > *(log after restart)* > > > 2016-02-15 22:21:23,692 INFO [main] zookeeper.ZooKeeper: Client > > > environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:java.io.tmpdir=/tmp > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:java.compiler= > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:os.name=Linux > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:os.arch=amd64 > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:os.version=3.10.0-327.3.1.el7.x86_64 > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:user.name=hbase > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:user.home=/home/hbase > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > environment:user.dir=/home/hbase > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > client connection, > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > sessionTimeout=3 > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > client connection, > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > sessionTimeout=3 > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > 2016-02-15 22:21:23,754 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Opening socket connection to server zookeeper2.localdomain/ > 10.5.1.105:2181. > > Will not attempt to authenticate using SASL (unknown error) > > 2016-02-15 22:21:23,755 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Opening socket connection to server zookeeper2.localdomain/ > 10.5.1.105:2181. > > Will not attempt to authenticate using SASL (unknown error) > > 2016-02-15 22:21:23,767 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, > > initiating session > > 2016-02-15 22:21:23,767 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, > > initiating session > > 2016-02-15 22:21:23,777 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Session establishment complete on server zookeeper2.localdomain/ > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = > 3 > > 2016-02-15 22:21:23,777 INFO > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > Session establishment complete on server zookeeper2.localdomain/ > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = > 3 > > > > I don't see any exception during the restart but for some reason the > process got killed and I don't see any SIGTERM or OOM on linux journal. > > Is that the last of the log? Just like that? St.Ack > Cheers > Pedro > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola > wrote: > > > Hi St.Ack, > > > > Thank you for your help. I have attached ~5 min worth of data before the > > crash. > > > > I restarted the cluster to update some configurations after that I moved > > some regions around to balance the cluster and just to ensure the data > > locality I ran a major compaction. After that I connected my App and 2 > > hours later this happened. So by the time of the crash I was not > > performing any operation over the cluster it was running normally. > > > > Thank you, > > Cheers > > Pedro > > > > On Tue, Feb 16, 2016 at 2:01 AM, Stack wrote: > > > >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola
Re: NPE on ProtobufLogWriter.sync()
> Operation category WRITE is not supported in state standby This looks to be coming from a NN that is in Standby state (or safe mode?). Did you check whether underlying HDFS is healthy. Is this HA configuration and hbase-site.xml contains the correct NN configuration? Enis On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandolawrote: > Just lost another RegionServer... The cluster was very stable until > yesterday, 2 region servers in less than 24 hours something might be wrong > in my configuration. > > Any insights? > > Thank you > Pedro > > *(rs log)* > > > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > > java.io.IOException: Bad response ERROR for block > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode > > DatanodeInfoWithStorage[10.5.1.117:50010 > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > > at > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod] > > retry.RetryInvocationHandler: Exception while invoking renewLease of > class > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after > > sleeping for 1261ms. > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > > Operation category WRITE is not supported in state standby > > at > > > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457) > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992) > > at > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652) > > at > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > > at org.apache.hadoop.ipc.Client.call(Client.java:1427) > > at org.apache.hadoop.ipc.Client.call(Client.java:1358) > > at > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source) > > at > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590) > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source) > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source) > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892) > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) > > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) > > at java.lang.Thread.run(Thread.java:745) > > 2016-02-16 14:22:37,753 WARN [DataStreamer for file > > > /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > > block
Re: NPE on ProtobufLogWriter.sync()
Just lost another RegionServer... The cluster was very stable until yesterday, 2 region servers in less than 24 hours something might be wrong in my configuration. Any insights? Thank you Pedro *(rs log)* > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > java.io.IOException: Bad response ERROR for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode > DatanodeInfoWithStorage[10.5.1.117:50010 > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod] > retry.RetryInvocationHandler: Exception while invoking renewLease of class > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after > sleeping for 1261ms. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > at org.apache.hadoop.ipc.Client.call(Client.java:1427) > at org.apache.hadoop.ipc.Client.call(Client.java:1358) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.renewLease(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590) > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.renewLease(Unknown Source) > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy18.renewLease(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892) > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) > at java.lang.Thread.run(Thread.java:745) > 2016-02-16 14:22:37,753 WARN [DataStreamer for file > /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > hdfs.DFSClient: Error Recovery for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline > DatanodeInfoWithStorage[10.5.2.11:50010,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK], > DatanodeInfoWithStorage[10.5.1.117:50010,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK], > DatanodeInfoWithStorage[10.5.2.57:50010,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]: > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010 > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > 2016-02-16 14:22:37,769 INFO
Re: NPE on ProtobufLogWriter.sync()
Just missed one of your questions. Were you trying to restart it and it wouldn't come up? It kept doing the > NPE on each restart? > Or it happened once and killed the regionserver? I received an alarm for the RS and I went there I saw the cluster not processing data and I tried to restart the RS. Everything went to normal but the RS crashed again then I decided to remove it completely until know more about the problem. Now looking into the restarting code I see that the problem might be other. *(log.out) * > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748 > Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1 > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or > directory *(log after restart)* > 2016-02-15 22:21:23,692 INFO [main] zookeeper.ZooKeeper: Client > environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=/tmp > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:java.compiler= > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:os.name=Linux > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:os.arch=amd64 > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:os.version=3.10.0-327.3.1.el7.x86_64 > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:user.name=hbase > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:user.home=/home/hbase > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > environment:user.dir=/home/hbase > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > client connection, > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > sessionTimeout=3 > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > client connection, > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > sessionTimeout=3 > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > 2016-02-15 22:21:23,754 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181. > Will not attempt to authenticate using SASL (unknown error) > 2016-02-15 22:21:23,755 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181. > Will not attempt to authenticate using SASL (unknown error) > 2016-02-15 22:21:23,767 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, > initiating session > 2016-02-15 22:21:23,767 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, > initiating session > 2016-02-15 22:21:23,777 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Session establishment complete on server zookeeper2.localdomain/ > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = 3 > 2016-02-15 22:21:23,777 INFO > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > Session establishment complete on server zookeeper2.localdomain/ > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = 3 > I don't see any exception during the restart but for some reason the process got killed and I don't see any SIGTERM or OOM on linux journal. Cheers Pedro On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandolawrote: > Hi St.Ack, > > Thank you for your help. I have attached ~5 min worth of data before the > crash. > > I restarted the cluster to update some configurations after that I moved > some regions around to balance the cluster and just to ensure the data > locality I ran a major compaction. After that I connected my App and 2 > hours later this happened. So by the time of the crash I was not > performing any operation over the cluster it was running normally. > > Thank you, > Cheers > Pedro > > On Tue, Feb 16, 2016 at 2:01 AM, Stack wrote: > >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola >> wrote: >> >> > Hi Guys, >> > >> > One of my region servers got into a state where it was unable to start >> and >> > the cluster was not receiving traffic for some time: >> >> >> >> Were you trying to restart it and it wouldn't come up? It kept doing the >> NPE on each restart? >> >> Or it happened once and killed the
Re: NPE on ProtobufLogWriter.sync()
Hi St.Ack, Thank you for your help. I have attached ~5 min worth of data before the crash. I restarted the cluster to update some configurations after that I moved some regions around to balance the cluster and just to ensure the data locality I ran a major compaction. After that I connected my App and 2 hours later this happened. So by the time of the crash I was not performing any operation over the cluster it was running normally. Thank you, Cheers Pedro On Tue, Feb 16, 2016 at 2:01 AM, Stackwrote: > On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola > wrote: > > > Hi Guys, > > > > One of my region servers got into a state where it was unable to start > and > > the cluster was not receiving traffic for some time: > > > > Were you trying to restart it and it wouldn't come up? It kept doing the > NPE on each restart? > > Or it happened once and killed the regionserver? > > > > *(master log)* > > > > > 2016-02-15 22:04:33,186 ERROR > > > [PriorityRpcServer.handler=4,queue=0,port=16000] > > master.MasterRpcServices: > > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a > fatal > > > error: > > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE > in > > > log roller > > > Cause: > > > java.io.IOException: java.lang.NullPointerException > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > > at java.lang.Thread.run(Thread.java:745) > > > Caused by: java.lang.NullPointerException > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > > ... 2 more 2016-02-15 22:05:45,678 WARN > > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] > > > master.CatalogJanitor: Failed scan of catalog table > > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > > > attempts=351, exceptions: > > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException: > > > callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at > > > region=hbase:meta,,1.1588230740, > > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at > > > > > > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) > > > at > > > > You running with read replicas enabled? > > > > > > > *(regionserver log)* > > > Anything before this log message about rolling the WAL? > > > > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request > > close of wal > > java.io.IOException: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > at > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > ... 2 more > > 2016-02-15 22:04:33,036 INFO [sync.2] wal.FSHLog: Slow sync cost: 234 > ms, > > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010 > ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK], > > DatanodeInfoWithStorage[10.5.2.95:50010 > ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK], > > DatanodeInfoWithStorage[10.5.2.57:50010 > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]] > > 2016-02-15 22:04:33,043 FATAL > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: ABORTING region server > > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller > > java.io.IOException: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > at > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > ... 2 more > > 2016-02-15 22:04:33,043 FATAL > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: > > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, > > org.apache.phoenix.coprocessor.SequenceRegionObserver, > > org.apache.phoenix.coprocessor.ScanRegionObserver, > > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, > > org.apache.phoenix.hbase.index.Indexer, > > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, > >
Re: NPE on ProtobufLogWriter.sync()
On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandolawrote: > Hi Guys, > > One of my region servers got into a state where it was unable to start and > the cluster was not receiving traffic for some time: Were you trying to restart it and it wouldn't come up? It kept doing the NPE on each restart? Or it happened once and killed the regionserver? *(master log)* > > > 2016-02-15 22:04:33,186 ERROR > > [PriorityRpcServer.handler=4,queue=0,port=16000] > master.MasterRpcServices: > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal > > error: > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in > > log roller > > Cause: > > java.io.IOException: java.lang.NullPointerException > > at > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > at > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NullPointerException > > at > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > ... 2 more 2016-02-15 22:05:45,678 WARN > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] > > master.CatalogJanitor: Failed scan of catalog table > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > > attempts=351, exceptions: > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException: > > callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at > > region=hbase:meta,,1.1588230740, > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at > > > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) > > at > You running with read replicas enabled? *(regionserver log)* Anything before this log message about rolling the WAL? > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request > close of wal > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > ... 2 more > 2016-02-15 22:04:33,036 INFO [sync.2] wal.FSHLog: Slow sync cost: 234 ms, > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010 ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK], > DatanodeInfoWithStorage[10.5.2.95:50010 ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK], > DatanodeInfoWithStorage[10.5.2.57:50010 > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]] > 2016-02-15 22:04:33,043 FATAL > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > regionserver.HRegionServer: ABORTING region server > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > ... 2 more > 2016-02-15 22:04:33,043 FATAL > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, > org.apache.phoenix.coprocessor.SequenceRegionObserver, > org.apache.phoenix.coprocessor.ScanRegionObserver, > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, > org.apache.phoenix.hbase.index.Indexer, > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint, > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > 2016-02-15 22:04:33,136 INFO > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > regionserver.HRegionServer: Dump of metrics as JSON on abort: { > "beans" : [ { > "name" : "java.lang:type=Memory", > "modelerType" : "sun.management.MemoryImpl", > "Verbose" : true, > "HeapMemoryUsage" : { > "committed" : 15032385536, > "init" : 15032385536, > "max" : 15032385536, > "used" : 3732843792 > }, > "ObjectPendingFinalizationCount" : 0, > "NonHeapMemoryUsage" : { > "committed" : 104660992, > "init" : 2555904, > "max" : -1, > "used" : 103018984 > }, >
NPE on ProtobufLogWriter.sync()
Hi Guys, One of my region servers got into a state where it was unable to start and the cluster was not receiving traffic for some time: *(master log)* > 2016-02-15 22:04:33,186 ERROR > [PriorityRpcServer.handler=4,queue=0,port=16000] master.MasterRpcServices: > Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal > error: > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in > log roller > Cause: > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > ... 2 more 2016-02-15 22:05:45,678 WARN > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] > master.CatalogJanitor: Failed scan of catalog table > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=351, exceptions: > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException: > callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) > at > org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) > at > org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) > at > org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:155) > at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) > at > org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89) > at > org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:168) > at > org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:120) > at > org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:221) > at > org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:102) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.SocketTimeoutException: callTimeout=6, > callDuration=68222: row '' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64) > ... 3 more > Caused by: > org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: > org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server > hbase-rs9.localdomain,16020,1455560991134 aborting > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2185) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at >