Re: NPE on ProtobufLogWriter.sync()

2016-02-17 Thread Pedro Gandola
Hi St.Ack,

Yes, I do have HA NN, I didn't failover it and it was not necessary to
clear the standby mode or run fsck. However, I'm a bit concerned about
those logs.
The logs that I sent to Enis show that only the standby NN was having
issues the active NN was normal and I didn't find errors on that time.

No crashes since then all region servers are operating normally.

Thanks
Cheers

On Wed, Feb 17, 2016 at 6:43 AM, Stack  wrote:

> What @Enis said, your NN seems to be in standby mode. You have HA NN? You
> failed over? You need to clear standby state? Run fsck?
> St.Ack
>
> On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola 
> wrote:
>
> > Hi Enis,
> >
> > At the time that the problem happened, the active NN log does not have
> any
> > error or exceptions however the standby node has the following:
> >
> > 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> > txid
> > > 4916546
> > > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 196 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > > 10.5.2.11:54684 Call#22619 Retry#1:
> > > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is
> not
> > > supported in state standby*
> > > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 161 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode
> from
> > > 10.5.2.11:54684 Call#22620 Retry#0:
> > > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > > supported in state standby
> >
> >
> > But the standby node log is full of messages like this:
> >
> > 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> >
> >
> > Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> > that HDFS is totally OK :).
> >
> > Thanks
> > Cheers
> > Pedro
> >
> > On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar 
> > wrote:
> >
> > > > Operation category WRITE is not supported in state standby
> > > This looks to be coming from a NN that is in Standby state (or safe
> > mode?).
> > > Did you check whether underlying HDFS is healthy. Is this HA
> > configuration
> > > and hbase-site.xml contains the correct NN configuration?
> > >
> > > Enis
> > >
> > >
> > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <
> pedro.gand...@gmail.com>
> > > wrote:
> > >
> > > > Just lost another RegionServer... The cluster was very stable until
> > > > yesterday, 2 region servers in less than 24 hours something might be
> > > wrong
> > > > in my configuration.
> > > >
> > > > Any insights?
> > > >
> > > > Thank you
> > > > Pedro
> > > >
> > > > *(rs log)*
> > > >
> > > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> > block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > > java.io.IOException: Bad response ERROR for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > > datanode
> > > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod
> ]
> > > > > retry.RetryInvocationHandler: Exception while invoking renewLease
> of
> > > > class
> > > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> > after
> > > > > sleeping for 1261ms.
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > > Operation category WRITE is not supported in state standby
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > > at
> > > > >
> > > >
> > >
> >
> 

Re: NPE on ProtobufLogWriter.sync()

2016-02-17 Thread Stack
What @Enis said, your NN seems to be in standby mode. You have HA NN? You
failed over? You need to clear standby state? Run fsck?
St.Ack

On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola 
wrote:

> Hi Enis,
>
> At the time that the problem happened, the active NN log does not have any
> error or exceptions however the standby node has the following:
>
> 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> txid
> > 4916546
> > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 196 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > 10.5.2.11:54684 Call#22619 Retry#1:
> > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not
> > supported in state standby*
> > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 161 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from
> > 10.5.2.11:54684 Call#22620 Retry#0:
> > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > supported in state standby
>
>
> But the standby node log is full of messages like this:
>
> 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
>
>
> Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> that HDFS is totally OK :).
>
> Thanks
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar 
> wrote:
>
> > > Operation category WRITE is not supported in state standby
> > This looks to be coming from a NN that is in Standby state (or safe
> mode?).
> > Did you check whether underlying HDFS is healthy. Is this HA
> configuration
> > and hbase-site.xml contains the correct NN configuration?
> >
> > Enis
> >
> >
> > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola 
> > wrote:
> >
> > > Just lost another RegionServer... The cluster was very stable until
> > > yesterday, 2 region servers in less than 24 hours something might be
> > wrong
> > > in my configuration.
> > >
> > > Any insights?
> > >
> > > Thank you
> > > Pedro
> > >
> > > *(rs log)*
> > >
> > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > java.io.IOException: Bad response ERROR for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > datanode
> > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > > > retry.RetryInvocationHandler: Exception while invoking renewLease of
> > > class
> > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> after
> > > > sleeping for 1261ms.
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > Operation category WRITE is not supported in state standby
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Pedro Gandola
Hi Enis,

At the time that the problem happened, the active NN log does not have any
error or exceptions however the standby node has the following:

2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from txid
> 4916546
> 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby
> 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby
> 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> Server handler 196 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> 10.5.2.11:54684 Call#22619 Retry#1:
> org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not
> supported in state standby*
> 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> Server handler 161 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from
> 10.5.2.11:54684 Call#22620 Retry#0:
> org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> supported in state standby


But the standby node log is full of messages like this:

2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby


Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
that HDFS is totally OK :).

Thanks
Cheers
Pedro

On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar  wrote:

> > Operation category WRITE is not supported in state standby
> This looks to be coming from a NN that is in Standby state (or safe mode?).
> Did you check whether underlying HDFS is healthy. Is this HA configuration
> and hbase-site.xml contains the correct NN configuration?
>
> Enis
>
>
> On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola 
> wrote:
>
> > Just lost another RegionServer... The cluster was very stable until
> > yesterday, 2 region servers in less than 24 hours something might be
> wrong
> > in my configuration.
> >
> > Any insights?
> >
> > Thank you
> > Pedro
> >
> > *(rs log)*
> >
> > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > java.io.IOException: Bad response ERROR for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> datanode
> > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > > retry.RetryInvocationHandler: Exception while invoking renewLease of
> > class
> > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> > > sleeping for 1261ms.
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > Operation category WRITE is not supported in state standby
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > at 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Pedro Gandola
Hi St.Ack

I attached the log in my previous response.

Thanks
Cheers

On Tue, Feb 16, 2016 at 11:51 PM, Stack  wrote:

> On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola 
> wrote:
>
> > Just missed one of your questions.
> >
> > Were you trying to restart it and it wouldn't come up? It kept doing the
> > > NPE on each restart?
> > > Or it happened once and killed the regionserver?
> >
> >
> > I received an alarm for the RS and I went there I saw the cluster not
> > processing data and I tried to restart the RS. Everything went to normal
> > but the RS crashed again then I decided to remove it completely until
> know
> > more about the problem. Now looking into the restarting code I see that
> the
> > problem might be other.
> >
> > *(log.out) *
> >
> > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> 20748
> > > Killed  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> 2>&1
> > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > > directory
> >
> >
> > *(log after restart)*
> >
> > > 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.io.tmpdir=/tmp
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.compiler=
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.name=Linux
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.arch=amd64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.name=hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.home=/home/hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.dir=/home/hbase
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=3
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=3
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,754 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,755 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> > 3
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> > 3
> > >
> >
> > I don't see any exception during the restart but for some reason the
> > process got killed and I don't see any SIGTERM or OOM on linux journal.
> >
> >
> Is that the last of the log? Just like that?
> St.Ack
>
>
> > Cheers
> > Pedro
> >
> >
> > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola 
> > wrote:
> >
> > > Hi St.Ack,
> > >
> > > Thank you for your help. I have attached ~5 min worth of data before
> the
> > > crash.
> > >
> > > I restarted the cluster to update some configurations after that I
> moved
> > > some regions around to balance the cluster and just to ensure the data
> > > 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Stack
On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola 
wrote:

> Just missed one of your questions.
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
> > NPE on each restart?
> > Or it happened once and killed the regionserver?
>
>
> I received an alarm for the RS and I went there I saw the cluster not
> processing data and I tried to restart the RS. Everything went to normal
> but the RS crashed again then I decided to remove it completely until know
> more about the problem. Now looking into the restarting code I see that the
> problem might be other.
>
> *(log.out) *
>
> > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> > Killed  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > directory
>
>
> *(log after restart)*
>
> > 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:java.io.tmpdir=/tmp
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:java.compiler=
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.name=Linux
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.arch=amd64
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.version=3.10.0-327.3.1.el7.x86_64
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.name=hbase
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.home=/home/hbase
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.dir=/home/hbase
> > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > client connection,
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > sessionTimeout=3
> > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > client connection,
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > sessionTimeout=3
> > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > 2016-02-15 22:21:23,754 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Opening socket connection to server zookeeper2.localdomain/
> 10.5.1.105:2181.
> > Will not attempt to authenticate using SASL (unknown error)
> > 2016-02-15 22:21:23,755 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Opening socket connection to server zookeeper2.localdomain/
> 10.5.1.105:2181.
> > Will not attempt to authenticate using SASL (unknown error)
> > 2016-02-15 22:21:23,767 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> > initiating session
> > 2016-02-15 22:21:23,767 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> > initiating session
> > 2016-02-15 22:21:23,777 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Session establishment complete on server zookeeper2.localdomain/
> > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> 3
> > 2016-02-15 22:21:23,777 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Session establishment complete on server zookeeper2.localdomain/
> > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> 3
> >
>
> I don't see any exception during the restart but for some reason the
> process got killed and I don't see any SIGTERM or OOM on linux journal.
>
>
Is that the last of the log? Just like that?
St.Ack


> Cheers
> Pedro
>
>
> On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola 
> wrote:
>
> > Hi St.Ack,
> >
> > Thank you for your help. I have attached ~5 min worth of data before the
> > crash.
> >
> > I restarted the cluster to update some configurations after that I moved
> > some regions around to balance the cluster and just to ensure the data
> > locality I ran a major compaction. After that I connected my App and 2
> > hours later  this happened. So by the time of the crash I was not
> > performing any operation over the cluster it was running normally.
> >
> > Thank you,
> > Cheers
> > Pedro
> >
> > On Tue, Feb 16, 2016 at 2:01 AM, Stack  wrote:
> >
> >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Enis Söztutar
> Operation category WRITE is not supported in state standby
This looks to be coming from a NN that is in Standby state (or safe mode?).
Did you check whether underlying HDFS is healthy. Is this HA configuration
and hbase-site.xml contains the correct NN configuration?

Enis


On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola 
wrote:

> Just lost another RegionServer... The cluster was very stable until
> yesterday, 2 region servers in less than 24 hours something might be wrong
> in my configuration.
>
> Any insights?
>
> Thank you
> Pedro
>
> *(rs log)*
>
> > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > java.io.IOException: Bad response ERROR for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode
> > DatanodeInfoWithStorage[10.5.1.117:50010
> > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > retry.RetryInvocationHandler: Exception while invoking renewLease of
> class
> > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> > sleeping for 1261ms.
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > Operation category WRITE is not supported in state standby
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > block 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Pedro Gandola
Just lost another RegionServer... The cluster was very stable until
yesterday, 2 region servers in less than 24 hours something might be wrong
in my configuration.

Any insights?

Thank you
Pedro

*(rs log)*

> 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> java.io.IOException: Bad response ERROR for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode
> DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> retry.RetryInvocationHandler: Exception while invoking renewLease of class
> ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> sleeping for 1261ms.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> at java.lang.Thread.run(Thread.java:745)
> 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: Error Recovery for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline
> DatanodeInfoWithStorage[10.5.2.11:50010,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> DatanodeInfoWithStorage[10.5.1.117:50010,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> 2016-02-16 14:22:37,769 INFO  

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Pedro Gandola
Just missed one of your questions.

Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
> Or it happened once and killed the regionserver?


I received an alarm for the RS and I went there I saw the cluster not
processing data and I tried to restart the RS. Everything went to normal
but the RS crashed again then I decided to remove it completely until know
more about the problem. Now looking into the restarting code I see that the
problem might be other.

*(log.out) *

> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> Killed  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> directory


*(log after restart)*

> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.io.tmpdir=/tmp
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.compiler=
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.name=Linux
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.arch=amd64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.version=3.10.0-327.3.1.el7.x86_64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.name=hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.home=/home/hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.dir=/home/hbase
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=3
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=3
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,754 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,755 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = 3
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = 3
>

I don't see any exception during the restart but for some reason the
process got killed and I don't see any SIGTERM or OOM on linux journal.

Cheers
Pedro


On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola 
wrote:

> Hi St.Ack,
>
> Thank you for your help. I have attached ~5 min worth of data before the
> crash.
>
> I restarted the cluster to update some configurations after that I moved
> some regions around to balance the cluster and just to ensure the data
> locality I ran a major compaction. After that I connected my App and 2
> hours later  this happened. So by the time of the crash I was not
> performing any operation over the cluster it was running normally.
>
> Thank you,
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 2:01 AM, Stack  wrote:
>
>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola 
>> wrote:
>>
>> > Hi Guys,
>> >
>> > One of my region servers got into a state where it was unable to start
>> and
>> > the cluster was not receiving traffic for some time:
>>
>>
>>
>> Were you trying to restart it and it wouldn't come up? It kept doing the
>> NPE on each restart?
>>
>> Or it happened once and killed the 

Re: NPE on ProtobufLogWriter.sync()

2016-02-16 Thread Pedro Gandola
Hi St.Ack,

Thank you for your help. I have attached ~5 min worth of data before the
crash.

I restarted the cluster to update some configurations after that I moved
some regions around to balance the cluster and just to ensure the data
locality I ran a major compaction. After that I connected my App and 2
hours later  this happened. So by the time of the crash I was not
performing any operation over the cluster it was running normally.

Thank you,
Cheers
Pedro

On Tue, Feb 16, 2016 at 2:01 AM, Stack  wrote:

> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola 
> wrote:
>
> > Hi Guys,
> >
> > One of my region servers got into a state where it was unable to start
> and
> > the cluster was not receiving traffic for some time:
>
>
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
>
> Or it happened once and killed the regionserver?
>
>
>
> *(master log)*
> >
> > > 2016-02-15 22:04:33,186 ERROR
> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > master.MasterRpcServices:
> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> fatal
> > > error:
> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE
> in
> > > log roller
> > > Cause:
> > > java.io.IOException: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > at java.lang.Thread.run(Thread.java:745)
> > > Caused by: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > master.CatalogJanitor: Failed scan of catalog table
> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > > attempts=351, exceptions:
> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > > callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at
> > > region=hbase:meta,,1.1588230740,
> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > at
> >
>
> You running with read replicas enabled?
>
>
> 
>
>
>
> *(regionserver log)*
>
>
> Anything before this log message about rolling the WAL?
>
>
> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> > close of wal
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
> ms,
> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > DatanodeInfoWithStorage[10.5.2.95:50010
> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > DatanodeInfoWithStorage[10.5.2.57:50010
> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: ABORTING region server
> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > org.apache.phoenix.hbase.index.Indexer,
> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > 

Re: NPE on ProtobufLogWriter.sync()

2016-02-15 Thread Stack
On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola 
wrote:

> Hi Guys,
>
> One of my region servers got into a state where it was unable to start and
> the cluster was not receiving traffic for some time:



Were you trying to restart it and it wouldn't come up? It kept doing the
NPE on each restart?

Or it happened once and killed the regionserver?



*(master log)*
>
> > 2016-02-15 22:04:33,186 ERROR
> > [PriorityRpcServer.handler=4,queue=0,port=16000]
> master.MasterRpcServices:
> > Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> > error:
> > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> > log roller
> > Cause:
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more 2016-02-15 22:05:45,678 WARN
> > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > master.CatalogJanitor: Failed scan of catalog table
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=351, exceptions:
> > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at
> > region=hbase:meta,,1.1588230740,
> > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > at
>

You running with read replicas enabled?






*(regionserver log)*


Anything before this log message about rolling the WAL?


> 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> close of wal
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234 ms,
> current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> DatanodeInfoWithStorage[10.5.2.95:50010
,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> org.apache.phoenix.hbase.index.Indexer,
> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2016-02-15 22:04:33,136 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>   "beans" : [ {
> "name" : "java.lang:type=Memory",
> "modelerType" : "sun.management.MemoryImpl",
> "Verbose" : true,
> "HeapMemoryUsage" : {
>   "committed" : 15032385536,
>   "init" : 15032385536,
>   "max" : 15032385536,
>   "used" : 3732843792
> },
> "ObjectPendingFinalizationCount" : 0,
> "NonHeapMemoryUsage" : {
>   "committed" : 104660992,
>   "init" : 2555904,
>   "max" : -1,
>   "used" : 103018984
> },
> 

NPE on ProtobufLogWriter.sync()

2016-02-15 Thread Pedro Gandola
Hi Guys,

One of my region servers got into a state where it was unable to start and
the cluster was not receiving traffic for some time:

*(master log)*

> 2016-02-15 22:04:33,186 ERROR
> [PriorityRpcServer.handler=4,queue=0,port=16000] master.MasterRpcServices:
> Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> error:
> ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> log roller
> Cause:
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more 2016-02-15 22:05:45,678 WARN
> [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> master.CatalogJanitor: Failed scan of catalog table
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=351, exceptions:
> Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=6, callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
> at
> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
> at
> org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:155)
> at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
> at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:168)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:120)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:221)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:102)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.SocketTimeoutException: callTimeout=6,
> callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
> ... 3 more
> Caused by:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> hbase-rs9.localdomain,16020,1455560991134 aborting
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2185)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
>