Just lost another RegionServer... The cluster was very stable until yesterday, 2 region servers in less than 24 hours something might be wrong in my configuration.
Any insights? Thank you Pedro *(rs log)* > 2016-02-16 14:22:37,749 WARN [ResponseProcessor for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 > java.io.IOException: Bad response ERROR for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode > DatanodeInfoWithStorage[10.5.1.117:50010 > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) > 2016-02-16 14:22:37,751 INFO [LeaseRenewer:hbase@nameservice-prod] > retry.RetryInvocationHandler: Exception while invoking renewLease of class > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/ > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after > sleeping for 1261ms. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > at org.apache.hadoop.ipc.Client.call(Client.java:1427) > at org.apache.hadoop.ipc.Client.call(Client.java:1358) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.renewLease(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590) > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.renewLease(Unknown Source) > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy18.renewLease(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892) > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) > at java.lang.Thread.run(Thread.java:745) > 2016-02-16 14:22:37,753 WARN [DataStreamer for file > /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > hdfs.DFSClient: Error Recovery for block > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline > DatanodeInfoWithStorage[10.5.2.11:50010,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK], > DatanodeInfoWithStorage[10.5.1.117:50010,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK], > DatanodeInfoWithStorage[10.5.2.57:50010,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]: > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010 > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK] > 2016-02-16 14:22:37,769 INFO [DataStreamer for file > /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > retry.RetryInvocationHandler: Exception while invoking > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB over > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over immediately. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > at org.apache.hadoop.ipc.Client.call(Client.java:1427) > at org.apache.hadoop.ipc.Client.call(Client.java:1358) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412) > 2016-02-16 14:22:37,781 WARN [DataStreamer for file > /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937] > hdfs.DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508 > (inode 641732): File is not open for writing. Holder > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > at org.apache.hadoop.ipc.Client.call(Client.java:1427) > at org.apache.hadoop.ipc.Client.call(Client.java:1358) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412) > 2016-02-16 14:22:38,134 INFO > [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn: > Opening socket connection to server zookeeper3.localdomain/10.5.2.166:2181. > Will not attempt to authenticate using SASL (unknown error) > 2016-02-16 14:22:38,137 INFO > [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn: > Socket connection established to zookeeper3.localdomain/10.5.2.166:2181, > initiating session > 2016-02-16 14:22:38,144 INFO > [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn: > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d has > expired, closing socket connection > 2016-02-16 14:22:38,144 FATAL [main-EventThread] > regionserver.HRegionServer: ABORTING region server > hbase-rs6.localdomain,16020,1455560986600: > regionserver:16020-0x25278d348e90c8d, > quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181, > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received expired from > ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <[email protected]> wrote: > Just missed one of your questions. > > Were you trying to restart it and it wouldn't come up? It kept doing the >> NPE on each restart? >> Or it happened once and killed the regionserver? > > > I received an alarm for the RS and I went there I saw the cluster not > processing data and I tried to restart the RS. Everything went to normal > but the RS crashed again then I decided to remove it completely until know > more about the problem. Now looking into the restarting code I see that the > problem might be other. > > *(log.out) * > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748 >> Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase >> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1 >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or >> directory > > > *(log after restart)* > >> 2016-02-15 22:21:23,692 INFO [main] zookeeper.ZooKeeper: Client >> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:java.io.tmpdir=/tmp >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:java.compiler=<NA> >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:os.name=Linux >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:os.arch=amd64 >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:os.version=3.10.0-327.3.1.el7.x86_64 >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:user.name=hbase >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:user.home=/home/hbase >> 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client >> environment:user.dir=/home/hbase >> 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating >> client connection, >> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 >> sessionTimeout=30000 >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 >> 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating >> client connection, >> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 >> sessionTimeout=30000 >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 >> 2016-02-15 22:21:23,754 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Opening socket connection to server zookeeper2.localdomain/ >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown >> error) >> 2016-02-15 22:21:23,755 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Opening socket connection to server zookeeper2.localdomain/ >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown >> error) >> 2016-02-15 22:21:23,767 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, >> initiating session >> 2016-02-15 22:21:23,767 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181, >> initiating session >> 2016-02-15 22:21:23,777 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Session establishment complete on server zookeeper2.localdomain/ >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = >> 30000 >> 2016-02-15 22:21:23,777 INFO >> [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: >> Session establishment complete on server zookeeper2.localdomain/ >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = >> 30000 >> > > I don't see any exception during the restart but for some reason the > process got killed and I don't see any SIGTERM or OOM on linux journal. > > Cheers > Pedro > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <[email protected]> > wrote: > >> Hi St.Ack, >> >> Thank you for your help. I have attached ~5 min worth of data before the >> crash. >> >> I restarted the cluster to update some configurations after that I moved >> some regions around to balance the cluster and just to ensure the data >> locality I ran a major compaction. After that I connected my App and 2 >> hours later this happened. So by the time of the crash I was not >> performing any operation over the cluster it was running normally. >> >> Thank you, >> Cheers >> Pedro >> >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <[email protected]> wrote: >> >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <[email protected]> >>> wrote: >>> >>> > Hi Guys, >>> > >>> > One of my region servers got into a state where it was unable to start >>> and >>> > the cluster was not receiving traffic for some time: >>> >>> >>> >>> Were you trying to restart it and it wouldn't come up? It kept doing the >>> NPE on each restart? >>> >>> Or it happened once and killed the regionserver? >>> >>> >>> >>> *(master log)* >>> > >>> > > 2016-02-15 22:04:33,186 ERROR >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000] >>> > master.MasterRpcServices: >>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a >>> fatal >>> > > error: >>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: >>> IOE in >>> > > log roller >>> > > Cause: >>> > > java.io.IOException: java.lang.NullPointerException >>> > > at >>> > > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) >>> > > at >>> > > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) >>> > > at java.lang.Thread.run(Thread.java:745) >>> > > Caused by: java.lang.NullPointerException >>> > > at >>> > > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] >>> > > master.CatalogJanitor: Failed scan of catalog table >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> after >>> > > attempts=351, exceptions: >>> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException: >>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' >>> at >>> > > region=hbase:meta,,1.1588230740, >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at >>> > > >>> > >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) >>> > > at >>> > >>> >>> You running with read replicas enabled? >>> >>> >>> .... >>> >>> >>> >>> *(regionserver log)* >>> >>> >>> Anything before this log message about rolling the WAL? >>> >>> >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, >>> request >>> > close of wal >>> > java.io.IOException: java.lang.NullPointerException >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) >>> > at java.lang.Thread.run(Thread.java:745) >>> > Caused by: java.lang.NullPointerException >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) >>> > ... 2 more >>> > 2016-02-15 22:04:33,036 INFO [sync.2] wal.FSHLog: Slow sync cost: 234 >>> ms, >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010 >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK], >>> > DatanodeInfoWithStorage[10.5.2.95:50010 >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK], >>> > DatanodeInfoWithStorage[10.5.2.57:50010 >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]] >>> > 2016-02-15 22:04:33,043 FATAL >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] >>> > regionserver.HRegionServer: ABORTING region server >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller >>> > java.io.IOException: java.lang.NullPointerException >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) >>> > at java.lang.Thread.run(Thread.java:745) >>> > Caused by: java.lang.NullPointerException >>> > at >>> > >>> >>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) >>> > ... 2 more >>> > 2016-02-15 22:04:33,043 FATAL >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] >>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors >>> are: >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger, >>> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver, >>> > org.apache.phoenix.coprocessor.ScanRegionObserver, >>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, >>> > org.apache.phoenix.hbase.index.Indexer, >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint, >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] >>> > 2016-02-15 22:04:33,136 INFO >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: { >>> > "beans" : [ { >>> > "name" : "java.lang:type=Memory", >>> > "modelerType" : "sun.management.MemoryImpl", >>> > "Verbose" : true, >>> > "HeapMemoryUsage" : { >>> > "committed" : 15032385536, >>> > "init" : 15032385536, >>> > "max" : 15032385536, >>> > "used" : 3732843792 >>> > }, >>> > "ObjectPendingFinalizationCount" : 0, >>> > "NonHeapMemoryUsage" : { >>> > "committed" : 104660992, >>> > "init" : 2555904, >>> > "max" : -1, >>> > "used" : 103018984 >>> > }, >>> > "ObjectName" : "java.lang:type=Memory" >>> > } ], >>> > "beans" : [ { >>> > "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", >>> > "modelerType" : "RegionServer,sub=IPC", >>> > "tag.Context" : "regionserver", >>> > "tag.Hostname" : "hbase-rs9" >>> > } ], >>> > "beans" : [ { >>> > "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication", >>> > "modelerType" : "RegionServer,sub=Replication", >>> > "tag.Context" : "regionserver", >>> > "tag.Hostname" : "hbase-rs9" >>> > } ], >>> > "beans" : [ { >>> > "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", >>> > "modelerType" : "RegionServer,sub=Server", >>> > "tag.Context" : "regionserver", >>> > "tag.Hostname" : "hbase-rs9" >>> > } ] >>> > } >>> > 2016-02-15 22:04:33,194 INFO >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] >>> > regionserver.HRegionServer: STOPPED: IOE in log roller >>> >>> Looks like we were in log roller, closing the WAL to put up a new one but >>> the a sync was trying to run at same time but stream undo under it.... >>> NPE... which caused server abort. >>> >>> ... >>> >>> >>> Any idea what happened? Today, I moved some regions around to balance the >>> cluster and I ran a major compaction after that and I added more threads >>> to >>> run small and large compactions, could this be related? >>> >>> >>> >>> Yeah, pass more of the regionserver log from before the NPE... to see >>> what >>> was going on at the time. I don't think the above changes related. >>> >>> >>> >>> I see that in the current branch the class *ProtobufLogWriter:176* >>> already >>> contains try..catch >>> >>> try { >>> > return this.output.getPos(); >>> > } catch (NullPointerException npe) { >>> > // Concurrent close... >>> > throw new IOException(npe); >>> > } >>> >>> >>> Yeah. Looks like the thought is an NPE could happen here because the >>> stream >>> is being closed out from under us by another thread (the syncers are off >>> on >>> their own just worried about syncing... This is just to get the NPE up >>> out >>> of the syncer thread and up into the general WAL subsystem.). It looks >>> like this caused the regionserver abort. I would think it should just >>> provoke the WAL roll, not an abort. Put up more RS log please. >>> >>> >>> >>> > But I would be nice to understand the root cause of this error and if >>> there >>> > is any misconfiguration from my side. >>> > >>> > >>> Not your config I'd say (if your config brought it on, its a bug). >>> >>> >>> >>> > *Version: *HBase 1.1.2 >>> > >>> > >>> >>> St.Ack >>> >>> P.S. It looks like a gentleman had similar issue end of last year: >>> >>> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E >>> No response then. >>> >>> >>> >>> >>> > Thank you >>> > Cheers >>> > Pedro >>> > >>> >> >> >
