Hi St.Ack I attached the log in my previous response.
Thanks Cheers On Tue, Feb 16, 2016 at 11:51 PM, Stack <[email protected]> wrote: > On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola <[email protected]> > wrote: > > > Just missed one of your questions. > > > > Were you trying to restart it and it wouldn't come up? It kept doing the > > > NPE on each restart? > > > Or it happened once and killed the regionserver? > > > > > > I received an alarm for the RS and I went there I saw the cluster not > > processing data and I tried to restart the RS. Everything went to normal > > but the RS crashed again then I decided to remove it completely until > know > > more about the problem. Now looking into the restarting code I see that > the > > problem might be other. > > > > *(log.out) * > > > > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: > 20748 > > > Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase > > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} > 2>&1 > > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or > > > directory > > > > > > *(log after restart)* > > > > > 2016-02-15 22:21:23,692 INFO [main] zookeeper.ZooKeeper: Client > > > > > > environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:java.io.tmpdir=/tmp > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:java.compiler=<NA> > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.name=Linux > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.arch=amd64 > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:os.version=3.10.0-327.3.1.el7.x86_64 > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.name=hbase > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.home=/home/hbase > > > 2016-02-15 22:21:23,693 INFO [main] zookeeper.ZooKeeper: Client > > > environment:user.dir=/home/hbase > > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > > client connection, > > > > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > > sessionTimeout=30000 > > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > > 2016-02-15 22:21:23,694 INFO [main] zookeeper.ZooKeeper: Initiating > > > client connection, > > > > > > connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181 > > > sessionTimeout=30000 > > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671 > > > 2016-02-15 22:21:23,754 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Opening socket connection to server zookeeper2.localdomain/ > > 10.5.1.105:2181. > > > Will not attempt to authenticate using SASL (unknown error) > > > 2016-02-15 22:21:23,755 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Opening socket connection to server zookeeper2.localdomain/ > > 10.5.1.105:2181. > > > Will not attempt to authenticate using SASL (unknown error) > > > 2016-02-15 22:21:23,767 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Socket connection established to zookeeper2.localdomain/ > 10.5.1.105:2181, > > > initiating session > > > 2016-02-15 22:21:23,767 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Socket connection established to zookeeper2.localdomain/ > 10.5.1.105:2181, > > > initiating session > > > 2016-02-15 22:21:23,777 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Session establishment complete on server zookeeper2.localdomain/ > > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = > > 30000 > > > 2016-02-15 22:21:23,777 INFO > > > [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn: > > > Session establishment complete on server zookeeper2.localdomain/ > > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = > > 30000 > > > > > > > I don't see any exception during the restart but for some reason the > > process got killed and I don't see any SIGTERM or OOM on linux journal. > > > > > Is that the last of the log? Just like that? > St.Ack > > > > Cheers > > Pedro > > > > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <[email protected]> > > wrote: > > > > > Hi St.Ack, > > > > > > Thank you for your help. I have attached ~5 min worth of data before > the > > > crash. > > > > > > I restarted the cluster to update some configurations after that I > moved > > > some regions around to balance the cluster and just to ensure the data > > > locality I ran a major compaction. After that I connected my App and 2 > > > hours later this happened. So by the time of the crash I was not > > > performing any operation over the cluster it was running normally. > > > > > > Thank you, > > > Cheers > > > Pedro > > > > > > On Tue, Feb 16, 2016 at 2:01 AM, Stack <[email protected]> wrote: > > > > > >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola < > [email protected] > > > > > >> wrote: > > >> > > >> > Hi Guys, > > >> > > > >> > One of my region servers got into a state where it was unable to > start > > >> and > > >> > the cluster was not receiving traffic for some time: > > >> > > >> > > >> > > >> Were you trying to restart it and it wouldn't come up? It kept doing > the > > >> NPE on each restart? > > >> > > >> Or it happened once and killed the regionserver? > > >> > > >> > > >> > > >> *(master log)* > > >> > > > >> > > 2016-02-15 22:04:33,186 ERROR > > >> > > [PriorityRpcServer.handler=4,queue=0,port=16000] > > >> > master.MasterRpcServices: > > >> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a > > >> fatal > > >> > > error: > > >> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: > > IOE > > >> in > > >> > > log roller > > >> > > Cause: > > >> > > java.io.IOException: java.lang.NullPointerException > > >> > > at > > >> > > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > >> > > at > > >> > > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > >> > > at java.lang.Thread.run(Thread.java:745) > > >> > > Caused by: java.lang.NullPointerException > > >> > > at > > >> > > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > >> > > ... 2 more 2016-02-15 22:05:45,678 WARN > > >> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] > > >> > > master.CatalogJanitor: Failed scan of catalog table > > >> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > after > > >> > > attempts=351, exceptions: > > >> > > Mon Feb 15 22:05:45 UTC 2016, null, > java.net.SocketTimeoutException: > > >> > > callTimeout=60000, callDuration=68222: row '' on table > 'hbase:meta' > > at > > >> > > region=hbase:meta,,1.1588230740, > > >> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at > > >> > > > > >> > > > >> > > > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) > > >> > > at > > >> > > > >> > > >> You running with read replicas enabled? > > >> > > >> > > >> .... > > >> > > >> > > >> > > >> *(regionserver log)* > > >> > > >> > > >> Anything before this log message about rolling the WAL? > > >> > > >> > > >> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, > > >> request > > >> > close of wal > > >> > java.io.IOException: java.lang.NullPointerException > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > >> > at java.lang.Thread.run(Thread.java:745) > > >> > Caused by: java.lang.NullPointerException > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > >> > ... 2 more > > >> > 2016-02-15 22:04:33,036 INFO [sync.2] wal.FSHLog: Slow sync cost: > 234 > > >> ms, > > >> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010 > > >> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK], > > >> > DatanodeInfoWithStorage[10.5.2.95:50010 > > >> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK], > > >> > DatanodeInfoWithStorage[10.5.2.57:50010 > > >> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]] > > >> > 2016-02-15 22:04:33,043 FATAL > > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > >> > regionserver.HRegionServer: ABORTING region server > > >> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller > > >> > java.io.IOException: java.lang.NullPointerException > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > >> > at java.lang.Thread.run(Thread.java:745) > > >> > Caused by: java.lang.NullPointerException > > >> > at > > >> > > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > >> > ... 2 more > > >> > 2016-02-15 22:04:33,043 FATAL > > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > >> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors > > are: > > >> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > > >> > > > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, > > >> > org.apache.phoenix.coprocessor.SequenceRegionObserver, > > >> > org.apache.phoenix.coprocessor.ScanRegionObserver, > > >> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, > > >> > org.apache.phoenix.hbase.index.Indexer, > > >> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, > > >> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > > >> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, > > >> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint, > > >> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > > >> > 2016-02-15 22:04:33,136 INFO > > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > >> > regionserver.HRegionServer: Dump of metrics as JSON on abort: { > > >> > "beans" : [ { > > >> > "name" : "java.lang:type=Memory", > > >> > "modelerType" : "sun.management.MemoryImpl", > > >> > "Verbose" : true, > > >> > "HeapMemoryUsage" : { > > >> > "committed" : 15032385536, > > >> > "init" : 15032385536, > > >> > "max" : 15032385536, > > >> > "used" : 3732843792 > > >> > }, > > >> > "ObjectPendingFinalizationCount" : 0, > > >> > "NonHeapMemoryUsage" : { > > >> > "committed" : 104660992, > > >> > "init" : 2555904, > > >> > "max" : -1, > > >> > "used" : 103018984 > > >> > }, > > >> > "ObjectName" : "java.lang:type=Memory" > > >> > } ], > > >> > "beans" : [ { > > >> > "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", > > >> > "modelerType" : "RegionServer,sub=IPC", > > >> > "tag.Context" : "regionserver", > > >> > "tag.Hostname" : "hbase-rs9" > > >> > } ], > > >> > "beans" : [ { > > >> > "name" : > "Hadoop:service=HBase,name=RegionServer,sub=Replication", > > >> > "modelerType" : "RegionServer,sub=Replication", > > >> > "tag.Context" : "regionserver", > > >> > "tag.Hostname" : "hbase-rs9" > > >> > } ], > > >> > "beans" : [ { > > >> > "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", > > >> > "modelerType" : "RegionServer,sub=Server", > > >> > "tag.Context" : "regionserver", > > >> > "tag.Hostname" : "hbase-rs9" > > >> > } ] > > >> > } > > >> > 2016-02-15 22:04:33,194 INFO > > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > >> > regionserver.HRegionServer: STOPPED: IOE in log roller > > >> > > >> Looks like we were in log roller, closing the WAL to put up a new one > > but > > >> the a sync was trying to run at same time but stream undo under it.... > > >> NPE... which caused server abort. > > >> > > >> ... > > >> > > >> > > >> Any idea what happened? Today, I moved some regions around to balance > > the > > >> cluster and I ran a major compaction after that and I added more > threads > > >> to > > >> run small and large compactions, could this be related? > > >> > > >> > > >> > > >> Yeah, pass more of the regionserver log from before the NPE... to see > > what > > >> was going on at the time. I don't think the above changes related. > > >> > > >> > > >> > > >> I see that in the current branch the class *ProtobufLogWriter:176* > > already > > >> contains try..catch > > >> > > >> try { > > >> > return this.output.getPos(); > > >> > } catch (NullPointerException npe) { > > >> > // Concurrent close... > > >> > throw new IOException(npe); > > >> > } > > >> > > >> > > >> Yeah. Looks like the thought is an NPE could happen here because the > > >> stream > > >> is being closed out from under us by another thread (the syncers are > off > > >> on > > >> their own just worried about syncing... This is just to get the NPE up > > out > > >> of the syncer thread and up into the general WAL subsystem.). It > looks > > >> like this caused the regionserver abort. I would think it should just > > >> provoke the WAL roll, not an abort. Put up more RS log please. > > >> > > >> > > >> > > >> > But I would be nice to understand the root cause of this error and > if > > >> there > > >> > is any misconfiguration from my side. > > >> > > > >> > > > >> Not your config I'd say (if your config brought it on, its a bug). > > >> > > >> > > >> > > >> > *Version: *HBase 1.1.2 > > >> > > > >> > > > >> > > >> St.Ack > > >> > > >> P.S. It looks like a gentleman had similar issue end of last year: > > >> > > >> > > > http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E > > >> No response then. > > >> > > >> > > >> > > >> > > >> > Thank you > > >> > Cheers > > >> > Pedro > > >> > > > >> > > > > > > > > >
