Hi St.Ack, Thank you for your help. I have attached ~5 min worth of data before the crash.
I restarted the cluster to update some configurations after that I moved some regions around to balance the cluster and just to ensure the data locality I ran a major compaction. After that I connected my App and 2 hours later this happened. So by the time of the crash I was not performing any operation over the cluster it was running normally. Thank you, Cheers Pedro On Tue, Feb 16, 2016 at 2:01 AM, Stack <[email protected]> wrote: > On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <[email protected]> > wrote: > > > Hi Guys, > > > > One of my region servers got into a state where it was unable to start > and > > the cluster was not receiving traffic for some time: > > > > Were you trying to restart it and it wouldn't come up? It kept doing the > NPE on each restart? > > Or it happened once and killed the regionserver? > > > > *(master log)* > > > > > 2016-02-15 22:04:33,186 ERROR > > > [PriorityRpcServer.handler=4,queue=0,port=16000] > > master.MasterRpcServices: > > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a > fatal > > > error: > > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE > in > > > log roller > > > Cause: > > > java.io.IOException: java.lang.NullPointerException > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > > at java.lang.Thread.run(Thread.java:745) > > > Caused by: java.lang.NullPointerException > > > at > > > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > > ... 2 more 2016-02-15 22:05:45,678 WARN > > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1] > > > master.CatalogJanitor: Failed scan of catalog table > > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > > > attempts=351, exceptions: > > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException: > > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at > > > region=hbase:meta,,1.1588230740, > > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at > > > > > > org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) > > > at > > > > You running with read replicas enabled? > > > .... > > > > *(regionserver log)* > > > Anything before this log message about rolling the WAL? > > > > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request > > close of wal > > java.io.IOException: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > at > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > ... 2 more > > 2016-02-15 22:04:33,036 INFO [sync.2] wal.FSHLog: Slow sync cost: 234 > ms, > > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010 > ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK], > > DatanodeInfoWithStorage[10.5.2.95:50010 > ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK], > > DatanodeInfoWithStorage[10.5.2.57:50010 > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]] > > 2016-02-15 22:04:33,043 FATAL > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: ABORTING region server > > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller > > java.io.IOException: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176) > > at > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NullPointerException > > at > > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173) > > ... 2 more > > 2016-02-15 22:04:33,043 FATAL > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: > > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, > > org.apache.phoenix.coprocessor.SequenceRegionObserver, > > org.apache.phoenix.coprocessor.ScanRegionObserver, > > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, > > org.apache.phoenix.hbase.index.Indexer, > > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, > > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint, > > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > > 2016-02-15 22:04:33,136 INFO > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: Dump of metrics as JSON on abort: { > > "beans" : [ { > > "name" : "java.lang:type=Memory", > > "modelerType" : "sun.management.MemoryImpl", > > "Verbose" : true, > > "HeapMemoryUsage" : { > > "committed" : 15032385536, > > "init" : 15032385536, > > "max" : 15032385536, > > "used" : 3732843792 > > }, > > "ObjectPendingFinalizationCount" : 0, > > "NonHeapMemoryUsage" : { > > "committed" : 104660992, > > "init" : 2555904, > > "max" : -1, > > "used" : 103018984 > > }, > > "ObjectName" : "java.lang:type=Memory" > > } ], > > "beans" : [ { > > "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", > > "modelerType" : "RegionServer,sub=IPC", > > "tag.Context" : "regionserver", > > "tag.Hostname" : "hbase-rs9" > > } ], > > "beans" : [ { > > "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication", > > "modelerType" : "RegionServer,sub=Replication", > > "tag.Context" : "regionserver", > > "tag.Hostname" : "hbase-rs9" > > } ], > > "beans" : [ { > > "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", > > "modelerType" : "RegionServer,sub=Server", > > "tag.Context" : "regionserver", > > "tag.Hostname" : "hbase-rs9" > > } ] > > } > > 2016-02-15 22:04:33,194 INFO > > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller] > > regionserver.HRegionServer: STOPPED: IOE in log roller > > Looks like we were in log roller, closing the WAL to put up a new one but > the a sync was trying to run at same time but stream undo under it.... > NPE... which caused server abort. > > ... > > > Any idea what happened? Today, I moved some regions around to balance the > cluster and I ran a major compaction after that and I added more threads to > run small and large compactions, could this be related? > > > > Yeah, pass more of the regionserver log from before the NPE... to see what > was going on at the time. I don't think the above changes related. > > > > I see that in the current branch the class *ProtobufLogWriter:176* already > contains try..catch > > try { > > return this.output.getPos(); > > } catch (NullPointerException npe) { > > // Concurrent close... > > throw new IOException(npe); > > } > > > Yeah. Looks like the thought is an NPE could happen here because the stream > is being closed out from under us by another thread (the syncers are off on > their own just worried about syncing... This is just to get the NPE up out > of the syncer thread and up into the general WAL subsystem.). It looks > like this caused the regionserver abort. I would think it should just > provoke the WAL roll, not an abort. Put up more RS log please. > > > > > But I would be nice to understand the root cause of this error and if > there > > is any misconfiguration from my side. > > > > > Not your config I'd say (if your config brought it on, its a bug). > > > > > *Version: *HBase 1.1.2 > > > > > > St.Ack > > P.S. It looks like a gentleman had similar issue end of last year: > > http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E > No response then. > > > > > > Thank you > > Cheers > > Pedro > > >
