Hi St.Ack,

Thank you for your help. I have attached ~5 min worth of data before the
crash.

I restarted the cluster to update some configurations after that I moved
some regions around to balance the cluster and just to ensure the data
locality I ran a major compaction. After that I connected my App and 2
hours later  this happened. So by the time of the crash I was not
performing any operation over the cluster it was running normally.

Thank you,
Cheers
Pedro

On Tue, Feb 16, 2016 at 2:01 AM, Stack <[email protected]> wrote:

> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <[email protected]>
> wrote:
>
> > Hi Guys,
> >
> > One of my region servers got into a state where it was unable to start
> and
> > the cluster was not receiving traffic for some time:
>
>
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
>
> Or it happened once and killed the regionserver?
>
>
>
> *(master log)*
> >
> > > 2016-02-15 22:04:33,186 ERROR
> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > master.MasterRpcServices:
> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> fatal
> > > error:
> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE
> in
> > > log roller
> > > Cause:
> > > java.io.IOException: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > at java.lang.Thread.run(Thread.java:745)
> > > Caused by: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > master.CatalogJanitor: Failed scan of catalog table
> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > > attempts=351, exceptions:
> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> > > region=hbase:meta,,1.1588230740,
> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > at
> >
>
> You running with read replicas enabled?
>
>
> ....
>
>
>
> *(regionserver log)*
>
>
> Anything before this log message about rolling the WAL?
>
>
> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> > close of wal
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
> ms,
> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > DatanodeInfoWithStorage[10.5.2.95:50010
> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > DatanodeInfoWithStorage[10.5.2.57:50010
> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: ABORTING region server
> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > org.apache.phoenix.hbase.index.Indexer,
> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > 2016-02-15 22:04:33,136 INFO
> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >   "beans" : [ {
> >     "name" : "java.lang:type=Memory",
> >     "modelerType" : "sun.management.MemoryImpl",
> >     "Verbose" : true,
> >     "HeapMemoryUsage" : {
> >       "committed" : 15032385536,
> >       "init" : 15032385536,
> >       "max" : 15032385536,
> >       "used" : 3732843792
> >     },
> >     "ObjectPendingFinalizationCount" : 0,
> >     "NonHeapMemoryUsage" : {
> >       "committed" : 104660992,
> >       "init" : 2555904,
> >       "max" : -1,
> >       "used" : 103018984
> >     },
> >     "ObjectName" : "java.lang:type=Memory"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >     "modelerType" : "RegionServer,sub=IPC",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >     "modelerType" : "RegionServer,sub=Replication",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >     "modelerType" : "RegionServer,sub=Server",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ]
> > }
> > 2016-02-15 22:04:33,194 INFO
> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: STOPPED: IOE in log roller
>
> Looks like we were in log roller, closing the WAL to put up a new one but
> the a sync was trying to run at same time but stream undo under it....
> NPE... which caused server abort.
>
> ...
>
>
> Any idea what happened? Today, I moved some regions around to balance the
> cluster and I ran a major compaction after that and I added more threads to
> run small and large compactions, could this be related?
>
>
>
> Yeah, pass more of the regionserver log from before the NPE... to see what
> was going on at the time.  I don't think the above changes related.
>
>
>
> I see that in the current branch the class *ProtobufLogWriter:176* already
> contains try..catch
>
> try {
> >   return this.output.getPos();
> > } catch (NullPointerException npe) {
> >   // Concurrent close...
> >   throw new IOException(npe);
> > }
>
>
> Yeah. Looks like the thought is an NPE could happen here because the stream
> is being closed out from under us by another thread (the syncers are off on
> their own just worried about syncing... This is just to get the NPE up out
> of the syncer thread and up into the general WAL subsystem.).  It looks
> like this caused the regionserver abort. I would think it should just
> provoke the WAL roll, not an abort. Put up more RS log please.
>
>
>
> > But I would be nice to understand the root cause of this error and if
> there
> > is any misconfiguration from my side.
> >
> >
> Not your config I'd say (if your config brought it on, its a bug).
>
>
>
> > *Version: *HBase 1.1.2
> >
> >
>
> St.Ack
>
> P.S. It looks like a gentleman had similar issue end of last year:
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E
> No response then.
>
>
>
>
> > Thank you
> > Cheers
> > Pedro
> >
>

Reply via email to