Re: NPE on ProtobufLogWriter.sync()

Pedro Gandola Tue, 16 Feb 2016 01:34:03 -0800

Just missed one of your questions.

Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
> Or it happened once and killed the regionserver?



I received an alarm for the RS and I went there I saw the cluster not
processing data and I tried to restart the RS. Everything went to normal
but the RS crashed again then I decided to remove it completely until know
more about the problem. Now looking into the restarting code I see that the
problem might be other.

*(log.out) *

> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> directory


*(log after restart)*

> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.io.tmpdir=/tmp
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.compiler=<NA>
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.name=Linux
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.arch=amd64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.version=3.10.0-327.3.1.el7.x86_64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.name=hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.home=/home/hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.dir=/home/hbase
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=30000
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=30000
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,754 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,755 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = 30000
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = 30000
>

I don't see any exception during the restart but for some reason the
process got killed and I don't see any SIGTERM or OOM on linux journal.

Cheers
Pedro


On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <[email protected]>
wrote:

> Hi St.Ack,
>
> Thank you for your help. I have attached ~5 min worth of data before the
> crash.
>
> I restarted the cluster to update some configurations after that I moved
> some regions around to balance the cluster and just to ensure the data
> locality I ran a major compaction. After that I connected my App and 2
> hours later  this happened. So by the time of the crash I was not
> performing any operation over the cluster it was running normally.
>
> Thank you,
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 2:01 AM, Stack <[email protected]> wrote:
>
>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <[email protected]>
>> wrote:
>>
>> > Hi Guys,
>> >
>> > One of my region servers got into a state where it was unable to start
>> and
>> > the cluster was not receiving traffic for some time:
>>
>>
>>
>> Were you trying to restart it and it wouldn't come up? It kept doing the
>> NPE on each restart?
>>
>> Or it happened once and killed the regionserver?
>>
>>
>>
>> *(master log)*
>> >
>> > > 2016-02-15 22:04:33,186 ERROR
>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
>> > master.MasterRpcServices:
>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
>> fatal
>> > > error:
>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE
>> in
>> > > log roller
>> > > Cause:
>> > > java.io.IOException: java.lang.NullPointerException
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > > at java.lang.Thread.run(Thread.java:745)
>> > > Caused by: java.lang.NullPointerException
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
>> > > master.CatalogJanitor: Failed scan of catalog table
>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>> > > attempts=351, exceptions:
>> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
>> > > region=hbase:meta,,1.1588230740,
>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
>> > >
>> >
>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>> > > at
>> >
>>
>> You running with read replicas enabled?
>>
>>
>> ....
>>
>>
>>
>> *(regionserver log)*
>>
>>
>> Anything before this log message about rolling the WAL?
>>
>>
>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
>> request
>> > close of wal
>> > java.io.IOException: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > ... 2 more
>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
>> ms,
>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
>> > DatanodeInfoWithStorage[10.5.2.95:50010
>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
>> > DatanodeInfoWithStorage[10.5.2.57:50010
>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
>> > 2016-02-15 22:04:33,043 FATAL
>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: ABORTING region server
>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
>> > java.io.IOException: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > ... 2 more
>> > 2016-02-15 22:04:33,043 FATAL
>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
>> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
>> > org.apache.phoenix.hbase.index.Indexer,
>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
>> > 2016-02-15 22:04:33,136 INFO
>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>> >   "beans" : [ {
>> >     "name" : "java.lang:type=Memory",
>> >     "modelerType" : "sun.management.MemoryImpl",
>> >     "Verbose" : true,
>> >     "HeapMemoryUsage" : {
>> >       "committed" : 15032385536,
>> >       "init" : 15032385536,
>> >       "max" : 15032385536,
>> >       "used" : 3732843792
>> >     },
>> >     "ObjectPendingFinalizationCount" : 0,
>> >     "NonHeapMemoryUsage" : {
>> >       "committed" : 104660992,
>> >       "init" : 2555904,
>> >       "max" : -1,
>> >       "used" : 103018984
>> >     },
>> >     "ObjectName" : "java.lang:type=Memory"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>> >     "modelerType" : "RegionServer,sub=IPC",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>> >     "modelerType" : "RegionServer,sub=Replication",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>> >     "modelerType" : "RegionServer,sub=Server",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ]
>> > }
>> > 2016-02-15 22:04:33,194 INFO
>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: STOPPED: IOE in log roller
>>
>> Looks like we were in log roller, closing the WAL to put up a new one but
>> the a sync was trying to run at same time but stream undo under it....
>> NPE... which caused server abort.
>>
>> ...
>>
>>
>> Any idea what happened? Today, I moved some regions around to balance the
>> cluster and I ran a major compaction after that and I added more threads
>> to
>> run small and large compactions, could this be related?
>>
>>
>>
>> Yeah, pass more of the regionserver log from before the NPE... to see what
>> was going on at the time.  I don't think the above changes related.
>>
>>
>>
>> I see that in the current branch the class *ProtobufLogWriter:176* already
>> contains try..catch
>>
>> try {
>> >   return this.output.getPos();
>> > } catch (NullPointerException npe) {
>> >   // Concurrent close...
>> >   throw new IOException(npe);
>> > }
>>
>>
>> Yeah. Looks like the thought is an NPE could happen here because the
>> stream
>> is being closed out from under us by another thread (the syncers are off
>> on
>> their own just worried about syncing... This is just to get the NPE up out
>> of the syncer thread and up into the general WAL subsystem.).  It looks
>> like this caused the regionserver abort. I would think it should just
>> provoke the WAL roll, not an abort. Put up more RS log please.
>>
>>
>>
>> > But I would be nice to understand the root cause of this error and if
>> there
>> > is any misconfiguration from my side.
>> >
>> >
>> Not your config I'd say (if your config brought it on, its a bug).
>>
>>
>>
>> > *Version: *HBase 1.1.2
>> >
>> >
>>
>> St.Ack
>>
>> P.S. It looks like a gentleman had similar issue end of last year:
>>
>> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3cca+nge+rofdtxbolmc0s4bcgwvtyecyjapzwxp5zju-gm4da...@mail.gmail.com%3E
>> No response then.
>>
>>
>>
>>
>> > Thank you
>> > Cheers
>> > Pedro
>> >
>>
>
>

Re: NPE on ProtobufLogWriter.sync()

Reply via email to