Re: ABORTING region server and following HBase cluster "crash"

Jonathan Leech Thu, 13 Sep 2018 20:12:57 -0700

This seems similar to a failure scenario I’ve seen a couple times. I believe 
after multiple restarts you got lucky and tables were brought up by Hbase in 
the correct order.


What happens is some kind of semi-catastrophic failure where 1 or more region 
servers go down with edits that weren’t flushed, and are only in the WAL. These 
edits belong to regions whose tables have secondary indexes. Hbase wants to 
replay the WAL before bringing up the region server. Phoenix wants to talk to 
the index region during this, but can’t. It fails enough times then stops. 

The more region servers / tables / indexes affected, the more likely that a 
full restart will get stuck in a classic deadlock. A good old-fashioned data 
center outage is a great way to get started with this kind of problem. You 
might make some progress and get stuck again, or restart number N might get 
those index regions initialized before the main table. 

The sure fire way to recover a cluster in this condition is to strategically 
disable all the tables that are failing to come up. You can do this from the 
Hbase shell as long as the master is running. If I remember right, it’s a pain 
since the disable command will hang. You might need to disable a table, kill 
the shell, disable the next table, etc. Then restart. You’ll eventually have a 
cluster with all the region servers finally started, and a bunch of disabled 
regions. If you disabled index tables, enable one, wait for it to become 
available; eg its WAL edits will be replayed, then enable the associated main 
table and wait for it to come online. If Hbase did it’s job without error, and 
your failure didn’t include losing 4 disks at once, order will be restored. 
Lather, rinse, repeat until everything is enabled and online. 

<TLDR> A big enough failure sprinkled with a little bit of bad luck and what 
seems to be a Phoenix flaw == deadlock trying to get HBASE to start up. Fix by 
forcing the order that Hbase brings regions online. Finally, never go full 
restart. </TLDR>

> On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com> wrote:
> 
> After update web interface at Master show that every region server now 1.4.7 
> and no RITS.
> 
> Cluster recovered only when we restart all regions servers 4 times...
> 
>> On 11 Sep 2018, at 04:08, Josh Elser <els...@apache.org> wrote:
>> 
>> Did you update the HBase jars on all RegionServers?
>> 
>> Make sure that you have all of the Regions assigned (no RITs). There could 
>> be a pretty simple explanation as to why the index can't be written to.
>> 
>>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>>> Correct me if im wrong.
>>> But looks like if you have A and B region server that has index and primary 
>>> table then possible situation like this.
>>> A and B under writes on table with indexes
>>> A - crash
>>> B failed on index update because A is not operating then B starting aborting
>>> A after restart try to rebuild index from WAL but B at this time is 
>>> aborting then A starting aborting too
>>> From this moment nothing happens (0 requests to region servers) and A and B 
>>> is not responsible from Master-status web interface
>>>> On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
>>>> <mailto:0x62...@gmail.com>> wrote:
>>>> 
>>>> After update we still can't recover HBase cluster. Our region servers 
>>>> ABORTING over and over:
>>>> 
>>>> prod003:
>>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536450772841: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod003,60020,1536450772841: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> 
>>>> prod004:
>>>> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> 
>>>> prod005:
>>>> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod005,60020,1536446400009: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod005,60020,1536446400009: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod005,60020,1536446400009: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09 02:54:30,058 FATAL 
>>>> [RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020] 
>>>> regionserver.HRegionServer: ABORTING region server 
>>>> prod005,60020,1536446400009: Could not update the index table, killing 
>>>> server region because couldn't write to an index table
>>>> 
>>>> And so on...
>>>> 
>>>> Trace is the same everywhere:
>>>> 
>>>> Sep 09 02:54:30 prod005 hbase[3772]: 
>>>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  
>>>> disableIndexOnFailure=true, Failed to write to multiple index tables: 
>>>> [KM_IDX1, KM_IDX2, KM_HISTORY_IDX1, KM_HISTORY_IDX2, KM_HISTORY_IDX3]
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:241)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1068)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>> 
>>>>> On 9 Sep 2018, at 01:44, Batyrshin Alexander <0x62...@gmail.com 
>>>>> <mailto:0x62...@gmail.com>> wrote:
>>>>> 
>>>>> Thank you.
>>>>> We're updating our cluster right now...
>>>>> 
>>>>> 
>>>>>> On 9 Sep 2018, at 01:39, Ted Yu <yuzhih...@gmail.com 
>>>>>> <mailto:yuzhih...@gmail.com>> wrote:
>>>>>> 
>>>>>> It seems you should deploy hbase with the following fix:
>>>>>> 
>>>>>> HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash
>>>>>> 
>>>>>> 1.4.7 was recently released.
>>>>>> 
>>>>>> FYI
>>>>>> 
>>>>>> On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander <0x62...@gmail.com 
>>>>>> <mailto:0x62...@gmail.com>> wrote:
>>>>>> 
>>>>>>    Hello,
>>>>>> 
>>>>>>   We got this exception from *prod006* server
>>>>>> 
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532
>>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING
>>>>>>   region server prod006,60020,1536235102833: Replay of
>>>>>>   WAL required. Forcing server shutdown
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:
>>>>>>   org.apache.hadoop.hbase.DroppedSnapshotException:
>>>>>>   region: 
>>>>>> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   java.lang.Thread.run(Thread.java:748)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: Caused by:
>>>>>>   java.lang.NullPointerException
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   java.util.ArrayList.<init>(ArrayList.java:178)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         ... 9 more
>>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532
>>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer:
>>>>>>   RegionServer abort: loaded coprocessors
>>>>>>   are: 
>>>>>> [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>>>>>>   org.apache.phoenix.coprocessor.SequenceRegionObserver,
>>>>>>   org.apache.phoenix.c
>>>>>> 
>>>>>>   After that we got ABORTING on almost every Region Servers in
>>>>>>   cluster with different reasons:
>>>>>> 
>>>>>>   *prod003*
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09 01:12:11,799
>>>>>>   FATAL [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f]
>>>>>>   regionserver.HRegionServer: ABORTING region
>>>>>>   server prod003,60020,1536444066291: Exception running
>>>>>>   postOpenDeployTasks; region=88bfac1dfd807c4cd1e9c1f31b4f053f
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:
>>>>>>   java.io.InterruptedIOException: #139, interrupted.
>>>>>>   currentNumberOfTask=8
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1484)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   org.apache.hadoop.hbase.client.HTable.put(HTable.java:1031)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1033)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1023)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.MetaTableAccessor.updateLocation(MetaTableAccessor.java:1433)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.MetaTableAccessor.updateRegionLocation(MetaTableAccessor.java:1400)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2041)
>>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:329)
>>>>>> 
>>>>>>   *prod002*
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]: 2018-09-09 01:12:30,144
>>>>>>   FATAL
>>>>>>   [RpcServer.default.FPBQ.Fifo.handler=36,queue=6,port=60020]
>>>>>>   regionserver.HRegionServer: ABORTING region
>>>>>>   server prod002,60020,1536235138673: Could not update the index
>>>>>>   table, killing server region because couldn't write to an index
>>>>>>   table
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>>>>>    disableIndexOnFailure=true, Failed to write to multiple index
>>>>>>   tables: [KM_IDX1, KM_IDX2, KM_HISTORY1, KM_HISTORY2,
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.access$000(UngroupedAggregateRegionObserver.java:164)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver$1.doMutation(UngroupedAggregateRegionObserver.java:246)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.index.PhoenixIndexFailurePolicy.doBatchWithRetries(PhoenixIndexFailurePolicy.java:455)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.handleIndexWriteException(UngroupedAggregateRegionObserver.java:929)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:243)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1077)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>>>>>>   
>>>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>> 
>>>>>> 
>>>>>>   And etc...
>>>>>> 
>>>>>>   Master-status web interface shows that contact lost from this
>>>>>>   aborted servers. 
>>>>> 
>>>> 
>

Re: ABORTING region server and following HBase cluster "crash"

Reply via email to