Re: ABORTING region server and following HBase cluster "crash"

Batyrshin Alexander Sat, 15 Sep 2018 11:36:56 -0700

I've found that we still not configured this:

hbase.region.server.rpc.scheduler.factory.class = 
org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory


Can this misconfiguration leads to our problems?

> On 15 Sep 2018, at 02:04, Sergey Soldatov <sergey.solda...@gmail.com> wrote:
> 
> That was the real problem quite a long time ago (couple years?). Can't say 
> for sure in which version that was fixed, but now indexes has a priority over 
> regular tables and their regions open first. So by the moment when we replay 
> WALs for tables, all index regions are supposed to be online. If you see the 
> problem on recent versions that usually means that cluster is not healthy and 
> some of the index regions stuck in RiT state.
> 
> Thanks,
> Sergey
> 
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech <jonat...@gmail.com 
> <mailto:jonat...@gmail.com>> wrote:
> This seems similar to a failure scenario I’ve seen a couple times. I believe 
> after multiple restarts you got lucky and tables were brought up by Hbase in 
> the correct order. 
> 
> What happens is some kind of semi-catastrophic failure where 1 or more region 
> servers go down with edits that weren’t flushed, and are only in the WAL. 
> These edits belong to regions whose tables have secondary indexes. Hbase 
> wants to replay the WAL before bringing up the region server. Phoenix wants 
> to talk to the index region during this, but can’t. It fails enough times 
> then stops. 
> 
> The more region servers / tables / indexes affected, the more likely that a 
> full restart will get stuck in a classic deadlock. A good old-fashioned data 
> center outage is a great way to get started with this kind of problem. You 
> might make some progress and get stuck again, or restart number N might get 
> those index regions initialized before the main table. 
> 
> The sure fire way to recover a cluster in this condition is to strategically 
> disable all the tables that are failing to come up. You can do this from the 
> Hbase shell as long as the master is running. If I remember right, it’s a 
> pain since the disable command will hang. You might need to disable a table, 
> kill the shell, disable the next table, etc. Then restart. You’ll eventually 
> have a cluster with all the region servers finally started, and a bunch of 
> disabled regions. If you disabled index tables, enable one, wait for it to 
> become available; eg its WAL edits will be replayed, then enable the 
> associated main table and wait for it to come online. If Hbase did it’s job 
> without error, and your failure didn’t include losing 4 disks at once, order 
> will be restored. Lather, rinse, repeat until everything is enabled and 
> online. 
> 
> <TLDR> A big enough failure sprinkled with a little bit of bad luck and what 
> seems to be a Phoenix flaw == deadlock trying to get HBASE to start up. Fix 
> by forcing the order that Hbase brings regions online. Finally, never go full 
> restart. </TLDR>
> 
> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com 
> > <mailto:0x62...@gmail.com>> wrote:
> > 
> > After update web interface at Master show that every region server now 
> > 1.4.7 and no RITS.
> > 
> > Cluster recovered only when we restart all regions servers 4 times...
> > 
> >> On 11 Sep 2018, at 04:08, Josh Elser <els...@apache.org 
> >> <mailto:els...@apache.org>> wrote:
> >> 
> >> Did you update the HBase jars on all RegionServers?
> >> 
> >> Make sure that you have all of the Regions assigned (no RITs). There could 
> >> be a pretty simple explanation as to why the index can't be written to.
> >> 
> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
> >>> Correct me if im wrong.
> >>> But looks like if you have A and B region server that has index and 
> >>> primary table then possible situation like this.
> >>> A and B under writes on table with indexes
> >>> A - crash
> >>> B failed on index update because A is not operating then B starting 
> >>> aborting
> >>> A after restart try to rebuild index from WAL but B at this time is 
> >>> aborting then A starting aborting too
> >>> From this moment nothing happens (0 requests to region servers) and A and 
> >>> B is not responsible from Master-status web interface
> >>>> On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
> >>>> <mailto:0x62...@gmail.com> <mailto:0x62...@gmail.com 
> >>>> <mailto:0x62...@gmail.com>>> wrote:
> >>>> 
> >>>> After update we still can't recover HBase cluster. Our region servers 
> >>>> ABORTING over and over:
> >>>> 
> >>>> prod003:
> >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536446665703: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536446665703: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536446665703: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536446665703: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536450772841: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod003,60020,1536450772841: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> 
> >>>> prod004:
> >>>> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod004,60020,1536446387325: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod004,60020,1536446387325: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod004,60020,1536446387325: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod004,60020,1536446387325: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> 
> >>>> prod005:
> >>>> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod005,60020,1536446400009: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod005,60020,1536446400009: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod005,60020,1536446400009: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09 02:54:30,058 FATAL 
> >>>> [RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020] 
> >>>> regionserver.HRegionServer: ABORTING region server 
> >>>> prod005,60020,1536446400009: Could not update the index table, killing 
> >>>> server region because couldn't write to an index table
> >>>> 
> >>>> And so on...
> >>>> 
> >>>> Trace is the same everywhere:
> >>>> 
> >>>> Sep 09 02:54:30 prod005 hbase[3772]: 
> >>>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
> >>>>   disableIndexOnFailure=true, Failed to write to multiple index tables: 
> >>>> [KM_IDX1, KM_IDX2, KM_HISTORY_IDX1, KM_HISTORY_IDX2, KM_HISTORY_IDX3]
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:241)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1068)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at 
> >>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> >>>> 
> >>>>> On 9 Sep 2018, at 01:44, Batyrshin Alexander <0x62...@gmail.com 
> >>>>> <mailto:0x62...@gmail.com> <mailto:0x62...@gmail.com 
> >>>>> <mailto:0x62...@gmail.com>>> wrote:
> >>>>> 
> >>>>> Thank you.
> >>>>> We're updating our cluster right now...
> >>>>> 
> >>>>> 
> >>>>>> On 9 Sep 2018, at 01:39, Ted Yu <yuzhih...@gmail.com 
> >>>>>> <mailto:yuzhih...@gmail.com> <mailto:yuzhih...@gmail.com 
> >>>>>> <mailto:yuzhih...@gmail.com>>> wrote:
> >>>>>> 
> >>>>>> It seems you should deploy hbase with the following fix:
> >>>>>> 
> >>>>>> HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash
> >>>>>> 
> >>>>>> 1.4.7 was recently released.
> >>>>>> 
> >>>>>> FYI
> >>>>>> 
> >>>>>> On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander <0x62...@gmail.com 
> >>>>>> <mailto:0x62...@gmail.com> <mailto:0x62...@gmail.com 
> >>>>>> <mailto:0x62...@gmail.com>>> wrote:
> >>>>>> 
> >>>>>>    Hello,
> >>>>>> 
> >>>>>>   We got this exception from *prod006* server
> >>>>>> 
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532
> >>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING
> >>>>>>   region server prod006,60020,1536235102833: Replay of
> >>>>>>   WAL required. Forcing server shutdown
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:
> >>>>>>   org.apache.hadoop.hbase.DroppedSnapshotException:
> >>>>>>   region: 
> >>>>>> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   java.lang.Thread.run(Thread.java:748)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: Caused by:
> >>>>>>   java.lang.NullPointerException
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   java.util.ArrayList.<init>(ArrayList.java:178)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         ... 9 more
> >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532
> >>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer:
> >>>>>>   RegionServer abort: loaded coprocessors
> >>>>>>   are: 
> >>>>>> [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> >>>>>>   org.apache.phoenix.coprocessor.SequenceRegionObserver,
> >>>>>>   org.apache.phoenix.c
> >>>>>> 
> >>>>>>   After that we got ABORTING on almost every Region Servers in
> >>>>>>   cluster with different reasons:
> >>>>>> 
> >>>>>>   *prod003*
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09 01:12:11,799
> >>>>>>   FATAL [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f]
> >>>>>>   regionserver.HRegionServer: ABORTING region
> >>>>>>   server prod003,60020,1536444066291: Exception running
> >>>>>>   postOpenDeployTasks; region=88bfac1dfd807c4cd1e9c1f31b4f053f
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:
> >>>>>>   java.io <http://java.io/>.InterruptedIOException: #139, interrupted.
> >>>>>>   currentNumberOfTask=8
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1484)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   org.apache.hadoop.hbase.client.HTable.put(HTable.java:1031)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1033)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1023)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.MetaTableAccessor.updateLocation(MetaTableAccessor.java:1433)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.MetaTableAccessor.updateRegionLocation(MetaTableAccessor.java:1400)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2041)
> >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:329)
> >>>>>> 
> >>>>>>   *prod002*
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]: 2018-09-09 01:12:30,144
> >>>>>>   FATAL
> >>>>>>   [RpcServer.default.FPBQ.Fifo.handler=36,queue=6,port=60020]
> >>>>>>   regionserver.HRegionServer: ABORTING region
> >>>>>>   server prod002,60020,1536235138673: Could not update the index
> >>>>>>   table, killing server region because couldn't write to an index
> >>>>>>   table
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
> >>>>>>    disableIndexOnFailure=true, Failed to write to multiple index
> >>>>>>   tables: [KM_IDX1, KM_IDX2, KM_HISTORY1, KM_HISTORY2,
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.access$000(UngroupedAggregateRegionObserver.java:164)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver$1.doMutation(UngroupedAggregateRegionObserver.java:246)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.index.PhoenixIndexFailurePolicy.doBatchWithRetries(PhoenixIndexFailurePolicy.java:455)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.handleIndexWriteException(UngroupedAggregateRegionObserver.java:929)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:243)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1077)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
> >>>>>>   
> >>>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> >>>>>> 
> >>>>>> 
> >>>>>>   And etc...
> >>>>>> 
> >>>>>>   Master-status web interface shows that contact lost from this
> >>>>>>   aborted servers. 
> >>>>> 
> >>>> 
> >

Re: ABORTING region server and following HBase cluster "crash"

Reply via email to