Re: ABORTING region server and following HBase cluster "crash"

Vincent Poon Fri, 02 Nov 2018 17:32:26 -0700

Indexes in Phoenix should not in theory cause any cluster outage.  An index
write failure should just disable the index, not cause a crash.
In practice, there have been some bugs around race conditions, the most
dangerous of which accidentally trigger a KillServerOnFailurePolicy which
then potentially cascades.
That policy is there for legacy reasons, I believe because at the time that
was the only way to keep indexes consistent - kill the RS and replay from
WAL.
There is now a partial rebuilder which detects when an index has been
disabled due to a write failure, and asynchronously attempts to rebuild the
index.  Killing the RS is supposed to be a last ditch effort only if the
index could not be disabled (because otherwise, your index is out of sync
but still active and your queries will return incorrect results).
PHOENIX-4977 made the policy configurable now.  If you would rather, in the
worst case, have your index potentially get out of sync instead of killing
RSs, you can set that to LeaveIndexActiveFailurePolicy.


On Fri, Nov 2, 2018 at 5:14 PM Neelesh <neele...@gmail.com> wrote:

> By no means am I judging Phoenix based on this. This is simply a design
> trade-off (scylladb goes the same route and builds global indexes). I
> appreciate all the effort that has gone in to Phoenix, and it was indeed a
> life saver. But the technical point remains that single node failures have
> potential to cascade to the entire cluster. That's the nature of global
> indexes, not specific to phoenix.
>
> I apologize if my response came off as dismissing phoenix altogether.
> FWIW, I'm a big advocate of phoenix at my org internally, albeit for the
> newer version.
>
>
> On Fri, Nov 2, 2018, 4:09 PM Josh Elser <els...@apache.org> wrote:
>
>> I would strongly disagree with the assertion that this is some
>> unavoidable problem. Yes, an inverted index is a data structure which,
>> by design, creates a hotspot (phrased another way, this is "data
>> locality").
>>
>> Lots of extremely smart individuals have spent a significant amount of
>> time and effort in stabilizing secondary indexes in the past 1-2 years,
>> not to mention others spending time on a local index implementation.
>> Judging Phoenix in its entirety based off of an arbitrarily old version
>> of Phoenix is disingenuous.
>>
>> On 11/2/18 2:00 PM, Neelesh wrote:
>> > I think this is an unavoidable problem in some sense, if global indexes
>> > are used. Essentially global indexes create a  graph of dependent
>> region
>> > servers due to index rpc calls from one RS to another. Any single
>> > failure is bound to affect the entire graph, which under reasonable
>> load
>> > becomes the entire HBase cluster. We had to drop global indexes just to
>> > keep the cluster running for more than a few days.
>> >
>> > I think Cassandra has local secondary indexes preciesly because of this
>> > issue. Last I checked there were significant pending improvements
>> > required for Phoenix local indexes, especially around read paths ( not
>> > utilizing primary key prefixes in secondary index reads where possible,
>> > for example)
>> >
>> >
>> > On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech <jonat...@gmail.com
>> > <mailto:jonat...@gmail.com>> wrote:
>> >
>> >     This seems similar to a failure scenario I’ve seen a couple times. I
>> >     believe after multiple restarts you got lucky and tables were
>> >     brought up by Hbase in the correct order.
>> >
>> >     What happens is some kind of semi-catastrophic failure where 1 or
>> >     more region servers go down with edits that weren’t flushed, and are
>> >     only in the WAL. These edits belong to regions whose tables have
>> >     secondary indexes. Hbase wants to replay the WAL before bringing up
>> >     the region server. Phoenix wants to talk to the index region during
>> >     this, but can’t. It fails enough times then stops.
>> >
>> >     The more region servers / tables / indexes affected, the more likely
>> >     that a full restart will get stuck in a classic deadlock. A good
>> >     old-fashioned data center outage is a great way to get started with
>> >     this kind of problem. You might make some progress and get stuck
>> >     again, or restart number N might get those index regions initialized
>> >     before the main table.
>> >
>> >     The sure fire way to recover a cluster in this condition is to
>> >     strategically disable all the tables that are failing to come up.
>> >     You can do this from the Hbase shell as long as the master is
>> >     running. If I remember right, it’s a pain since the disable command
>> >     will hang. You might need to disable a table, kill the shell,
>> >     disable the next table, etc. Then restart. You’ll eventually have a
>> >     cluster with all the region servers finally started, and a bunch of
>> >     disabled regions. If you disabled index tables, enable one, wait for
>> >     it to become available; eg its WAL edits will be replayed, then
>> >     enable the associated main table and wait for it to come online. If
>> >     Hbase did it’s job without error, and your failure didn’t include
>> >     losing 4 disks at once, order will be restored. Lather, rinse,
>> >     repeat until everything is enabled and online.
>> >
>> >     <TLDR> A big enough failure sprinkled with a little bit of bad luck
>> >     and what seems to be a Phoenix flaw == deadlock trying to get HBASE
>> >     to start up. Fix by forcing the order that Hbase brings regions
>> >     online. Finally, never go full restart. </TLDR>
>> >
>> >      > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander
>> >     <0x62...@gmail.com <mailto:0x62...@gmail.com>> wrote:
>> >      >
>> >      > After update web interface at Master show that every region
>> >     server now 1.4.7 and no RITS.
>> >      >
>> >      > Cluster recovered only when we restart all regions servers 4
>> times...
>> >      >
>> >      >> On 11 Sep 2018, at 04:08, Josh Elser <els...@apache.org
>> >     <mailto:els...@apache.org>> wrote:
>> >      >>
>> >      >> Did you update the HBase jars on all RegionServers?
>> >      >>
>> >      >> Make sure that you have all of the Regions assigned (no RITs).
>> >     There could be a pretty simple explanation as to why the index can't
>> >     be written to.
>> >      >>
>> >      >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>> >      >>> Correct me if im wrong.
>> >      >>> But looks like if you have A and B region server that has index
>> >     and primary table then possible situation like this.
>> >      >>> A and B under writes on table with indexes
>> >      >>> A - crash
>> >      >>> B failed on index update because A is not operating then B
>> >     starting aborting
>> >      >>> A after restart try to rebuild index from WAL but B at this
>> >     time is aborting then A starting aborting too
>> >      >>> From this moment nothing happens (0 requests to region servers)
>> >     and A and B is not responsible from Master-status web interface
>> >      >>>> On 9 Sep 2018, at 04:38, Batyrshin Alexander
>> >     <0x62...@gmail.com <mailto:0x62...@gmail.com>
>> >     <mailto:0x62...@gmail.com <mailto:0x62...@gmail.com>>> wrote:
>> >      >>>>
>> >      >>>> After update we still can't recover HBase cluster. Our region
>> >     servers ABORTING over and over:
>> >      >>>>
>> >      >>>> prod003:
>> >      >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536446665703: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536446665703: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536446665703: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536446665703: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536450772841: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod003,60020,1536450772841: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>>
>> >      >>>> prod004:
>> >      >>>> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod004,60020,1536446387325: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod004,60020,1536446387325: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod004,60020,1536446387325: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod004,60020,1536446387325: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>>
>> >      >>>> prod005:
>> >      >>>> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod005,60020,1536446400009: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod005,60020,1536446400009: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod005,60020,1536446400009: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09 02:54:30,058
>> >     FATAL [RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020]
>> >     regionserver.HRegionServer: ABORTING region server
>> >     prod005,60020,1536446400009: Could not update the index table,
>> >     killing server region because couldn't write to an index table
>> >      >>>>
>> >      >>>> And so on...
>> >      >>>>
>> >      >>>> Trace is the same everywhere:
>> >      >>>>
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:
>> >
>>  org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>> >     disableIndexOnFailure=true, Failed to write to multiple index
>> >     tables: [KM_IDX1, KM_IDX2, KM_HISTORY_IDX1, KM_HISTORY_IDX2,
>> >     KM_HISTORY_IDX3]
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >     org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:241)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1068)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >     org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >     org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
>> >
>>  org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> >      >>>>
>> >      >>>>> On 9 Sep 2018, at 01:44, Batyrshin Alexander
>> >     <0x62...@gmail.com <mailto:0x62...@gmail.com>
>> >     <mailto:0x62...@gmail.com <mailto:0x62...@gmail.com>>> wrote:
>> >      >>>>>
>> >      >>>>> Thank you.
>> >      >>>>> We're updating our cluster right now...
>> >      >>>>>
>> >      >>>>>
>> >      >>>>>> On 9 Sep 2018, at 01:39, Ted Yu <yuzhih...@gmail.com
>> >     <mailto:yuzhih...@gmail.com> <mailto:yuzhih...@gmail.com
>> >     <mailto:yuzhih...@gmail.com>>> wrote:
>> >      >>>>>>
>> >      >>>>>> It seems you should deploy hbase with the following fix:
>> >      >>>>>>
>> >      >>>>>> HBASE-21069 NPE in StoreScanner.updateReaders causes RS to
>> crash
>> >      >>>>>>
>> >      >>>>>> 1.4.7 was recently released.
>> >      >>>>>>
>> >      >>>>>> FYI
>> >      >>>>>>
>> >      >>>>>> On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander
>> >     <0x62...@gmail.com <mailto:0x62...@gmail.com>
>> >     <mailto:0x62...@gmail.com <mailto:0x62...@gmail.com>>> wrote:
>> >      >>>>>>
>> >      >>>>>>    Hello,
>> >      >>>>>>
>> >      >>>>>>   We got this exception from *prod006* server
>> >      >>>>>>
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09
>> 00:38:02,532
>> >      >>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer:
>> ABORTING
>> >      >>>>>>   region server prod006,60020,1536235102833: Replay of
>> >      >>>>>>   WAL required. Forcing server shutdown
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:
>> >      >>>>>>   org.apache.hadoop.hbase.DroppedSnapshotException:
>> >      >>>>>>   region:
>> >
>>  
>> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>   java.lang.Thread.run(Thread.java:748)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: Caused by:
>> >      >>>>>>   java.lang.NullPointerException
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>   java.util.ArrayList.<init>(ArrayList.java:178)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         ... 9 more
>> >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09
>> 00:38:02,532
>> >      >>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer:
>> >      >>>>>>   RegionServer abort: loaded coprocessors
>> >      >>>>>>   are:
>> >
>>  [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>> >      >>>>>>   org.apache.phoenix.coprocessor.SequenceRegionObserver,
>> >      >>>>>>   org.apache.phoenix.c
>> >      >>>>>>
>> >      >>>>>>   After that we got ABORTING on almost every Region Servers
>> in
>> >      >>>>>>   cluster with different reasons:
>> >      >>>>>>
>> >      >>>>>>   *prod003*
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09
>> 01:12:11,799
>> >      >>>>>>   FATAL
>> [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f]
>> >      >>>>>>   regionserver.HRegionServer: ABORTING region
>> >      >>>>>>   server prod003,60020,1536444066291: Exception running
>> >      >>>>>>   postOpenDeployTasks;
>> region=88bfac1dfd807c4cd1e9c1f31b4f053f
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:
>> >      >>>>>> java.io <http://java.io>.InterruptedIOException: #139,
>> >     interrupted.
>> >      >>>>>>   currentNumberOfTask=8
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1484)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>>  org.apache.hadoop.hbase.client.HTable.put(HTable.java:1031)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1033)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1023)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.MetaTableAccessor.updateLocation(MetaTableAccessor.java:1433)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.MetaTableAccessor.updateRegionLocation(MetaTableAccessor.java:1400)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2041)
>> >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:329)
>> >      >>>>>>
>> >      >>>>>>   *prod002*
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]: 2018-09-09
>> 01:12:30,144
>> >      >>>>>>   FATAL
>> >      >>>>>>
>>  [RpcServer.default.FPBQ.Fifo.handler=36,queue=6,port=60020]
>> >      >>>>>>   regionserver.HRegionServer: ABORTING region
>> >      >>>>>>   server prod002,60020,1536235138673: Could not update the
>> index
>> >      >>>>>>   table, killing server region because couldn't write to an
>> >     index
>> >      >>>>>>   table
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:
>> >      >>>>>>
>> >
>>  org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>> >      >>>>>>    disableIndexOnFailure=true, Failed to write to multiple
>> index
>> >      >>>>>>   tables: [KM_IDX1, KM_IDX2, KM_HISTORY1, KM_HISTORY2,
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >       org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.access$000(UngroupedAggregateRegionObserver.java:164)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver$1.doMutation(UngroupedAggregateRegionObserver.java:246)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.index.PhoenixIndexFailurePolicy.doBatchWithRetries(PhoenixIndexFailurePolicy.java:455)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.handleIndexWriteException(UngroupedAggregateRegionObserver.java:929)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:243)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1077)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >       org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >       org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
>> >      >>>>>>
>> >
>>  org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> >      >>>>>>
>> >      >>>>>>
>> >      >>>>>>   And etc...
>> >      >>>>>>
>> >      >>>>>>   Master-status web interface shows that contact lost from
>> this
>> >      >>>>>>   aborted servers.
>> >      >>>>>
>> >      >>>>
>> >      >
>> >
>>
>

Re: ABORTING region server and following HBase cluster "crash"

Reply via email to