RE: HBASE WALs

Marc Hoppins Thu, 11 Mar 2021 04:14:21 -0800

Currently, hbase UI reports that there is only ONE region on hbase25 - which is 
probably our stuck region.  Does this help in any way that we can more easily 
fix this?


-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com> 
Sent: Wednesday, March 10, 2021 10:56 AM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> Sorry if I seem stupid but this is still all new to me.
>
Forgot to mention, there's no stupid questions here. Don't be shy and keep'em 
coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
wellington.chevre...@gmail.com> escreveu:

> However, how would that help anyway?  If we cannot fix this at this 
> time
>> then any upgrade would have inconsistencies also, yes?
>>
> The upgrade on it's own wouldn't fix existing inconsistencies, but you 
> would now have support for additional tooling (hbase-operators-tool)  
> to help you with this.
>
> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean
>> that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
> IIRC, those were all UnassignProcedures, so it means the unassignment 
> of the related region has completed and the region for that particular 
> procedure went offline.
>
> If we change the table state in meta to 'ENABLED', could this 
> kickstart
>> all these things or will it just lead to further problems?
>
> Masters work with its own memory cache of meta, so manually updating 
> it will just make masters cache inconsistent with meta. You would need 
> to restart masters to get its cache reloaded from master. The main 
> problem is that you still have the rogue procedures, which you can't 
> get rid of without stopping the cluster. One alternative to a full 
> cluster outage would be to identify all RSes running the rogue procs 
> (you can find that from active master logs), then stop only those and 
> master, clean masterprocwals, then start it again.
>
>
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
> The table state may have been already updated to disabled, most of its 
> regions may already be offline, but the 73587 DisableTableProcedure 
> cannot be considered "done" until all its sub procedures are indeed completed.
>
>
> Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> <marc.hopp...@eset.sk>
> escreveu:
>
>> Thanks for that.
>>
>> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
>> and do not have a viable business use to pay the extortionate amount 
>> of money required to upgrade.  Which would give these cluster access 
>> to newer versions.
>>
>> However, how would that help anyway?  If we cannot fix this at this 
>> time then any upgrade would have inconsistencies also, yes?
>>
>> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
>> mean that they were successfully and fully moved from hbase25 to each 
>> server mentioned in that procedure?  Or does it just mean that the 
>> region was successfully unassigned from hbase25 but the data still 
>> resides on hbase25?  I see locality 0.
>>
>> If we change the table state in meta to 'ENABLED', could this 
>> kickstart all these things or will it just lead to further problems?  
>> I suppose it means I am asking, the 73587 DisableTableProcedure, does 
>> it mean that the table is waiting to be disabled?  HBASE master 
>> declares that table is NOT enabled.
>>
>> Sorry if I seem stupid but this is still all new to me.
>>
>> I appreciate the help.
>>
>> -----Original Message-----
>> From: Wellington Chevreuil <wellington.chevre...@gmail.com>
>> Sent: Tuesday, March 9, 2021 1:20 PM
>> To: Hbase-User <user@hbase.apache.org>
>> Subject: Re: HBASE WALs
>>
>> EXTERNAL
>>
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> Per your list procedures output attached, it seems the procs states 
>> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
>> PID 73827, which is the UnassignProcedure for this region. Problem is 
>> that there are already 5 APs for the same region, which may be 
>> causing some deadlocks. If this cluster was on a hbck2 supported 
>> version, you could get rid of this state using bypass command on all 
>> these proc ids, then manually get the table/regions states consistent 
>> again using setRegionState/setTableState/assigns/unassigns methods.
>>
>> Without tooling, the only option I can think of is to stop cluster, 
>> clean out masterprocwals, restart cluster, then use hbase shell to 
>> enable/disable/assign regions. You may also need to manually update 
>> table/region states in meta table. Of course, you can automate these 
>> manual steps into your own tooling, but may be a better strategy in 
>> the long term to upgrade to a more stable version that also benefits 
>> from more tooling supported by the community.
>>
>>
>>
>>
>>
>> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
>> <marc.hopp...@eset.sk>
>> escreveu:
>>
>> > Hi, Wellington,
>> >
>> > I was on 'vacation' (no road trip or overseas anything) for a week.
>> >
>> > All fails are waiting on the same PID (73587), a DISABLE TABLE
>> procedure.
>> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
>> > the problem.
>> >
>> > I am still mystified about the HBCK2-tools. I have attached a 
>> > previous thread that you commented on at the time.
>> >
>> > I did build a tools for our HBASE 2.1.0...or rather, I built it on 
>> > Ubuntu
>> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu
>> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
>> > used it to help fix a similar problem with an offline table and RITs.
>> > Both HBASE versions are the same.
>> >
>> > I attach a 'sheet' with the current procs/locks.
>> >
>> > -----Original Message-----
>> > From: Marc Hoppins <marc.hopp...@eset.sk>
>> > Sent: Wednesday, March 3, 2021 9:51 AM
>> > To: user@hbase.apache.org
>> > Cc: Martin Oravec <martin.ora...@eset.sk>
>> > Subject: RE: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Thanks, Wellington,
>> >
>> > I have already build a hbck1-tools for 2.1.0 using method described 
>> > in other topics. All the HBASE and JDK here is the same version so 
>> > if it worked fixing one cluster HBASE then it should work for other 
>> > installs.
>> >
>> > Fiddling with masterprocWALs will require complete shutdown of 
>> > hbase operations to prevent incoming reds/writes on other tables 
>> > and I am not sure how disruptive that will be other than "probably a lot".
>> >
>> > -----Original Message-----
>> > From: Wellington Chevreuil <wellington.chevre...@gmail.com>
>> > Sent: Tuesday, March 2, 2021 10:57 AM
>> > To: Hbase-User <user@hbase.apache.org>
>> > Subject: Re: HBASE WALs
>> >
>> > EXTERNAL
>> >
>> > Sorry, missed your previous email. I was hoping you were not on a 
>> > non-stable version, so that you would benefit from hbck2 tool support.
>> > Unfortunately, 2.1.0 is among the early releases that don't work 
>> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>> >
>> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > seems
>> > > mostly unhappy with one region in particular, and is reporting on
>> that.
>> > >
>> > Are the other regions for the table properly closed, and this is 
>> > the only one stuck? If you do a list_procedures, are you able to 
>> > identify an 'unassign' procedure still running for this table? Or 
>> > if you grep master logs for this region, do you see any messages 
>> > suggesting there's still ongoing attempts to bring the region 
>> > offline? If there's apparently no procedure/no ongoing attempts to 
>> > offline the region, you might try to manually update its state in 
>> > meta table, then flip masters (assuming you have master HA), so 
>> > that the new active loads an up to date state from meta table.
>> >
>> > Otherwise, if there's still a rogue procedure trying to offline the 
>> > region, unfortunately, due to the lack of hbck support, you would 
>> > most likely need a more disruptive intervention similar to what you 
>> > had described in your first email, but instead of normal wal 
>> > folder, master proc wals is what you really would need to clean out 
>> > here, as that is where procedures state is persisted, and you 
>> > wouldn't want the rogue procedure to be resumed.
>> >
>> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
>> > <marc.hopp...@eset.sk>
>> > escreveu:
>> >
>> > > If you know of anything that will help I would appreciate it.
>> > >
>> > > If you need any log output let me know.
>> > >
>> > > Thanks
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Wellington Chevreuil <wellington.chevre...@gmail.com>
>> > > Sent: Thursday, February 25, 2021 4:08 PM
>> > > To: Hbase-User <user@hbase.apache.org>
>> > > Subject: Re: HBASE WALs
>> > >
>> > > EXTERNAL
>> > >
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > Multiple regions edits would be present in a single wal file. 
>> > > That's why upon a RS crash and wal processing, there's a wal split phase.
>> > >
>> > > I am trying to find a way to clear a RIT for a disabled table. A 
>> > > similar
>> > > > problem (but on a test cluster) involved me clearing znode 
>> > > > info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > Which hbase version are you on?
>> > >
>> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
>> > > <marc.hopp...@eset.sk>
>> > > escreveu:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Do WAL files contain information for multiple regions per WAL 
>> > > > or is one WAL associated with one region?
>> > > >
>> > > > I am trying to find a way to clear a RIT for a disabled table. 
>> > > > A similar problem (but on a test cluster) involved me clearing 
>> > > > znode info, deleting HDFS data for the table and deleting 
>> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
>> > > >
>> > > > Table cannot be enabled.
>> > > >
>> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
>> > > > seems mostly unhappy with one region in particular, and is 
>> > > > reporting
>> > on that.
>> > > >
>> > > > There are many tables that are very active so I don't think it 
>> > > > is possible to stop the entire service without a lot of 
>> > > > forewarning to
>> > > users.
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > >
>> >
>>
>

RE: HBASE WALs

Reply via email to