RE: HBASE WALs

Marc Hoppins Tue, 16 Mar 2021 00:13:21 -0700

Overall, I am mystified as to how this could happen.  If Hadoop has a 
replication factor (I believe we use the default) of 3 and we have two 
datacenters with masters and workers in both, how can a network outage affect 
Hadoop operation? Surely it should have used available resources to continue 
operations...or have I misinterpreted entirely?


-----Original Message-----
From: Stack <[email protected]> 
Sent: Tuesday, March 16, 2021 7:16 AM
To: Hbase-User <[email protected]>
Subject: Re: HBASE WALs

EXTERNAL

On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <[email protected]> wrote:

> Hi, all,
>
> For our stuck region, this exists in meta.  Could we alter the state 
> to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
>
> You could but IIRC, in that version of HBase, you may need to restart 
> the
Master after the change (changing hbase:meta does not update the Master's 
in-memory state). On restart, Master will read hbase:meta to discover Region 
state.

S


> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}  
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:seqnumDuringOpen, timestamp=1611787189839, 
> value=\x00\x00\x00\x00\x00\x00\x04\x8F
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:server, timestamp=1611787189839, value=
> dr1-hbase18.jumbo.hq.eset.com:16020
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:serverstartcode, timestamp=1611787189839, 
> value=1611785264032  
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:sn, timestamp=1613580024017, value=
> ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
>  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> column=info:state, timestamp=1613580024017, value=OPENING
>
> -----Original Message-----
> From: Wellington Chevreuil <[email protected]>
> Sent: Wednesday, March 10, 2021 10:56 AM
> To: Hbase-User <[email protected]>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > Sorry if I seem stupid but this is still all new to me.
> >
> Forgot to mention, there's no stupid questions here. Don't be shy and 
> keep'em coming.
>
> Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> [email protected]> escreveu:
>
> > However, how would that help anyway?  If we cannot fix this at this 
> > time
> >> then any upgrade would have inconsistencies also, yes?
> >>
> > The upgrade on it's own wouldn't fix existing inconsistencies, but 
> > you would now have support for additional tooling 
> > (hbase-operators-tool) to help you with this.
> >
> > As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> > mean
> >> that they were successfully and fully moved from hbase25 to each 
> >> server mentioned in that procedure?  Or does it just mean that the 
> >> region was successfully unassigned from hbase25 but the data still 
> >> resides on hbase25?  I see locality 0.
> >>
> > IIRC, those were all UnassignProcedures, so it means the 
> > unassignment of the related region has completed and the region for 
> > that particular procedure went offline.
> >
> > If we change the table state in meta to 'ENABLED', could this 
> > kickstart
> >> all these things or will it just lead to further problems?
> >
> > Masters work with its own memory cache of meta, so manually updating 
> > it will just make masters cache inconsistent with meta. You would 
> > need to restart masters to get its cache reloaded from master. The 
> > main problem is that you still have the rogue procedures, which you 
> > can't get rid of without stopping the cluster. One alternative to a 
> > full cluster outage would be to identify all RSes running the rogue 
> > procs (you can find that from active master logs), then stop only 
> > those and master, clean masterprocwals, then start it again.
> >
> >
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> >> does it mean that the table is waiting to be disabled?  HBASE 
> >> master declares that table is NOT enabled.
> >>
> > The table state may have been already updated to disabled, most of 
> > its regions may already be offline, but the 73587 
> > DisableTableProcedure cannot be considered "done" until all its sub 
> > procedures are indeed
> completed.
> >
> >
> > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > <[email protected]>
> > escreveu:
> >
> >> Thanks for that.
> >>
> >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 
> >> and do not have a viable business use to pay the extortionate 
> >> amount of money required to upgrade.  Which would give these 
> >> cluster access to newer versions.
> >>
> >> However, how would that help anyway?  If we cannot fix this at this 
> >> time then any upgrade would have inconsistencies also, yes?
> >>
> >> As all the 'SUCCESS' procedures have a parent ID 73587, does this 
> >> mean that they were successfully and fully moved from hbase25 to 
> >> each server mentioned in that procedure?  Or does it just mean that 
> >> the region was successfully unassigned from hbase25 but the data 
> >> still resides on hbase25?  I see locality 0.
> >>
> >> If we change the table state in meta to 'ENABLED', could this 
> >> kickstart all these things or will it just lead to further problems?
> >> I suppose it means I am asking, the 73587 DisableTableProcedure, 
> >> does it mean that the table is waiting to be disabled?  HBASE 
> >> master declares that table is NOT enabled.
> >>
> >> Sorry if I seem stupid but this is still all new to me.
> >>
> >> I appreciate the help.
> >>
> >> -----Original Message-----
> >> From: Wellington Chevreuil <[email protected]>
> >> Sent: Tuesday, March 9, 2021 1:20 PM
> >> To: Hbase-User <[email protected]>
> >> Subject: Re: HBASE WALs
> >>
> >> EXTERNAL
> >>
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to 
> >> > be the problem.
> >> >
> >> Per your list procedures output attached, it seems the procs states 
> >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with 
> >> PID 73827, which is the UnassignProcedure for this region. Problem 
> >> is that there are already 5 APs for the same region, which may be 
> >> causing some deadlocks. If this cluster was on a hbck2 supported 
> >> version, you could get rid of this state using bypass command on 
> >> all these proc ids, then manually get the table/regions states 
> >> consistent again using setRegionState/setTableState/assigns/unassigns 
> >> methods.
> >>
> >> Without tooling, the only option I can think of is to stop cluster, 
> >> clean out masterprocwals, restart cluster, then use hbase shell to 
> >> enable/disable/assign regions. You may also need to manually update 
> >> table/region states in meta table. Of course, you can automate 
> >> these manual steps into your own tooling, but may be a better 
> >> strategy in the long term to upgrade to a more stable version that 
> >> also benefits from more tooling supported by the community.
> >>
> >>
> >>
> >>
> >>
> >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> >> <[email protected]>
> >> escreveu:
> >>
> >> > Hi, Wellington,
> >> >
> >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> >> >
> >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> >> procedure.
> >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to 
> >> > be the problem.
> >> >
> >> > I am still mystified about the HBCK2-tools. I have attached a 
> >> > previous thread that you commented on at the time.
> >> >
> >> > I did build a tools for our HBASE 2.1.0...or rather, I built it 
> >> > on Ubuntu
> >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> >> > Ubuntu
> >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  
> >> > I used it to help fix a similar problem with an offline table and RITs.
> >> > Both HBASE versions are the same.
> >> >
> >> > I attach a 'sheet' with the current procs/locks.
> >> >
> >> > -----Original Message-----
> >> > From: Marc Hoppins <[email protected]>
> >> > Sent: Wednesday, March 3, 2021 9:51 AM
> >> > To: [email protected]
> >> > Cc: Martin Oravec <[email protected]>
> >> > Subject: RE: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Thanks, Wellington,
> >> >
> >> > I have already build a hbck1-tools for 2.1.0 using method 
> >> > described in other topics. All the HBASE and JDK here is the same 
> >> > version so if it worked fixing one cluster HBASE then it should 
> >> > work for other
> installs.
> >> >
> >> > Fiddling with masterprocWALs will require complete shutdown of 
> >> > hbase operations to prevent incoming reds/writes on other tables 
> >> > and I am not sure how disruptive that will be other than 
> >> > "probably a
> lot".
> >> >
> >> > -----Original Message-----
> >> > From: Wellington Chevreuil <[email protected]>
> >> > Sent: Tuesday, March 2, 2021 10:57 AM
> >> > To: Hbase-User <[email protected]>
> >> > Subject: Re: HBASE WALs
> >> >
> >> > EXTERNAL
> >> >
> >> > Sorry, missed your previous email. I was hoping you were not on a 
> >> > non-stable version, so that you would benefit from hbck2 tool support.
> >> > Unfortunately, 2.1.0 is among the early releases that don't work 
> >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> >> >
> >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> >> > seems
> >> > > mostly unhappy with one region in particular, and is reporting 
> >> > > on
> >> that.
> >> > >
> >> > Are the other regions for the table properly closed, and this is 
> >> > the only one stuck? If you do a list_procedures, are you able to 
> >> > identify an 'unassign' procedure still running for this table? Or 
> >> > if you grep master logs for this region, do you see any messages 
> >> > suggesting there's still ongoing attempts to bring the region 
> >> > offline? If there's apparently no procedure/no ongoing attempts 
> >> > to offline the region, you might try to manually update its state 
> >> > in meta table, then flip masters (assuming you have master HA), 
> >> > so that the new active loads an up to date state from meta table.
> >> >
> >> > Otherwise, if there's still a rogue procedure trying to offline 
> >> > the region, unfortunately, due to the lack of hbck support, you 
> >> > would most likely need a more disruptive intervention similar to 
> >> > what you had described in your first email, but instead of normal 
> >> > wal folder, master proc wals is what you really would need to 
> >> > clean out here, as that is where procedures state is persisted, 
> >> > and you wouldn't want the rogue procedure to be resumed.
> >> >
> >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> >> > <[email protected]>
> >> > escreveu:
> >> >
> >> > > If you know of anything that will help I would appreciate it.
> >> > >
> >> > > If you need any log output let me know.
> >> > >
> >> > > Thanks
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Wellington Chevreuil <[email protected]>
> >> > > Sent: Thursday, February 25, 2021 4:08 PM
> >> > > To: Hbase-User <[email protected]>
> >> > > Subject: Re: HBASE WALs
> >> > >
> >> > > EXTERNAL
> >> > >
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL 
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > Multiple regions edits would be present in a single wal file.
> >> > > That's why upon a RS crash and wal processing, there's a wal 
> >> > > split
> phase.
> >> > >
> >> > > I am trying to find a way to clear a RIT for a disabled table. 
> >> > > A similar
> >> > > > problem (but on a test cluster) involved me clearing znode 
> >> > > > info, deleting HDFS data for the table and deleting 
> >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > Which hbase version are you on?
> >> > >
> >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> >> > > <[email protected]>
> >> > > escreveu:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > Do WAL files contain information for multiple regions per WAL 
> >> > > > or is one WAL associated with one region?
> >> > > >
> >> > > > I am trying to find a way to clear a RIT for a disabled table.
> >> > > > A similar problem (but on a test cluster) involved me 
> >> > > > clearing znode info, deleting HDFS data for the table and 
> >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE service.
> >> > > >
> >> > > > Table cannot be enabled.
> >> > > >
> >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> >> > > > system seems mostly unhappy with one region in particular, 
> >> > > > and is reporting
> >> > on that.
> >> > > >
> >> > > > There are many tables that are very active so I don't think 
> >> > > > it is possible to stop the entire service without a lot of 
> >> > > > forewarning to
> >> > > users.
> >> > > >
> >> > > > Thanks in advance.
> >> > > >
> >> > >
> >> >
> >>
> >
>

RE: HBASE WALs

Reply via email to