Re: HBASE WALs

Wellington Chevreuil Tue, 23 Mar 2021 10:16:40 -0700

>
> I am still not certain what will happen.  masterProcWALs contain info for
> all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table
data. User table data go on "normal" WALs, not "masterProcWALs".


 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of master
> WALs are now created. This means there is a bunch of pending operations,
> yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each
of these "unassign" operations comprise a set of sequential phases. These
internal operations are called "procedures". Information about the progress
of these operations as it progresses through its different phases are
stored in these masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable
commands finished successfully, and all your procedures are finished (apart
from that rogue one existing for while already), you would be good to clean
out masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table state due
> to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went
through masterProcWals and resumed the rogue procedure from the unfinished
state it was when you restarted hbase. If you had removed masterProcWALs
prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing with
> master WALs has me on edge.

>From what I understand, you already have the tables disabled, and no
unfinished procs apart from the rogue one, so just clean out masterProcWALs
and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <[email protected]>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info for
> all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of master
> WALs are now created. This means there is a bunch of pending operations,
> yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table state due
> to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot clone
> it while it is in a state of (DISABLED) flux but, once again, messing with
> master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <[email protected]>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <[email protected]>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending and
> > current operations will finish. How long will it take to write all
> > data - if indeed the data does get permanently written - so that we
> > can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, all
> related data would already have been flushed into hfiles and wouldn't be on
> your wals. But please be aware that what you really need here to get rid of
> the rogue proc is to remove master proc wals, not normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <[email protected]>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has a
> > replication factor (I believe we use the default) of 3 and we have two
> > datacenters with masters and workers in both, how can a network outage
> > affect Hadoop operation? Surely it should have used available
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <[email protected]>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <[email protected]>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <[email protected]>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the state
> > > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to
> > > restart the
> > Master after the change (changing hbase:meta does not update the
> > Master's in-memory state). On restart, Master will read hbase:meta to
> > discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839,
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <[email protected]>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <[email protected]>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> > > [email protected]> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, but
> > > > you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > > mean
> > > >> that they were successfully and fully moved from hbase25 to each
> > > >> server mentioned in that procedure?  Or does it just mean that
> > > >> the region was successfully unassigned from hbase25 but the data
> > > >> still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the
> > > > unassignment of the related region has completed and the region
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from
> > > > master. The main problem is that you still have the rogue
> > > > procedures, which you can't get rid of without stopping the
> > > > cluster. One alternative to a full cluster outage would be to
> > > > identify all RSes running the rogue procs (you can find that from
> > > > active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > > >> does it mean that the table is waiting to be disabled?  HBASE
> > > >> master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most of
> > > > its regions may already be offline, but the 73587
> > > > DisableTableProcedure cannot be considered "done" until all its
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > > > <[email protected]>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the
> > > >> extortionate amount of money required to upgrade.  Which would
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > >> mean that they were successfully and fully moved from hbase25 to
> > > >> each server mentioned in that procedure?  Or does it just mean
> > > >> that the region was successfully unassigned from hbase25 but the
> > > >> data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > > >> does it mean that the table is waiting to be disabled?  HBASE
> > > >> master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <[email protected]>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <[email protected]>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this
> > > >> region. Problem is that there are already 5 APs for the same
> > > >> region, which may be causing some deadlocks. If this cluster was
> > > >> on a hbck2 supported version, you could get rid of this state
> > > >> using bypass command on all these proc ids, then manually get the
> > > >> table/regions states consistent again using
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop
> > > >> cluster, clean out masterprocwals, restart cluster, then use
> > > >> hbase shell to enable/disable/assign regions. You may also need
> > > >> to manually update table/region states in meta table. Of course,
> > > >> you can automate these manual steps into your own tooling, but
> > > >> may be a better strategy in the long term to upgrade to a more
> > > >> stable version that also benefits from more tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> > > >> <[email protected]>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it
> > > >> > on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <[email protected]>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: [email protected]
> > > >> > Cc: Martin Oravec <[email protected]>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method
> > > >> > described in other topics. All the HBASE and JDK here is the
> > > >> > same version so if it worked fixing one cluster HBASE then it
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown of
> > > >> > hbase operations to prevent incoming reds/writes on other
> > > >> > tables and I am not sure how disruptive that will be other than
> > > >> > "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <[email protected]>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <[email protected]>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not on
> > > >> > a non-stable version, so that you would benefit from hbck2 tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > > >> > seems
> > > >> > > mostly unhappy with one region in particular, and is
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this
> > > >> > is the only one stuck? If you do a list_procedures, are you
> > > >> > able to identify an 'unassign' procedure still running for this
> > > >> > table? Or if you grep master logs for this region, do you see
> > > >> > any messages suggesting there's still ongoing attempts to bring
> > > >> > the region offline? If there's apparently no procedure/no
> > > >> > ongoing attempts to offline the region, you might try to
> > > >> > manually update its state in meta table, then flip masters
> > > >> > (assuming you have master HA), so that the new active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to offline
> > > >> > the region, unfortunately, due to the lack of hbck support, you
> > > >> > would most likely need a more disruptive intervention similar
> > > >> > to what you had described in your first email, but instead of
> > > >> > normal wal folder, master proc wals is what you really would
> > > >> > need to clean out here, as that is where procedures state is
> > > >> > persisted, and you wouldn't want the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > > >> > <[email protected]>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <[email protected]>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <[email protected]>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a wal
> > > >> > > split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing znode
> > > >> > > > info, deleting HDFS data for the table and deleting
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > > >> > > <[email protected]>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me
> > > >> > > > clearing znode info, deleting HDFS data for the table and
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > > >> > > > system seems mostly unhappy with one region in particular,
> > > >> > > > and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't think
> > > >> > > > it is possible to stop the entire service without a lot of
> > > >> > > > forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: HBASE WALs

Reply via email to