>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the
> problem.
>
Per your list procedures output attached, it seems the procs states are all
inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827,
which is the UnassignProcedure for this region. Problem is that there are
already 5 APs for the same region, which may be causing some deadlocks. If
this cluster was on a hbck2 supported version, you could get rid of this
state using bypass command on all these proc ids, then manually get the
table/regions states consistent again using
setRegionState/setTableState/assigns/unassigns methods.

Without tooling, the only option I can think of is to stop cluster, clean
out masterprocwals, restart cluster, then use hbase shell to
enable/disable/assign regions. You may also need to manually update
table/region states in meta table. Of course, you can automate these manual
steps into your own tooling, but may be a better strategy in the long term
to upgrade to a more stable version that also benefits from more tooling
supported by the community.





Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <marc.hopp...@eset.sk>
escreveu:

> Hi, Wellington,
>
> I was on 'vacation' (no road trip or overseas anything) for a week.
>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the
> problem.
>
> I am still mystified about the HBCK2-tools. I have attached a previous
> thread that you commented on at the time.
>
> I did build a tools for our HBASE 2.1.0...or rather, I built it on Ubuntu
> 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 16.04
> with a slightly different java (Oracle Java 8, 1.8.0_181).  I used it to
> help fix a similar problem with an offline table and RITs.  Both HBASE
> versions are the same.
>
> I attach a 'sheet' with the current procs/locks.
>
> -----Original Message-----
> From: Marc Hoppins <marc.hopp...@eset.sk>
> Sent: Wednesday, March 3, 2021 9:51 AM
> To: user@hbase.apache.org
> Cc: Martin Oravec <martin.ora...@eset.sk>
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> Thanks, Wellington,
>
> I have already build a hbck1-tools for 2.1.0 using method described in
> other topics. All the HBASE and JDK here is the same version so if it
> worked fixing one cluster HBASE then it should work for other installs.
>
> Fiddling with masterprocWALs will require complete shutdown of hbase
> operations to prevent incoming reds/writes on other tables and I am not
> sure how disruptive that will be other than "probably a lot".
>
> -----Original Message-----
> From: Wellington Chevreuil <wellington.chevre...@gmail.com>
> Sent: Tuesday, March 2, 2021 10:57 AM
> To: Hbase-User <user@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> Sorry, missed your previous email. I was hoping you were not on a
> non-stable version, so that you would benefit from hbck2 tool support.
> Unfortunately, 2.1.0 is among the early releases that don't work with this
> tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>
> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > mostly unhappy with one region in particular, and is reporting on that.
> >
> Are the other regions for the table properly closed, and this is the only
> one stuck? If you do a list_procedures, are you able to identify an
> 'unassign' procedure still running for this table? Or if you grep master
> logs for this region, do you see any messages suggesting there's still
> ongoing attempts to bring the region offline? If there's apparently no
> procedure/no ongoing attempts to offline the region, you might try to
> manually update its state in meta table, then flip masters (assuming you
> have master HA), so that the new active loads an up to date state from meta
> table.
>
> Otherwise, if there's still a rogue procedure trying to offline the
> region, unfortunately, due to the lack of hbck support, you would most
> likely need a more disruptive intervention similar to what you had
> described in your first email, but instead of normal wal folder, master
> proc wals is what you really would need to clean out here, as that is where
> procedures state is persisted, and you wouldn't want the rogue procedure to
> be resumed.
>
> Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <marc.hopp...@eset.sk>
> escreveu:
>
> > If you know of anything that will help I would appreciate it.
> >
> > If you need any log output let me know.
> >
> > Thanks
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <wellington.chevre...@gmail.com>
> > Sent: Thursday, February 25, 2021 4:08 PM
> > To: Hbase-User <user@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Do WAL files contain information for multiple regions per WAL or is
> > > one WAL associated with one region?
> > >
> > Multiple regions edits would be present in a single wal file. That's
> > why upon a RS crash and wal processing, there's a wal split phase.
> >
> > I am trying to find a way to clear a RIT for a disabled table. A
> > similar
> > > problem (but on a test cluster) involved me clearing znode info,
> > > deleting HDFS data for the table and deleting WALs/MasterProcWAL
> > > files, finally restarting HBASE service.
> > >
> > Which hbase version are you on?
> >
> > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > <marc.hopp...@eset.sk>
> > escreveu:
> >
> > > Hi all,
> > >
> > > Do WAL files contain information for multiple regions per WAL or is
> > > one WAL associated with one region?
> > >
> > > I am trying to find a way to clear a RIT for a disabled table. A
> > > similar problem (but on a test cluster) involved me clearing znode
> > > info, deleting HDFS data for the table and deleting
> > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >
> > > Table cannot be enabled.
> > >
> > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > > seems mostly unhappy with one region in particular, and is reporting
> on that.
> > >
> > > There are many tables that are very active so I don't think it is
> > > possible to stop the entire service without a lot of forewarning to
> > users.
> > >
> > > Thanks in advance.
> > >
> >
>

Reply via email to