> > All fails are waiting on the same PID (73587), a DISABLE TABLE procedure. > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the > problem. > Per your list procedures output attached, it seems the procs states are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827, which is the UnassignProcedure for this region. Problem is that there are already 5 APs for the same region, which may be causing some deadlocks. If this cluster was on a hbck2 supported version, you could get rid of this state using bypass command on all these proc ids, then manually get the table/regions states consistent again using setRegionState/setTableState/assigns/unassigns methods.
Without tooling, the only option I can think of is to stop cluster, clean out masterprocwals, restart cluster, then use hbase shell to enable/disable/assign regions. You may also need to manually update table/region states in meta table. Of course, you can automate these manual steps into your own tooling, but may be a better strategy in the long term to upgrade to a more stable version that also benefits from more tooling supported by the community. Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <marc.hopp...@eset.sk> escreveu: > Hi, Wellington, > > I was on 'vacation' (no road trip or overseas anything) for a week. > > All fails are waiting on the same PID (73587), a DISABLE TABLE procedure. > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be the > problem. > > I am still mystified about the HBCK2-tools. I have attached a previous > thread that you commented on at the time. > > I did build a tools for our HBASE 2.1.0...or rather, I built it on Ubuntu > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 16.04 > with a slightly different java (Oracle Java 8, 1.8.0_181). I used it to > help fix a similar problem with an offline table and RITs. Both HBASE > versions are the same. > > I attach a 'sheet' with the current procs/locks. > > -----Original Message----- > From: Marc Hoppins <marc.hopp...@eset.sk> > Sent: Wednesday, March 3, 2021 9:51 AM > To: user@hbase.apache.org > Cc: Martin Oravec <martin.ora...@eset.sk> > Subject: RE: HBASE WALs > > EXTERNAL > > Thanks, Wellington, > > I have already build a hbck1-tools for 2.1.0 using method described in > other topics. All the HBASE and JDK here is the same version so if it > worked fixing one cluster HBASE then it should work for other installs. > > Fiddling with masterprocWALs will require complete shutdown of hbase > operations to prevent incoming reds/writes on other tables and I am not > sure how disruptive that will be other than "probably a lot". > > -----Original Message----- > From: Wellington Chevreuil <wellington.chevre...@gmail.com> > Sent: Tuesday, March 2, 2021 10:57 AM > To: Hbase-User <user@hbase.apache.org> > Subject: Re: HBASE WALs > > EXTERNAL > > Sorry, missed your previous email. I was hoping you were not on a > non-stable version, so that you would benefit from hbck2 tool support. > Unfortunately, 2.1.0 is among the early releases that don't work with this > tool (it requires at least 2.0.3, 2.1.1 or 2.2.0). > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems > > mostly unhappy with one region in particular, and is reporting on that. > > > Are the other regions for the table properly closed, and this is the only > one stuck? If you do a list_procedures, are you able to identify an > 'unassign' procedure still running for this table? Or if you grep master > logs for this region, do you see any messages suggesting there's still > ongoing attempts to bring the region offline? If there's apparently no > procedure/no ongoing attempts to offline the region, you might try to > manually update its state in meta table, then flip masters (assuming you > have master HA), so that the new active loads an up to date state from meta > table. > > Otherwise, if there's still a rogue procedure trying to offline the > region, unfortunately, due to the lack of hbck support, you would most > likely need a more disruptive intervention similar to what you had > described in your first email, but instead of normal wal folder, master > proc wals is what you really would need to clean out here, as that is where > procedures state is persisted, and you wouldn't want the rogue procedure to > be resumed. > > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins <marc.hopp...@eset.sk> > escreveu: > > > If you know of anything that will help I would appreciate it. > > > > If you need any log output let me know. > > > > Thanks > > > > > > -----Original Message----- > > From: Wellington Chevreuil <wellington.chevre...@gmail.com> > > Sent: Thursday, February 25, 2021 4:08 PM > > To: Hbase-User <user@hbase.apache.org> > > Subject: Re: HBASE WALs > > > > EXTERNAL > > > > > > > > Do WAL files contain information for multiple regions per WAL or is > > > one WAL associated with one region? > > > > > Multiple regions edits would be present in a single wal file. That's > > why upon a RS crash and wal processing, there's a wal split phase. > > > > I am trying to find a way to clear a RIT for a disabled table. A > > similar > > > problem (but on a test cluster) involved me clearing znode info, > > > deleting HDFS data for the table and deleting WALs/MasterProcWAL > > > files, finally restarting HBASE service. > > > > > Which hbase version are you on? > > > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins > > <marc.hopp...@eset.sk> > > escreveu: > > > > > Hi all, > > > > > > Do WAL files contain information for multiple regions per WAL or is > > > one WAL associated with one region? > > > > > > I am trying to find a way to clear a RIT for a disabled table. A > > > similar problem (but on a test cluster) involved me clearing znode > > > info, deleting HDFS data for the table and deleting > > > WALs/MasterProcWAL files, finally restarting HBASE service. > > > > > > Table cannot be enabled. > > > > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system > > > seems mostly unhappy with one region in particular, and is reporting > on that. > > > > > > There are many tables that are very active so I don't think it is > > > possible to stop the entire service without a lot of forewarning to > > users. > > > > > > Thanks in advance. > > > > > >