> > Sorry if I seem stupid but this is still all new to me. > Forgot to mention, there's no stupid questions here. Don't be shy and keep'em coming.
Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < wellington.chevre...@gmail.com> escreveu: > However, how would that help anyway? If we cannot fix this at this time >> then any upgrade would have inconsistencies also, yes? >> > The upgrade on it's own wouldn't fix existing inconsistencies, but you > would now have support for additional tooling (hbase-operators-tool) to > help you with this. > > As all the 'SUCCESS' procedures have a parent ID 73587, does this mean >> that they were successfully and fully moved from hbase25 to each server >> mentioned in that procedure? Or does it just mean that the region was >> successfully unassigned from hbase25 but the data still resides on >> hbase25? I see locality 0. >> > IIRC, those were all UnassignProcedures, so it means the unassignment of > the related region has completed and the region for that particular > procedure went offline. > > If we change the table state in meta to 'ENABLED', could this kickstart >> all these things or will it just lead to further problems? > > Masters work with its own memory cache of meta, so manually updating it > will just make masters cache inconsistent with meta. You would need to > restart masters to get its cache reloaded from master. The main problem is > that you still have the rogue procedures, which you can't get rid of > without stopping the cluster. One alternative to a full cluster outage > would be to identify all RSes running the rogue procs (you can find that > from active master logs), then stop only those and master, clean > masterprocwals, then start it again. > > >> I suppose it means I am asking, the 73587 DisableTableProcedure, does it >> mean that the table is waiting to be disabled? HBASE master declares that >> table is NOT enabled. >> > The table state may have been already updated to disabled, most of its > regions may already be offline, but the 73587 DisableTableProcedure cannot > be considered "done" until all its sub procedures are indeed completed. > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins <marc.hopp...@eset.sk> > escreveu: > >> Thanks for that. >> >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 and do >> not have a viable business use to pay the extortionate amount of money >> required to upgrade. Which would give these cluster access to newer >> versions. >> >> However, how would that help anyway? If we cannot fix this at this time >> then any upgrade would have inconsistencies also, yes? >> >> As all the 'SUCCESS' procedures have a parent ID 73587, does this mean >> that they were successfully and fully moved from hbase25 to each server >> mentioned in that procedure? Or does it just mean that the region was >> successfully unassigned from hbase25 but the data still resides on >> hbase25? I see locality 0. >> >> If we change the table state in meta to 'ENABLED', could this kickstart >> all these things or will it just lead to further problems? I suppose it >> means I am asking, the 73587 DisableTableProcedure, does it mean that the >> table is waiting to be disabled? HBASE master declares that table is NOT >> enabled. >> >> Sorry if I seem stupid but this is still all new to me. >> >> I appreciate the help. >> >> -----Original Message----- >> From: Wellington Chevreuil <wellington.chevre...@gmail.com> >> Sent: Tuesday, March 9, 2021 1:20 PM >> To: Hbase-User <user@hbase.apache.org> >> Subject: Re: HBASE WALs >> >> EXTERNAL >> >> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE >> procedure. >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be >> > the problem. >> > >> Per your list procedures output attached, it seems the procs states are >> all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827, >> which is the UnassignProcedure for this region. Problem is that there are >> already 5 APs for the same region, which may be causing some deadlocks. If >> this cluster was on a hbck2 supported version, you could get rid of this >> state using bypass command on all these proc ids, then manually get the >> table/regions states consistent again using >> setRegionState/setTableState/assigns/unassigns methods. >> >> Without tooling, the only option I can think of is to stop cluster, clean >> out masterprocwals, restart cluster, then use hbase shell to >> enable/disable/assign regions. You may also need to manually update >> table/region states in meta table. Of course, you can automate these manual >> steps into your own tooling, but may be a better strategy in the long term >> to upgrade to a more stable version that also benefits from more tooling >> supported by the community. >> >> >> >> >> >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <marc.hopp...@eset.sk> >> escreveu: >> >> > Hi, Wellington, >> > >> > I was on 'vacation' (no road trip or overseas anything) for a week. >> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE >> procedure. >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be >> > the problem. >> > >> > I am still mystified about the HBCK2-tools. I have attached a previous >> > thread that you commented on at the time. >> > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it on >> > Ubuntu >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181). I >> > used it to help fix a similar problem with an offline table and RITs. >> > Both HBASE versions are the same. >> > >> > I attach a 'sheet' with the current procs/locks. >> > >> > -----Original Message----- >> > From: Marc Hoppins <marc.hopp...@eset.sk> >> > Sent: Wednesday, March 3, 2021 9:51 AM >> > To: user@hbase.apache.org >> > Cc: Martin Oravec <martin.ora...@eset.sk> >> > Subject: RE: HBASE WALs >> > >> > EXTERNAL >> > >> > Thanks, Wellington, >> > >> > I have already build a hbck1-tools for 2.1.0 using method described in >> > other topics. All the HBASE and JDK here is the same version so if it >> > worked fixing one cluster HBASE then it should work for other installs. >> > >> > Fiddling with masterprocWALs will require complete shutdown of hbase >> > operations to prevent incoming reds/writes on other tables and I am >> > not sure how disruptive that will be other than "probably a lot". >> > >> > -----Original Message----- >> > From: Wellington Chevreuil <wellington.chevre...@gmail.com> >> > Sent: Tuesday, March 2, 2021 10:57 AM >> > To: Hbase-User <user@hbase.apache.org> >> > Subject: Re: HBASE WALs >> > >> > EXTERNAL >> > >> > Sorry, missed your previous email. I was hoping you were not on a >> > non-stable version, so that you would benefit from hbck2 tool support. >> > Unfortunately, 2.1.0 is among the early releases that don't work with >> > this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0). >> > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems >> > > mostly unhappy with one region in particular, and is reporting on >> that. >> > > >> > Are the other regions for the table properly closed, and this is the >> > only one stuck? If you do a list_procedures, are you able to identify >> > an 'unassign' procedure still running for this table? Or if you grep >> > master logs for this region, do you see any messages suggesting >> > there's still ongoing attempts to bring the region offline? If there's >> > apparently no procedure/no ongoing attempts to offline the region, you >> > might try to manually update its state in meta table, then flip >> > masters (assuming you have master HA), so that the new active loads an >> > up to date state from meta table. >> > >> > Otherwise, if there's still a rogue procedure trying to offline the >> > region, unfortunately, due to the lack of hbck support, you would most >> > likely need a more disruptive intervention similar to what you had >> > described in your first email, but instead of normal wal folder, >> > master proc wals is what you really would need to clean out here, as >> > that is where procedures state is persisted, and you wouldn't want the >> > rogue procedure to be resumed. >> > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins >> > <marc.hopp...@eset.sk> >> > escreveu: >> > >> > > If you know of anything that will help I would appreciate it. >> > > >> > > If you need any log output let me know. >> > > >> > > Thanks >> > > >> > > >> > > -----Original Message----- >> > > From: Wellington Chevreuil <wellington.chevre...@gmail.com> >> > > Sent: Thursday, February 25, 2021 4:08 PM >> > > To: Hbase-User <user@hbase.apache.org> >> > > Subject: Re: HBASE WALs >> > > >> > > EXTERNAL >> > > >> > > > >> > > > Do WAL files contain information for multiple regions per WAL or >> > > > is one WAL associated with one region? >> > > > >> > > Multiple regions edits would be present in a single wal file. That's >> > > why upon a RS crash and wal processing, there's a wal split phase. >> > > >> > > I am trying to find a way to clear a RIT for a disabled table. A >> > > similar >> > > > problem (but on a test cluster) involved me clearing znode info, >> > > > deleting HDFS data for the table and deleting WALs/MasterProcWAL >> > > > files, finally restarting HBASE service. >> > > > >> > > Which hbase version are you on? >> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins >> > > <marc.hopp...@eset.sk> >> > > escreveu: >> > > >> > > > Hi all, >> > > > >> > > > Do WAL files contain information for multiple regions per WAL or >> > > > is one WAL associated with one region? >> > > > >> > > > I am trying to find a way to clear a RIT for a disabled table. A >> > > > similar problem (but on a test cluster) involved me clearing znode >> > > > info, deleting HDFS data for the table and deleting >> > > > WALs/MasterProcWAL files, finally restarting HBASE service. >> > > > >> > > > Table cannot be enabled. >> > > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system >> > > > seems mostly unhappy with one region in particular, and is >> > > > reporting >> > on that. >> > > > >> > > > There are many tables that are very active so I don't think it is >> > > > possible to stop the entire service without a lot of forewarning >> > > > to >> > > users. >> > > > >> > > > Thanks in advance. >> > > > >> > > >> > >> >