Re: [DISCUSS] The priority of HBASE-29380
It could also affect other types of procedures but the probability is very low I think. You need to send two requests to the same table/peer at the same time, especially that the two procedures must be polled out and both be executed by different PEWorkers. And then when acquiring lock, one of the procedures will lose the battle and then be put into the waiting queue of the LockAndQueue instance for this table/peer. And then, the winner procedure must schedule a sub procedure, which has the same procedure type, and when the sub procedure finishes, it will clean up the LockAndQueue instance so the procedure in the waiting queue can never be executed again. Most peer related procedures will hold exclusive lock for the whole execution time, i.e, holdLock = true, and will schedule a RefreshPeerProcedure which is the same type, so it is more likely to hit this problem. And yes, restarting master can fix the problem, as all the above things are in memory state, after restarting, the 'lost' procedure will be scheduled again. Thanks. Junegunn Choi 于2025年6月9日周一 21:40写道: > > Just my two cents. > > I'd consider two questions: > > * Does it only occur when attempting to remove multiple peer configurations > within a short time frame? If so, I think the likelihood of hitting it in > practice is quite low. > * Is the system recoverable without service disruption? (e.g., by > restarting the master server) > > If the answer to both is yes, then I'd say we can lower the priority. > > That said, as I was writing this email, I just realized you already > submitted a fix. > It's a solid refactoring, and the fix aligns well with your observation. > Great job as always. > > Thanks, > Junegunn
Re: [DISCUSS] The priority of HBASE-29380
Just my two cents. I'd consider two questions: * Does it only occur when attempting to remove multiple peer configurations within a short time frame? If so, I think the likelihood of hitting it in practice is quite low. * Is the system recoverable without service disruption? (e.g., by restarting the master server) If the answer to both is yes, then I'd say we can lower the priority. That said, as I was writing this email, I just realized you already submitted a fix. It's a solid refactoring, and the fix aligns well with your observation. Great job as always. Thanks, Junegunn
Re: [DISCUSS] The priority of HBASE-29380
Some updates. https://github.com/apache/hbase/pull/7077 I've been able to reproduce the problem by a simple UT. 张铎(Duo Zhang) 于2025年6月6日周五 23:47写道: > > While reviewing the flaky list of branch-3, I found a very critical > problem which may cause a procedure to hang forever. > > I set it as a blocker for all upcoming releases, but to be honest I > still need some time to see how to actually fix it, and may also need > some proper tests. > > So if we want to cut 2.6.3 and 2.5.12 soon, I'm OK to reduce the > priority and move 2.6.3 and 2.5.12 out from the fix versions, since > the bug has been there for quite a while... > > Thoughts? Thanks.
[DISCUSS] The priority of HBASE-29380
While reviewing the flaky list of branch-3, I found a very critical problem which may cause a procedure to hang forever. I set it as a blocker for all upcoming releases, but to be honest I still need some time to see how to actually fix it, and may also need some proper tests. So if we want to cut 2.6.3 and 2.5.12 soon, I'm OK to reduce the priority and move 2.6.3 and 2.5.12 out from the fix versions, since the bug has been there for quite a while... Thoughts? Thanks.
