On Tue, 18 Feb 2025 19:42:07 GMT, Chris Plummer <cjplum...@openjdk.org> wrote:
>> That's a good question. I hadn't noticed the extra check for step->pending, >> and it looks like that was also in place for the previous version that >> relied on calling ClearFramePop on each outstanding NotifyFramePop. I think >> it is just bit rot as the code initially evolved. If it was possibly for the >> flag to be cleared by some other thread (I actually think it might be, but >> need to look closer), this extra check is not of much help since it could be >> cleared after the check. Let me look into this a bit more. > > There are multiple threads that can end up calling clearStep(). However, all > callers of clearStep() grab the eventHandler_lock() first, with one somewhat > obscure exception: > > debugInit_reset > threadControl_reset > removeVThreads > clearThread > stepControl_clearRequest > clearStep > > This is done for a debugger reset after a connection has been dropped. > Probably not at much risk of a race in clearStep, but I think is best to add > eventHandler locking here also. After doing this there will be no race > conditions to worry about. > > I'm pretty sure this extra check for "pending" is just bit rot. I know I > wanted an if block around the new code to make it easy to disable so I could > easily benchmark and test with and without "the fix". I think at one point > the code looked a lot different (I think there was something outside of the > if), but that was long ago. I looked a bit more... There are multiple threads that can end up calling clearStep(), so we need to be careful about races. There are two main paths into clearStep. -The first is through stepControl_endStep. It always grabs the handler_lock and stepControl_lock. -The 2nd is through through stepControl_clearRequest, which grabs no locks itself, but there are some already grabbed when it is called. It should really grab the stepControl_lock, but it can't because that lock is suppose to be grabbed before threadControl_lock, which has already been grabbed. The proper order is handler_lock -> stepControl_lock -> threadControl_lock. So we need to fix locking for stepControl_clearRequest somehow. There are 3 paths into stepControl_clearRequest(). 1. The first goes through clearThread via removeThread via removeResumed. This path always holds the handler_lock and threadControl_lock. 2. The 2nd goes through clearThread via removeThread via threadControl_onEventHandlerExit. This also holds the handler_lock and threadControl_onEventHandlerExit. 3. The 3rd goes through clearThread via removeVThreads via threadControl_reset. This path always holds the handler_lock, but not threadControl_lock Holding the threadControl_lock here does not help us since stepControl_endStep does not grab that lock. Holding the eventHandler_lock does help since stepControl_endStep does grab it, but the 3rd path above does not grab the eventHandler_lock. I think the simplest fix here is to just make it do so. But that's using the eventHandler_lock to make stepControl code thread safe. Not exactly clean. So that means instead requiring the caller of stepControl_endStep to do all the needed locking. Basically add locking of stepControl_lock somewhere. Adding locking of stepControl_lock in clearThread is too late since we already hold the threadLock. It really needs to be pushed up the call chain of each of the above 3 paths to the point just before grabing the threadLock. For example for the 2nd call path above we have: if (ei == EI_THREAD_END) { eventHandler_lock(); /* for proper lock order - see removeThread() call below */ debugMonitorEnter(threadLock); So we need to add stepControl_lock() in between those too lines. We need to do similar for the 1st and 3rd paths also. I think this is best held off for another PR. I can file a CR for it now with these notes. It's not a new problem, although might be slightly more at risk of going awry since we are doing a lot more in clearStep() now, but so far non of my testing has turned up any issues, so I think a proper fix can wait. It also relates to the ranked locking support project, which I hope to start back up soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23182#discussion_r1960569052