On Tue, 18 Feb 2025 19:42:07 GMT, Chris Plummer <cjplum...@openjdk.org> wrote:

>> That's a good question. I hadn't noticed the extra check for step->pending, 
>> and it looks like that was also in place for the previous version that 
>> relied on calling ClearFramePop on each outstanding NotifyFramePop. I think 
>> it is just bit rot as the code initially evolved. If it was possibly for the 
>> flag to be cleared by some other thread (I actually think it might be, but 
>> need to look closer), this extra check is not of much help since it could be 
>> cleared after the check. Let me look into this a bit more.
>
> There are multiple threads that can end up calling clearStep(). However, all 
> callers of clearStep() grab the eventHandler_lock() first, with one somewhat 
> obscure exception:
> 
> debugInit_reset
> threadControl_reset
> removeVThreads
> clearThread
> stepControl_clearRequest
> clearStep
> 
> This is done for a debugger reset after a connection has been dropped. 
> Probably not at much risk of a race in clearStep, but I think is best to add 
> eventHandler locking here also. After doing this there will be no race 
> conditions to worry about.
> 
> I'm pretty sure this extra check for "pending" is just bit rot. I know I 
> wanted an if block around the new code to make it easy to disable so I could 
> easily benchmark and test with and without "the fix". I think at one point 
> the code looked a lot different (I think there was something outside of the 
> if), but that was long ago.

I looked a bit more...

There are multiple threads that can end up calling clearStep(), so we need to 
be careful about races. There are two main paths into clearStep.
-The first is through stepControl_endStep. It always grabs the handler_lock and 
stepControl_lock.
-The 2nd is through through stepControl_clearRequest, which grabs no locks 
itself, but there are some already grabbed when it is called. It should really 
grab the stepControl_lock, but it can't because that lock is suppose to be 
grabbed before threadControl_lock, which has already been grabbed.  The proper 
order is handler_lock -> stepControl_lock -> threadControl_lock. So we need to 
fix locking for stepControl_clearRequest somehow.

There are 3 paths into stepControl_clearRequest().

1. The first goes through clearThread via removeThread via removeResumed.  This 
path always holds the handler_lock and threadControl_lock.
2. The 2nd goes through clearThread via removeThread via 
threadControl_onEventHandlerExit. This also holds the handler_lock and 
threadControl_onEventHandlerExit.
3. The 3rd goes through clearThread via removeVThreads via threadControl_reset. 
This path always holds the handler_lock, but not threadControl_lock

Holding the threadControl_lock here does not help us since stepControl_endStep 
does not grab that lock. Holding the eventHandler_lock does help since 
stepControl_endStep does grab it, but the 3rd path above does not grab the 
eventHandler_lock. I think the simplest fix here is to just make it do so. But 
that's using the eventHandler_lock to make stepControl code thread safe. Not 
exactly clean. So that means instead requiring the caller of 
stepControl_endStep to do all the needed locking. Basically add locking of 
stepControl_lock somewhere. Adding locking of stepControl_lock in clearThread 
is too late since we already hold the threadLock. It really needs to be pushed 
up the call chain of each of the above 3 paths to the point just before grabing 
the threadLock. For example for the 2nd call path above we have:

    if (ei == EI_THREAD_END) {
        eventHandler_lock(); /* for proper lock order - see removeThread() call 
below */
        debugMonitorEnter(threadLock);
 
So we need to add stepControl_lock() in between those too lines. We need to do 
similar for the 1st and 3rd paths also.

I think this is best held off for another PR. I can file a CR for it now with 
these notes. It's not a new problem, although might be slightly more at risk of 
going awry since we are doing a lot more in clearStep() now, but so far non of 
my testing has turned up any issues, so I think a proper fix can wait. It also 
relates to the ranked locking support project, which I hope to start back up 
soon.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23182#discussion_r1960569052

Reply via email to