On 1/15/15 5:09 AM, Ivan Gerasimov wrote:
Hello everyone!
This is yet another iteration in the attempt to solve the 'wrong exit
code' bug on Windows [1].
The wrong exit code has been observed once again with one of the
regression tests.
The suspicion is that this might be due to the critical section had
been destroyed before all the children threads were terminated.
In that case, one of the threads had been unblocked and called
_endthreadex(), which resulted in a race.
To address this scenario, it is proposed to make the thread that is
about to exit the process raise a flag.
If the flag is raised, any threads wishing to exit should instead
suspend themselves.
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8069048
WEBREV: http://cr.openjdk.java.net/~igerasim/8069048/0/webrev/
src/os/windows/vm/os_windows.cpp
line 3895: // don't let the current thread to proceed to _endthreadex()
Typo: 'let the current thread to proceed to'
-> 'let the current thread proceed to'
Just making sure that I understand the revised algorithm:
- before the EPT_PROCESS thread gets here, EPT_THREAD threads
will work as before and call line 3909 _endthreadex()
- after the EPT_PROCESS thread gets here and sets the flag
on line 3886: OrderAccess::release_store(&process_exiting, 1);
- an EPT_THREAD thread may have made it past flag check on line
3802: } else if (OrderAccess::load_acquire(&process_exiting) ==
0) {
but it will be blocked on line 3803:
EnterCriticalSection(&crit_sect);
- an EPT_THREAD thread that sees the flag set on line 3802 will
drop into the self-suspend block on lines 3892-3900
- after the EPT_PROCESS thread exits the critical section, then
any EPT_THREAD threads that were blocked trying to acquire
the critical section will now see the flag set on line 3805:
if (what == EPT_THREAD &&
OrderAccess::load_acquire(&process_exiting) == 0) {
and drop into the self-suspend block on lines 3892-3900
Short version: any EPT_THREAD threads that arrive after the
EPT_PROCESS thread owns the critical section will never call
line 3909 _endthreadex() because they self-suspend.
OK, I concur that this new algorithm looks correct and will reduce
the number of threads racing through line 3909 _endthreadex() while
the EPT_PROCESS thread is trying to call exit().
One possible hole remains that we've discussed before. If an
EPT_THREAD thread calls _endthreadex() before the EPT_PROCESS
thread gets here, and if the EPT_THREAD thread stalls in
_endthreadex(), then it's still possible for that EPT_THREAD
thread to mess up the exit code if it unblocks after the
EPT_PROCESS thread has set the exit code. We've discussed this
before and there's nothing we can do about other than try and
reduce the probability by reducing the number of EPT_THREAD
threads that are calling _endthreadex().
Thumbs up!
Side note: A new failure of this mechanism was filed recently:
JDK-8069068 VM warning: WaitForMultipleObjects timed out (0) ...
https://bugs.openjdk.java.net/browse/JDK-8069068
The bug was filed against JDK9-B45 so it has the most recent
fix (https://bugs.openjdk.java.net/browse/JDK-8066863). The
weird part is that WaitForMultipleObjects() timed out without
an error code being set. Don't know if that means anything in
particular in the Win* APIS...
This fix (8069048) may also reduce the probability of this
failure mode because we'll be queueing fewer threads on the
handle list...
Dan
[1] https://bugs.openjdk.java.net/browse/JDK-6573254
Sincerely yours,
Ivan