Hi Thomas,
Thanks for taking a look.
On 18/11/2019 9:58 pm, Thomas Stüfe wrote:
This is evil :)
There might be more cases like this, e.g.
frame_x86.cpp frame::is_interpreted_frame_valid():
if (locals > thread->stack_base() || locals < (address) fp()) return false;
Yes that might be a case where >= should be in use. I'll file another
bug to check uses of stack_base().
Also, I would have thought the little alloca() dance we do at the start
of thread_native_entry() would push the first real frame down the stack.
I know nothing of that code. :)
The fix looks good.
Thanks!
David
-----
Cheers, Thomas
On Mon, Nov 18, 2019 at 3:31 AM David Holmes <[email protected]
<mailto:[email protected]>> wrote:
Bug: https://bugs.openjdk.java.net/browse/JDK-8215355
webrev: http://cr.openjdk.java.net/~dholmes/8215355/webrev/
This was a very difficult bug to track down and I want to publicly
acknowledge and thank the jemalloc folk (users and developers) for
continuing to investigate this issue from their side. Without their
persistence this issue would have languished.
The thread stack_base() is the first address above the thread's stack.
However, the "in stack" checks performed by Thread::on_local_stack and
Thread::is_in_stack allowed the checked address to be equal to the
stack_base() - which is not correct. Here's how this manifests as
the bug:
- Let a JavaThread instance, T2, be allocated at the end of thread T1's
stack i.e. at T1->stack_base()
[This seems to be why this only reproduced with jemalloc.]
- Let T2 lock an inflated monitor
- Let T1 try to lock the same monitor
- T1 would consider the _owner field value (T2) as being in its
stack
and so consider the monitor stack-locked by T1
- And so both T1 and T2 would have ownership of the monitor
allowing
the monitor state (and application state) to be corrupted. This results
in a range of hangs and crashes depending on the exact interleaving.
Interestingly Thread::is_in_usable_stack does not have this bug.
The bug can be tracked way back to JDK-6699669 as explained in the bug
report. That issue also showed that the same bug existed in the SA
implementations of these "on stack" checks.
Testing:
- The reproducer from the bug report, using jemalloc, ran over 5000
times without failing in any way.
- tiers 1-3 on all Oracle platforms
- serviceability/sa tests
Thanks,
David
-----