Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Chris Plummer Wed, 24 Jun 2020 11:25:33 -0700

On 6/24/20 12:01 AM, Yasumasa Suenaga wrote:

On 2020/06/24 15:32, Chris Plummer wrote:
Hi Yasumasa ,
I think LinuxAMD64CFrame is used for pstack and what I've beenlooking at has been jstack, and in particular AMD64CurrentFrameGuess,which does use "last java frame". I'm not sure why LinuxAMD64CFramedoes not look at "last java frame". Maybe it should.
I thought both pattern (jstack, mixed stack) for this change.
As you know, mixed jstack (jstack --mixed) attempt to find top ofnative stack via LinuxAMD64CFrame, register values are needed for it(so it depends on ptrace() call). So I guess mixed mode jstack (jhsdbjstack --mixed) would not show any stacks (cannot find "last javaframe").

Hi Yasumasa,

I should have been more clear on what I meant by jstack and pstack. Forjstack I meant using StackTrace.java, which is what you get by defaultwith "jhsdb jstack" and also the clhsdb jstack command. For pstack Imeant PStack.java, which is what you get with "jhsdb jstack --mixed" orthe clhsdb pstack command.

So this CR impacts both types of stack traces in that they will get nullregisters when the the lower level API fails to get the register set.For StackTrace.java it will then defer to "last java frame" ifavailable. For PStack.java it will not, and will always result in nostack trace. The code of interest is here:

AMD64ThreadContext context = (AMD64ThreadContext)thread.getContext();

       Address pc  = context.getRegisterAsAddress(AMD64ThreadContext.RIP);
       if (pc == null) return null;
       return LinuxAMD64CFrame.getTopFrame(dbg, pc, context);

So the question is should "last java frame" be used if pc == null. Ifso, then getTopFrame() would also need to be modified to use "last javaframe" when fetching RBP.


thanks,

Chris

Thanks,

Yasumasa
thanks,

Chris

On 6/23/20 11:04 PM, Yasumasa Suenaga wrote:
Hi Chris,

Thanks you for explanation.
Your change looks good (but "last java frame" would not be found inLinux AMD64 because RSP is NULL - cf. LinuxAMD64CFrame.java)
Thanks,

Yasumasa


On 2020/06/24 12:09, Chris Plummer wrote:
On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:
Hi Chris,
Skillful troubleshooters who use jhsdb will aware this warnings,and they will take other appropriate methods.
However, I'm not sure it is worth to continue to perform even ifSA cannot get register values.
For example, Linux AMD64 depends on RIP and RSP values to find topframe.According to your change, The caller ofgetThreadIntegerRegisterSet() has responsible for dealing with theset of null registers. However X86ThreadContext::data (it includesraw register values) would still be zero when it happens.
This is what I intended to have happen. Just end up with aregister set of all nulls. Then when stack walking code gets anull, it will revert to "last java frame" if available, otherwiseno stack dump is done.
So I think register holder (e.g. X86ThreadContext) should havetri-state (have registers, fail to get registers, not yet attemptto get registers).
OTOH it might be over-engineering. What do you think?
Before implementing this I looked at the what would be the easierapproach to get the desired effect of stack walking code simplyfailing over to using "last java frame", and decided the null setof registers was easiest. Other approaches involved more changesand impacted more files.
thanks,

Chris
Thanks,

Yasumasa


On 2020/06/24 3:16, Chris Plummer wrote:
On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,

On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a bad corefile is not a possibility. However, get_lwp_regs() doesredirect to core_get_lwp_regs() for core files. It can fail,but the only reason it ever does is if the LWP can't be foundin the core (which is never suppose to happen). I would thinkif this happened due to the core being truncated, SA would beblowing up all over the place with exceptions, probably beforewe ever get to this code, but in any cast what we do herewouldn't really make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors other thanESRCH. Why should they be treated differently?getThreadIntegerRegisterSet0() is used for finding the currentframe for stack tracing. With my changes any failure willresult in deferring to "last java frame" if set, and otherwisejust not produce a stack trace (and the WARNING will be presentin the output). This seems preferable to completely abandoningany further thread stack tracking.
I'm not sure we can trust call stack when ptrace() returns anyerrors other than ESRCH even if "last java frame" is available.For example, don't ptrace() return EFAULT or EIO when somethingwrong? (e.g. stack corruption) If so, it may lead to a wronganalysis for troubleshooter.I think it should be abort dumping call stack for its thread atleast.
Hi Yasumasa,
In general stack walking makes a best effort and can be wrong,even when not getting errors like this. For any activelyexecuting thread SA needs to determine where the stack starts,with register contents being the starting point (SP, FP, and PC).These registers could contain anything, and SA makes a besteffort to determine a current frame from them. However, theverification steps it takes are not 100% guaranteed, and can leadto an incorrect assumption of the current frame, which in turncan result in an exception later on when walking the stack. SeeJDK-8247641.
Keep in mind that the WARNING message will always be there. Thisshould be enough to put the troubleshooter on alert that thestack trace may not be accurate. I think it's better to make anattempt at a stack trace then to just abandon it and not attemptto do something that may be useful.
thanks,

Chris
Thanks,

Yasumasa
thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be setto errno by default.
So I guess it is similar to "generic" error code.

https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno otherthan ESRCH.For example, if we analyze broken core (e.g. the core wasdumped with disk full), we might get EFAULT.Thus I prefer to handle ESRCH only in your patch, and also Ithink SA should throw DebuggerException if other error isoccurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html


Thanks,

Yasumasa


On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that aJavaThread cannot be on the ThreadList after its OS threadhas been destroyed since the JavaThread removes itself fromthe ThreadList, and therefore must be running on its OSthread. The logic of the fix is unchanged from the firstwebrev, but I updated the comments to better reflect what isgoing on. I also updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
thanks,

Chris

On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,

On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,

On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,

Please help review the following:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's asummary of changes in each file:
The problem sounds to me like a variation of themore general problem of not ensuring a thread iskept alive whilst acting upon it. I don't know howthe SA finds these references to the threads it isgoing to stackwalk, but is it possible to fix thisvia appropriate uses of ThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in the VMchanges. For example, SA can't create a wrapper to aJavaThread, only to have the JavaThread be freedlater on. It's just not possible.
Then how does it obtain a reference to a JavaThreadfor which the native OS thread id is invalid? Anythread found in _java_thread_list is either live orstill to be started. In the latter case theJavaThread->osThread does not have its thread_id set yet.
My assumption was that the JavaThread is in the processof being destroyed, and it has freed its OS thread butis itself still in the thread list. I did notice thatthe OS thread id being used looked to be in the rangeof thread id #'s you would expect for the running app,so that to me indicated it was once valid, but is no more.
Keep in mind that although hotspot may havesynchronization code that prevents you from pulling aJavaThread off the thread list when it is in theprocess of being destroyed (I'm guessing it does), SAhas no such protections.
But you stated that once the SA has attached, the targetVM can't change. If the SA gets its set of thread fromone attach then tries to make queries about thosethreads in a separate attach, then obviously it could beproviding garbage thread information. So you would needto re-validate the JavaThread in the target VM beforetrying to do anything with it.
That's not what is going on here. It's attaching anddoing a stack trace, which involves getting the threadlist and iterating through all threads without detaching.
Okay so I restate my original comment - all theJavaThreads must be alive or not yet started, so how areyou encountering an invalid thread id? Any thread you findvia the ThreadsList can't have destroyed its osThread. Inany case the logic should be checking thread->osThread()for NULL, and then osThread()->get_state() to ensure it is>= INITIALIZED before using the thread_id().
Hi David,
I chatted with Dan about this, and he said since theJavaThread is responsible for removing itself from theThreadList, it is impossible to have a JavaThread still onthe ThreadList, but without and underlying OS Thread. SoI'm a bit perplexed as to how I can find a JavaThread onthe ThreadList, but that results in ESRCH when trying toaccess the thread with ptrace. My only conclusion is thatthis failure is somehow spurious, and maybe the issue itjust that the thread is in some temporary state thatprevents its access. If so, I still think the approach I'mtaking is the correct one, but the comments should be updated.
ESRCH can have other meanings but I don't know enough aboutthe broader context to know whether they are applicable inthis case.
ESRCH The specified process does not exist, or isnot currently being traced by the caller, or is not stopped
              (for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don'tknow the code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned up,it prevented the thread from getting a stack trace due tothe exception being thrown. What I hadn't realize is thatafter fixing it to not throw an exception, which resultedin the stack walking code getting all nulls for registervalues, I actually started to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemonprio=5 tid=0x00007f12f0cd6420 nid=0x1f99 runnable[0x00007f125f0f4000]
    java.lang.Thread.State: RUNNABLE
    JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failedfor lwp (8089)CurrentFrameGuess: choosing last Java frame: sp =0x00007f125f0f4770, fp = 0x00007f125f0f47c0
  - java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
- java.io.FileInputStream.read() @bci=1, line=223(Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()@bci=108, line=216 (Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()@bci=4 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=832 (Interpretedframe)
The "CurrentFrameGuess" output is some debug tracing I hadenabled, and it indicates that the stack walking code isusing the "last java frame" setting, which it will do ifcurrent registers values don't indicate a valid frame (aswould be the case if sp was null). I had previously assumedthat without an underling valid LWP, there would be nostack trace. Given that there is one, there must be a validLWP. Otherwise I don't see how the stack could have beenwalked. That's another indication that the ptrace failureis spurious in nature.
thanks,

Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb toissue commands on addresses, if the address is no longervalid for the command you are executing, then you wouldget the appropriate error when there is an attempt tocreate a wrapper for it. I don't know of any command thatoperates directly on a JavaThread, but I think there arefor InstanceKlass. So if you remembered the address of anInstanceKlass, and then reattached and tried a commandthat takes an InstanceKlass address, you would get anexception when SA tries to create the wrapper for theInsanceKlass if it were no longer a valid address for one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cppsrc/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.msrc/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp-Instead of throwing an exception when the OSThreadID is invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java-Deal with the array of registers read in beingnull due to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java-Fix issue with"sun.jvm.hotspot.debugger.DebuggerException"appearing twice when printing the exception.
thanks,

Chris

Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Reply via email to