Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Chris Plummer Tue, 23 Jun 2020 20:11:48 -0700

On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:

Hi Chris,
Skillful troubleshooters who use jhsdb will aware this warnings, andthey will take other appropriate methods.
However, I'm not sure it is worth to continue to perform even if SAcannot get register values.
For example, Linux AMD64 depends on RIP and RSP values to find top frame.
According to your change, The caller of getThreadIntegerRegisterSet()has responsible for dealing with the set of null registers. HoweverX86ThreadContext::data (it includes raw register values) would stillbe zero when it happens.

This is what I intended to have happen. Just end up with a register setof all nulls. Then when stack walking code gets a null, it will revertto "last java frame" if available, otherwise no stack dump is done.

So I think register holder (e.g. X86ThreadContext) should havetri-state (have registers, fail to get registers, not yet attempt toget registers).
OTOH it might be over-engineering. What do you think?

Before implementing this I looked at the what would be the easierapproach to get the desired effect of stack walking code simply failingover to using "last java frame", and decided the null set of registerswas easiest. Other approaches involved more changes and impacted more files.


thanks,

Chris

Thanks,

Yasumasa


On 2020/06/24 3:16, Chris Plummer wrote:
On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,

On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a bad corefile is not a possibility. However, get_lwp_regs() does redirect tocore_get_lwp_regs() for core files. It can fail, but the onlyreason it ever does is if the LWP can't be found in the core (whichis never suppose to happen). I would think if this happened due tothe core being truncated, SA would be blowing up all over the placewith exceptions, probably before we ever get to this code, but inany cast what we do here wouldn't really make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors other thanESRCH. Why should they be treated differently?getThreadIntegerRegisterSet0() is used for finding the currentframe for stack tracing. With my changes any failure will result indeferring to "last java frame" if set, and otherwise just notproduce a stack trace (and the WARNING will be present in theoutput). This seems preferable to completely abandoning any furtherthread stack tracking.
I'm not sure we can trust call stack when ptrace() returns anyerrors other than ESRCH even if "last java frame" is available. Forexample, don't ptrace() return EFAULT or EIO when something wrong?(e.g. stack corruption) If so, it may lead to a wrong analysis fortroubleshooter.
I think it should be abort dumping call stack for its thread at least.
Hi Yasumasa,
In general stack walking makes a best effort and can be wrong, evenwhen not getting errors like this. For any actively executing threadSA needs to determine where the stack starts, with register contentsbeing the starting point (SP, FP, and PC). These registers couldcontain anything, and SA makes a best effort to determine a currentframe from them. However, the verification steps it takes are not100% guaranteed, and can lead to an incorrect assumption of thecurrent frame, which in turn can result in an exception later on whenwalking the stack. See JDK-8247641.
Keep in mind that the WARNING message will always be there. Thisshould be enough to put the troubleshooter on alert that the stacktrace may not be accurate. I think it's better to make an attempt ata stack trace then to just abandon it and not attempt to do somethingthat may be useful.
thanks,

Chris
Thanks,

Yasumasa
thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be set toerrno by default.
So I guess it is similar to "generic" error code.

https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno otherthan ESRCH.For example, if we analyze broken core (e.g. the core was dumpedwith disk full), we might get EFAULT.Thus I prefer to handle ESRCH only in your patch, and also I thinkSA should throw DebuggerException if other error is occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html


Thanks,

Yasumasa


On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that aJavaThread cannot be on the ThreadList after its OS thread hasbeen destroyed since the JavaThread removes itself from theThreadList, and therefore must be running on its OS thread. Thelogic of the fix is unchanged from the first webrev, but Iupdated the comments to better reflect what is going on. I alsoupdated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html

thanks,

Chris

On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,

On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,

On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,

Please help review the following:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's asummary of changes in each file:
The problem sounds to me like a variation of the moregeneral problem of not ensuring a thread is kept alivewhilst acting upon it. I don't know how the SA findsthese references to the threads it is going tostackwalk, but is it possible to fix this viaappropriate uses of ThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in the VMchanges. For example, SA can't create a wrapper to aJavaThread, only to have the JavaThread be freed lateron. It's just not possible.
Then how does it obtain a reference to a JavaThread forwhich the native OS thread id is invalid? Any thread foundin _java_thread_list is either live or still to bestarted. In the latter case the JavaThread->osThread doesnot have its thread_id set yet.
My assumption was that the JavaThread is in the process ofbeing destroyed, and it has freed its OS thread but isitself still in the thread list. I did notice that the OSthread id being used looked to be in the range of thread id#'s you would expect for the running app, so that to meindicated it was once valid, but is no more.
Keep in mind that although hotspot may have synchronizationcode that prevents you from pulling a JavaThread off thethread list when it is in the process of being destroyed(I'm guessing it does), SA has no such protections.
But you stated that once the SA has attached, the target VMcan't change. If the SA gets its set of thread from oneattach then tries to make queries about those threads in aseparate attach, then obviously it could be providinggarbage thread information. So you would need to re-validatethe JavaThread in the target VM before trying to do anythingwith it.
That's not what is going on here. It's attaching and doing astack trace, which involves getting the thread list anditerating through all threads without detaching.
Okay so I restate my original comment - all the JavaThreadsmust be alive or not yet started, so how are you encounteringan invalid thread id? Any thread you find via the ThreadsListcan't have destroyed its osThread. In any case the logicshould be checking thread->osThread() for NULL, and thenosThread()->get_state() to ensure it is >= INITIALIZED beforeusing the thread_id().
Hi David,
I chatted with Dan about this, and he said since the JavaThreadis responsible for removing itself from the ThreadList, it isimpossible to have a JavaThread still on the ThreadList, butwithout and underlying OS Thread. So I'm a bit perplexed as tohow I can find a JavaThread on the ThreadList, but that resultsin ESRCH when trying to access the thread with ptrace. My onlyconclusion is that this failure is somehow spurious, and maybethe issue it just that the thread is in some temporary statethat prevents its access. If so, I still think the approach I'mtaking is the correct one, but the comments should be updated.
ESRCH can have other meanings but I don't know enough about thebroader context to know whether they are applicable in this case.
ESRCH The specified process does not exist, or is notcurrently being traced by the caller, or is not stopped
              (for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don't knowthe code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned up, itprevented the thread from getting a stack trace due to theexception being thrown. What I hadn't realize is that afterfixing it to not throw an exception, which resulted in thestack walking code getting all nulls for register values, Iactually started to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemon prio=5tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
    java.lang.Thread.State: RUNNABLE
    JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed forlwp (8089)CurrentFrameGuess: choosing last Java frame: sp =0x00007f125f0f4770, fp = 0x00007f125f0f47c0
  - java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
- java.io.FileInputStream.read() @bci=1, line=223(Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()@bci=108, line=216 (Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()@bci=4 (Interpreted frame)
  - java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing I hadenabled, and it indicates that the stack walking code is usingthe "last java frame" setting, which it will do if currentregisters values don't indicate a valid frame (as would be thecase if sp was null). I had previously assumed that without anunderling valid LWP, there would be no stack trace. Given thatthere is one, there must be a valid LWP. Otherwise I don't seehow the stack could have been walked. That's another indicationthat the ptrace failure is spurious in nature.
thanks,

Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb to issuecommands on addresses, if the address is no longer valid forthe command you are executing, then you would get theappropriate error when there is an attempt to create awrapper for it. I don't know of any command that operatesdirectly on a JavaThread, but I think there are forInstanceKlass. So if you remembered the address of anInstanceKlass, and then reattached and tried a command thattakes an InstanceKlass address, you would get an exceptionwhen SA tries to create the wrapper for the InsanceKlass ifit were no longer a valid address for one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cppsrc/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.msrc/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp-Instead of throwing an exception when the OS ThreadIDis invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java-Deal with the array of registers read in being nulldue to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java-Fix issue with"sun.jvm.hotspot.debugger.DebuggerException" appearingtwice when printing the exception.
thanks,

Chris

Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Reply via email to