Re: RFR(M): 8247515: OSX pc_to_symbol() lookup does not work with core files

Kevin Walls Fri, 24 Jul 2020 02:18:18 -0700

Thanks Chris - all sounds good to me.  Thanks for all the MachO insights...



On 23/07/2020 22:27, Chris Plummer wrote:

Just a minor update of some new findings (no new code change). The DBhash table being used by default will overwrite an existing entry, notduplicate it, and this is indeed what was happening. The second entryadded was the one with a 0 offset. When I enable the R_NOOVERWRITEflag, it stops the overwrite and that also fixes the problem, butthat's only because the entry with offset 0 comes last. The fix I'vedone is better since it avoids the offet 0 entry altogether, so evenif it came first it would not get used.
thanks,

Chris

On 7/22/20 10:25 PM, Chris Plummer wrote:
Hi Kevin,
Thanks for the review. Unfortunately there was yet another bug whichI have now addressed. Although testing with mach5 went fine, when Itried with a local build I had issues. SA couldn't even get past anearly part of the core file handling which involves doing someadjustments related to CDS. It needs to look up a couple of hotspotsymbols by name to do this, and get their values (such as_SharedBaseAddress). Although the symbol -> address lookup seemed towork, the values retrieved from the address were garbage. After somedebugging I noticed the 3 symbols being looked up all had the sameaddress. Then I noticed this address was at offset 0 of the libjvmsegment. After a lot more debugging I found the problem. Thesesymbols were actually in the symbol table twice, once with a properoffset and once with an offset of 0. I'm not sure why the ones withan offset of 0 were there (other than they originated in the mach-osymbol table).
The reason this didn't always happen is because SA takes all thesymbols it finds and adds them to a hash table for fast symbol ->address lookup. If a symbol is in there twice, it's a tossup whichyou'll get. It could change from build to build in fact. The triggerfor my local build was probably how I ran configure, which likely isnot the same as mach5, although I'm unsure if this just gave me theunlucky hashing, or if in fact it resulted in the entries with offset0. In any case, the fix is to ignore entries with offset 0. Here'sthe updated webrev:
http://cr.openjdk.java.net/~cjplummer/8247515/webrev.03/index.html
All the changes since webrev.02 are in build_symtab() in symtab.c.Besides ignoring entries with offset 0 to fix the bug, I also didsome cleanup. There used to be two loops to iterate over the symbols.There wasn't really a good reason for this, so now there is just one.Also, it no longer adds entries with a file offset 0, an offset intothe string section of 0, or an empty string. By doing this the sizeof the libjvm symbol table went from about 240k entries to 90k. Sinceit was originally allocated at it's full potential size, it's nowreallocate to the smaller size after symbol table processing is over.
thanks,

Chris

On 7/22/20 2:45 AM, Kevin Walls wrote:
Thanks Chris, yes looks good, I like that we check the librarybounds before calling nearest_symbol.
--
Kevin


On 21/07/2020 21:05, Chris Plummer wrote:
Hi Serguei and Kevin,

The webrev has been updated:

http://cr.openjdk.java.net/~cjplummer/8247515/webrev.02/index.html
https://bugs.openjdk.java.net/browse/JDK-8247515

Two issues were addressed:
(1) Code in symbol_for_pc() assumed the caller had first checked tomake sure that the symbol is in a library, where-as some callersassume NULL will be returned if the symbol is not in a library.This is the case for pstack for example, which first blindly does apc to symbol lookup, and only if that returns null does it thencheck if the pc is in the codecache or interpreter. The logic insymbol_for_pc() assumed that if the pc was greater than the startaddress of the last library in the list, then it must be in thatlibrary. So in stack traces the frames for compiled or interpretedpc's showed up as the last symbol in the last library, plus somevery large offset. The fix is to now track the size of libraries sowe can do a proper range check.
(2) There are issues with finding system libraries. See [1]JDK-8249779. Because of this I disabled support for trying tolocate them. This was done in ps_core.c, and there are"JDK-8249779" comment references in the code where I did this. Theend result of this is that PMap for core files won't show systemlibraries, and PStack for core files won't show symbols foraddresses in system libraries. Note that currently support for PMapand PStack is disabled for OSX, but I will shortly send out areview to enable them for OSX core files as part of the fix for [2]JDK-8248882.
[1] https://bugs.openjdk.java.net/browse/JDK-8249779
[2] https://bugs.openjdk.java.net/browse/JDK-8248882

thanks,

Chris

On 7/14/20 5:46 PM, serguei.spit...@oracle.com wrote:
Okay, I'll wait for new webrev version to review.

Thanks,
Serguei


On 7/14/20 17:40, Chris Plummer wrote:
Hi Serguie,
Thanks for reviewing. This webrev is in limbo right now as Idiscovered some issues that Kevin and I have been discussing offline. One is that the code assumes the caller has first checkedto make sure that the symbol is in a library, where-as the actualcallers assume NULL will be returned if the symbol is not in alibrary. The end result is that we end up returning a symbol,even for address in the code cache or interpreter. So in stacktraces these frame show up as the last symbol in the lastlibrary, plus some very large offset. I have a fix for that werenow I track the size of each library. But there is another issuewith the code that tries to discover all libraries in the corefile. It misses a very large number of system libraries. Weunderstand why, but are not sure of the solution. I might justchange to code to only worry about user libraries (JDK libs andother JNI libs).
Some comments below also.

On 7/14/20 4:37 PM, serguei.spit...@oracle.com wrote:
Hi Chris,

I like the suggestion from Kevin below.
I'm not sure which suggestion since I updated the webrev based onhis initial suggestion.
I have a couple of minor comments so far.

http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c.frames.html
313 if (!lib->next || lib->next->base >= addr) {
I wonder if the check above has to be:
313 if (!lib->next || lib->next->base > addr) {
Yes, > would be better, although this goes away with my fix thattracks the size of each library.
http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c.frames.html
417 if (offset_from_sym >= 0) { // ignore symbols that comesafter "offset"
Replace: comes => come
Ok.

thanks,

Chris
Thanks,
Serguei


On 7/8/20 03:23, Kevin Walls wrote:
Sure -- I was thinking lowest_offset_from_sym initialisingstarting at a high positive integer (that would now bePTRDIFF_MAX I think) to save a comparison with e.g. -1, you canjust check if the new offset is less than lowest_offset_from_sym
With the ptrdiff_t change you made, this all looks good to mehowever you decide. 8-)
On 07/07/2020 21:17, Chris Plummer wrote:
Hi Kevin,
Thanks for the review. Yes, that lack of initialization oflowest_offset_from_sym is a bug. I'm real surprised thecompiler didn't catch it as it will be uninitialized garbagethe first time it is referenced. Fortunately usually theeventual offset is very small if not 0, so probably this neverprevented a proper match. I think there's also another bug:
 415       uintptr_t offset_from_sym = offset - sym->offset;
"offset" is the passed in offset, essentially the address ofthe symbol we are interested in, but given as an offset fromthe start of the DSO. "sym->offset" is also an offset from thestart of the DSO. It could be located before or after"offset". This means the math could result in a negativenumber, which when converted to unsigned would be a very largepositive number. This happens whenever you check a symbol thatis actually located after the address you are looking up. Theend result is harmless, because it just means there's no waywe will match that symbol, which is what you want, but itwould be good to clean this up.
I think what is best is to use ptrdiff_t and initializelowest_offset_from_sym to -1. I've updated the webrev:
http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/index.html
thanks,

Chris

On 7/7/20 4:09 AM, Kevin Walls wrote:
Hi Chris,

Yes I think this looks good.
Question: In nearest_symbol, do we need to initializelowest_offset_from_sym to something impossibly high, as if itdefaults to zero we never find a better/nearer result?
Thanks
Kevin


On 07/07/2020 06:10, Chris Plummer wrote:
Hello,

Please help review the following:
http://cr.openjdk.java.net/~cjplummer/8247515/webrev.00/index.html
https://bugs.openjdk.java.net/browse/JDK-8247515
The CR contains a description of the issues being addressed.There is also no test for this symbol lookup support yet. Itwill be there after I push JDK-8247516 and JDK-8247514,which are both blocked by the CR.
[1] https://bugs.openjdk.java.net/browse/JDK-8247516
[2] https://bugs.openjdk.java.net/browse/JDK-8247514

thanks,

Chris

Re: RFR(M): 8247515: OSX pc_to_symbol() lookup does not work with core files

Reply via email to