Re: [Valgrind-users] unable to read core generated by valgrind in gdb / aarch64

Tomas Vondra Fri, 09 Sep 2022 07:27:45 -0700


On 9/9/22 04:58, John Reiser wrote:

1. Describe the environment completely.

Also: Any kind of threading (pthreads, or shm_open, ormmap(,,,MAP_SHARED,,))

must be mentioned explicitly.  Multiple execution contexts which access
the same address space instance are a significant complicating factor.

If threading is involved, then try using "valgrind --tool=drd ..."
or --tool=helgrind, because those tools specifically target detecting
race conditions and other synchronization errors, much like --tool=memcheck
[the default tool when no --tool= is mentioned] targets errors involving
malloc() and free(), uninitialized variables, etc.

No threading is used. Postgres is multi-process, and uses shared memoryfor the shared cache (through shm_open etc.). FWIW, as I mentionedbefore, this works perfectly fine when the core is not generated byvalgrind.

4. Walk before attempting to run.
Did you try a simple example? Write a half-page program with 5subroutines,each of which calls the next one, and the last one sends SIGABRT tothe process.
Does the .core file when run under valgrind give the correcttraceback using gdb?
Specifically: apply valgrind to the small program which causes adeliberate SIGABRT,and get a core file. Does gdb give the correct traceback for that corefile?If not, then you have an ideal test case for filing a bug report againstvalgrindbecause even the simple core file is bad. If gdb does give a correcttracebackfor the simple core file, then you have to keep looking for the sourceof the
problem on your larger program.


I'll try this once I have access to the machine early next week.

5. (Learn and) Use the built-in tools where possible.
Run the process interactively, invoking valgrind with "--vgdb-error=0",
and giving the debugger command "(gdb) continue" after establishing
connectivity between vgdb and the process.
See the valgrind manual, section 3.2.9 "vgdb command line options".
When the SIGABRT happens, then vgdb will allow you to use all theordinary
gdb commands to get a backtrace, go up and down the stack, examine
variables and other memory, run
    (gdb) info proc
    (gdb) shell cat /proc/$PID/maps
to see exactly the layout of process memory, etc.
There are also special commands to access valgrind functionality
interactively, such as checking for memory leaks.
I already explained why I don't want / can't use the interactive gdb.I'm aware of the option, I've used it before, but in this case it'snot very practical.
The gdb process does not *have* to be run interactively, it just takesmore work
and patience to run non-interactively.  Run "valgrind --vgdb-error=0 ..."
and notice the last part of the printed instructions:

          and then give GDB the following command
==215935== target remote |/path/to/libexec/valgrind/../../bin/vgdb --pid=215935
      ==215935== --pid is optional if only one valgrind process is running
So if there is only one valgrind process, then you do not need to knowthe pid.Thus you can run gdb with re-directed stdin/stdout/stderr, or perhapsuse the -xcommand-line option. This allows a static, pre-scripted list of gdbcommands;it may require a few iterations to get a good debug script. (Try thecommandsusing the trivial SIGABRT case!) Also get the full gdb manual (morethan 800 pages)
and look at the "thread apply all ..." and "frame apply all ..." commands.

Sure, but that's more of a workaround - it does not make the core fileuseful, it provides alternative way to get to the same result. Plus itrequires additional tooling/scripting, and I'd prefer keeping thetooling as simple as possible.

Postgres is a multi-process system, that runs a bunch of managementprocesses, and client processes (1:1 to connections). We don't know inwhich one an issue might happen, so we'd have to attach a script to eachof them.

Furthermore, there's the question of performance - we run these tests onmany machines (although only some of them run them under valgrind), thevalgrind makes it fairly slow already - if this vgdb thing makes evenslower, that'd be an issue. But I haven't measured it, so maybe it's notas bad as I'm afraid.

It may be possible to perform some interactive "reconnaisance" to suggest
good things for the script to try.  Using --vgdb-error=0, put a breakpoint
on a likely location for the error (or shortly before the error),
and look around.  In the logged traceback:

TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:"reorderbuffer.c", Line: 902, PID: 536049)

   (ExceptionalCondition+0x98)[0x8f5cec]
   (+0x57a574)[0x682574]
   (+0x579edc)[0x681edc]
   (ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
   (SnapBuildProcessNewCid+0x94)[0x68b6a4]

any of those named locations, or shortly before them, might be a good spot.
When execution stops at any one of the breakpoints, then look around
and see if you can find clues about "prev_first_lsn < cur_txn->first_lsn"
even though the error has not yet occurred.  Perhaps this will help
identify location(s) that might be closer to the actual error
when it does happen.  This might suggest commands for the non-interactive
gdb debugging script.

This does not work, I'm afraid. The issue is a (rare) race condition,and we run the assert thousands of times and it's fine 99.999% of thetime. The breakpoint & interactive reconnaissance is unlikely to findanything 99% of the time, and it can easily make the race condition goaway by changing the timing. That's kinda the interesting thing - thisis not an issue valgrind is meant to discover, it's just that it seemsto change the timing just enough to increase the probability.


regards
Tomas


_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Re: [Valgrind-users] unable to read core generated by valgrind in gdb / aarch64

Reply via email to