On 9/9/22 04:58, John Reiser wrote:
1. Describe the environment completely.

Also: Any kind of threading (pthreads, or shm_open, or mmap(,,,MAP_SHARED,,))
must be mentioned explicitly.  Multiple execution contexts which access
the same address space instance are a significant complicating factor.

If threading is involved, then try using "valgrind --tool=drd ..."
or --tool=helgrind, because those tools specifically target detecting
race conditions and other synchronization errors, much like --tool=memcheck
[the default tool when no --tool= is mentioned] targets errors involving
malloc() and free(), uninitialized variables, etc.


No threading is used. Postgres is multi-process, and uses shared memory for the shared cache (through shm_open etc.). FWIW, as I mentioned before, this works perfectly fine when the core is not generated by valgrind.

4. Walk before attempting to run.
Did you try a simple example?  Write a half-page program with 5 subroutines, each of which calls the next one, and the last one sends SIGABRT to the process.

Does the .core file when run under valgrind give the correct traceback using gdb?

Specifically: apply valgrind to the small program which causes a deliberate SIGABRT, and get a core file.  Does gdb give the correct traceback for that core file? If not, then you have an ideal test case for filing a bug report against valgrind because even the simple core file is bad.  If gdb does give a correct traceback for the simple core file, then you have to keep looking for the source of the
problem on your larger program.


I'll try this once I have access to the machine early next week.


5. (Learn and) Use the built-in tools where possible.
Run the process interactively, invoking valgrind with "--vgdb-error=0",
and giving the debugger command "(gdb) continue" after establishing
connectivity between vgdb and the process.
See the valgrind manual, section 3.2.9 "vgdb command line options".
When the SIGABRT happens, then vgdb will allow you to use all the ordinary
gdb commands to get a backtrace, go up and down the stack, examine
variables and other memory, run
    (gdb) info proc
    (gdb) shell cat /proc/$PID/maps
to see exactly the layout of process memory, etc.
There are also special commands to access valgrind functionality
interactively, such as checking for memory leaks.


I already explained why I don't want / can't use the interactive gdb. I'm aware of the option, I've used it before, but in this case it's not very practical.

The gdb process does not *have* to be run interactively, it just takes more work
and patience to run non-interactively.  Run "valgrind --vgdb-error=0 ..."
and notice the last part of the printed instructions:

          and then give GDB the following command
     ==215935==   target remote | /path/to/libexec/valgrind/../../bin/vgdb --pid=215935
      ==215935== --pid is optional if only one valgrind process is running

So if there is only one valgrind process, then you do not need to know the pid. Thus you can run gdb with re-directed stdin/stdout/stderr, or perhaps use the -x command-line option.  This allows a static, pre-scripted list of gdb commands; it may require a few iterations to get a good debug script.  (Try the commands using the trivial SIGABRT case!)  Also get the full gdb manual (more than 800 pages)
and look at the "thread apply all ..." and "frame apply all ..." commands.


Sure, but that's more of a workaround - it does not make the core file useful, it provides alternative way to get to the same result. Plus it requires additional tooling/scripting, and I'd prefer keeping the tooling as simple as possible.

Postgres is a multi-process system, that runs a bunch of management processes, and client processes (1:1 to connections). We don't know in which one an issue might happen, so we'd have to attach a script to each of them.

Furthermore, there's the question of performance - we run these tests on many machines (although only some of them run them under valgrind), the valgrind makes it fairly slow already - if this vgdb thing makes even slower, that'd be an issue. But I haven't measured it, so maybe it's not as bad as I'm afraid.

It may be possible to perform some interactive "reconnaisance" to suggest
good things for the script to try.  Using --vgdb-error=0, put a breakpoint
on a likely location for the error (or shortly before the error),
and look around.  In the logged traceback:

  TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 902, PID: 536049)
   (ExceptionalCondition+0x98)[0x8f5cec]
   (+0x57a574)[0x682574]
   (+0x579edc)[0x681edc]
   (ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
   (SnapBuildProcessNewCid+0x94)[0x68b6a4]

any of those named locations, or shortly before them, might be a good spot.
When execution stops at any one of the breakpoints, then look around
and see if you can find clues about "prev_first_lsn < cur_txn->first_lsn"
even though the error has not yet occurred.  Perhaps this will help
identify location(s) that might be closer to the actual error
when it does happen.  This might suggest commands for the non-interactive
gdb debugging script.


This does not work, I'm afraid. The issue is a (rare) race condition, and we run the assert thousands of times and it's fine 99.999% of the time. The breakpoint & interactive reconnaissance is unlikely to find anything 99% of the time, and it can easily make the race condition go away by changing the timing. That's kinda the interesting thing - this is not an issue valgrind is meant to discover, it's just that it seems to change the timing just enough to increase the probability.

regards
Tomas


_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to