On 9/9/22 04:58, John Reiser wrote:
1. Describe the environment completely.
Also: Any kind of threading (pthreads, or shm_open, or
mmap(,,,MAP_SHARED,,))
must be mentioned explicitly. Multiple execution contexts which access
the same address space instance are a significant complicating factor.
If threading is involved, then try using "valgrind --tool=drd ..."
or --tool=helgrind, because those tools specifically target detecting
race conditions and other synchronization errors, much like --tool=memcheck
[the default tool when no --tool= is mentioned] targets errors involving
malloc() and free(), uninitialized variables, etc.
No threading is used. Postgres is multi-process, and uses shared memory
for the shared cache (through shm_open etc.). FWIW, as I mentioned
before, this works perfectly fine when the core is not generated by
valgrind.
4. Walk before attempting to run.
Did you try a simple example? Write a half-page program with 5
subroutines,
each of which calls the next one, and the last one sends SIGABRT to
the process.
Does the .core file when run under valgrind give the correct
traceback using gdb?
Specifically: apply valgrind to the small program which causes a
deliberate SIGABRT,
and get a core file. Does gdb give the correct traceback for that core
file?
If not, then you have an ideal test case for filing a bug report against
valgrind
because even the simple core file is bad. If gdb does give a correct
traceback
for the simple core file, then you have to keep looking for the source
of the
problem on your larger program.
I'll try this once I have access to the machine early next week.
5. (Learn and) Use the built-in tools where possible.
Run the process interactively, invoking valgrind with "--vgdb-error=0",
and giving the debugger command "(gdb) continue" after establishing
connectivity between vgdb and the process.
See the valgrind manual, section 3.2.9 "vgdb command line options".
When the SIGABRT happens, then vgdb will allow you to use all the
ordinary
gdb commands to get a backtrace, go up and down the stack, examine
variables and other memory, run
(gdb) info proc
(gdb) shell cat /proc/$PID/maps
to see exactly the layout of process memory, etc.
There are also special commands to access valgrind functionality
interactively, such as checking for memory leaks.
I already explained why I don't want / can't use the interactive gdb.
I'm aware of the option, I've used it before, but in this case it's
not very practical.
The gdb process does not *have* to be run interactively, it just takes
more work
and patience to run non-interactively. Run "valgrind --vgdb-error=0 ..."
and notice the last part of the printed instructions:
and then give GDB the following command
==215935== target remote |
/path/to/libexec/valgrind/../../bin/vgdb --pid=215935
==215935== --pid is optional if only one valgrind process is running
So if there is only one valgrind process, then you do not need to know
the pid.
Thus you can run gdb with re-directed stdin/stdout/stderr, or perhaps
use the -x
command-line option. This allows a static, pre-scripted list of gdb
commands;
it may require a few iterations to get a good debug script. (Try the
commands
using the trivial SIGABRT case!) Also get the full gdb manual (more
than 800 pages)
and look at the "thread apply all ..." and "frame apply all ..." commands.
Sure, but that's more of a workaround - it does not make the core file
useful, it provides alternative way to get to the same result. Plus it
requires additional tooling/scripting, and I'd prefer keeping the
tooling as simple as possible.
Postgres is a multi-process system, that runs a bunch of management
processes, and client processes (1:1 to connections). We don't know in
which one an issue might happen, so we'd have to attach a script to each
of them.
Furthermore, there's the question of performance - we run these tests on
many machines (although only some of them run them under valgrind), the
valgrind makes it fairly slow already - if this vgdb thing makes even
slower, that'd be an issue. But I haven't measured it, so maybe it's not
as bad as I'm afraid.
It may be possible to perform some interactive "reconnaisance" to suggest
good things for the script to try. Using --vgdb-error=0, put a breakpoint
on a likely location for the error (or shortly before the error),
and look around. In the logged traceback:
TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
"reorderbuffer.c", Line: 902, PID: 536049)
(ExceptionalCondition+0x98)[0x8f5cec]
(+0x57a574)[0x682574]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
(SnapBuildProcessNewCid+0x94)[0x68b6a4]
any of those named locations, or shortly before them, might be a good spot.
When execution stops at any one of the breakpoints, then look around
and see if you can find clues about "prev_first_lsn < cur_txn->first_lsn"
even though the error has not yet occurred. Perhaps this will help
identify location(s) that might be closer to the actual error
when it does happen. This might suggest commands for the non-interactive
gdb debugging script.
This does not work, I'm afraid. The issue is a (rare) race condition,
and we run the assert thousands of times and it's fine 99.999% of the
time. The breakpoint & interactive reconnaissance is unlikely to find
anything 99% of the time, and it can easily make the race condition go
away by changing the timing. That's kinda the interesting thing - this
is not an issue valgrind is meant to discover, it's just that it seems
to change the timing just enough to increase the probability.
regards
Tomas
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users