Hi,
I'm having some issues with analyzing cores generated from valgrind. I
do get the core file, but when I try opening it in gdb it just shows
some entirely bogus information / backtrace etc.
This is a rpi4 machine, with 64-bit debian, running a local build of
valgrind 3.19.0 (built from sources, not a package).
This is how I run the program (postgres binary)
valgrind --quiet --trace-children=yes --track-origins=yes \
--read-var-info=yes --num-callers=20 --leak-check=no \
--gen-suppressions=all --error-limit=no \
--log-file=/tmp/valgrind.543917.log postgres \
-D /home/debian/postgres /contrib/test_decoding/tmp_check_iso/data \
-F -c listen_addresses= -k /tmp/pg_regress-n7HodE
I get a ~200MB core file in /tmp, which I try loading like this:
gdb src/backend/postgres /tmp/valgrind.542299.log.core.542391
but all I get is this:
Reading symbols from src/backend/postgres...
[New LWP 542391]
Cannot access memory at address 0xcc10cc00cbf0cc6
Cannot access memory at address 0xcc10cc00cbf0cbe
Core was generated by `'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00000000049d42ac in ?? ()
(gdb) bt
#0 0x00000000049d42ac in ?? ()
#1 0x0000000000400000 in dshash_dump (hash_table=0x0) at dshash.c:782
#2 0x0000000000400000 in dshash_dump (hash_table=0x49c0e44) at
dshash.c:782
#3 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt
stack?)
So the stack might be corrupt, for some reason? The first part looks
entirely bogus too, though. The file size seems about right - with 128MB
shared buffers, 200MB might be about right.
The core is triggered by an "assert" in the source, and we even log a
backtrace into the log - and that seems much more plausible:
TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
"reorderbuffer.c", Line: 902, PID: 536049)
(ExceptionalCondition+0x98)[0x8f5cec]
(+0x57a574)[0x682574]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
(SnapBuildProcessNewCid+0x94)[0x68b6a4]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62e2ac)[0x7362ac]
(PortalRun+0x27c)[0x735f08]
(+0x626be8)[0x72ebe8]
(PostgresMain+0x9a0)[0x733e9c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]
Clearly, this is not an issue valgrind is meant to detect (like invalid
memory access, etc.) but an application issue. I've tried reproducing it
without valgrind, but it only ever happens with valgrind - my theory is
it's some sort of race condition, and valgrind changes the timing in a
way that makes it much more likely to hit. I need to analyze the core to
inspect the state more closely, etc.
Any ideas what I might be doing wrong? Or how do I load the core file?
thanks
Tomas
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users