Hi,

I'm having some issues with analyzing cores generated from valgrind. I do get the core file, but when I try opening it in gdb it just shows some entirely bogus information / backtrace etc.

This is a rpi4 machine, with 64-bit debian, running a local build of valgrind 3.19.0 (built from sources, not a package).

This is how I run the program (postgres binary)

  valgrind --quiet --trace-children=yes --track-origins=yes \
  --read-var-info=yes --num-callers=20 --leak-check=no \
  --gen-suppressions=all --error-limit=no \
  --log-file=/tmp/valgrind.543917.log postgres \
  -D /home/debian/postgres /contrib/test_decoding/tmp_check_iso/data \
  -F -c listen_addresses= -k /tmp/pg_regress-n7HodE

I get a ~200MB core file in /tmp, which I try loading like this:

  gdb src/backend/postgres /tmp/valgrind.542299.log.core.542391

but all I get is this:

  Reading symbols from src/backend/postgres...
  [New LWP 542391]
  Cannot access memory at address 0xcc10cc00cbf0cc6
  Cannot access memory at address 0xcc10cc00cbf0cbe
  Core was generated by `'.
  Program terminated with signal SIGABRT, Aborted.
  #0  0x00000000049d42ac in ?? ()
  (gdb) bt
  #0  0x00000000049d42ac in ?? ()
  #1  0x0000000000400000 in dshash_dump (hash_table=0x0) at dshash.c:782
#2 0x0000000000400000 in dshash_dump (hash_table=0x49c0e44) at dshash.c:782
  #3  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

So the stack might be corrupt, for some reason? The first part looks entirely bogus too, though. The file size seems about right - with 128MB shared buffers, 200MB might be about right.

The core is triggered by an "assert" in the source, and we even log a backtrace into the log - and that seems much more plausible:

TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 902, PID: 536049)
  (ExceptionalCondition+0x98)[0x8f5cec]
  (+0x57a574)[0x682574]
  (+0x579edc)[0x681edc]
  (ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
  (SnapBuildProcessNewCid+0x94)[0x68b6a4]
  (heap2_decode+0x17c)[0x671584]
  (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
  (+0x570f88)[0x678f88]
  (pg_logical_slot_get_changes+0x1c)[0x6790fc]
  (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
  (+0x3be638)[0x4c6638]
  (+0x3a2c14)[0x4aac14]
  (ExecScan+0x8c)[0x4aaca8]
  (+0x3bea14)[0x4c6a14]
  (+0x39ea60)[0x4a6a60]
  (+0x392378)[0x49a378]
  (+0x39520c)[0x49d20c]
  (standard_ExecutorRun+0x214)[0x49aad8]
  (ExecutorRun+0x64)[0x49a8b8]
  (+0x62e2ac)[0x7362ac]
  (PortalRun+0x27c)[0x735f08]
  (+0x626be8)[0x72ebe8]
  (PostgresMain+0x9a0)[0x733e9c]
  (+0x547be8)[0x64fbe8]
  (+0x547540)[0x64f540]
  (+0x542d30)[0x64ad30]
  (PostmasterMain+0x1460)[0x64a574]
  (+0x418888)[0x520888]

Clearly, this is not an issue valgrind is meant to detect (like invalid memory access, etc.) but an application issue. I've tried reproducing it without valgrind, but it only ever happens with valgrind - my theory is it's some sort of race condition, and valgrind changes the timing in a way that makes it much more likely to hit. I need to analyze the core to inspect the state more closely, etc.

Any ideas what I might be doing wrong? Or how do I load the core file?


thanks
Tomas


_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to