Re: rare crash - FailedAssertion snapbuild.c Line: 580
On 2018-08-30 16:44, Alvaro Herrera wrote: On 2018-Aug-30, Erik Rijkers wrote: ok, is this any use? Seems mostly good, but the Xids are not printed. Could you please do "bt full"? Also: frame 3 print *snap See the attached. # gdb --quiet -ex 'bt full' --batch /var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres /var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core &> gdb_bt_full.txt [New LWP 147484] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: walsender rijkers [local] idle in transaction '. Program terminated with signal SIGABRT, Aborted. #0 0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. #0 0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 resultvar = 0 pid = 147484 selftid = 147484 #1 0x7f0fd20e8448 in __GI_abort () at abort.c:89 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 35419576, 139705945370615, 1, 0, 2, 139705925209384, 773155, 35419576, 36381032, 139705945396501, 0, 139705926081536, 0, 139705929013136, 139705929007200}}, sa_flags = -748812544, sa_restorer = 0x7f0fd20fd99a <_IO_vfprintf_internal+22490>} sigs = {__val = {32, 0 }} #2 0x008880bf in ExceptionalCondition (conditionName=conditionName@entry=0xa417f8 "!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", errorType=errorType@entry=0x8d365d "FailedAssertion", fileName=fileName@entry=0xa41223 "snapbuild.c", lineNumber=lineNumber@entry=580) at assert.c:54 No locals. #3 0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580 safeXid = 773155 snap = 0x22b2168 xid = 0 newxip = newxcnt = 0 __func__ = "SnapBuildInitialSnapshot" #4 0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at walsender.c:951 snap = ctx = 0x21c75b8 need_full_snapshot = true snapshot_name = 0x0 nulls = {false, false, false, false} xloc = "\001\231\034\002\000\000\000\000\030\347\031\002\002\000\000\000\000\000\000\000\002\000\000\000Q\000\000\000\000\000\000\000\370\034\032\002\000\000\000\000\030\347\031\002\000\000\000\000È»\034\002", '\000' slot_name = tstate = tupdesc = values = {0, 103, 8192, 6683594} reserve_wal = snapshot_action = dest = __func__ = "CreateReplicationSlot" #5 exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 "CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527 parse_rc = cmd_node = 0x223bae0 cmd_context = 0x223b410 old_context = 0x21a1be0 __func__ = "exec_replication_command" #6 0x0077e8ee in PostgresMain (argc=, argv=argv@entry=0x21cbbc8, dbname=, username=) at postgres.c:4155 query_string = 0x21a1cf8 "CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT" firstchar = input_message = {data = 0x21a1cf8 "CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT", len = 98, maxlen = 1024, cursor = 98} local_sigjmp_buf = {{__jmpbuf = {657920533, -5186006737813875216, 35251992, 35437512, 1535629124, 35251992, 5188120396153509360, -5186006335271315984}, __mask_was_saved = 1, __saved_mask = {__val = {0, 35437440, 10709497, 35251992, 35265504, 1024, 35437512, 35437512, 9122032, 0, 9094142, 14351168, 140721267739056, 35437512, 905, 35270984 send_ready_for_query = false disable_idle_in_transaction_timeout = false __func__ = "PostgresMain" #7 0x00704fde in BackendRun (port=0x21c4520) at postmaster.c:4361 ac = 1 secs = 588944346 usecs = 999759 i = 1 av = 0x21cbbc8 maxac = 2 __func__ = "BackendRun" #8 BackendStartup (port=0x21c4520) at postmaster.c:4033 bn = pid = __func__ = "BackendStartup" #9 ServerLoop () at postmaster.c:1706 port = 0x21c4520 rmask = {fds_bits = {32, 0 }} selres = now = readmask = {fds_bits = {56, 0 }} nSockets = 6 last_lockfile_recheck_time = last_touch_time = __func__ = "ServerLoop" #10 0x00705e0f in PostmasterMain (argc=argc@entry=12, argv=argv@entry=0x219c470) at postmaster.c:1379 opt = status = userDoption = listen_addr_saved = i = output_config_variable = __func__ = "PostmasterMain" #11 0x00478d80 in main (argc=12,
Re: rare crash - FailedAssertion snapbuild.c Line: 580
On 2018-Aug-30, Erik Rijkers wrote: > ok, is this any use? Seems mostly good, but the Xids are not printed. Could you please do "bt full"? Also: frame 3 print *snap Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: rare crash - FailedAssertion snapbuild.c Line: 580
On 2018-08-29 21:15, Andres Freund wrote: Hi, On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote: To test postgres 11, I still regularly run series of short sessions of pgbench-over-logical-replication (basically the same thing that I used last year [1] - now in a perl incarnation). Most of the time the replication is stable and finishes correctly but sometimes (rarely) I get: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", File: "snapbuild.c", Line: 580) This will probably be difficult to reproduce and to act upon but I wanted to report it anyway as in the course of the last few months I have seen it several times, on several machines. Always rarely, always postgres 11 (I did not try other versions). Thanks for testing! Could you possibly run the tests with core files enabled, so we at get a backtrace in case of trouble? Knowing what the values here are would be tremendously helpful... ok, is this any use? $ gdb --quiet /var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres /var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core Reading symbols from /var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done. [New LWP 147484] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: walsender rijkers [local] idle in transaction '. Program terminated with signal SIGABRT, Aborted. #0 0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f0fd20e8448 in __GI_abort () at abort.c:89 #2 0x008880bf in ExceptionalCondition (conditionName=conditionName@entry=0xa417f8 "!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", errorType=errorType@entry=0x8d365d "FailedAssertion", fileName=fileName@entry=0xa41223 "snapbuild.c", lineNumber=lineNumber@entry=580) at assert.c:54 #3 0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580 #4 0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at walsender.c:951 #5 exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 "CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527 #6 0x0077e8ee in PostgresMain (argc=, argv=argv@entry=0x21cbbc8, dbname=, username=out>) at postgres.c:4155 #7 0x00704fde in BackendRun (port=0x21c4520) at postmaster.c:4361 #8 BackendStartup (port=0x21c4520) at postmaster.c:4033 #9 ServerLoop () at postmaster.c:1706 #10 0x00705e0f in PostmasterMain (argc=argc@entry=12, argv=argv@entry=0x219c470) at postmaster.c:1379 #11 0x00478d80 in main (argc=12, argv=0x219c470) at main.c:228 (gdb)
Re: rare crash - FailedAssertion snapbuild.c Line: 580
Hi, On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote: > To test postgres 11, I still regularly run series of short sessions of > pgbench-over-logical-replication (basically the same thing that I used last > year [1] - now in a perl incarnation). Most of the time the replication is > stable and finishes correctly but sometimes (rarely) I get: > > TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, > snap->xmin))", File: "snapbuild.c", Line: 580) > > This will probably be difficult to reproduce and to act upon but I wanted to > report it anyway as in the course of the last few months I have seen it > several times, on several machines. Always rarely, always postgres 11 (I did > not try other versions). Thanks for testing! Could you possibly run the tests with core files enabled, so we at get a backtrace in case of trouble? Knowing what the values here are would be tremendously helpful... Greetings, Andres Freund
rare crash - FailedAssertion snapbuild.c Line: 580
Hello, To test postgres 11, I still regularly run series of short sessions of pgbench-over-logical-replication (basically the same thing that I used last year [1] - now in a perl incarnation). Most of the time the replication is stable and finishes correctly but sometimes (rarely) I get: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", File: "snapbuild.c", Line: 580) This will probably be difficult to reproduce and to act upon but I wanted to report it anyway as in the course of the last few months I have seen it several times, on several machines. Always rarely, always postgres 11 (I did not try other versions). Erik Rijkers [1] https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl source/version: bf2d0462cd73 / REL_11_STABLE (compiled 20180828_1629) 1 primary, 4 replicas (on 1 host). Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres then: wait for log-repl sync. (I also added extra data type columns to the basic pgbench tables to test all the different data types) Below is a grep for the assert message, with 15 surrounding lines from the latest occurrence. $ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B 15 -Ei 'trap|assert' 65*/logfile.65* | less 6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG: logical decoding found consistent point at 0/89E038E0 6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG: received replication command: CREATE_REPLICATION_SLOT "sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG: logical decoding found initial starting point at 0/8AF5D590 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL: Waiting for transactions (approximately 1) older than 886297 to end. 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG: logical decoding found consistent point at 0/8B018310 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG: received replication command: START_REPLICATION SLOT "sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1', publication_names '"pub1_6516"') 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG: starting logical decoding for slot "sub2_6517_6517_18834_sync_18793" 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL: Streaming transactions committing after 0/89E03918, reading WAL from 0/841EE960. 6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG: logical decoding found consistent point at 0/841EF3E0 6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL: Logical decoding will begin using saved snapshot. 6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG: received replication command: CREATE_REPLICATION_SLOT "sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG: logical decoding found consistent point at 0/8C6BA408 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516:TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", File: "snapbuild.c", Line: 580) 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG: server process (PID 139541) was terminated by signal 6: Aborted 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL: Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG: terminating any other active server processes 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING: terminating connection because of crash of another server process 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT: In a moment you should be able to reconnect to the database and repeat your command. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT: slot "sub2_6517_6517", output plugin "pgoutput", in the change callback, associated LSN 0/8B11CF00 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING: terminating connection because of crash of another server process 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL: The postmaster has