Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Erik Rijkers

On 2018-08-30 16:44, Alvaro Herrera wrote:

On 2018-Aug-30, Erik Rijkers wrote:


ok, is this any use?


Seems mostly good, but the Xids are not printed.  Could you please do
"bt full"?  Also:

frame 3
print *snap


See the attached.




# gdb --quiet -ex 'bt full' --batch 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres 
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core  &>  
gdb_bt_full.txt

[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in transaction  
 '.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
resultvar = 0
pid = 147484
selftid = 147484
#1  0x7f0fd20e8448 in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 35419576, 139705945370615, 1, 0, 2, 139705925209384, 
773155, 35419576, 36381032, 139705945396501, 0, 139705926081536, 0, 
139705929013136, 139705929007200}}, sa_flags = -748812544, sa_restorer = 
0x7f0fd20fd99a <_IO_vfprintf_internal+22490>}
sigs = {__val = {32, 0 }}
#2  0x008880bf in ExceptionalCondition 
(conditionName=conditionName@entry=0xa417f8 
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
errorType=errorType@entry=0x8d365d "FailedAssertion", 
fileName=fileName@entry=0xa41223 "snapbuild.c", 
lineNumber=lineNumber@entry=580) at assert.c:54
No locals.
#3  0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
safeXid = 773155
snap = 0x22b2168
xid = 0
newxip = 
newxcnt = 0
__func__ = "SnapBuildInitialSnapshot"
#4  0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at 
walsender.c:951
snap = 
ctx = 0x21c75b8
need_full_snapshot = true
snapshot_name = 0x0
nulls = {false, false, false, false}
xloc = 
"\001\231\034\002\000\000\000\000\030\347\031\002\002\000\000\000\000\000\000\000\002\000\000\000Q\000\000\000\000\000\000\000\370\034\032\002\000\000\000\000\030\347\031\002\000\000\000\000È»\034\002",
 '\000' 
slot_name = 
tstate = 
tupdesc = 
values = {0, 103, 8192, 6683594}
reserve_wal = 
snapshot_action = 
dest = 
__func__ = "CreateReplicationSlot"
#5  exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL 
pgoutput USE_SNAPSHOT") at walsender.c:1527
parse_rc = 
cmd_node = 0x223bae0
cmd_context = 0x223b410
old_context = 0x21a1be0
__func__ = "exec_replication_command"
#6  0x0077e8ee in PostgresMain (argc=, 
argv=argv@entry=0x21cbbc8, dbname=, username=) at 
postgres.c:4155
query_string = 0x21a1cf8 "CREATE_REPLICATION_SLOT 
\"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT"
firstchar = 
input_message = {data = 0x21a1cf8 "CREATE_REPLICATION_SLOT 
\"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT", 
len = 98, maxlen = 1024, cursor = 98}
local_sigjmp_buf = {{__jmpbuf = {657920533, -5186006737813875216, 
35251992, 35437512, 1535629124, 35251992, 5188120396153509360, 
-5186006335271315984}, __mask_was_saved = 1, __saved_mask = {__val = {0, 
35437440, 10709497, 35251992, 35265504, 1024, 35437512, 35437512, 9122032, 0, 
9094142, 14351168, 140721267739056, 35437512, 905, 35270984
send_ready_for_query = false
disable_idle_in_transaction_timeout = false
__func__ = "PostgresMain"
#7  0x00704fde in BackendRun (port=0x21c4520) at postmaster.c:4361
ac = 1
secs = 588944346
usecs = 999759
i = 1
av = 0x21cbbc8
maxac = 2
__func__ = "BackendRun"
#8  BackendStartup (port=0x21c4520) at postmaster.c:4033
bn = 
pid = 
__func__ = "BackendStartup"
#9  ServerLoop () at postmaster.c:1706
port = 0x21c4520
rmask = {fds_bits = {32, 0 }}
selres = 
now = 
readmask = {fds_bits = {56, 0 }}
nSockets = 6
last_lockfile_recheck_time = 
last_touch_time = 
__func__ = "ServerLoop"
#10 0x00705e0f in PostmasterMain (argc=argc@entry=12, 
argv=argv@entry=0x219c470) at postmaster.c:1379
opt = 
status = 
userDoption = 
listen_addr_saved = 
i = 
output_config_variable = 
__func__ = "PostmasterMain"
#11 0x00478d80 in main (argc=12, 

Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Alvaro Herrera
On 2018-Aug-30, Erik Rijkers wrote:

> ok, is this any use?

Seems mostly good, but the Xids are not printed.  Could you please do
"bt full"?  Also:

frame 3
print *snap

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Erik Rijkers

On 2018-08-29 21:15, Andres Freund wrote:

Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used 
last
year [1] - now in a perl incarnation).  Most of the time the 
replication is

stable and finishes correctly but sometimes (rarely) I get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I 
wanted to
report it anyway as in the course of the last few months I have seen 
it
several times, on several machines. Always rarely, always postgres 11 
(I did

not try other versions).


Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble?  Knowing what the
values here are would be tremendously helpful...


ok, is this any use?

$ gdb --quiet 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres  
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core
Reading symbols from 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done.

[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in 
transaction   '.

Program terminated with signal SIGABRT, Aborted.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.

(gdb) bt
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x7f0fd20e8448 in __GI_abort () at abort.c:89
#2  0x008880bf in ExceptionalCondition 
(conditionName=conditionName@entry=0xa417f8 
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
errorType=errorType@entry=0x8d365d "FailedAssertion", 
fileName=fileName@entry=0xa41223 "snapbuild.c", 
lineNumber=lineNumber@entry=580) at assert.c:54

#3  0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
#4  0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at 
walsender.c:951
#5  exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY 
LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527
#6  0x0077e8ee in PostgresMain (argc=, 
argv=argv@entry=0x21cbbc8, dbname=, username=out>) at postgres.c:4155
#7  0x00704fde in BackendRun (port=0x21c4520) at 
postmaster.c:4361

#8  BackendStartup (port=0x21c4520) at postmaster.c:4033
#9  ServerLoop () at postmaster.c:1706
#10 0x00705e0f in PostmasterMain (argc=argc@entry=12, 
argv=argv@entry=0x219c470) at postmaster.c:1379

#11 0x00478d80 in main (argc=12, argv=0x219c470) at main.c:228
(gdb)






Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
> To test postgres 11, I still regularly run series of short sessions of
> pgbench-over-logical-replication (basically the same thing that I used last
> year [1] - now in a perl incarnation).  Most of the time the replication is
> stable and finishes correctly but sometimes (rarely) I get:
> 
> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin))", File: "snapbuild.c", Line: 580)
> 
> This will probably be difficult to reproduce and to act upon but I wanted to
> report it anyway as in the course of the last few months I have seen it
> several times, on several machines. Always rarely, always postgres 11 (I did
> not try other versions).

Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble?  Knowing what the
values here are would be tremendously helpful...

Greetings,

Andres Freund



rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-29 Thread Erik Rijkers

Hello,

To test postgres 11, I still regularly run series of short sessions of 
pgbench-over-logical-replication (basically the same thing that I used 
last year [1] - now in a perl incarnation).  Most of the time the 
replication is stable and finishes correctly but sometimes (rarely) I 
get:


TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, 
snap->xmin))", File: "snapbuild.c", Line: 580)


This will probably be difficult to reproduce and to act upon but I 
wanted to report it anyway as in the course of the last few months I 
have seen it several times, on several machines. Always rarely, always 
postgres 11 (I did not try other versions).



Erik Rijkers


[1] 
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl



source/version: bf2d0462cd73 / REL_11_STABLE  (compiled 20180828_1629)

1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
  then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
  then: wait for log-repl sync.
(I also added extra data type columns to the basic pgbench tables to 
test all the different data types)


Below is a grep for the assert message, with 15 surrounding lines from 
the latest occurrence.


$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B 
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:  
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:  
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:  
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:  
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:  
received replication command: START_REPLICATION SLOT 
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1', 
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:  
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:  
Streaming transactions committing after 0/89E03918, reading WAL from 
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:  
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:  
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:  
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP: 
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:  
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE 
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:  
The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT:  In 
a moment you should be able to reconnect to the database and repeat your 
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:  
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback, 
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:  
The postmaster has