Re: [HACKERS] assertion failure 9.3.4
Alvaro Herrera wrote: Alvaro Herrera wrote: I'm thinking about the comparison of full infomask as you propose instead of just the bits that we actually care about. I think the only thing that could cause a spurious failure (causing an extra execution of the HeapTupleSatisfiesUpdate call and the stuff below) is somebody setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent enough that it should pretty harmless. However, should we worry about possible future infomask bit changes that could negatively affect this behavior? Here's a complete patch illustrating what I mean. This is slightly more expensive than straight infomask comparison in terms of machine instructions, but that seems okay to me. I have pushed a slightly tweaked version of this, after verifying that it solves the problem in Andrew's test environment. Josh, if you could verify that it solves the problem for you too, it'd be great. Thanks for the report and test case. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Alvaro Herrera wrote: I'm thinking about the comparison of full infomask as you propose instead of just the bits that we actually care about. I think the only thing that could cause a spurious failure (causing an extra execution of the HeapTupleSatisfiesUpdate call and the stuff below) is somebody setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent enough that it should pretty harmless. However, should we worry about possible future infomask bit changes that could negatively affect this behavior? Here's a complete patch illustrating what I mean. This is slightly more expensive than straight infomask comparison in terms of machine instructions, but that seems okay to me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 2586,2591 compute_infobits(uint16 infomask, uint16 infomask2) --- 2586,2612 } /* + * Given two versions of the same t_infomask for a tuple, compare them and + * return whether the relevant status for a tuple Xmax has changed. This is + * used after a buffer lock has been released and reacquired: we want to ensure + * that the tuple state continues to be the same it was when we previously + * examined it. + * + * Note the Xmax field itself must be compared separately. + */ + static inline bool + xmax_infomsk_changed(uint16 infomask1, uint16 infomask2) + { + const uint16 interesting = + HEAP_XMAX_IS_MULTI | HEAP_XMAX_LOCK_ONLY | HEAP_LOCK_MASK; + + if ((infomask1 interesting) != (infomask2 interesting)) + return true; + + return false; + } + + /* * heap_delete - delete a tuple * * NB: do not call this directly unless you are prepared to deal with *** *** 2722,2728 l1: * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ ! if (!(tp.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; --- 2743,2749 * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ ! if (xmax_infomsk_changed(tp.t_data-t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; *** *** 2748,2754 l1: * other xact could update this tuple before we get to this point. * Check for xmax change, and start over if so. */ ! if ((tp.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; --- 2769,2775 * other xact could update this tuple before we get to this point. * Check for xmax change, and start over if so. */ ! if (xmax_infomsk_changed(tp.t_data-t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; *** *** 3275,3281 l2: * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ ! if (!(oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; --- 3296,3302 * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ ! if (xmax_infomsk_changed(oldtup.t_data-t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; *** *** 3324,3335 l2: if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask) key_intact) { LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); - /* * recheck the locker; if someone else changed the tuple while * we weren't looking, start over. */ ! if ((oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals( HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) --- 3345,3355 if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask) key_intact) { LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* * recheck the locker; if someone else changed the tuple while * we weren't looking, start over. */ ! if (xmax_infomsk_changed(oldtup.t_data-t_infomask, infomask) || !TransactionIdEquals( HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) *** *** 3350,3356 l2: * some other xact could update this tuple before we get to * this point. Check for xmax change, and start over if so. */ ! if ((oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals( HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) --- 3370,3376 * some other xact
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-21 19:43:15 -0400, Andrew Dunstan wrote: On 04/21/2014 02:54 PM, Andres Freund wrote: Hi, I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. What's the assert that makes it happen faster? That might help a lot in constructing a self-contained test. Assertion and *preliminary*, *hacky* fix attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 40c09d8a85db137caf42c2dfe36adda63a381c75 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 22 Apr 2014 08:42:11 +0200 Subject: [PATCH 1/2] Error out when creating a multixact with more than one updater. --- src/backend/access/heap/heapam.c | 3 --- src/backend/access/transam/multixact.c | 15 +++ src/include/access/multixact.h | 3 +++ 3 files changed, 18 insertions(+), 3 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 2605f20..bd26b80 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -190,9 +190,6 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = /* Get the LockTupleMode for a given MultiXactStatus */ #define TUPLOCK_from_mxstatus(status) \ (MultiXactStatusLock[(status)]) -/* Get the is_update bit for a given MultiXactStatus */ -#define ISUPDATE_from_mxstatus(status) \ - ((status) MultiXactStatusForUpdate) /* * heap support routines diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index d4ad678..4ff1e58 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -713,6 +713,21 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members) return multi; } + { + int i; + TransactionId update_xact = InvalidTransactionId; + + for (i = 0; i nmembers; i++) + { + /* Ignore lockers */ + if (!ISUPDATE_from_mxstatus(members[i].status)) +continue; + if (update_xact != InvalidTransactionId) +elog(ERROR, new multixact has more than one updating member); + update_xact = members[i].xid; + } + } + /* * Assign the MXID and offsets range to use, and make sure there is space * in the OFFSETs and MEMBERs files. NB: this routine does diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h index 9729f27..95ffe0e 100644 --- a/src/include/access/multixact.h +++ b/src/include/access/multixact.h @@ -48,6 +48,9 @@ typedef enum #define MaxMultiXactStatus MultiXactStatusUpdate +/* Get the is_update bit for a given MultiXactStatus */ +#define ISUPDATE_from_mxstatus(status) \ + ((status) MultiXactStatusForUpdate) typedef struct MultiXactMember { -- 1.8.5.rc2.dirty From ebd4f834b7cdccb9a9fe336dd981811e26bfc271 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 22 Apr 2014 08:42:44 +0200 Subject: [PATCH 2/2] preliminary fix for corruption issue. --- src/backend/access/heap/heapam.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index bd26b80..9c075c4 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2722,7 +2722,7 @@ l1: * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ - if (!(tp.t_data-t_infomask HEAP_XMAX_IS_MULTI) || + if (tp.t_data-t_infomask != infomask || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; @@ -2748,7 +2748,7 @@ l1: * other xact could update this tuple before we get to this point. * Check for xmax change, and start over if so. */ - if ((tp.t_data-t_infomask HEAP_XMAX_IS_MULTI) || + if (tp.t_data-t_infomask != infomask || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), xwait)) goto l1; @@ -3275,7 +3275,7 @@ l2: * update this tuple before we get to this point. Check for xmax * change, and start over if so. */ - if (!(oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || + if (oldtup.t_data-t_infomask != infomask || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; @@ -3329,7 +3329,7 @@ l2: * recheck the locker; if someone else changed the tuple while * we weren't looking, start over. */ -if ((oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || +if (oldtup.t_data-t_infomask != infomask || !TransactionIdEquals( HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) @@ -3350,7 +3350,7 @@ l2: * some other xact could update this tuple before we get to
Re: [HACKERS] assertion failure 9.3.4
Andres Freund wrote: On 2014-04-21 19:43:15 -0400, Andrew Dunstan wrote: On 04/21/2014 02:54 PM, Andres Freund wrote: Hi, I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. What's the assert that makes it happen faster? That might help a lot in constructing a self-contained test. Assertion and *preliminary*, *hacky* fix attached. Thanks for the analysis and patches. I've been playing with this on my own a bit, and one thing that I just noticed is that at least for heap_update I cannot reproduce a problem when the xmax is originally a multixact, so AFAICT the number of places that need patched aren't as many. Some testing later, I think the issue only occurs if we determine that we don't need to wait for the xid/multi to complete, because otherwise the wait itself saves us. (It's easy to cause the problem by adding a breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer lock, and then having transaction A lock for key share, then transaction B update the tuple which stops at the breakpoint, then transaction A also update the tuple, and finally release transaction B). For now I offer a cleaned up version of your patch to add the assertion that multis don't contain multiple updates. I considered the idea of making this #ifdef USE_ASSERT_CHECKING, because it has to walk the complete array of members; and then have full elogs in MultiXactIdExpand and MultiXactIdCreate, which are lighter because they can check more easily. But on second thoughts I refrained from doing that, because surely the arrays are not as large anyway, are they. I think I should push this patch first, so that Andrew and Josh can try their respective test cases which should start throwing errors, then push the actual fixes. Does that sound okay? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 190,198 static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = /* Get the LockTupleMode for a given MultiXactStatus */ #define TUPLOCK_from_mxstatus(status) \ (MultiXactStatusLock[(status)]) - /* Get the is_update bit for a given MultiXactStatus */ - #define ISUPDATE_from_mxstatus(status) \ - ((status) MultiXactStatusForUpdate) /* * heap support routines --- 190,195 *** a/src/backend/access/transam/multixact.c --- b/src/backend/access/transam/multixact.c *** *** 457,463 MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status) for (i = 0, j = 0; i nmembers; i++) { if (TransactionIdIsInProgress(members[i].xid) || ! ((members[i].status MultiXactStatusForUpdate) TransactionIdDidCommit(members[i].xid))) { newMembers[j].xid = members[i].xid; --- 457,463 for (i = 0, j = 0; i nmembers; i++) { if (TransactionIdIsInProgress(members[i].xid) || ! (ISUPDATE_from_mxstatus(members[i].status) TransactionIdDidCommit(members[i].xid))) { newMembers[j].xid = members[i].xid; *** *** 713,718 MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members) --- 713,734 return multi; } + /* Verify that there is a single update Xid among the given members. */ + { + int i; + bool has_update = false; + + for (i = 0; i nmembers; i++) + { + if (ISUPDATE_from_mxstatus(members[i].status)) + { + if (has_update) + elog(ERROR, new multixact has more than one updating member); + has_update = true; + } + } + } + /* * Assign the MXID and offsets range to use, and make sure there is space * in the OFFSETs and MEMBERs files. NB: this routine does *** a/src/include/access/multixact.h --- b/src/include/access/multixact.h *** *** 48,53 typedef enum --- 48,57 #define MaxMultiXactStatus MultiXactStatusUpdate + /* does a status value correspond to a tuple update? */ + #define ISUPDATE_from_mxstatus(status) \ + ((status) MultiXactStatusForUpdate) + typedef struct MultiXactMember { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/22/2014 02:01 PM, Alvaro Herrera wrote: I think I should push this patch first, so that Andrew and Josh can try their respective test cases which should start throwing errors, then push the actual fixes. Does that sound okay? Note that I have a limited ability to actually test my failing test case -- I have to fire up the user's full application test suite to produce it, and that needs to be scheduled. However, Andrew is able to rerun his test case repeatedly. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/22/2014 05:20 PM, Josh Berkus wrote: On 04/22/2014 02:01 PM, Alvaro Herrera wrote: I think I should push this patch first, so that Andrew and Josh can try their respective test cases which should start throwing errors, then push the actual fixes. Does that sound okay? Note that I have a limited ability to actually test my failing test case -- I have to fire up the user's full application test suite to produce it, and that needs to be scheduled. However, Andrew is able to rerun his test case repeatedly. Alvaro has access (as does Andres) to my setup, and instructions on how to run the test. I am going to be offline most of the time from tonight until next Tuesday. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/22/2014 02:01 PM, Alvaro Herrera wrote: Some testing later, I think the issue only occurs if we determine that we don't need to wait for the xid/multi to complete, because otherwise the wait itself saves us. (It's easy to cause the problem by adding a breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer lock, and then having transaction A lock for key share, then transaction B update the tuple which stops at the breakpoint, then transaction A also update the tuple, and finally release transaction B). So, trying to make my synthetic test work: In order to encounter this issue, I'd need to have two concurrent processes update the child records of the same parent record? That is: A --- B1 \--- B2 ... and the issue should only happen if I update both B1 and B2 concurrently in separate sessions? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-22 17:36:42 -0400, Andrew Dunstan wrote: On 04/22/2014 05:20 PM, Josh Berkus wrote: On 04/22/2014 02:01 PM, Alvaro Herrera wrote: I think I should push this patch first, so that Andrew and Josh can try their respective test cases which should start throwing errors, then push the actual fixes. Does that sound okay? Note that I have a limited ability to actually test my failing test case -- I have to fire up the user's full application test suite to produce it, and that needs to be scheduled. However, Andrew is able to rerun his test case repeatedly. Alvaro has access (as does Andres) to my setup, and instructions on how to run the test. I am going to be offline most of the time from tonight until next Tuesday. My preliminary patch seems to have fixed the problem in your setup... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-22 14:40:46 -0700, Josh Berkus wrote: On 04/22/2014 02:01 PM, Alvaro Herrera wrote: Some testing later, I think the issue only occurs if we determine that we don't need to wait for the xid/multi to complete, because otherwise the wait itself saves us. (It's easy to cause the problem by adding a breakpoint in heapam.c:3325, i.e. just before re-acquiring the buffer lock, and then having transaction A lock for key share, then transaction B update the tuple which stops at the breakpoint, then transaction A also update the tuple, and finally release transaction B). So, trying to make my synthetic test work: In order to encounter this issue, I'd need to have two concurrent processes update the child records of the same parent record? That is: A --- B1 \--- B2 ... and the issue should only happen if I update both B1 and B2 concurrently in separate sessions? I don't think that'll trigger it. You need rows that are first key share locked and then updated by the locking transaction. Under concurrency. And the timewindow really is rather small.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-22 18:01:40 -0300, Alvaro Herrera wrote: Thanks for the analysis and patches. I've been playing with this on my own a bit, and one thing that I just noticed is that at least for heap_update I cannot reproduce a problem when the xmax is originally a multixact, so AFAICT the number of places that need patched aren't as many. I am quite uncomfortable with that assumption. I don't immediately see a problem for some of the cases, but leaving them in a weaker state than 9.2 makes me uncomfortable. For now I offer a cleaned up version of your patch to add the assertion that multis don't contain multiple updates. I considered the idea of making this #ifdef USE_ASSERT_CHECKING, because it has to walk the complete array of members; and then have full elogs in MultiXactIdExpand and MultiXactIdCreate, which are lighter because they can check more easily. But on second thoughts I refrained from doing that, because surely the arrays are not as large anyway, are they. Yea, I think it's fine to do it where it's in the proposed patch. I think I should push this patch first, so that Andrew and Josh can try their respective test cases which should start throwing errors, then push the actual fixes. Does that sound okay? +1 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
In order to encounter this issue, I'd need to have two concurrent processes update the child records of the same parent record? That is: A --- B1 \--- B2 ... and the issue should only happen if I update both B1 and B2 concurrently in separate sessions? I don't think that'll trigger it. You need rows that are first key share locked and then updated by the locking transaction. Under concurrency. And the timewindow really is rather small.. Well, currently I have a test which locks A and B1, then updates B1 (twice, actually), and then updates A. However, since there's a lock on A, there's no concurrent updating of B1 and B2. This is based on the behavior of the queue where I originally saw the problem, but it doesn't reproduce the bug. I'm thinking I need to just lock B1, update B1, then A, while allowing a concurrent session to update B2 and and A. No? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-22 14:49:00 -0700, Josh Berkus wrote: In order to encounter this issue, I'd need to have two concurrent processes update the child records of the same parent record? That is: A --- B1 \--- B2 ... and the issue should only happen if I update both B1 and B2 concurrently in separate sessions? I don't think that'll trigger it. You need rows that are first key share locked and then updated by the locking transaction. Under concurrency. And the timewindow really is rather small.. Well, currently I have a test which locks A and B1, then updates B1 (twice, actually), and then updates A. However, since there's a lock on A, there's no concurrent updating of B1 and B2. This is based on the behavior of the queue where I originally saw the problem, but it doesn't reproduce the bug. I'm thinking I need to just lock B1, update B1, then A, while allowing a concurrent session to update B2 and and A. No? I don't think this gets any easier to reproduce by introducing more than one relation. Have one session acquire key share locks and then update, and another one just doing updates. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Josh Berkus wrote: In order to encounter this issue, I'd need to have two concurrent processes update the child records of the same parent record? That is: A --- B1 \--- B2 ... and the issue should only happen if I update both B1 and B2 concurrently in separate sessions? I don't think that'll trigger it. You need rows that are first key share locked and then updated by the locking transaction. Under concurrency. And the timewindow really is rather small.. Well, currently I have a test which locks A and B1, then updates B1 (twice, actually), and then updates A. However, since there's a lock on A, there's no concurrent updating of B1 and B2. This is based on the behavior of the queue where I originally saw the problem, but it doesn't reproduce the bug. If you want to make it easier to reproduce, you need to insert some pg_usleep() calls in carefully selected spots. As Andres says, the window is small normally. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Andres Freund wrote: On 2014-04-22 18:01:40 -0300, Alvaro Herrera wrote: Thanks for the analysis and patches. I've been playing with this on my own a bit, and one thing that I just noticed is that at least for heap_update I cannot reproduce a problem when the xmax is originally a multixact, so AFAICT the number of places that need patched aren't as many. I am quite uncomfortable with that assumption. I don't immediately see a problem for some of the cases, but leaving them in a weaker state than 9.2 makes me uncomfortable. That's true too. I'm thinking about the comparison of full infomask as you propose instead of just the bits that we actually care about. I think the only thing that could cause a spurious failure (causing an extra execution of the HeapTupleSatisfiesUpdate call and the stuff below) is somebody setting HEAP_XMIN_COMMITTED concurrently; but that seems infrequent enough that it should pretty harmless. However, should we worry about possible future infomask bit changes that could negatively affect this behavior? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/22/2014 05:07 PM, Alvaro Herrera wrote: If you want to make it easier to reproduce, you need to insert some pg_usleep() calls in carefully selected spots. As Andres says, the window is small normally. Yeah, but the whole point of this is that having pg_stat-statements/auto_explain loaded should make the window much bigger. In the cases where we're hitting it, we're hitting it with a fair degree of predicability. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
All, More on this: 1) I've confirmed at the 2nd site that the issue doesn't happen if pg_stat_statements.so is not loaded. So this seems to be confirmation that either auto_explain, pg_stat_statements, or both need to be loaded (but not necessarily created as extensions) in order to have the issue. 2) I've been unable to reproduce the issue using a naive test case, but I'll keep trying. 3) The XID information is interesting. a) I've confirmed that this is a case of having multiple row versions rather than allowing a duplicate PK to be inserted. b) the end result of this corruption is XIDs which go backwards: xmin | xmax | ctid| step_id | workitem_id | status -+-+---+-+-+- 3362707 | 3362707 | (6177,31) | 1 | 446469 | pending 3362710 | 3362707 | (6177,32) | 1 | 446469 | working 5520953 | 5520953 | (5064,105) | 1 | 727946 | pending 5520954 | 5520953 | (5064,108) | 1 | 727946 | working What's additionally problematic is that the current snapshot minxid is in the 900 range, so it's not clear why any of the above rows are visible at all. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
* Josh Berkus (j...@agliodbs.com) wrote: 1) I've confirmed at the 2nd site that the issue doesn't happen if pg_stat_statements.so is not loaded. So this seems to be confirmation that either auto_explain, pg_stat_statements, or both need to be loaded (but not necessarily created as extensions) in order to have the issue. Interesting... 2) I've been unable to reproduce the issue using a naive test case, but I'll keep trying. Thanks. 3) The XID information is interesting. a) I've confirmed that this is a case of having multiple row versions rather than allowing a duplicate PK to be inserted. b) the end result of this corruption is XIDs which go backwards: xmin | xmax | ctid| step_id | workitem_id | status -+-+---+-+-+- 3362707 | 3362707 | (6177,31) | 1 | 446469 | pending 3362710 | 3362707 | (6177,32) | 1 | 446469 | working 5520953 | 5520953 | (5064,105) | 1 | 727946 | pending 5520954 | 5520953 | (5064,108) | 1 | 727946 | working What's additionally problematic is that the current snapshot minxid is in the 900 range, so it's not clear why any of the above rows are visible at all. Can you get the infomask bits..? What's does pg_controldata report wrt the MultiXid's? THanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] assertion failure 9.3.4
Can you get the infomask bits..? What's does pg_controldata report wrt the MultiXid's? Can't get the infomask bits. pg_controldata attached, with some redactions. Unfortunately, it appears that they've continued to do tests on this system, so the XID counter has advanced somewhat. pg_control version number:937 Catalog version number: {redacted} Database system identifier: {redacted} Database cluster state: in production pg_control last modified: Mon 21 Apr 2014 04:28:30 PM UTC Latest checkpoint location: 8/C8B05F98 Prior checkpoint location:8/C663A4B0 Latest checkpoint's REDO location:8/C6DC9A10 Latest checkpoint's REDO WAL file:0003000800C6 Latest checkpoint's TimeLineID: 3 Latest checkpoint's PrevTimeLineID: 3 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/11214375 Latest checkpoint's NextOID: 32769 Latest checkpoint's NextMultiXactId: 28939 Latest checkpoint's NextMultiOffset: 58004 Latest checkpoint's oldestXID:674 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 11214375 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Time of latest checkpoint:Mon 21 Apr 2014 04:21:02 PM UTC Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location:0/0 Backup end location: 0/0 End-of-backup record required:no Current wal_level setting:hot_standby Current max_connections setting: 100 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Josh Berkus j...@agliodbs.com writes: 1) I've confirmed at the 2nd site that the issue doesn't happen if pg_stat_statements.so is not loaded. So this seems to be confirmation that either auto_explain, pg_stat_statements, or both need to be loaded (but not necessarily created as extensions) in order to have the issue. I looked a little bit for possible interactions between those two modules, and almost immediately noticed that pg_stat_statements is not being polite enough to call any previous occupant of the post_parse_analyze_hook. That's not the bug we're looking for, since auto_explain does not use post_parse_analyze_hook, but it's a bug nonetheless. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Hi, I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Since the symptom of the problem seem to be multixacts with more than one updating xid, I added a check to MultiXactIdCreateFromMembers() preventing that. That requires to move ISUPDATE_from_mxstatus() to a header, but I think we should definitely add such a assert. As it turns out the problem is in the else if (result == HeapTupleBeingUpdated wait) branch in (at least) heap_update(). When the problem is hit the to-be-updated tuple originally has HEAP_XMIN_COMMITTED | HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_KEYSHR_LOCK set. So we release the buffer lock, acquire the tuple lock, and reacquire the buffer lock. But inbetween the locking backend has actually updated the tuple. The code tries to protect against that with: /* * recheck the locker; if someone else changed the tuple while * we weren't looking, start over. */ if ((oldtup.t_data-t_infomask HEAP_XMAX_IS_MULTI) || !TransactionIdEquals( HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; can_continue = true; locker_remains = true; and similar. The problem is that in Andrew's case the infomask changes from 0x2192 to 0x2102 (i.e. it's a normal update afterwards), while xmax stays the same. Ooops. A bit later there's: result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated; So, from thereon we happily continue to update the tuple, thinking there's no previous updater. Which obviously causes problems. I've hacked^Wfixed this by changing the infomask test above into infomask != oldtup.t_data-t_infomask in a couple of places. That seems to be sufficient to survive the testcase a couple of times. I am too hungry right now to think about a proper fix for this and whether there's further problematic areas. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Andres Freund and...@2ndquadrant.com writes: I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Hmm ... is this the same thing Josh is reporting? If so, why the apparent connection with use of pg_stat_statements? Maybe that just changes the timing to make the race condition more likely? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/21/2014 12:26 PM, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Hmm ... is this the same thing Josh is reporting? If so, why the apparent connection with use of pg_stat_statements? Maybe that just changes the timing to make the race condition more likely? Well, in Andrew's case it only happens if auto_explain is loaded. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/21/2014 03:26 PM, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Hmm ... is this the same thing Josh is reporting? If so, why the apparent connection with use of pg_stat_statements? Maybe that just changes the timing to make the race condition more likely? That's my theory. Josh's case is very similar indeed to mine. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-21 15:26:03 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Hmm ... is this the same thing Josh is reporting? If so, why the apparent connection with use of pg_stat_statements? Maybe that just changes the timing to make the race condition more likely? I am not sure. I could only reproduce the problem with both stat_statements and auto_explain loaded, but I didn't try very long. It's pretty timing sensitive, so I could very well imagine that some additional overhead makes it easier to reproduce. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 2014-04-21 12:31:09 -0700, Josh Berkus wrote: On 04/21/2014 12:26 PM, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. Hmm ... is this the same thing Josh is reporting? If so, why the apparent connection with use of pg_stat_statements? Maybe that just changes the timing to make the race condition more likely? Well, in Andrew's case it only happens if auto_explain is loaded. Josh, how long does it take you to reproduce the issue? And can you reproduce it outside of a production environment? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Josh, how long does it take you to reproduce the issue? A couple hours. And can you reproduce it outside of a production environment? Not yet, still working on that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/21/2014 02:54 PM, Andres Freund wrote: Hi, I spent the last two hours poking arounds in the environment Andrew provided and I was able to reproduce the issue, find a assert to reproduce it much faster and find a possible root cause. What's the assert that makes it happen faster? That might help a lot in constructing a self-contained test. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
All, I've taken a stab at creating a reproduceable test case based on the characterisitics of the production issues I'm seeing. But clearly there's an element I'm missing, because I'm not able to produce the bug with a pgbench-based test case. My current test has FKs, updating both FK'd tables, updating both indexed and non-indexed columns, and doing multiple updates in the same transaction, and lock-blocking. Files are attached in case someone has better ideas. queue_bench.sql is the setup file, and then you do: pgbench -n -T 600 -c 15 -j 5 -f queue_adder.bench -f queue_worker.bench -f queue_worker.bench -f queue_worker.bench -f queue_worker.bench {dbname} ... or whatever levels of c and j make sense on your hardware. FWIW, this particular test case might make a nice destruction test case for replication testing, too. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com CREATE TABLE queue ( queue_id serial not null primary key, ts timestamptz not null default now(), cat int not null, status text not null default 'new', current_step int not null default 1 ); CREATE TABLE queue_steps ( queue_id int not null, step int not null, status text not null default 'pending', done timestamptz, payload text ); INSERT INTO queue ( ts, cat, current_step ) SELECT now() - interval '15 seconds' + ( i * interval '1 second' ), (random()*3)::int, 1 FROM generate_series(1,12) as gs(i); INSERT INTO queue_steps ( queue_id, step, status, payload ) SELECT queue_id, gs.i, 'pending', md5(random()::text) FROM queue, generate_series(1,3) as gs(i); CREATE INDEX queue_ts ON queue(ts DESC); CREATE INDEX queue_step_pk ON queue_steps(queue_id, step); ALTER TABLE queue_steps ADD CONSTRAINT queue_fk FOREIGN KEY ( queue_id ) REFERENCES queue(queue_id); CREATE FUNCTION add_queue_item ( ncat int ) RETURNS INT LANGUAGE plpgsql AS $f$ DECLARE qid INT; BEGIN INSERT INTO queue ( ts, cat, status, current_step ) VALUES ( now(), ncat, 'new', 1 ) RETURNING queue_id INTO qid; INSERT INTO queue_steps ( queue_id, step, status, payload ) SELECT qid, gs.i, 'pending', md5(random()::text) FROM generate_series(1,3) as gs(i); RETURN qid; END; $f$; CREATE OR REPLACE FUNCTION queue_step ( ncat int ) RETURNS INT LANGUAGE plpgsql AS $f$ DECLARE qid INT; stnum INT; BEGIN SELECT queue_id INTO qid FROM queue WHERE status = 'working' AND cat = ncat ORDER BY ts LIMIT 1 FOR UPDATE; IF qid IS NULL THEN SELECT queue_id INTO qid FROM queue WHERE status = 'new' AND cat = ncat ORDER BY ts LIMIT 1 FOR UPDATE; END IF; SELECT step FROM queue_steps INTO stnum WHERE queue_id = qid AND status = 'pending' ORDER BY step LIMIT 1; IF stnum = 1 THEN UPDATE queue SET status = 'working' WHERE queue_id = qid; END IF; UPDATE queue_steps SET status = 'working' WHERE queue_id = qid AND step = stnum; PERFORM pg_sleep(0.002); UPDATE queue_steps SET status = 'done', done = now() WHERE queue_id = qid AND step = stnum; IF stnum = 3 THEN UPDATE queue SET status = 'done' WHERE queue_id = qid; ELSE UPDATE queue SET current_step = stnum + 1 WHERE queue_id = qid; END IF; RETURN qid; END; $f$; ANALYZE; \setrandom rcat 0 3 \setrandom wtime 1 3 BEGIN; SELECT add_queue_item(:rcat); END; \sleep :wtime ms \setrandom rcat 0 3 \setrandom wtime 1 3 BEGIN; SELECT queue_step(:rcat); \sleep :wtime ms END; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/17/2014 10:15 AM, Andrew Dunstan wrote: On 04/16/2014 10:28 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 04/16/2014 07:19 PM, Tom Lane wrote: Yeah, it would be real nice to see a self-contained test case for this. Well, that might be hard to put together, but I did try running without pg_stat_statements and auto_explain loaded and the error did not occur. Not sure where that gets us in terms of deciding on a culprit. Could we at least see the exact parameter settings for pg_stat_statements and auto_explain? (And any other GUCs with nondefault values?) Here are all the settings from the run that failed: listen_addresses = '*' port = 5432 fsync = on synchronous_commit = off checkpoint_segments = 128 checkpoint_completion_target = 0.9 shared_buffers = 512MB max_connections = 300 work_mem = 128MB maintenance_work_mem = 32MB effective_cache_size = 16GB effective_io_concurrency = 2 logging_collector = on log_destination = 'stderr' log_filename = 'postgresql-%a.log' log_rotation_size = 0 log_truncate_on_rotation = on log_line_prefix = '%t [%p] %l: ' log_connections = on log_disconnections = on log_statement = 'all' track_activity_query_size = 10240 shared_preload_libraries = 'auto_explain,pg_stat_statements' As you can see, auto_explain's log_min_duration hasn't been set, so it shouldn't be doing anything very much, I should think. There definitely seems to be something going on involving these two pre-loaded modules. With both auto_explain and pg_stat_statements preloaded I can reproduce the error fairly reliably. I have also reproduced it, but less reliably, with auto_explain alone loaded. I have not reproduced it with pg_stat_statements alone loaded, but Josh Berkus has reported that another client has experienced something very similar (duplicate PK errors on a non-assertion-enabled build) with pg_stat_statements alone loaded. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/18/2014 09:42 AM, Andrew Dunstan wrote: There definitely seems to be something going on involving these two pre-loaded modules. With both auto_explain and pg_stat_statements preloaded I can reproduce the error fairly reliably. I have also reproduced it, but less reliably, with auto_explain alone loaded. I have not reproduced it with pg_stat_statements alone loaded, but Josh Berkus has reported that another client has experienced something very similar (duplicate PK errors on a non-assertion-enabled build) with pg_stat_statements alone loaded. Correct. However, this seems to produce the issue less reliably; that is, it takes several hours of load testing for a duplicate PK to show up, so I suspect that with pg_stat_statements alone it's also timing issue. I'm still waiting on the results with pg_stat_statements.so NOT loaded to confirm that we're not actually getting two separate bugs with similar symptoms. The second site does not have any increase in query size. Their only settings for pg_s_s are: pg_stat_statements.max = 1 pg_stat_statements.track = all -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/16/2014 10:28 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 04/16/2014 07:19 PM, Tom Lane wrote: Yeah, it would be real nice to see a self-contained test case for this. Well, that might be hard to put together, but I did try running without pg_stat_statements and auto_explain loaded and the error did not occur. Not sure where that gets us in terms of deciding on a culprit. Could we at least see the exact parameter settings for pg_stat_statements and auto_explain? (And any other GUCs with nondefault values?) Here are all the settings from the run that failed: listen_addresses = '*' port = 5432 fsync = on synchronous_commit = off checkpoint_segments = 128 checkpoint_completion_target = 0.9 shared_buffers = 512MB max_connections = 300 work_mem = 128MB maintenance_work_mem = 32MB effective_cache_size = 16GB effective_io_concurrency = 2 logging_collector = on log_destination = 'stderr' log_filename = 'postgresql-%a.log' log_rotation_size = 0 log_truncate_on_rotation = on log_line_prefix = '%t [%p] %l: ' log_connections = on log_disconnections = on log_statement = 'all' track_activity_query_size = 10240 shared_preload_libraries = 'auto_explain,pg_stat_statements' As you can see, auto_explain's log_min_duration hasn't been set, so it shouldn't be doing anything very much, I should think. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
All, So have encountered a 2nd report of this issue, or of an issue which sounds very similar: - corruption in two queue tables - the tables are written in a high-concurrency, lock-contested environment - user uses SELECT FOR UPDATE with these tables. - pg_stat_statements .so is loaded, but the extension is not installed - four rows were added to the queue tables, while not being added to the PK index. This allowed duplicate PKs to be added. Currently the user is testing not loading the pg_stat_statements.so to see if the problem goes away. They have a destruction test environment, so we should be able to confirm/deny in a couple days. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan and...@dunslane.net wrote: track_activity_query_size = 10240 shared_preload_libraries = 'auto_explain,pg_stat_statements' As you can see, auto_explain's log_min_duration hasn't been set, so it shouldn't be doing anything very much, I should think. track_activity_query_size is 10240? Not 1024? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/17/2014 09:04 PM, Peter Geoghegan wrote: On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan and...@dunslane.net wrote: track_activity_query_size = 10240 shared_preload_libraries = 'auto_explain,pg_stat_statements' As you can see, auto_explain's log_min_duration hasn't been set, so it shouldn't be doing anything very much, I should think. track_activity_query_size is 10240? Not 1024? yes. I normally find 1024 is far too small. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
So, from top to bottom I see the following elements: * backend is executing a query * this query is getting captured by pg_stat_statements * the query is also getting captured by autoexplain, in chain from pg_stat_statements * autoexplain runs the query, which invokes a plpgsql function * this plpgsql function runs another query, and this one is captured by pg_stat_statements * and also by autoexplain * The autoexplain run of this inner query invokes a trigger * .. which is a FK trigger, and therefore runs ri_PerformCheck which runs another query * This FK check query gets captured by pg_stat_statements * and also by autoexplain, which runs it * this autoexplain run of the third query invokes SeqNext * this does a heap access, which uses HeapTupleSatisfiesMVCC * this one tries to read the update XID and gets an assertion failure. Oh my. Would it be possible for you to provide a self-contained setup that behaves like this? I think a bt full would be useful since it'd provide the queries at each of the three stages. I'm not quite clear on why the third query, the one in ri_PerformCheck, is invoking a sequence. AFAICS it's this: /* -- * The query string built is * SELECT 1 FROM ONLY pktable x WHERE pkatt1 = $1 [AND ...] * FOR KEY SHARE OF x * The type id's for the $ parameters are those of the * corresponding FK attributes. * -- */ so either I'm misreading the whole thing, or there is something very odd going on here. Are you aware of something unusual in the FK setups? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Alvaro Herrera alvhe...@2ndquadrant.com writes: I'm not quite clear on why the third query, the one in ri_PerformCheck, is invoking a sequence. It's not --- SeqNext is the next-tuple function for a sequential scan. Nothing to do with sequences. Now, it *is* worth wondering why the heck a query on the table's primary key is using a seqscan and not an indexscan. Maybe the planner thinks there are just a few rows in the table? But the stack trace seems unexceptional other than that. I'm wondering if the combination of autoexplain and pg_stat_statements is causing trouble. Yeah, it would be real nice to see a self-contained test case for this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/16/2014 07:19 PM, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: I'm not quite clear on why the third query, the one in ri_PerformCheck, is invoking a sequence. It's not --- SeqNext is the next-tuple function for a sequential scan. Nothing to do with sequences. Now, it *is* worth wondering why the heck a query on the table's primary key is using a seqscan and not an indexscan. Maybe the planner thinks there are just a few rows in the table? But the stack trace seems unexceptional other than that. I'm wondering if the combination of autoexplain and pg_stat_statements is causing trouble. Yeah, it would be real nice to see a self-contained test case for this. Well, that might be hard to put together, but I did try running without pg_stat_statements and auto_explain loaded and the error did not occur. Not sure where that gets us in terms of deciding on a culprit. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Andrew Dunstan and...@dunslane.net writes: On 04/16/2014 07:19 PM, Tom Lane wrote: Yeah, it would be real nice to see a self-contained test case for this. Well, that might be hard to put together, but I did try running without pg_stat_statements and auto_explain loaded and the error did not occur. Not sure where that gets us in terms of deciding on a culprit. Could we at least see the exact parameter settings for pg_stat_statements and auto_explain? (And any other GUCs with nondefault values?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
Andrew Dunstan wrote: On 04/16/2014 07:19 PM, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: I'm not quite clear on why the third query, the one in ri_PerformCheck, is invoking a sequence. It's not --- SeqNext is the next-tuple function for a sequential scan. Nothing to do with sequences. Now, it *is* worth wondering why the heck a query on the table's primary key is using a seqscan and not an indexscan. Maybe the planner thinks there are just a few rows in the table? But the stack trace seems unexceptional other than that. I'm wondering if the combination of autoexplain and pg_stat_statements is causing trouble. Yeah, it would be real nice to see a self-contained test case for this. Well, that might be hard to put together, but I did try running without pg_stat_statements and auto_explain loaded and the error did not occur. Well, can you get us the queries that are being run, and the schemas involved in those queries? Not sure where that gets us in terms of deciding on a culprit. I suspect, in the blind, that autoexplain is executing a query that causes a second update XID to be added to a multixact (which already has another one), perhaps with the same snapshot as the original update. Why would this be happening in a FK-check query? No idea. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] assertion failure 9.3.4
With a client's code I have just managed to produce the following assertion failure on 9.3.4: 2014-04-15 01:02:46 GMT [19854] 76299: LOG: execute unnamed: select * from asp_ins_event_task_log( job_id:=1, event_id:=3164, task_name:='EventUtcComputeTask', task_status_code:='VALID' , task_start_utc:='04/15/2014 01:02:44.563', task_end_utc:='04/15/2014 01:02:44.563') TRAP: FailedAssertion(!(update_xact == ((TransactionId) 0)), File: /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c, Line: 5873) 2014-04-15 01:02:46 GMT [11959] 11: LOG: server process (PID 19854) was terminated by signal 6: Aborted 2014-04-15 01:02:46 GMT [11959] 12: DETAIL: Failed process was running: select * from asp_ins_event_task_log( job_id:=1, event_id:=3164, task_name:='EventUtcComputeTask', task_status_code :='VALID', task_start_utc:='04/15/2014 01:02:44.563', task_end_utc:='04/15/2014 01:02:44.563') 2014-04-15 01:02:46 GMT [11959] 13: LOG: terminating any other active server processes When running without assertions, the client reports experiencing tables with duplicate primary keys among other things. It's apparently quite reproducible. I'm digging into this, but it's a nasty bug and any useful thoughts would be appreciated. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/14/2014 09:28 PM, Andrew Dunstan wrote: With a client's code I have just managed to produce the following assertion failure on 9.3.4: 2014-04-15 01:02:46 GMT [19854] 76299: LOG: execute unnamed: select * from asp_ins_event_task_log( job_id:=1, event_id:=3164, task_name:='EventUtcComputeTask', task_status_code:='VALID' , task_start_utc:='04/15/2014 01:02:44.563', task_end_utc:='04/15/2014 01:02:44.563') TRAP: FailedAssertion(!(update_xact == ((TransactionId) 0)), File: /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c, Line: 5873) 2014-04-15 01:02:46 GMT [11959] 11: LOG: server process (PID 19854) was terminated by signal 6: Aborted 2014-04-15 01:02:46 GMT [11959] 12: DETAIL: Failed process was running: select * from asp_ins_event_task_log( job_id:=1, event_id:=3164, task_name:='EventUtcComputeTask', task_status_code :='VALID', task_start_utc:='04/15/2014 01:02:44.563', task_end_utc:='04/15/2014 01:02:44.563') 2014-04-15 01:02:46 GMT [11959] 13: LOG: terminating any other active server processes When running without assertions, the client reports experiencing tables with duplicate primary keys among other things. It's apparently quite reproducible. I'm digging into this, but it's a nasty bug and any useful thoughts would be appreciated. and here the stack trace: #0 0x00361ba36285 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00361ba37b9b in __GI_abort () at abort.c:91 #2 0x0075c157 in ExceptionalCondition (conditionName=optimized out, errorType=optimized out, fileName=optimized out, lineNumber=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54 #3 0x0048c2af in MultiXactIdGetUpdateXid (xmax=optimized out, t_infomask=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873 #4 0x0078ad50 in HeapTupleSatisfiesMVCC (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at /home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221 #5 0x0048aad2 in heapgetpage (scan=0x11522f0, page=6) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:397 #6 0x0048aee9 in heapgettup_pagemode (scan=0x11522f0, dir=optimized out, nkeys=0, key=0x0) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:939 #7 0x0048d646 in heap_getnext (scan=0x11522f0, direction=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:1459 #8 0x005becab in SeqNext (node=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/executor/nodeSeqscan.c:66 #9 0x005ac66e in ExecScanFetch (recheckMtd=0x5bec70 SeqRecheck, accessMtd=0x5bec80 SeqNext, node=0x1151488) at /home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:82 #10 ExecScan (node=0x1151488, accessMtd=0x5bec80 SeqNext, recheckMtd=0x5bec70 SeqRecheck) at /home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:167 #11 0x005a5338 in ExecProcNode (node=0x1151488) at /home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:400 #12 0x005b9bcf in ExecLockRows (node=0x1151278) at /home/andrew/pgl/pg_9_3/src/backend/executor/nodeLockRows.c:56 #13 0x005a51d0 in ExecProcNode (node=0x1151278) at /home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:496 #14 0x005a250a in ExecutePlan (dest=0xb952e0, direction=optimized out, numberTuples=1, sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x1151278, estate=0x1151038) at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:1473 #15 standard_ExecutorRun (queryDesc=0x114f230, direction=optimized out, count=1) at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:307 #16 0x7feb5cf2f27d in explain_ExecutorRun (queryDesc=0x114f230, direction=ForwardScanDirection, count=1) at /home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:233 #17 0x7feb5cd2a235 in pgss_ExecutorRun (queryDesc=0x114f230, direction=ForwardScanDirection, count=1) at /home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:717 #18 0x005c8b97 in _SPI_pquery (tcount=1, fire_triggers=0 '\000', queryDesc=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2369 #19 _SPI_execute_plan (plan=0x10afac8, paramLI=0x114f028, snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=0 '\000', tcount=1) at /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2157 #20 0x005c8fd1 in SPI_execute_snapshot (plan=0x10afac8, Values=0x7fff147d8500, Nulls=0x7fff147d8700 , snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=0 '\000', tcount=1) at /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:488 #21 0x00722635 in ri_PerformCheck (riinfo=0xf98168, qkey=0x7fff147d8a20, qplan=0x10afac8,
Re: [HACKERS] assertion failure 9.3.4
Andrew Dunstan wrote: and here the stack trace: #0 0x00361ba36285 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00361ba37b9b in __GI_abort () at abort.c:91 #2 0x0075c157 in ExceptionalCondition (conditionName=optimized out, errorType=optimized out, fileName=optimized out, lineNumber=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54 #3 0x0048c2af in MultiXactIdGetUpdateXid (xmax=optimized out, t_infomask=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873 #4 0x0078ad50 in HeapTupleSatisfiesMVCC (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at /home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221 Clearly this is a bug related to multixacts and the related tqual.c changse. Will look. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assertion failure 9.3.4
On 04/14/2014 10:02 PM, Alvaro Herrera wrote: Andrew Dunstan wrote: and here the stack trace: #0 0x00361ba36285 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00361ba37b9b in __GI_abort () at abort.c:91 #2 0x0075c157 in ExceptionalCondition (conditionName=optimized out, errorType=optimized out, fileName=optimized out, lineNumber=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54 #3 0x0048c2af in MultiXactIdGetUpdateXid (xmax=optimized out, t_infomask=optimized out) at /home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873 #4 0x0078ad50 in HeapTupleSatisfiesMVCC (tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at /home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221 Clearly this is a bug related to multixacts and the related tqual.c changse. Will look. OK. If you need me to test patches I can do so. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers