Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-08 Thread Michael Paquier
On Thu, Nov 9, 2017 at 8:25 AM, Asim Praveen  wrote:
> Indeed, the assertion tripped during WAL replay on the standby.  This was
> caught by TAP tests under src/test/recovery.  The assertion is now fixed so
> that WAL replay is exempt from the check.  Please find the new patch
> attached.  The tests now pass with the fix.  I also manually verified that
> recovery works with "wal_consistency_checking=all".

I still have a bad feeling about that bit... Still, it does not change
the fact that patch 0001 in
https://www.postgresql.org/message-id/CANXE4TccH_VjdKaHc9=KyH0Y7WORqZN+=mH5f=mp0bw3gzx...@mail.gmail.com
needs a committer per the fact that it visibly fixes incorrect backend
code and API contract. So I am switching the CF entry to ready for
committer, but only for 0001.

The other things could always be taken care of later.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-08 Thread Asim Praveen
Hi Michael

On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier 
wrote:

>
> Did you really test WAL replay? This still ignores that PageGetLSN is
> as well taken in some code paths, like recovery, where actions on the
> page are guaranteed to be serialized, like during recovery, so this
> patch would cause the system to blow up. Note that pageinspect,
> amcheck and wal_consistency_checking also process on page copies. So
> the assertion failure of 0002 would trigger in those cases.
>

Indeed, the assertion tripped during WAL replay on the standby.  This was
caught by TAP tests under src/test/recovery.  The assertion is now fixed so
that WAL replay is exempt from the check.  Please find the new patch
attached.  The tests now pass with the fix.  I also manually verified that
recovery works with "wal_consistency_checking=all".

Asim


0002-PageGetLSN-assert-that-locks-are-properly-held.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-07 Thread Michael Paquier
On Wed, Nov 8, 2017 at 2:26 AM, Jacob Champion  wrote:
> On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier
>  wrote:
>> Did you really test WAL replay?
>
> Is there a way to test this other than installcheck-world? The only
> failure we've run into at the moment is in the snapshot-too-old tests.
> Maybe we're not configuring with all the tests enabled?

Not automatically. The simplest method I use in this case is
installcheck with a standby replaying changes behind.

>>> The assertion added caught at least one code path where TestForOldSnapshot
>>> calls PageGetLSN without holding any lock.  The snapshot_too_old test in
>>> "check-world" failed due to the assertion failure.  This needs to be fixed,
>>> see the open question in the opening mail on this thread.
>>
>> Good point. I am looping Kevin Grittner here for his input, as the
>> author and committer of old_snapshot_threshold. Things can be
>> addressed with a separate patch by roughly moving the check on
>> PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic()
>> instead.
>
> It still doesn't pass the tests, as not all code paths ensure that a
> content lock is held before calling TestForOldSnapshot.
> BufferGetLSNAtomic only adds the spinlock.

I would prefer waiting for more input from Kevin here. This may prove
to be a more invasive change. There are no objections into fixing the
existing callers in index AMs though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-07 Thread Jacob Champion
On Tue, Nov 7, 2017 at 9:26 AM, Jacob Champion  wrote:
> On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier
>  wrote:
>> It seems to me that 0001 is good for a committer lookup, that will get
>> rid of all existing bugs. For 0002, what you are proposing is still
>> not a good idea for anything using page copies.
>
> I think there is still significant confusion here. To be clear: this
> patch is intended to add no changes for local page copies.

Maybe a better way to put this is: we see no failures with the
pageinspect regression tests, which exercise PageGetLSN() via the
page_header() function. Are there other tests we should be paying
attention to that might show a problem with our local-copy logic?

--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-07 Thread Jacob Champion
Hi Michael,

On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier
 wrote:
> Did you really test WAL replay?

Is there a way to test this other than installcheck-world? The only
failure we've run into at the moment is in the snapshot-too-old tests.
Maybe we're not configuring with all the tests enabled?

>> The assertion added caught at least one code path where TestForOldSnapshot
>> calls PageGetLSN without holding any lock.  The snapshot_too_old test in
>> "check-world" failed due to the assertion failure.  This needs to be fixed,
>> see the open question in the opening mail on this thread.
>
> Good point. I am looping Kevin Grittner here for his input, as the
> author and committer of old_snapshot_threshold. Things can be
> addressed with a separate patch by roughly moving the check on
> PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic()
> instead.

It still doesn't pass the tests, as not all code paths ensure that a
content lock is held before calling TestForOldSnapshot.
BufferGetLSNAtomic only adds the spinlock.

> The commit fest has lost view of this entry, and so did I. So I have
> added a new entry:
> https://commitfest.postgresql.org/16/1371/

Thank you!

> BufferGetLSNAtomic() could really use LWLockHeldByMe(). Could you
> consider it with an extra patch on top of 0001?

The LWLockHeldByMe() calls are added to BufferGetLSNAtomic in patch
0002 (because it does all its work through PageGetLSN).

> It seems to me that 0001 is good for a committer lookup, that will get
> rid of all existing bugs. For 0002, what you are proposing is still
> not a good idea for anything using page copies.

I think there is still significant confusion here. To be clear: this
patch is intended to add no changes for local page copies. As I've
tried to say (three or four times now): to our understanding, local
page copies are not allocated in the shared BufferBlocks region and
are therefore not subject to the new assertions. Am I missing a corner
case, or completely misunderstanding your point? I never got a direct
response to my earlier questions on this.

--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-06 Thread Michael Paquier
On Tue, Nov 7, 2017 at 7:27 AM, Asim Praveen  wrote:
> On Mon, Oct 2, 2017 at 6:48 PM, Michael Paquier 
> wrote:
>> Jacob, here are some ideas to make this thread move on. I would
>> suggest to produce a set of patches that do things incrementally:
>> 1) One patch that changes the calls of PageGetLSN to
>> BufferGetLSNAtomic which are now not appropriate. You have spotted
>> some of them in the btree and gist code, but not all based on my first
>> lookup. There is still one in gistFindCorrectParent and one in btree
>> _bt_split. The monitoring of the other calls (sequence.c and
>> vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble
>> that should be fixed I think.
>
> Thank you for your suggestions.  Please find the first patch attached as
> "0001-...".  We verified both, gistFindCorrectParent and _bt_split and all
> calls to PageGetLSN are made with exclusive lock on the buffer contents
> held.

Cool. Thanks for double-checking. XLogRecordAssemble() is fine after
more lookup at this code, XLogRegisterBuffer already doing some sanity
checks.

>> 2) A second patch that strengthens a bit checks around
>> BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you
>> are originally suggesting.
>> A comment could be as well added in bufpage.h for PageGetLSN to let
>> users know that it should be used carefully, in the vein of what is
>> mentioned in src/backend/access/transam/README.
>
> The second patch "0002-..." does the above.  We have a comment added to
> AssertPageIsLockedForLSN as suggested.

Did you really test WAL replay? This still ignores that PageGetLSN is
as well taken in some code paths, like recovery, where actions on the
page are guaranteed to be serialized, like during recovery, so this
patch would cause the system to blow up. Note that pageinspect,
amcheck and wal_consistency_checking also process on page copies. So
the assertion failure of 0002 would trigger in those cases.

> The assertion added caught at least one code path where TestForOldSnapshot
> calls PageGetLSN without holding any lock.  The snapshot_too_old test in
> "check-world" failed due to the assertion failure.  This needs to be fixed,
> see the open question in the opening mail on this thread.

Good point. I am looping Kevin Grittner here for his input, as the
author and committer of old_snapshot_threshold. Things can be
addressed with a separate patch by roughly moving the check on
PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic()
instead.

The commit fest has lost view of this entry, and so did I. So I have
added a new entry:
https://commitfest.postgresql.org/16/1371/

BufferGetLSNAtomic() could really use LWLockHeldByMe(). Could you
consider it with an extra patch on top of 0001?

It seems to me that 0001 is good for a committer lookup, that will get
rid of all existing bugs. For 0002, what you are proposing is still
not a good idea for anything using page copies. Here are some
suggestions:
- Implement a PageGetLSNFromCopy, dedicated at working correctly when
working on a page copy. Then switch callers of amcheck, pageinspect
and wal_consistency_checking to use that.
- Implement something like GetLSNFromLockedPage, and switch of
backend's PageGetLSN to that. Performance impact could be seen..
- Have a PageGetLSNSafe, which can be used safely for serialized actions.
It could be an idea to remove PageGetLSN to force a breakage of
extensions calling it, so as they would review any of its calls. Not a
fan of that though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-11-06 Thread Asim Praveen
Hi Michael
On Mon, Oct 2, 2017 at 6:48 PM, Michael Paquier 
wrote:
>
> Jacob, here are some ideas to make this thread move on. I would
> suggest to produce a set of patches that do things incrementally:
> 1) One patch that changes the calls of PageGetLSN to
> BufferGetLSNAtomic which are now not appropriate. You have spotted
> some of them in the btree and gist code, but not all based on my first
> lookup. There is still one in gistFindCorrectParent and one in btree
> _bt_split. The monitoring of the other calls (sequence.c and
> vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble
> that should be fixed I think.

Thank you for your suggestions.  Please find the first patch attached as
"0001-...".  We verified both, gistFindCorrectParent and _bt_split and all
calls to PageGetLSN are made with exclusive lock on the buffer contents
held.

> 2) A second patch that strengthens a bit checks around
> BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you
> are originally suggesting.
> A comment could be as well added in bufpage.h for PageGetLSN to let
> users know that it should be used carefully, in the vein of what is
> mentioned in src/backend/access/transam/README.


The second patch "0002-..." does the above.  We have a comment added to
AssertPageIsLockedForLSN as suggested.

The assertion added caught at least one code path where TestForOldSnapshot
calls PageGetLSN without holding any lock.  The snapshot_too_old test in
"check-world" failed due to the assertion failure.  This needs to be fixed,
see the open question in the opening mail on this thread.

Asim and Jacob


0001-Change-incorrect-calls-to-PageGetLSN-to-BufferGetLSN.patch
Description: Binary data


0002-PageGetLSN-assert-that-locks-are-properly-held.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-06 Thread Michael Paquier
On Fri, Oct 6, 2017 at 7:34 PM, Alvaro Herrera  wrote:
> Just search for "Æsop" in the archives:
> https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR=fmxhsjsn_mri3rjygvbrmcrff...@mail.gmail.com

(laugh)
I didn't know this one.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-06 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera  
> wrote:

> > I'd argue about this in the same direction I argued about
> > BufferGetPage() needing an LSN check that's applied separately: if it's
> > too easy for a developer to do the wrong thing (i.e. fail to apply the
> > separate LSN check after reading the page) then the routine should be
> > patched to do the check itself, and another routine should be offered
> > for the cases that explicitly can do without the check.
> >
> > I was eventually outvoted in the BufferGetPage() saga, though, so I'm
> > not sure that that discussion sets precedent.
> 
> Oh... I don't recall this discussion. A quick lookup at the archives
> does not show me a specific thread either.

Just search for "Æsop" in the archives:
https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR=fmxhsjsn_mri3rjygvbrmcrff...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-03 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera  
> wrote:

> > I'd argue about this in the same direction I argued about
> > BufferGetPage() needing an LSN check that's applied separately: if it's
> > too easy for a developer to do the wrong thing (i.e. fail to apply the
> > separate LSN check after reading the page) then the routine should be
> > patched to do the check itself, and another routine should be offered
> > for the cases that explicitly can do without the check.
> >
> > I was eventually outvoted in the BufferGetPage() saga, though, so I'm
> > not sure that that discussion sets precedent.
> 
> Oh... I don't recall this discussion. A quick lookup at the archives
> does not show me a specific thread either.

I mean Kevin patch for snapshot-too-old:
https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR%3DFmxHsjSn_MRi3rJYgvbRMCRfFz%2BA%40mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Michael Paquier
On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera  wrote:
> Michael Paquier wrote:
>> On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas  wrote:
>> > On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
>> >  wrote:
>> >> So those bits could be considered for integration.
>> >
>> > Yes, and they also tend to suggest that the rest of the patch has value.
>>
>> Well, there are cases where you don't need any locking checks, and the
>> proposed patch ignores that. Take for example pageinspect which works
>> on a copy of a page, or just WAL replay which serializes everything,
>> and in both cases PageGetLSN can be used safely. So for compatibility
>> reasons I think that PageGetLSN should be kept alone to not surprise
>> anybody using it, or at least an equivalent should be introduced. It
>> would be interesting to make BufferGetLSNAtomic hold tighter checks,
>> like something that makes use of LWLockHeldByMe for example.

Jacob, here are some ideas to make this thread move on. I would
suggest to produce a set of patches that do things incrementally:
1) One patch that changes the calls of PageGetLSN to
BufferGetLSNAtomic which are now not appropriate. You have spotted
some of them in the btree and gist code, but not all based on my first
lookup. There is still one in gistFindCorrectParent and one in btree
_bt_split. The monitoring of the other calls (sequence.c and
vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble
that should be fixed I think.
2) A second patch that strengthens a bit checks around
BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you
are originally suggesting.
A comment could be as well added in bufpage.h for PageGetLSN to let
users know that it should be used carefully, in the vein of what is
mentioned in src/backend/access/transam/README.

> I'd argue about this in the same direction I argued about
> BufferGetPage() needing an LSN check that's applied separately: if it's
> too easy for a developer to do the wrong thing (i.e. fail to apply the
> separate LSN check after reading the page) then the routine should be
> patched to do the check itself, and another routine should be offered
> for the cases that explicitly can do without the check.
>
> I was eventually outvoted in the BufferGetPage() saga, though, so I'm
> not sure that that discussion sets precedent.

Oh... I don't recall this discussion. A quick lookup at the archives
does not show me a specific thread either.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Robert Haas
On Mon, Oct 2, 2017 at 11:05 AM, Michael Paquier
 wrote:
> Well, there are cases where you don't need any locking checks, and the
> proposed patch ignores that.

I understand that, but shouldn't we then look for a way to adjust the
patch so that it doesn't have that issue any longer, rather than just
kicking it to the curb?  I mean, just saying "patch suxxor, next"
doesn't seem like the right approach to something that has apparently
already found real problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Alvaro Herrera
Michael Paquier wrote:
> On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas  wrote:
> > On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
> >  wrote:
> >> So those bits could be considered for integration.
> >
> > Yes, and they also tend to suggest that the rest of the patch has value.
> 
> Well, there are cases where you don't need any locking checks, and the
> proposed patch ignores that. Take for example pageinspect which works
> on a copy of a page, or just WAL replay which serializes everything,
> and in both cases PageGetLSN can be used safely. So for compatibility
> reasons I think that PageGetLSN should be kept alone to not surprise
> anybody using it, or at least an equivalent should be introduced. It
> would be interesting to make BufferGetLSNAtomic hold tighter checks,
> like something that makes use of LWLockHeldByMe for example.

I'd argue about this in the same direction I argued about
BufferGetPage() needing an LSN check that's applied separately: if it's
too easy for a developer to do the wrong thing (i.e. fail to apply the
separate LSN check after reading the page) then the routine should be
patched to do the check itself, and another routine should be offered
for the cases that explicitly can do without the check.

I was eventually outvoted in the BufferGetPage() saga, though, so I'm
not sure that that discussion sets precedent.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Michael Paquier
On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas  wrote:
> On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
>  wrote:
>> So those bits could be considered for integration.
>
> Yes, and they also tend to suggest that the rest of the patch has value.

Well, there are cases where you don't need any locking checks, and the
proposed patch ignores that. Take for example pageinspect which works
on a copy of a page, or just WAL replay which serializes everything,
and in both cases PageGetLSN can be used safely. So for compatibility
reasons I think that PageGetLSN should be kept alone to not surprise
anybody using it, or at least an equivalent should be introduced. It
would be interesting to make BufferGetLSNAtomic hold tighter checks,
like something that makes use of LWLockHeldByMe for example.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Robert Haas
On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
 wrote:
> So those bits could be considered for integration.

Yes, and they also tend to suggest that the rest of the patch has value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Michael Paquier
On Mon, Oct 2, 2017 at 9:09 PM, Robert Haas  wrote:
> I think the first question we ought to be asking ourselves is whether
> any of the PageGetLSN -> BufferGetLSNAtomic changes the patch
> introduces are live bugs.  If they are, then we ought to fix those
> separately (and probably back-patch).  If they are not, then we need
> to think about how to adjust the patch so that it doesn't complain
> about things that are in fact OK.

If you look at each item one by one that the patch touches based on
the contract defined in transam/README...

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -640,7 +640,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size
freespace, GISTSTATE *giststate)
}

stack->page = (Page) BufferGetPage(stack->buffer);
-   stack->lsn = PageGetLSN(stack->page);
+   stack->lsn = BufferGetLSNAtomic(stack->buffer);
There is an incorrect use of PageGetLSN here. A shared lock can be
taken on the page but there is no buffer header lock used when using
PageGetLSN.

@@ -890,7 +890,7 @@ gistFindPath(Relation r, BlockNumber child,
OffsetNumber *downlinkoffnum)
break;
}

-   top->lsn = PageGetLSN(page);
+   top->lsn = BufferGetLSNAtomic(buffer);
Here as well only a shared lock is taken on the page.

@@ -61,7 +61,7 @@ gistkillitems(IndexScanDesc scan)
 * read. killedItems could be not valid so LP_DEAD hints applying is not
 * safe.
 */
-   if (PageGetLSN(page) != so->curPageLSN)
+   if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
{
UnlockReleaseBuffer(buffer);
so->numKilled = 0;  /* reset counter */
@@ -384,7 +384,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem
*pageItem, double *myDistances,
 * safe to apply LP_DEAD hints to the page later. This allows us to drop
 * the pin for MVCC scans, which allows vacuum to avoid blocking.
 */
-   so->curPageLSN = PageGetLSN(page);
+   so->curPageLSN = BufferGetLSNAtomic(buffer);
Those two as well only use a shared lock.

@@ -249,7 +249,7 @@ gistbulkdelete(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,

ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-   ptr->parentlsn = PageGetLSN(page);
+   ptr->parentlsn = BufferGetLSNAtomic(buffer);
ptr->next = stack->next;
stack->next = ptr;
Here also a shared lock is only taken, that's a VACUUM code path for
Gist indexes.

+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1224,7 +1224,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection
dir, OffsetNumber offnum)
 * safe to apply LP_DEAD hints to the page later.  This allows us to drop
 * the pin for MVCC scans, which allows vacuum to avoid blocking.
 */
-   so->currPos.lsn = PageGetLSN(page);
+   so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
Here the caller of _bt_readpage should have taken a lock, but the page
is not necessarily pinned. Still, _bt_getbuf makes sure that the pin
is done.

+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1772,7 +1772,7 @@ _bt_killitems(IndexScanDesc scan)
return;

page = BufferGetPage(buf);
-   if (PageGetLSN(page) == so->currPos.lsn)
+   if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
so->currPos.buf = buf;
Same here thanks to _bt_getbuf.

So those bits could be considered for integration.

Also, if I read the gist code correctly, there is one other case in
gistFindCorrectParent. And in btree, there is one occurence in
_bt_split. In XLogRecordAssemble, there could be more checks regarding
the locks taken on the buffer registered.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Robert Haas
On Mon, Sep 4, 2017 at 2:14 AM, Michael Paquier
 wrote:
A>> that would trip it. The latter part is still in progress, because I'm
> Well, PageGetLSN can be used in some hot code paths, xloginsert.c
> being one, so it does not seem wise to me to switch it to something
> more complicated than a macro, and also it is not bounded to any
> locking contracts now.

I don't see how turning it into a static inline function is worse.
We've been doing a fair amount of that lately and it generally makes
things better, not worse, often avoiding multiple evaluation hazards
at the same time it generates better machine code.

I find this patch sort of messy; in particular, the definition of
AssertPageIsLockedForLSN tries to reverse-engineer a buffer ID from a
Page, and that's sort of ugly.  But I think the concept of trying to
make sure that our code is adhering to necessary locking rules is a
pretty good one.

I think the first question we ought to be asking ourselves is whether
any of the PageGetLSN -> BufferGetLSNAtomic changes the patch
introduces are live bugs.  If they are, then we ought to fix those
separately (and probably back-patch).  If they are not, then we need
to think about how to adjust the patch so that it doesn't complain
about things that are in fact OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-02 Thread Daniel Gustafsson
> On 20 Sep 2017, at 00:29, Jacob Champion  wrote:
> 
> On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion  wrote:
>> On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
>>  wrote:
>>> In short, it seems to me that this patch should be rejected in its
>>> current shape.
>> 
>> Is the half of the patch that switches PageGetLSN to
>> BufferGetLSNAtomic correct, at least?
> 
> Any further thoughts on this? If the BufferGetLSNAtomic fixes made
> here are not correct to begin with, then the rest of the patch is
> probably moot; I just want to double-check that that is the case.

Based on the discussions in this thread, I’m marking this patch Returned with
feedback. Please re-submit a new version in an upcoming commitfest when ready.

cheers ./daniel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-19 Thread Jacob Champion
On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion  wrote:
> On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
>  wrote:
>> In short, it seems to me that this patch should be rejected in its
>> current shape.
>
> Is the half of the patch that switches PageGetLSN to
> BufferGetLSNAtomic correct, at least?

Any further thoughts on this? If the BufferGetLSNAtomic fixes made
here are not correct to begin with, then the rest of the patch is
probably moot; I just want to double-check that that is the case.

--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-06 Thread Jacob Champion
On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
 wrote:
> On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion  wrote:
>> On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
>>  wrote:
>>> +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
>>> +void
>>> +AssertPageIsLockedForLSN(Page page)
>>> +{
>>> From a design point of view, bufmgr.c is an API interface for buffer
>>> management on the backend-size, so just using FRONTEND there looks
>>> wrong to me (bufmgr.h is already bad enough IMO now).
>>
>> The comments suggested that bufmgr.h could be incidentally pulled into
>> frontend code through other headers. Or do you mean that I don't need
>> to check for FRONTEND in the C source file (since, I assume, it's only
>> ever being compiled for the backend)? I'm not sure what you mean by
>> "bufmgr.h is already bad enough".
>
> I mean that this should not become worse without a better reason. This
> patch makes it so.
>
>>> This bit in src/backend/access/transam/README may interest you:
>>> Note that we must only use PageSetLSN/PageGetLSN() when we know the action
>>> is serialised. Only Startup process may modify data blocks during recovery,
>>> so Startup process may execute PageGetLSN() without fear of serialisation
>>> problems. All other processes must only call PageSet/GetLSN when holding
>>> either an exclusive buffer lock or a shared lock plus buffer header lock,
>>> or be writing the data block directly rather than through shared buffers
>>> while holding AccessExclusiveLock on the relation.
>>>
>>> So I see no problem with the exiting caller.
>>
>> Is heap_xlog_visible only called during startup?
>
> Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record.
>
>> If so, is there a
>> general rule for which locks are required during startup and which
>> aren't?
>
> You can surely say that a process is fine to read a variable without a
> lock if it is the only one updating it, other processes would need a
> lock to read a variable. In this case the startup process is the only
> one updating blocks, so that's at least one code path where the is no
> need to take a lock when looking at a page LSN with PageGetLSN.

Is there a way to programmatically determine whether or not we're in
such a situation? That could be added to the assertion.

The code here is pretty inconsistent on whether or not PageGetLSN is
called inside or outside a lock -- see XLogReadBufferForRedoExtended
for instance.

> Another example is pageinspect which works on a copy of a page and
> uses PageGetLSN, so no direct locks on the buffer page is needed.

And again -- this case should be correctly handled by the new
assertion. Copies of pages are not checked for locks; we can't recover
a buffer header from a page that isn't part of the BufferBlocks array.

> In short, it seems to me that this patch should be rejected in its
> current shape.

Is the half of the patch that switches PageGetLSN to
BufferGetLSNAtomic correct, at least?

> There is no need to put more constraints on a API which
> does not need more constraints and which is used widely by extensions.

The goal here is not to add more constraints, but to verify that the
existing constraints are followed. Perhaps this patch doesn't do that
well/correctly, but I want to make sure we're on the same page
regarding the end goal.

--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-05 Thread Michael Paquier
On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion  wrote:
> On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
>  wrote:
>> +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
>> +void
>> +AssertPageIsLockedForLSN(Page page)
>> +{
>> From a design point of view, bufmgr.c is an API interface for buffer
>> management on the backend-size, so just using FRONTEND there looks
>> wrong to me (bufmgr.h is already bad enough IMO now).
>
> The comments suggested that bufmgr.h could be incidentally pulled into
> frontend code through other headers. Or do you mean that I don't need
> to check for FRONTEND in the C source file (since, I assume, it's only
> ever being compiled for the backend)? I'm not sure what you mean by
> "bufmgr.h is already bad enough".

I mean that this should not become worse without a better reason. This
patch makes it so.

>> This bit in src/backend/access/transam/README may interest you:
>> Note that we must only use PageSetLSN/PageGetLSN() when we know the action
>> is serialised. Only Startup process may modify data blocks during recovery,
>> so Startup process may execute PageGetLSN() without fear of serialisation
>> problems. All other processes must only call PageSet/GetLSN when holding
>> either an exclusive buffer lock or a shared lock plus buffer header lock,
>> or be writing the data block directly rather than through shared buffers
>> while holding AccessExclusiveLock on the relation.
>>
>> So I see no problem with the exiting caller.
>
> Is heap_xlog_visible only called during startup?

Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record.

> If so, is there a
> general rule for which locks are required during startup and which
> aren't?

You can surely say that a process is fine to read a variable without a
lock if it is the only one updating it, other processes would need a
lock to read a variable. In this case the startup process is the only
one updating blocks, so that's at least one code path where the is no
need to take a lock when looking at a page LSN with PageGetLSN.
Another example is pageinspect which works on a copy of a page and
uses PageGetLSN, so no direct locks on the buffer page is needed.

In short, it seems to me that this patch should be rejected in its
current shape. There is no need to put more constraints on a API which
does not need more constraints and which is used widely by extensions.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-05 Thread Jacob Champion
Hi Michael, thanks for the review!

On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
 wrote:
>> While working on checksum support for GPDB, we noticed that several
>> callers of PageGetLSN() didn't follow the correct locking procedure.
>
> Well, PageGetLSN can be used in some hot code paths, xloginsert.c
> being one, so it does not seem wise to me to switch it to something
> more complicated than a macro,

If raw perf is an issue -- now that I've refactored the assertion into
its own function, I could probably push the implementation back into a
macro. Something like

#define PageGetLSN(page) (AssertPageIsLockedForLSN((Page) page),
PageXLogRecPtrGet(((PageHeader) page)->pd_lsn))

Callers would need to watch out for the double-evaluation, but that's
already documented with this group of macros.

> and also it is not bounded to any
> locking contracts now. For example, pageinspect works on a copy of the
> buffer, so that's one case where we just don't care. So we should it
> be tightened in Postgres?

The proposed patch doesn't tighten the contract in this case -- if the
buffer that's passed to PageGetLSN() isn't a shared buffer (i.e. if
it's not part of the BufferBlocks array), the assertion passes
unconditionally.

> That's the portion of the proposal I don't
> get. You could introduce a custom API for this purpose, isn't the
> origin of your problems with GPDB for checksums that data is
> replicated across many nodes so things need to be locked uniquely?

GPDB introduces additional problems, but I think this issue is
independent of anything GPDB-specific. One way to look at it: are the
changes in the proposed patch, from PageGetLSN to BufferGetLSNAtomic,
correct? If not, well, that's a good discussion to have. But if they
are correct, then why were they missed? and is there a way to help
future contributors out in the future? That's the goal of this patch;
I don't think it's something that a custom API solves.

>> - This change introduces a subtle source incompatibility: whereas
>> previously both Pages and PageHeaders could be passed to PageGetLSN(),
>> now callers must explicitly cast to Page if they don't already have
>> one.
>
> That would produce warnings for extensions, so people would likely
> catch that when compiling with a newer version, if they use -Werror...
>
>> - Is my use of FRONTEND here correct? The documentation made it seem
>> like some compilers could try to link against the
>> AssertPageIsLockedForLSN function, even if PageGetLSN was never
>> called.
>
> +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
> +void
> +AssertPageIsLockedForLSN(Page page)
> +{
> From a design point of view, bufmgr.c is an API interface for buffer
> management on the backend-size, so just using FRONTEND there looks
> wrong to me (bufmgr.h is already bad enough IMO now).

The comments suggested that bufmgr.h could be incidentally pulled into
frontend code through other headers. Or do you mean that I don't need
to check for FRONTEND in the C source file (since, I assume, it's only
ever being compiled for the backend)? I'm not sure what you mean by
"bufmgr.h is already bad enough".

>> - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't
>> really know what the best way is to fix it. It explicitly unlocks the
>> buffer, with the comment that visibilitymap_set() needs to handle
>> locking itself, but then calls PageGetLSN() to determine whether
>> visibilitymap_set() needs to be called.
>
> This bit in src/backend/access/transam/README may interest you:
> Note that we must only use PageSetLSN/PageGetLSN() when we know the action
> is serialised. Only Startup process may modify data blocks during recovery,
> so Startup process may execute PageGetLSN() without fear of serialisation
> problems. All other processes must only call PageSet/GetLSN when holding
> either an exclusive buffer lock or a shared lock plus buffer header lock,
> or be writing the data block directly rather than through shared buffers
> while holding AccessExclusiveLock on the relation.
>
> So I see no problem with the exiting caller.

Is heap_xlog_visible only called during startup? If so, is there a
general rule for which locks are required during startup and which
aren't?

Currently there is no exception in the assertion made for startup
processes, so that's something that would need to be changed. Is there
a way to determine whether the current process considers itself to be
a startup process?

Thanks again!
--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-04 Thread Michael Paquier
On Fri, Sep 1, 2017 at 3:53 AM, Jacob Champion  wrote:
> The patch is really two pieces: add the assertion, and fix the callers
> that would trip it. The latter part is still in progress, because I'm
> running into some places where I'm not sure what the correct way
> forward is.

I looked at your patch.

> While working on checksum support for GPDB, we noticed that several
> callers of PageGetLSN() didn't follow the correct locking procedure.

Well, PageGetLSN can be used in some hot code paths, xloginsert.c
being one, so it does not seem wise to me to switch it to something
more complicated than a macro, and also it is not bounded to any
locking contracts now. For example, pageinspect works on a copy of the
buffer, so that's one case where we just don't care. So we should it
be tightened in Postgres? That's the portion of the proposal I don't
get. You could introduce a custom API for this purpose, isn't the
origin of your problems with GPDB for checksums that data is
replicated across many nodes so things need to be locked uniquely?

> - This change introduces a subtle source incompatibility: whereas
> previously both Pages and PageHeaders could be passed to PageGetLSN(),
> now callers must explicitly cast to Page if they don't already have
> one.

That would produce warnings for extensions, so people would likely
catch that when compiling with a newer version, if they use -Werror...

> - Is my use of FRONTEND here correct? The documentation made it seem
> like some compilers could try to link against the
> AssertPageIsLockedForLSN function, even if PageGetLSN was never
> called.

+#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
+void
+AssertPageIsLockedForLSN(Page page)
+{
>From a design point of view, bufmgr.c is an API interface for buffer
management on the backend-size, so just using FRONTEND there looks
wrong to me (bufmgr.h is already bad enough IMO now).

> - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't
> really know what the best way is to fix it. It explicitly unlocks the
> buffer, with the comment that visibilitymap_set() needs to handle
> locking itself, but then calls PageGetLSN() to determine whether
> visibilitymap_set() needs to be called.

This bit in src/backend/access/transam/README may interest you:
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.

So I see no problem with the exiting caller.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-01 Thread Robert Haas
On Fri, Sep 1, 2017 at 12:24 PM, Jacob Champion  wrote:
> On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas  wrote:
>> It's a good idea to add patches to commitfest.postgresql.org
>
> Hi Robert, I added it yesterday as
> https://commitfest.postgresql.org/14/1279/ . Let me know if I need to
> touch anything up there.

Nah, looks fine, I just somehow missed it when I looked the first time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-01 Thread Jacob Champion
On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas  wrote:
> It's a good idea to add patches to commitfest.postgresql.org

Hi Robert, I added it yesterday as
https://commitfest.postgresql.org/14/1279/ . Let me know if I need to
touch anything up there.

Thanks!
--Jacob


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-09-01 Thread Robert Haas
On Thu, Aug 31, 2017 at 2:53 PM, Jacob Champion  wrote:
> While working on checksum support for GPDB, we noticed that several
> callers of PageGetLSN() didn't follow the correct locking procedure.
> To try to help ferret out present and future mistakes, we added an
> assertion to PageGetLSN() that checks whether those locks were being
> held correctly. This patch is a first-draft attempt to port that
> assertion back up to postgres master, based on work by Asim Praveen,
> Ashwin Agrawal, and myself.
>
> The patch is really two pieces: add the assertion, and fix the callers
> that would trip it. The latter part is still in progress, because I'm
> running into some places where I'm not sure what the correct way
> forward is.
>
> (I'm a newbie to this list and this code base, so please don't
> hesitate to correct me on anything, whether that's code- or
> culture-related!)

It's a good idea to add patches to commitfest.postgresql.org

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-08-31 Thread Jacob Champion
Hello all,

While working on checksum support for GPDB, we noticed that several
callers of PageGetLSN() didn't follow the correct locking procedure.
To try to help ferret out present and future mistakes, we added an
assertion to PageGetLSN() that checks whether those locks were being
held correctly. This patch is a first-draft attempt to port that
assertion back up to postgres master, based on work by Asim Praveen,
Ashwin Agrawal, and myself.

The patch is really two pieces: add the assertion, and fix the callers
that would trip it. The latter part is still in progress, because I'm
running into some places where I'm not sure what the correct way
forward is.

(I'm a newbie to this list and this code base, so please don't
hesitate to correct me on anything, whether that's code- or
culture-related!)

= Notes/Observations =

- This change introduces a subtle source incompatibility: whereas
previously both Pages and PageHeaders could be passed to PageGetLSN(),
now callers must explicitly cast to Page if they don't already have
one.

- I originally tried to put (and the GPDB patch succeeds in putting)
the entire assertion implementation into PageGetLSN in bufpage.h. That
seems unworkable here -- buf_internals.h is no longer publicized, and
the assertion needs those internals to perform its checks. So I moved
the assertion into its own helper function in bufmgr.c. If assertions
are disabled, that helper declaration gets turned into an empty macro,
so there shouldn't be a performance hit.

= Open Questions =

- Is my use of FRONTEND here correct? The documentation made it seem
like some compilers could try to link against the
AssertPageIsLockedForLSN function, even if PageGetLSN was never
called.

- Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't
really know what the best way is to fix it. It explicitly unlocks the
buffer, with the comment that visibilitymap_set() needs to handle
locking itself, but then calls PageGetLSN() to determine whether
visibilitymap_set() needs to be called.

- TestForOldSnapshot is also problematic. The function is called in
places where only a shared lock is held, so it needs to hold the
header spinlock at least some of the time, but we only pass the Page
as an argument. I could do the same "find buffer descriptor from page
pointer" logic that we utilize in the assertion implementation, but is
there a better way?

- Should I try to add this assertion to PageSetLSN() as well? We were
focused mostly on the GetLSN side of things, since that was the more
risky piece of our backport. PageSetLSN is called from many more
places, and the review there would be much larger.

= Known Issues =

- This approach doesn't seem to work when checksums are disabled. In
that case, BufferGetLSNAtomic isn't actually atomic, so it seems to
always trip the assertion. I'm not sure of the best way to proceed --
is it really correct not to follow the locking contract if checksums
are disabled?

What do you think? Is this something worth pursuing? Any and all
comments welcome.

Thanks!
--Jacob


PageGetLSN-assert-locks-held.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers