Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Michael Paquier
On Wed, Apr 10, 2024 at 03:28:22PM +0530, Amit Kapila wrote:
> I can understand this comment as I am aware of this code but not sure
> it would be equally easy for the people first time looking at this
> code. One may try to find the equivalent assertion in
> _hash_freeovflpage(). The alternative could be: "Ensure that the
> required flags are set when there are no tuples. See
> _hash_freeovflpage().". I am also fine if you prefer to go with your
> proposed comment.

Yes, I can see your point about why that's confusing.  Your suggestion
is much better, so after a second look I've used your version of the
comment and applied the patch on HEAD.

I am wondering if we have other problems like that with dirty buffers
at replay.  Perhaps I should put my nose more onto the replay paths
and extend these automated checks with wal_consistency_checking.
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Amit Kapila
On Wed, Apr 10, 2024 at 1:27 PM Michael Paquier  wrote:
>
> On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote:
> >> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
> >> I'm slightly annoyed by the fact that there is no check on
> >> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> >> show the symmetry between the insert and replay paths.  Shouldn't
> >> there be at least an assert for that in the branch when there are no
> >> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> >> just before updating the hash page's opaque area when
> >> is_prev_bucket_same_wrt.
> >
> > Indeed, added an Assert() in else part. Was it same as your expectation?
>
> Yep, close enough.  Thanks to the insert path we know that there will
> be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
> and the replay path where the assertion is added.
>

It is fine to have an assertion for this path.

+ else
+ {
+ /*
+ * See _hash_freeovflpage() which has a similar assertion when
+ * there are no tuples.
+ */
+ Assert(xldata->is_prim_bucket_same_wrt ||
+xldata->is_prev_bucket_same_wrt);

I can understand this comment as I am aware of this code but not sure
it would be equally easy for the people first time looking at this
code. One may try to find the equivalent assertion in
_hash_freeovflpage(). The alternative could be: "Ensure that the
required flags are set when there are no tuples. See
_hash_freeovflpage().". I am also fine if you prefer to go with your
proposed comment.

Otherwise, the patch looks good to me.

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Michael Paquier
On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote:
>> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
>> I'm slightly annoyed by the fact that there is no check on
>> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
>> show the symmetry between the insert and replay paths.  Shouldn't
>> there be at least an assert for that in the branch when there are no
>> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
>> just before updating the hash page's opaque area when
>> is_prev_bucket_same_wrt.
> 
> Indeed, added an Assert() in else part. Was it same as your expectation?

Yep, close enough.  Thanks to the insert path we know that there will
be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
and the replay path where the assertion is added.

>> I've been thinking about ways to make such cases detectable in an
>> automated fashion.  The best choice would be
>> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
>> copy of the block image stored in WAL before applying the page masking
>> which would mask the LSN.  A problem, unfortunately, is that we would
>> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
>> flag so we would be able to track if the block rebuilt from the record
>> has the *same* LSN as the copy used for the consistency check.  So
>> this edge consistency case would come at a cost, I am afraid, and the
>> 8 bits of bimg_info are precious :/
> 
> I could not decide that the change has more benefit than its cost, so I did
> nothing for it.

It does not prevent doing an implementation that can be used for some
test validation in the context of this thread.  Using this idea, I
have done the attached 0002.  This is not something to merge into the
tree, of course, just a toy to:
- Validate the fix for the problem we know.
- More braodly, check if there are other areas covered by the main
regression test suite if there are other problems.

And from what I can see, applying 0002 without 0001 causes the
following test to fail as the standby chokes on a inconsistent LSN
with a FATAL because the LSN of the apply page coming from the primary
and the LSN saved in the page replayed don't match.  Here is a command
to reproduce the problem:
cd src/test/recovery/ && \
  PG_TEST_EXTRA=wal_consistency_checking \
  PROVE_TESTS=t/027_stream_regress.pl make check

And then in the logs you'd see stuff like:
2024-04-10 16:52:21.844 JST [2437] FATAL:  inconsistent page LSN
replayed 0/A7E5CD18 primary 0/A7E56348
2024-04-10 16:52:21.844 JST [2437] CONTEXT:  WAL redo at 0/A7E59F68
for Hash/SQUEEZE_PAGE: prevblkno 28, nextblkno 4294967295, ntups 0,
is_primary T; blkref #1: rel 1663/16384/25434, blk 7 FPW; blkref #2:
rel 1663/16384/25434, blk 29 FPW; blkref #3: rel 1663/16384/25434, blk
28 FPW; blkref #5: rel 1663/16384/25434, blk 9 FPW; blkref #6: rel
1663/16384/25434, blk 0 FPW

I don't see other areas with a similar problem, at the extent of the
core regression tests, that is to say.  That's my way to say that your
patch looks good to me and that I'm planning to apply it to fix the
issue.

This shows me a more interesting issue unrelated to this thread:
027_stream_regress.pl would be stuck if the standby finds an
inconsistent page under wal_consistency_checking.  This needs to be
fixed before I'm able to create a buildfarm animal with this setup.
I'll spawn a new thread about that tomorrow.
--
Michael
From 4c6597f2ce57439696aecdbda08b374857ab715d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 10 Apr 2024 16:47:29 +0900
Subject: [PATCH v3 1/2] Fix inconsistency with replay of hash squeeze record
 for clean buffers

Blah.
---
 src/backend/access/hash/hash_xlog.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..d7ba74579b 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_wbuf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, );
 
@@ -695,6 +696,17 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 ninserted++;
 			}
+
+			mod_wbuf = true;
+		}
+		else
+		{
+			/*
+			 * See _hash_freeovflpage() which has a similar assertion when
+			 * there are no tuples.
+			 */
+			Assert(xldata->is_prim_bucket_same_wrt ||
+   xldata->is_prev_bucket_same_wrt);
 		}
 
 		/*
@@ -711,10 +723,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_wbuf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_wbuf)
+		{
+			PageSetLSN(writepage, 

RE: Is this a problem in GenericXLogFinish()?

2024-04-09 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
> > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
> > but we missed the redo case. I made a fix patch based on the suggestion [1].
> 
> +   boolmod_buf = false;
> 
> Perhaps you could name that mod_wbuf, to be consistent with the WAL
> insert path.

Right, fixed.

> I'm slightly annoyed by the fact that there is no check on
> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> show the symmetry between the insert and replay paths.  Shouldn't
> there be at least an assert for that in the branch when there are no
> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> just before updating the hash page's opaque area when
> is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

> I've been thinking about ways to make such cases detectable in an
> automated fashion.  The best choice would be
> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
> copy of the block image stored in WAL before applying the page masking
> which would mask the LSN.  A problem, unfortunately, is that we would
> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
> flag so we would be able to track if the block rebuilt from the record
> has the *same* LSN as the copy used for the consistency check.  So
> this edge consistency case would come at a cost, I am afraid, and the
> 8 bits of bimg_info are precious :/

I could not decide that the change has more benefit than its cost, so I did
nothing for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2_add_modbuf_flag.diff
Description: v2_add_modbuf_flag.diff


Re: Is this a problem in GenericXLogFinish()?

2024-04-09 Thread Michael Paquier
On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
> Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
> but we missed the redo case. I made a fix patch based on the suggestion [1].

+   boolmod_buf = false;

Perhaps you could name that mod_wbuf, to be consistent with the WAL
insert path.

I'm slightly annoyed by the fact that there is no check on
is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
show the symmetry between the insert and replay paths.  Shouldn't
there be at least an assert for that in the branch when there are no
tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
just before updating the hash page's opaque area when
is_prev_bucket_same_wrt.

I've been thinking about ways to make such cases detectable in an
automated fashion.  The best choice would be
verifyBackupPageConsistency(), just after RestoreBlockImage() on the
copy of the block image stored in WAL before applying the page masking
which would mask the LSN.  A problem, unfortunately, is that we would
need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
flag so we would be able to track if the block rebuilt from the record
has the *same* LSN as the copy used for the consistency check.  So
this edge consistency case would come at a cost, I am afraid, and the
8 bits of bimg_info are precious :/
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2024-04-05 Thread Michael Paquier
On Fri, Apr 05, 2024 at 01:26:29PM +0900, Michael Paquier wrote:
> On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:
>> On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:
>> > Thanks for the report and looking into it. Pushed!
>> 
>> Thanks Amit for the commit and Kuroda-san for the patch!
> 
> I have been pinged about this thread and I should have looked a bit
> closer, but wait a minute here.

I have forgotten to mention that Zubeyr Eryilmaz, a colleague, should
be credited here.
--
Michael


signature.asc
Description: PGP signature


RE: Is this a problem in GenericXLogFinish()?

2024-04-05 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> There is still some divergence between the code path of
> _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
> squeezing a page: we do not set the LSN of the write buffer if
> (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
> !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
> but at replay we call PageSetLSN() on the write buffer and mark it
> dirty in this case.  Isn't that incorrect?  It seems to me that we
> should be able to make the conditional depending on the state of the
> xl_hash_squeeze_page record, no?

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1].

Below part contained my analysis and how I reproduced.
I posted for clarifying the issue to others.


=

## Pointed issue

Assuming the case
- xlrec.ntups is 0,
- xlrec.is_prim_bucket_same_wrt is true, and 
- xlrec.is_prev_bucket_same_wrt is false.
This meant that there are several overflow pages and the tail deadpage is 
removing.
In this case, the primary page is not have to be modified.

While doing the normal operation, the removal is done in _hash_freeovflpage().
If above condition is met, mod_wbuf is not set to true so PageSetLSN() is 
skipped.

While doing the recovery, the squeeze and removal is done in 
hash_xlog_squeeze_page().
In this function PageSetLSN() is set unconditionally.
He said in this case the PageSetLSN should be avoided as well.

## Analysis

IIUC there is the same issue with pointed by [2].
He said the PageSetLSN() should be set when the page is really modified.
In the discussing case, wbug is not modified (just removing the tail entry) so 
that no need
to assign LSN to it. However, we might miss to update the redo case as well.

## How to reproduce

I could reproduce the case below steps.
1. Added the debug log like [3]
2. Constructed a physical replication.
3. Ran hash_index.sql
4. Found the added debug log.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz
[3]:
```
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
writeopaque->hasho_nextblkno = xldata->nextblkno;
}

+if (xldata->ntups == 0 &&
+xldata->is_prim_bucket_same_wrt &&
+!xldata->is_prev_bucket_same_wrt)
+elog(LOG, "XXX no need to set PageSetLSN");
+
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



add_modbuf_flag.diff
Description: add_modbuf_flag.diff


Re: Is this a problem in GenericXLogFinish()?

2024-04-04 Thread Amit Kapila
On Fri, Apr 5, 2024 at 9:56 AM Michael Paquier  wrote:
>
> On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:
> > On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:
> > > Thanks for the report and looking into it. Pushed!
> >
> > Thanks Amit for the commit and Kuroda-san for the patch!
>
> I have been pinged about this thread and I should have looked a bit
> closer, but wait a minute here.
>
> There is still some divergence between the code path of
> _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
> squeezing a page: we do not set the LSN of the write buffer if
> (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
> !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
> but at replay we call PageSetLSN() on the write buffer and mark it
> dirty in this case.  Isn't that incorrect?
>

Agreed. We will try to reproduce this.

>  It seems to me that we
> should be able to make the conditional depending on the state of the
> xl_hash_squeeze_page record, no?
>

I think we can have a flag like mod_buf and set it in both the
conditions if (xldata->ntups > 0) and if
(xldata->is_prev_bucket_same_wrt). If the flag is set then we can set
the LSN and mark buffer dirty.

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2024-04-04 Thread Michael Paquier
On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:
> On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:
> > Thanks for the report and looking into it. Pushed!
> 
> Thanks Amit for the commit and Kuroda-san for the patch!

I have been pinged about this thread and I should have looked a bit
closer, but wait a minute here.

There is still some divergence between the code path of
_hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
squeezing a page: we do not set the LSN of the write buffer if
(xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
!xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
but at replay we call PageSetLSN() on the write buffer and mark it
dirty in this case.  Isn't that incorrect?  It seems to me that we
should be able to make the conditional depending on the state of the
xl_hash_squeeze_page record, no?
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2024-02-07 Thread Michael Paquier
On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:
> Thanks for the report and looking into it. Pushed!

Thanks Amit for the commit and Kuroda-san for the patch!
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2024-02-07 Thread Amit Kapila
On Mon, Feb 5, 2024 at 1:33 PM Michael Paquier  wrote:
>
> On Mon, Feb 05, 2024 at 07:57:09AM +, Hayato Kuroda (Fujitsu) wrote:
> > You are right. Based on the previous discussions, PageSetLSN() must be 
> > called
> > after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for 
> > skipping
> > these requirements. Definitevely, no_change buffers must not be 
> > PageSetLSN()'d.
> > Other pages, e.g., metabuf, has already been followed the rule.
>
> At quick glance, this v2 seems kind of right to me: you are setting
> the page LSN only when the page is registered in the record and
> actually dirtied.
>

Thanks for the report and looking into it. Pushed!

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2024-02-05 Thread Michael Paquier
On Mon, Feb 05, 2024 at 07:57:09AM +, Hayato Kuroda (Fujitsu) wrote:
> You are right. Based on the previous discussions, PageSetLSN() must be called
> after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
> these requirements. Definitevely, no_change buffers must not be 
> PageSetLSN()'d.
> Other pages, e.g., metabuf, has already been followed the rule.

At quick glance, this v2 seems kind of right to me: you are setting
the page LSN only when the page is registered in the record and
actually dirtied.
--
Michael


signature.asc
Description: PGP signature


RE: Is this a problem in GenericXLogFinish()?

2024-02-04 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> 
> @@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,
> Buffer ovflbuf,
>   if (!xlrec.is_prev_bucket_same_wrt)
>   wbuf_flags |= REGBUF_NO_CHANGE;
>   XLogRegisterBuffer(1, wbuf, wbuf_flags);
> +
> + /* Track the registration status for later use */
> + wbuf_registered = true;
>   }
> 
>   XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
> @@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer
> bucketbuf, Buffer ovflbuf,
> 
>   recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
> 
> - PageSetLSN(BufferGetPage(wbuf), recptr);
> + /* Set LSN to wbuf page buffer only when it is being registered */
> + if (wbuf_registered)
> + PageSetLSN(BufferGetPage(wbuf), recptr);
> 
> Why set LSN when the page is not modified (say when we use the flag
> REGBUF_NO_CHANGE)?  I think you need to use a flag mod_wbuf and set it
> in appropriate places during register buffer calls.

You are right. Based on the previous discussions, PageSetLSN() must be called
after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d.
Other pages, e.g., metabuf, has already been followed the rule.

I updated the patch based on the requirement.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2_avoid_registration.patch
Description: v2_avoid_registration.patch


Re: Is this a problem in GenericXLogFinish()?

2024-02-04 Thread Amit Kapila
On Mon, Feb 5, 2024 at 10:00 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> >
> > Amit, this has been applied as of 861f86beea1c, and I got pinged about
> > the fact this triggers inconsistencies because we always set the LSN
> > of the write buffer (wbuf in _hash_freeovflpage) but
> > XLogRegisterBuffer() would *not* be called when the two following
> > conditions happen:
> > - When xlrec.ntups <= 0.
> > - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt
> >
> > And it seems to me that there is still a bug here: there should be no
> > point in setting the LSN on the write buffer if we don't register it
> > in WAL at all, no?
>
> Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing 
> the
> issue.
>

@@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,
Buffer ovflbuf,
  if (!xlrec.is_prev_bucket_same_wrt)
  wbuf_flags |= REGBUF_NO_CHANGE;
  XLogRegisterBuffer(1, wbuf, wbuf_flags);
+
+ /* Track the registration status for later use */
+ wbuf_registered = true;
  }

  XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
@@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer
bucketbuf, Buffer ovflbuf,

  recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);

- PageSetLSN(BufferGetPage(wbuf), recptr);
+ /* Set LSN to wbuf page buffer only when it is being registered */
+ if (wbuf_registered)
+ PageSetLSN(BufferGetPage(wbuf), recptr);

Why set LSN when the page is not modified (say when we use the flag
REGBUF_NO_CHANGE)?  I think you need to use a flag mod_wbuf and set it
in appropriate places during register buffer calls.

-- 
With Regards,
Amit Kapila.




RE: Is this a problem in GenericXLogFinish()?

2024-02-04 Thread Hayato Kuroda (Fujitsu)
Dear Michael, Amit,

> 
> Amit, this has been applied as of 861f86beea1c, and I got pinged about
> the fact this triggers inconsistencies because we always set the LSN
> of the write buffer (wbuf in _hash_freeovflpage) but
> XLogRegisterBuffer() would *not* be called when the two following
> conditions happen:
> - When xlrec.ntups <= 0.
> - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt
> 
> And it seems to me that there is still a bug here: there should be no
> point in setting the LSN on the write buffer if we don't register it
> in WAL at all, no?

Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the
issue.

This patch can avoid the inconsistency due to the LSN setting and output a debug
LOG when we met such a case. I executed hash_index.sql and confirmed the log was
output [1]. This meant that current test has already had a workload which meets 
below
conditions:

 - the overflow page has no tuples (xlrec.ntups is 0),
 - to-be-written page - wbuf - is not the primary (xlrec.is_prim_bucket_same_wrt
   is false), and
 - to-be-written buffer is not next to the overflow page
   (xlrec.is_prev_bucket_same_wrt is false)

So, I think my patch (after removing elog(...) part) can fix the issue. Thought?

[1]:
```
LOG:  XXX: is_wbuf_registered: false
CONTEXT:  while vacuuming index "hash_cleanup_index" of relation 
"public.hash_cleanup_heap"
STATEMENT:  VACUUM hash_cleanup_heap;
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



avoid_registration.patch
Description: avoid_registration.patch


Re: Is this a problem in GenericXLogFinish()?

2024-02-02 Thread Amit Kapila
On Fri, Feb 2, 2024 at 12:40 PM Michael Paquier  wrote:
>
> Amit, this has been applied as of 861f86beea1c, and I got pinged about
> the fact this triggers inconsistencies because we always set the LSN
> of the write buffer (wbuf in _hash_freeovflpage) but
> XLogRegisterBuffer() would *not* be called when the two following
> conditions happen:
> - When xlrec.ntups <= 0.
> - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt
>
> And it seems to me that there is still a bug here: there should be no
> point in setting the LSN on the write buffer if we don't register it
> in WAL at all, no?
>

Right, I see the problem. I'll look into it further.

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2024-02-01 Thread Michael Paquier
On Fri, Dec 01, 2023 at 03:27:33PM +0530, Amit Kapila wrote:
> Pushed!

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-12-01 Thread Amit Kapila
On Thu, Nov 30, 2023 at 4:30 PM Alexander Lakhin  wrote:
>
> 30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote:
> > Again, good catch. Here is my analysis and fix patch.
> > I think it is sufficient to add an initialization for writebuf.
>
> I agree with the change.
>

Pushed!

-- 
With Regards,
Amit Kapila.




RE: Is this a problem in GenericXLogFinish()?

2023-11-30 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> I think it is acceptable especially if it increases code
> coverage. Can you once check that?

PSA the screen shot. I did "PROVE_TESTS="t/027*" make check" in 
src/test/recovery, and
generated a report.
Line 661 was not hit in the HEAD, but the screen showed that it was executed.

[1]: 
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Is this a problem in GenericXLogFinish()?

2023-11-30 Thread Alexander Lakhin

Hello Kuroda-san,

30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote:

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.


I agree with the change. It aligns hash_xlog_squeeze_page() with
hash_xlog_move_page_contents() in regard to initialization of writebuf.
Interestingly enough, the discrepancy exists since these functions
appeared with c11453ce0, and I can't see what justified adding the
initialization inside hash_xlog_move_page_contents() only.
So I think that if this doesn't affect performance, having aligned coding
(writebuf initialized in both functions) is better.


I want to add a test case for it as well. I've tested on my env and found that 
proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller 
value.)

I compared the execution time before/after patching, and they are not so 
different
(1077 ms -> 1125 ms).

How do you think?


I can confirm that the test case proposed demonstrates what is expected
and the duration increase is tolerable, as to me.

Best regards,
Alexander




Re: Is this a problem in GenericXLogFinish()?

2023-11-30 Thread Amit Kapila
On Thu, Nov 30, 2023 at 12:58 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> >
> > Good catch, thank you for reporting! I will investigate more about it and 
> > post my
> > analysis.
> >
>
> Again, good catch. Here is my analysis and fix patch.
> I think it is sufficient to add an initialization for writebuf.
>
> In the reported case, neither is_prim_bucket_same_wrt nor 
> is_prev_bucket_same_wrt
> is set in the WAL record, and ntups is also zero. This means that the wbuf is 
> not
> written in the WAL record on primary side (see [1]).
> Also, there are no reasons to read (and lock) other buffer page because we do
> not modify it. Based on them, I think that just initializing as InvalidBuffer 
> is sufficient.
>

Agreed.

>
> I want to add a test case for it as well. I've tested on my env and found 
> that proposed
> tuples seems sufficient.
> (We must fill the primary bucket, so initial 500 is needed. Also, we must add
> many dead pages to lead squeeze operation. Final page seems OK to smaller 
> value.)
>
> I compared the execution time before/after patching, and they are not so 
> different
> (1077 ms -> 1125 ms).
>

In my environment, it increased from 375ms to 385ms (median of five
runs). I think it is acceptable especially if it increases code
coverage. Can you once check that?

-- 
With Regards,
Amit Kapila.




RE: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Hayato Kuroda (Fujitsu)
Dear Alexander,

> 
> Good catch, thank you for reporting! I will investigate more about it and 
> post my
> analysis.
>

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

In the reported case, neither is_prim_bucket_same_wrt nor 
is_prev_bucket_same_wrt
is set in the WAL record, and ntups is also zero. This means that the wbuf is 
not
written in the WAL record on primary side (see [1]).
Also, there are no reasons to read (and lock) other buffer page because we do
not modify it. Based on them, I think that just initializing as InvalidBuffer 
is sufficient.


I want to add a test case for it as well. I've tested on my env and found that 
proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller 
value.)

I compared the execution time before/after patching, and they are not so 
different
(1077 ms -> 1125 ms).

How do you think?

[1]:
```
else if (xlrec.is_prim_bucket_same_wrt || 
xlrec.is_prev_bucket_same_wrt)
{
uint8   wbuf_flags;

/*
 * A write buffer needs to be registered even if no 
tuples are
 * added to it to ensure that we can acquire a cleanup 
lock on it
 * if it is the same as primary bucket buffer or update 
the
 * nextblkno if it is same as the previous bucket 
buffer.
 */
Assert(xlrec.ntups == 0);

wbuf_flags = REGBUF_STANDARD;
if (!xlrec.is_prev_bucket_same_wrt)
wbuf_flags |= REGBUF_NO_CHANGE;
XLogRegisterBuffer(1, wbuf, wbuf_flags);
}
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



initialize_writebuf.patch
Description: initialize_writebuf.patch


RE: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Hayato Kuroda (Fujitsu)
Dear Alexander,

> 
> I've discovered that that patch introduced a code path leading to an
> uninitialized memory access.
> With the following addition to hash_index.sql:
>   -- Fill overflow pages by "dead" tuples.
>   BEGIN;
>   INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000)
> as i;
> +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
> i;
>   ROLLBACK;
> +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
> i;
> 
> make check -C src/test/recovery/ PROVE_TESTS="t/027*"
> when executed under Valgrind, triggers:
> ==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised
> value(s)
> ==00:00:02:30.285 97744==at 0x227585: BufferIsValid (bufmgr.h:303)
> ==00:00:02:30.285 97744==by 0x227585: hash_xlog_squeeze_page
> (hash_xlog.c:781)
> ==00:00:02:30.285 97744==by 0x228133: hash_redo (hash_xlog.c:1083)
> ==00:00:02:30.285 97744==by 0x2C2801: ApplyWalRecord
> (xlogrecovery.c:1943)
> ==00:00:02:30.285 97744==by 0x2C5C52: PerformWalRecovery
> (xlogrecovery.c:1774)
> ==00:00:02:30.285 97744==by 0x2B63A1: StartupXLOG (xlog.c:5559)
> ==00:00:02:30.285 97744==by 0x558165: StartupProcessMain
> (startup.c:282)
> ==00:00:02:30.285 97744==by 0x54DFE8: AuxiliaryProcessMain
> (auxprocess.c:141)
> ==00:00:02:30.285 97744==by 0x5546B0: StartChildProcess
> (postmaster.c:5331)
> ==00:00:02:30.285 97744==by 0x557A53: PostmasterMain
> (postmaster.c:1458)
> ==00:00:02:30.285 97744==by 0x4720C2: main (main.c:198)
> ==00:00:02:30.285 97744==
> (in 027_stream_regress_standby_1.log)
> 
> That is, when line
> https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.ht
> ml#661
> is reached, writebuf stays uninitialized.

Good catch, thank you for reporting! I will investigate more about it and post 
my analysis.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Is this a problem in GenericXLogFinish()?

2023-11-29 Thread Alexander Lakhin

Hello,

13.11.2023 20:21, Robert Haas wrote:

On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
 wrote:

Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:



I've discovered that that patch introduced a code path leading to an
uninitialized memory access.
With the following addition to hash_index.sql:
 -- Fill overflow pages by "dead" tuples.
 BEGIN;
 INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
 ROLLBACK;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;

make check -C src/test/recovery/ PROVE_TESTS="t/027*"
when executed under Valgrind, triggers:
==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised 
value(s)
==00:00:02:30.285 97744==    at 0x227585: BufferIsValid (bufmgr.h:303)
==00:00:02:30.285 97744==    by 0x227585: hash_xlog_squeeze_page 
(hash_xlog.c:781)
==00:00:02:30.285 97744==    by 0x228133: hash_redo (hash_xlog.c:1083)
==00:00:02:30.285 97744==    by 0x2C2801: ApplyWalRecord (xlogrecovery.c:1943)
==00:00:02:30.285 97744==    by 0x2C5C52: PerformWalRecovery 
(xlogrecovery.c:1774)
==00:00:02:30.285 97744==    by 0x2B63A1: StartupXLOG (xlog.c:5559)
==00:00:02:30.285 97744==    by 0x558165: StartupProcessMain (startup.c:282)
==00:00:02:30.285 97744==    by 0x54DFE8: AuxiliaryProcessMain 
(auxprocess.c:141)
==00:00:02:30.285 97744==    by 0x5546B0: StartChildProcess (postmaster.c:5331)
==00:00:02:30.285 97744==    by 0x557A53: PostmasterMain (postmaster.c:1458)
==00:00:02:30.285 97744==    by 0x4720C2: main (main.c:198)
==00:00:02:30.285 97744==
(in 027_stream_regress_standby_1.log)

That is, when line
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661
is reached, writebuf stays uninitialized.

Best regards,
Alexander




Re: Is this a problem in GenericXLogFinish()?

2023-11-13 Thread Amit Kapila
On Mon, Nov 13, 2023 at 10:51 PM Robert Haas  wrote:
>
> On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
>  wrote:
> > Moved.
>
> I see that this patch was committed, but I'm not very convinced that
> the approach is correct. The comment says this:
>
> +   /*
> +* A write buffer needs to be registered even if no tuples are
> +* added to it to ensure that we can acquire a cleanup lock on it
> +* if it is the same as primary bucket buffer or update the
> +* nextblkno if it is same as the previous bucket buffer.
> +*/
>
> But surely if the buffer is the same as one of those others, then it's
> registered anyway,

I don't think for others it's registered. For example, consider the
case when prevpage and the writepage are the same (aka
xlrec.is_prev_bucket_same_wrt is true), it won't be registered in
another code path (see comment [1]).

>
> and if it isn't, then it doesn't need to be.
>

In the previous example, we need it to update the nextblockno during replay.

I am not sure if I understand the scenario you are worried about, so
if my response doesn't address your concern, can you please explain a
bit more about the scenario you have in mind?

[1] -
/*
 * If prevpage and the writepage (block in which we are moving tuples
 * from overflow) are same, then no need to separately register
 * prevpage.  During replay, we can directly update the nextblock in
 * writepage.
 */

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2023-11-13 Thread Robert Haas
On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
 wrote:
> Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:

+   /*
+* A write buffer needs to be registered even if no tuples are
+* added to it to ensure that we can acquire a cleanup lock on it
+* if it is the same as primary bucket buffer or update the
+* nextblkno if it is same as the previous bucket buffer.
+*/

But surely if the buffer is the same as one of those others, then it's
registered anyway, and if it isn't, then it doesn't need to be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Is this a problem in GenericXLogFinish()?

2023-11-12 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thanks for reviewing! PSA new version patch.

> > > Next we should add some test codes. I will continue considering but please
> post
> > > anything
> > > If you have idea.
> >
> > And I did, PSA the patch. This patch adds two parts in hash_index.sql.
> >
> > In the first part, the primary bucket page is filled by live tuples and some
> overflow
> > pages are by dead ones. This leads removal of overflow pages without moving
> tuples.
> > HEAD will crash if this were executed, which is already reported on hackers.
> >
> 
> +-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
> +CHECKPOINT;
> +VACUUM hash_cleanup_heap;
> 
> Why do we need CHECKPOINT in the above test?

CHECKPOINT command is needed for writing a hash pages to disk. IIUC if the 
command
is omitted, none of tuples are recorded to hash index *file*, so squeeze would 
not
be occurred.

You can test by 1) restoring changes for hashovfl.c then 2) removing CHECKPOINT
command. Server crash would not be occurred.

> > The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt ==
> false &&
> > is_prev_bucket_same_wrt == true), which is never done before.
> >
> 
> +-- Insert few tuples, the primary bucket page will not satisfy
> +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
> 
> What do you mean by the second part of the sentence (the primary
> bucket page will not satisfy)?

I meant to say that the primary bucket page still can have more tuples. Maybe I
should say "will not be full". Reworded.

> Few other minor comments:
> ===
> 1. Can we use ROLLBACK instead of ABORT in the tests?

Changed. IIRC they have same meaning, but it seems that most of sql files have
"ROLLBACK".

> 2.
> - for (i = 0; i < nitups; i++)
> + for (int i = 0; i < nitups; i++)
> 
> I don't think there is a need to make this unrelated change.

Reverted. I thought that the variable i would be used only when nitups>0 so that
it could be removed, but it was not my business.

> 3.
> + /*
> + * A write buffer needs to be registered even if no tuples are added to
> + * it to ensure that we can acquire a cleanup lock on it if it is the
> + * same as primary bucket buffer or update the nextblkno if it is same
> + * as the previous bucket buffer.
> + */
> + else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
> + {
> + uint8 wbuf_flags;
> +
> + Assert(xlrec.ntups == 0);
> 
> Can we move this comment inside else if, just before Assert?

Moved.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_hash_squeeze_wal_5.patch
Description: fix_hash_squeeze_wal_5.patch


Re: Is this a problem in GenericXLogFinish()?

2023-11-12 Thread Amit Kapila
On Fri, Nov 10, 2023 at 9:17 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Next we should add some test codes. I will continue considering but please 
> > post
> > anything
> > If you have idea.
>
> And I did, PSA the patch. This patch adds two parts in hash_index.sql.
>
> In the first part, the primary bucket page is filled by live tuples and some 
> overflow
> pages are by dead ones. This leads removal of overflow pages without moving 
> tuples.
> HEAD will crash if this were executed, which is already reported on hackers.
>

+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;

Why do we need CHECKPOINT in the above test?

> The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false &&
> is_prev_bucket_same_wrt == true), which is never done before.
>

+-- Insert few tuples, the primary bucket page will not satisfy
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;

What do you mean by the second part of the sentence (the primary
bucket page will not satisfy)?

Few other minor comments:
===
1. Can we use ROLLBACK instead of ABORT in the tests?

2.
- for (i = 0; i < nitups; i++)
+ for (int i = 0; i < nitups; i++)

I don't think there is a need to make this unrelated change.

3.
+ /*
+ * A write buffer needs to be registered even if no tuples are added to
+ * it to ensure that we can acquire a cleanup lock on it if it is the
+ * same as primary bucket buffer or update the nextblkno if it is same
+ * as the previous bucket buffer.
+ */
+ else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+ {
+ uint8 wbuf_flags;
+
+ Assert(xlrec.ntups == 0);

Can we move this comment inside else if, just before Assert?

-- 
With Regards,
Amit Kapila.




RE: Is this a problem in GenericXLogFinish()?

2023-11-09 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Next we should add some test codes. I will continue considering but please 
> post
> anything
> If you have idea.

And I did, PSA the patch. This patch adds two parts in hash_index.sql.

In the first part, the primary bucket page is filled by live tuples and some 
overflow
pages are by dead ones. This leads removal of overflow pages without moving 
tuples.
HEAD will crash if this were executed, which is already reported on hackers.

The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false &&
is_prev_bucket_same_wrt == true), which is never done before.



Also, I measured the execution time of before/after patching. Below is a madian
for 9 measurements.

BEFORE -> AFTER
647 ms -> 710 ms

This means that the execution time increased -10%, it will not affect so much.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




fix_hash_squeeze_wal_4.patch
Description: fix_hash_squeeze_wal_4.patch


RE: Is this a problem in GenericXLogFinish()?

2023-11-05 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Dear Amit, Michael,
> 
> Thanks for making the patch!
> 
> > Why register wbuf at all if there are no tuples to add and it is not
> > the same as bucketbuf? Also, I think this won't be correct if prevbuf
> > and wrtbuf are the same and also we have no tuples to add to wbuf. I
> > have attached a naive and crude way to achieve it. This needs more
> > work both in terms of trying to find a better way to change the code
> > or ensure this won't break any existing case. I have just run the
> > existing tests. Such a fix certainly required more testing.
> 
> I'm verifying the idea (currently it seems OK), but at least there is a 
> coding error -
> wbuf_flags should be uint8 here. PSA the fixed patch.

Here is a new patch which is bit refactored. It did:

* If-conditions in _hash_freeovflpage() were swapped.
* Based on above, an Assert(xlrec.ntups == 0) was added.
* A condition in hash_xlog_squeeze_page() was followed the change as well
* comments were adjusted

Next we should add some test codes. I will continue considering but please post 
anything
If you have idea.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_hash_squeeze_wal_3.patch
Description: fix_hash_squeeze_wal_3.patch


RE: Is this a problem in GenericXLogFinish()?

2023-11-05 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Michael,

Thanks for making the patch!

> Why register wbuf at all if there are no tuples to add and it is not
> the same as bucketbuf? Also, I think this won't be correct if prevbuf
> and wrtbuf are the same and also we have no tuples to add to wbuf. I
> have attached a naive and crude way to achieve it. This needs more
> work both in terms of trying to find a better way to change the code
> or ensure this won't break any existing case. I have just run the
> existing tests. Such a fix certainly required more testing.

I'm verifying the idea (currently it seems OK), but at least there is a coding 
error -
wbuf_flags should be uint8 here. PSA the fixed patch.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_hash_squeeze_wal_2.patch
Description: fix_hash_squeeze_wal_2.patch


Re: Is this a problem in GenericXLogFinish()?

2023-11-02 Thread Amit Kapila
On Wed, Nov 1, 2023 at 12:54 PM Michael Paquier  wrote:
>
> On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote:
> > On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier  wrote:
> >> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:
> >> > Yes, we need it to exclude any concurrent in-progress scans that could
> >> > return incorrect tuples during bucket squeeze operation.
> >>
> >> Thanks.  So I assume that we should just set REGBUF_NO_CHANGE when the
> >> primary and write buffers are the same and there are no tuples to
> >> move.  Say with something like the attached?
> >
> > What if the primary and write buffer are not the same but ntups is
> > zero? Won't we hit the assertion again in that case?
>
> The code tells that it should be able to handle such a case,
>

It should be possible when we encounter some page in between the chain
that has all dead items and write buffer points to some page after the
primary bucket page and before the read buffer page.

> so this
> would mean to set REGBUF_NO_CHANGE only when we have no tuples to
> move:
> -XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
> +/*
> + * A cleanup lock is still required on the write buffer even
> + * if there are no tuples to move, so it needs to be registered
> + * in this case.
> + */
> +wbuf_flags = REGBUF_STANDARD;
> +if (xlrec.ntups == 0)
> +wbuf_flags |= REGBUF_NO_CHANGE;
> +XLogRegisterBuffer(1, wbuf, wbuf_flags);
>

Why register wbuf at all if there are no tuples to add and it is not
the same as bucketbuf? Also, I think this won't be correct if prevbuf
and wrtbuf are the same and also we have no tuples to add to wbuf. I
have attached a naive and crude way to achieve it. This needs more
work both in terms of trying to find a better way to change the code
or ensure this won't break any existing case. I have just run the
existing tests. Such a fix certainly required more testing.

-- 
With Regards,
Amit Kapila.


fix_hash_squeeze_wal_1.patch
Description: Binary data


Re: Is this a problem in GenericXLogFinish()?

2023-11-01 Thread Michael Paquier
On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote:
> On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier  wrote:
>> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:
>> > Yes, we need it to exclude any concurrent in-progress scans that could
>> > return incorrect tuples during bucket squeeze operation.
>>
>> Thanks.  So I assume that we should just set REGBUF_NO_CHANGE when the
>> primary and write buffers are the same and there are no tuples to
>> move.  Say with something like the attached?
> 
> What if the primary and write buffer are not the same but ntups is
> zero? Won't we hit the assertion again in that case?

The code tells that it should be able to handle such a case, so this
would mean to set REGBUF_NO_CHANGE only when we have no tuples to
move:
-XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+/*
+ * A cleanup lock is still required on the write buffer even
+ * if there are no tuples to move, so it needs to be registered
+ * in this case.
+ */
+wbuf_flags = REGBUF_STANDARD;
+if (xlrec.ntups == 0)
+wbuf_flags |= REGBUF_NO_CHANGE;
+XLogRegisterBuffer(1, wbuf, wbuf_flags);

Anyway, there is still a hole here: the regression tests have zero
coverage for the case where there are no tuples to move if
!is_prim_bucket_same_wrt.  There are only two queries that stress the
squeeze path with no tuples, both use is_prim_bucket_same_wrt:
INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
VACUUM hash_split_heap;

Perhaps this should have more coverage to make sure that all these
scenarios are covered at replay (likely with 027_stream_regress.pl)?
The case posted by Alexander at [1] falls under the same category
(is_prim_bucket_same_wrt with no tuples).

[1]: 
https://www.postgresql.org/message-id/f045c8f7-ee24-ead6-3679-c04a43d21...@gmail.com
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-10-31 Thread Robert Haas
On Mon, Oct 30, 2023 at 11:15 PM Amit Kapila  wrote:
> My understanding is the same. It is possible that my wording is not
> clear. Let me try to clarify again, Michael asked: "do we need the
> cleanup lock on the write buffer even if there are no tuples, and even
> if primary bucket and the write bucket are the same?" My reading of
> his question was do we need a cleanup lock even if the primary bucket
> and write bucket are the same which means the write bucket will be the
> first page in the chain and we need a cleanup lock on it. I think the
> second condition (no tuples) seems irrelevant here as whether that is
> true or false we need a cleanup lock.

Ah, OK, I see now. I missed the fact that, per what Michael wrote, you
were assuming the primary buffer and the write buffer were the same.
In that case I agree, but the whole formulation seems backwards. I
think the clearer way to say this is: we need a cleanup lock on the
primary bucket page no matter what; and we need to write tuples into
the write buffer if there are any tuples to write but not if there
aren't. If the two buffers are the same then the requirements are
added together.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Amit Kapila
On Mon, Oct 30, 2023 at 7:13 PM Robert Haas  wrote:
>
> On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila  wrote:
> > > Hmm.  So my question is: do we need the cleanup lock on the write
> > > buffer even if there are no tuples, and even if primary bucket and the
> > > write bucket are the same?
> >
> > Yes, we need it to exclude any concurrent in-progress scans that could
> > return incorrect tuples during bucket squeeze operation.
>
> Amit, thanks for weighing in, but I'm not convinced. I thought we only
> ever used a cleanup lock on the main bucket page to guard against
> concurrent scans.
>

My understanding is the same. It is possible that my wording is not
clear. Let me try to clarify again, Michael asked: "do we need the
cleanup lock on the write buffer even if there are no tuples, and even
if primary bucket and the write bucket are the same?" My reading of
his question was do we need a cleanup lock even if the primary bucket
and write bucket are the same which means the write bucket will be the
first page in the chain and we need a cleanup lock on it. I think the
second condition (no tuples) seems irrelevant here as whether that is
true or false we need a cleanup lock.

>
 Here you seem to be saying that we need a cleanup
> lock on some page that may be an overflow page somewhere in the middle
> of the chain, and that doesn't seem right to me.
>

Sorry, I don't intend to say this.


--
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Robert Haas
On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila  wrote:
> > Hmm.  So my question is: do we need the cleanup lock on the write
> > buffer even if there are no tuples, and even if primary bucket and the
> > write bucket are the same?
>
> Yes, we need it to exclude any concurrent in-progress scans that could
> return incorrect tuples during bucket squeeze operation.

Amit, thanks for weighing in, but I'm not convinced. I thought we only
ever used a cleanup lock on the main bucket page to guard against
concurrent scans. Here you seem to be saying that we need a cleanup
lock on some page that may be an overflow page somewhere in the middle
of the chain, and that doesn't seem right to me.

So ... are you sure? If yes, can you provide any more detailed justification?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Amit Kapila
On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier  wrote:
>
> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:
> > Yes, we need it to exclude any concurrent in-progress scans that could
> > return incorrect tuples during bucket squeeze operation.
>
> Thanks.  So I assume that we should just set REGBUF_NO_CHANGE when the
> primary and write buffers are the same and there are no tuples to
> move.  Say with something like the attached?
>

What if the primary and write buffer are not the same but ntups is
zero? Won't we hit the assertion again in that case?

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2023-10-28 Thread Michael Paquier
On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:
> Yes, we need it to exclude any concurrent in-progress scans that could
> return incorrect tuples during bucket squeeze operation.

Thanks.  So I assume that we should just set REGBUF_NO_CHANGE when the
primary and write buffers are the same and there are no tuples to
move.  Say with something like the attached?
--
Michael
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..9928068179 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -647,6 +647,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
 		int			i;
+		int			wbuf_flags;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -668,7 +669,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		/*
+		 * If the bucket buffer and the primary buffer are the same, a cleanup
+		 * lock is still required on the write buffer even if there no tuples
+		 * to move, so it needs to be registered in this case.
+		 */
+		wbuf_flags = REGBUF_STANDARD;
+		if (xlrec.is_prim_bucket_same_wrt && xlrec.ntups == 0)
+			wbuf_flags |= REGBUF_NO_CHANGE;
+		XLogRegisterBuffer(1, wbuf, wbuf_flags);
 		if (xlrec.ntups > 0)
 		{
 			XLogRegisterBufData(1, (char *) itup_offsets,


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-10-28 Thread Amit Kapila
On Fri, Oct 27, 2023 at 5:45 AM Michael Paquier  wrote:
>
> On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote:
> > Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
> > the same (your first scenario) but the second scenario you mention
> > (nextbuf  == wbuf) should be impossible.
>
> Okay..
>
> > It seems to me that maybe we shouldn't even be registering wbuf or
> > doing anything at all to it if there are no tuples that need moving.
> > That would also require some adjustment of the redo routine.
>
> Hmm.  So my question is: do we need the cleanup lock on the write
> buffer even if there are no tuples, and even if primary bucket and the
> write bucket are the same?
>

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

-- 
With Regards,
Amit Kapila.




Re: Is this a problem in GenericXLogFinish()?

2023-10-26 Thread Michael Paquier
On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote:
> Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
> the same (your first scenario) but the second scenario you mention
> (nextbuf  == wbuf) should be impossible.

Okay..

> It seems to me that maybe we shouldn't even be registering wbuf or
> doing anything at all to it if there are no tuples that need moving.
> That would also require some adjustment of the redo routine.

Hmm.  So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same?  I'd like to think that what you say is OK,
still I am not completely sure after reading the lock assumptions in
the hash README or 6d46f4783efe.  A simpler thing would be to mark
buffer 1 with REGBUF_NO_CHANGE when the primary and write buffers are
the same if we expect the lock to always be taken, I guess..

I've noticed that the replay paths for XLOG_HASH_MOVE_PAGE_CONTENTS
and XLOG_HASH_SQUEEZE_PAGE are similar with their page handlings (some
copy-pastes?).  A MOVE record should never have zero tuples, still the
replay path assumes that this can be possible, so it could be
simplified.
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 3:31 AM Michael Paquier  wrote:
> Hmm.  Looking at hash_xlog_squeeze_page(), it looks like wbuf is
> expected to always be registered.  So, you're right here: it should be
> OK to be less aggressive with setting the page LSN on wbuf if ntups is
> 0, but there's more to it?  For example, it is possible that
> bucketbuf, prevbuf and wbuf refer to the same buffer, causing
> is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true.
> Particularly, if nextbuf and wbuf are the same buffers, we finish by
> registering twice the same buffer with two different IDs to perform
> the tuple insertion and the opaque area updates in two steps.  And
> that's..  Err..  not really necessary?  I mean, as far as I read this
> code you could just reuse the buffer registered at ID 1 and update its
> opaque area if is_prim_bucket_same_wrt is true?

Read the comments for _hash_squeezebucket. Particularly, note this part:

 *  Try to squeeze the tuples onto pages occurring earlier in the
 *  bucket chain in an attempt to free overflow pages. When we start
 *  the "squeezing", the page from which we start taking tuples (the
 *  "read" page) is the last bucket in the bucket chain and the page
 *  onto which we start squeezing tuples (the "write" page) is the
 *  first page in the bucket chain.  The read page works backward and
 *  the write page works forward; the procedure terminates when the
 *  read page and write page are the same page.

Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
the same (your first scenario) but the second scenario you mention
(nextbuf  == wbuf) should be impossible.

It seems to me that maybe we shouldn't even be registering wbuf or
doing anything at all to it if there are no tuples that need moving.
That would also require some adjustment of the redo routine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-26 Thread Michael Paquier
On Wed, Oct 25, 2023 at 10:06:23PM -0700, Jeff Davis wrote:
> Thank you for the report! I was able to reproduce and observe that the
> buffer is not marked dirty.
> 
> The call (hashovfl.c:671):
> 
>   XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD)
> 
> is followed unconditionally by:
> 
>   PageSetLSN(BufferGetPage(wbuf), recptr)
> 
> so if the Assert were not there, it would be setting the LSN on a page
> that's not marked dirty. Therefore I think this is a bug, or at least
> an interesting/unexpected case.
> 
> Perhaps the registration and the PageSetLSN don't need to happen when
> nitups==0?

Hmm.  Looking at hash_xlog_squeeze_page(), it looks like wbuf is
expected to always be registered.  So, you're right here: it should be
OK to be less aggressive with setting the page LSN on wbuf if ntups is
0, but there's more to it?  For example, it is possible that
bucketbuf, prevbuf and wbuf refer to the same buffer, causing
is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true. 
Particularly, if nextbuf and wbuf are the same buffers, we finish by
registering twice the same buffer with two different IDs to perform
the tuple insertion and the opaque area updates in two steps.  And
that's..  Err..  not really necessary?  I mean, as far as I read this
code you could just reuse the buffer registered at ID 1 and update its
opaque area if is_prim_bucket_same_wrt is true?
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-10-25 Thread Jeff Davis
On Thu, 2023-10-26 at 07:00 +0300, Alexander Lakhin wrote:

> It looks like the buffer is not dirty in the problematic call.

Thank you for the report! I was able to reproduce and observe that the
buffer is not marked dirty.

The call (hashovfl.c:671):

  XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD)

is followed unconditionally by:

  PageSetLSN(BufferGetPage(wbuf), recptr)

so if the Assert were not there, it would be setting the LSN on a page
that's not marked dirty. Therefore I think this is a bug, or at least
an interesting/unexpected case.

Perhaps the registration and the PageSetLSN don't need to happen when
nitups==0?

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-10-25 Thread Alexander Lakhin

Hello hackers,

24.10.2023 03:45, Jeff Davis wrote:

Committed.


I've encountered a case with a hash index, when an assert added by the
commit fails:

CREATE TABLE t (i INT);
CREATE INDEX hi ON t USING HASH (i);
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);

BEGIN;
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);
ROLLBACK;

CHECKPOINT;

VACUUM t;

Leads to:
Core was generated by `postgres: law regression [local] VACUUM  
 '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/211944' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) 
at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139924569388864) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139924569388864, signo=signo@entry=6) at 
./nptl/pthread_kill.c:89
#3  0x7f42b99ed476 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x7f42b99d37f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x55f128e83f1b in ExceptionalCondition (conditionName=0x55f128f33520 "BufferIsExclusiveLocked(buffer) && 
BufferIsDirty(buffer)", fileName=0x55f128f333a8 "xloginsert.c", lineNumber=262) at assert.c:66

#6  0x55f1287edce3 in XLogRegisterBuffer (block_id=1 '\001', buffer=1808, 
flags=8 '\b') at xloginsert.c:262
#7  0x55f128742833 in _hash_freeovflpage (rel=0x7f42adb95c88, bucketbuf=1808, ovflbuf=1825, wbuf=1808, 
itups=0x7ffc3f18c010, itup_offsets=0x7ffc3f18bce0, tups_size=0x7ffc3f18ccd0, nitups=0, bstrategy=0x55f12a562820)

    at hashovfl.c:671
#8  0x55f128743567 in _hash_squeezebucket (rel=0x7f42adb95c88, bucket=6, bucket_blkno=7, bucket_buf=1808, 
bstrategy=0x55f12a562820) at hashovfl.c:1064
#9  0x55f12873ca2a in hashbucketcleanup (rel=0x7f42adb95c88, cur_bucket=6, bucket_buf=1808, bucket_blkno=7, 
bstrategy=0x55f12a562820, maxbucket=7, highmask=15, lowmask=7, tuples_removed=0x7ffc3f18fb28,
    num_index_tuples=0x7ffc3f18fb30, split_cleanup=false, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at hash.c:921
#10 0x55f12873bfcf in hashbulkdelete (info=0x7ffc3f18fc30, stats=0x0, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at hash.c:549
#11 0x55f128776fbb in index_bulk_delete (info=0x7ffc3f18fc30, istat=0x0, callback=0x55f1289ba1de , 
callback_state=0x55f12a566778) at indexam.c:709
#12 0x55f1289ba03d in vac_bulkdel_one_index (ivinfo=0x7ffc3f18fc30, istat=0x0, dead_items=0x55f12a566778) at 
vacuum.c:2480
#13 0x55f128771286 in lazy_vacuum_one_index (indrel=0x7f42adb95c88, istat=0x0, reltuples=-1, vacrel=0x55f12a4b9c30) 
at vacuumlazy.c:2768

#14 0x55f1287704a3 in lazy_vacuum_all_indexes (vacrel=0x55f12a4b9c30) at 
vacuumlazy.c:2338
#15 0x55f128770275 in lazy_vacuum (vacrel=0x55f12a4b9c30) at 
vacuumlazy.c:2256
...

It looks like the buffer is not dirty in the problematic call.

Best regards,
Alexander




Re: Is this a problem in GenericXLogFinish()?

2023-10-23 Thread Jeff Davis
On Mon, 2023-10-23 at 10:14 -0400, Robert Haas wrote:
> I think that would be a bug. I think it would be OK to just change
> the
> page LSN and nothing else, but that requires calling
> MarkBufferDirty()
> at some point. If you only call PageSetLSN() and never
> MarkBufferDirty(), that sounds messed up: either the former should be
> skipped, or the latter should be added. We shouldn't modify a shared
> buffer without marking it dirty.
> 

In that case, I think REGBUF_NO_CHANGE is the right name for the flag.

Committed.

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-10-23 Thread Robert Haas
On Fri, Oct 20, 2023 at 12:28 PM Jeff Davis  wrote:
> I still have a question though: if a buffer is exclusive-locked,
> unmodified and clean, and then the caller registers it and later does
> PageSetLSN (just as if it were dirty), is that a definite bug?
>
> There are a couple callsites where the control flow is complex enough
> that it's hard to be sure the buffer is always marked dirty before
> being registered (like in log_heap_visible(), as I mentioned upthread).
> But those callsites are all doing PageSetLSN, unlike the hash index
> case.

I think that would be a bug. I think it would be OK to just change the
page LSN and nothing else, but that requires calling MarkBufferDirty()
at some point. If you only call PageSetLSN() and never
MarkBufferDirty(), that sounds messed up: either the former should be
skipped, or the latter should be added. We shouldn't modify a shared
buffer without marking it dirty.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-20 Thread Jeff Davis
On Thu, 2023-10-19 at 16:12 -0400, Robert Haas wrote:
> For what it's worth, though, I think it would be better to
> just make these cases exceptions to your Assert

OK, I'll probably commit something like v4 then.

I still have a question though: if a buffer is exclusive-locked,
unmodified and clean, and then the caller registers it and later does
PageSetLSN (just as if it were dirty), is that a definite bug?

There are a couple callsites where the control flow is complex enough
that it's hard to be sure the buffer is always marked dirty before
being registered (like in log_heap_visible(), as I mentioned upthread).
But those callsites are all doing PageSetLSN, unlike the hash index
case.

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-10-19 Thread Robert Haas
On Tue, Oct 17, 2023 at 12:38 PM Jeff Davis  wrote:
> I meant: are those cleanup operations frequent enough that dirtying
> those buffers in that case would matter?

Honestly, I'm not sure. Probably not? I mean, hashbucketcleanup()
seems to only be called during vacuum or a bucket split, and I don't
think you can have super-frequent calls to _hash_freeovflpage()
either. For what it's worth, though, I think it would be better to
just make these cases exceptions to your Assert, as you did in the
patch, rather than changing them to dirty the buffer. There doesn't
seem to be enough upside to making the assert unconditional to justify
changing stuff that might have a real-world performance cost ... even
if we don't think it would amount to much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-17 Thread Jeff Davis
On Tue, 2023-10-17 at 08:41 -0400, Robert Haas wrote:
> Sorry, I'm not sure I understand the question. Are you asking whether
> dirtying buffers unnecessarily might be slower than not doing that?

I meant: are those cleanup operations frequent enough that dirtying
those buffers in that case would matter?

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-10-17 Thread Robert Haas
On Mon, Oct 16, 2023 at 7:31 PM Jeff Davis  wrote:
> Another option might be to just change the hash indexing code to follow
> the correct protocol, locking and calling MarkBufferDirty() in those 3
> call sites. Marking the buffer dirty is easy, but making sure that it's
> locked might require some refactoring. Robert, would following the
> right protocol here affect performance?

Sorry, I'm not sure I understand the question. Are you asking whether
dirtying buffers unnecessarily might be slower than not doing that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-16 Thread Jeff Davis
On Wed, 2023-10-11 at 14:53 +0300, Heikki Linnakangas wrote:
> 
> The comment suggests that you don't need to hold an exclusive lock
> when 
> you call this, but there's an assertion that you do.

Reworded.

> Is it a new requirement that you must hold an exclusive lock on the 
> buffer when you call XLogRegisterBuffer()? I think that's reasonable.

Agreed.

> 
> I'd suggest a comment here to explain what's wrong if someone hits
> this 
> assertion. Something like "The buffer must be marked as dirty before 
> registering, unless REGBUF_CLEAN_OK is set to indicate that the
> buffer 
> is not being modified".

Done, with different wording.

> 
> > If we commit this, ideally someone would look into the places where
> > REGBUF_CLEAN_OK is used and make sure that it's not a bug, and
> > perhaps
> > refactor so that it would benefit from the Assert. But refactoring
> > those places is outside of the scope of this patch.

I don't think it makes sense to register a buffer that is perhaps clean
and perhaps not. After registering a buffer and writing WAL, you need
to PageSetLSN(), and that should only be done after MarkBufferDirty(),
right?

So if you need to condition PageSetLSN() on whether MarkBufferDirty()
has happened, it would be trivial to use the same condition for
XLogRegisterBuffer(). Therefore I changed the name back to
REGBUF_NO_CHANGE, as you suggested originally.

The hash indexing code knows it's not modifying the bucket buffer,
doesn't mark it dirty, and doesn't set the LSN. I presume that's safe.

However, there are quite a few places where
XLogRegisterBuffer()+WAL+PageSetLSN() all happen, but it's not
immediately obvious that all code paths to get there also
MarkBufferDirty(). For instanace:

   lazy_scan_heap() -- conditionally marks heapbuf as dirty
   visibilitymap_set() -- conditionally calls log_heap_visible
   log_heap_visible()   
   XLogRegisterBuffer(1, heap_buffer, flags)

if those conditions don't match up exactly, it seems we could get into
a situation where we update the LSN of a clean page, which seems bad.

There are a few other places in the hash code and spgist code where I
have similar concerns. Not many, though, I looked at all of the call
sites (at least briefly) and most of them are fine.

> About that: you added the flag to a couple of XLogRegisterBuffer()
> calls 
> in GIN, because they call MarkBufferDirty() only after 
> XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest 
> doing that now.

Done.

> > It sounds like we have no intention to change the hash index code,
> > so
> > are we OK if this flag just lasts forever? Do you still think it
> > offers
> > a useful check?
> 
> Yeah, I think this is a useful assertion. It might catch some bugs in
> extensions too.

I was asking more about the REGBUF_NO_CHANGE flag. It feels very
specific to the hash indexes, and I'm not sure we want to encourage
extensions to use it.

Though maybe the locking protocol used by hash indexes is more
generally applicable, and other indexing strategies might want to use
it, too?

Another option might be to just change the hash indexing code to follow
the correct protocol, locking and calling MarkBufferDirty() in those 3
call sites. Marking the buffer dirty is easy, but making sure that it's
locked might require some refactoring. Robert, would following the
right protocol here affect performance?

Regards,
Jeff Davis

From f0c2d489dd8f2268803f4d2cc8642e29cb0d3576 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 28 Sep 2023 11:19:06 -0700
Subject: [PATCH v4] Assert that buffers are marked dirty before
 XLogRegisterBuffer().

Enforce the rule from transam/README in XLogRegisterBuffer(), and
update callers to follow the rule.

Hash indexes sometimes register clean pages as a part of the locking
protocol, so provide a REGBUF_NO_CHANGE flag to support that use.

Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a...@iki.fi
---
 src/backend/access/gin/ginbtree.c   | 14 +++---
 src/backend/access/gin/gindatapage.c|  6 +++
 src/backend/access/gin/ginentrypage.c   |  3 ++
 src/backend/access/gin/ginfast.c|  5 ++-
 src/backend/access/hash/hash.c  | 11 +++--
 src/backend/access/hash/hashovfl.c  | 21 ++---
 src/backend/access/transam/xloginsert.c | 16 ++-
 src/backend/storage/buffer/bufmgr.c | 59 +
 src/include/access/xloginsert.h |  1 +
 src/include/storage/bufmgr.h|  2 +
 10 files changed, 118 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 06e97a7fbb..fc694b40f1 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -387,24 +387,22 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		START_CRIT_SECTION();
 
 		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
-		{
 			XLogBeginInsert();
-			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
-			if 

Re: Is this a problem in GenericXLogFinish()?

2023-10-11 Thread Robert Haas
On Wed, Oct 11, 2023 at 7:53 AM Heikki Linnakangas  wrote:
> > + * Buffer must be pinned and exclusive-locked.  (If caller does not hold
> > + * exclusive lock, then the result may be stale before it's returned.)
> The comment suggests that you don't need to hold an exclusive lock when
> you call this, but there's an assertion that you do.

I don't think the comment suggests that. It would if you only read the
sentence in parentheses. But if you read both of them it seems clear
enough. I guess the parenthetical sentence cloud say "If the caller
did not hold an exclusive lock, then the result might become stale
even before it was returned," basically putting the whole thing in the
subjunctive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-11 Thread Heikki Linnakangas

On 10/10/2023 22:57, Jeff Davis wrote:

On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote:

Also, I ran into some problems with GIN that might require a bit more
refactoring in ginPlaceToPage(). Perhaps we could consider
REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
use it temporarily until we have a chance to analyze/refactor each of
the callers that need it.


For the Assert, I have attached a new patch for v17.


Thanks for working on this!


+/*
+ * BufferIsDirty
+ *
+ * Checks if buffer is already dirty.
+ *
+ * Buffer must be pinned and exclusive-locked.  (If caller does not hold
+ * exclusive lock, then the result may be stale before it's returned.)
+ */
+bool
+BufferIsDirty(Buffer buffer)
+{
+   BufferDesc *bufHdr;
+
+   if (BufferIsLocal(buffer))
+   {
+   int bufid = -buffer - 1;
+   bufHdr = GetLocalBufferDescriptor(bufid);
+   }
+   else
+   {
+   bufHdr = GetBufferDescriptor(buffer - 1);
+   }
+
+   Assert(BufferIsPinned(buffer));
+   Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+   LW_EXCLUSIVE));
+
+   return pg_atomic_read_u32(>state) & BM_DIRTY;
+}
The comment suggests that you don't need to hold an exclusive lock when 
you call this, but there's an assertion that you do.


Is it a new requirement that you must hold an exclusive lock on the 
buffer when you call XLogRegisterBuffer()? I think that's reasonable.


I thought if that might be a problem when you WAL-log a page when you 
set hint bits on it and checksums are enabled. Hint bits can be set 
holding just a share lock. But it uses XLogSaveBufferForHint(), which 
calls XLogRegisterBlock() instead of XLogRegisterBuffer()


There is a comment above XLogSaveBufferForHint() that implies that 
XLogRegisterBuffer() requires you to hold an exclusive-lock but I don't 
see that spelled out explicitly in XLogRegisterBuffer() itself. Maybe 
add an assertion for that in XLogRegisterBuffer() to make it more explicit.



--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -246,6 +246,7 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 
flags)
 
 	/* NO_IMAGE doesn't make sense with FORCE_IMAGE */

Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE;
+   Assert((flags & REGBUF_CLEAN_OK) || BufferIsDirty(buffer));
Assert(begininsert_called);
 
 	if (block_id >= max_registered_block_id)


I'd suggest a comment here to explain what's wrong if someone hits this 
assertion. Something like "The buffer must be marked as dirty before 
registering, unless REGBUF_CLEAN_OK is set to indicate that the buffer 
is not being modified".



I changed the name of the flag to REGBUF_CLEAN_OK, because for some of
the callsites it was not clear to me whether the page is always
unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a
way to bypass the Assert for callsites where we either know that we
actually want to register an unchanged page, or for callsites where we
don't know if the page is changed or not.


Ok. A flag to explicitly mark that the page is not modified would 
actually be nice for pg_rewind. It could ignore such page references. 
Not that it matters much in practice, since those records are so rare. 
And for that, we'd need to include the flag in the WAL record too. So I 
think this is fine.



If we commit this, ideally someone would look into the places where
REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps
refactor so that it would benefit from the Assert. But refactoring
those places is outside of the scope of this patch.


About that: you added the flag to a couple of XLogRegisterBuffer() calls 
in GIN, because they call MarkBufferDirty() only after 
XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest 
doing that now.



It sounds like we have no intention to change the hash index code, so
are we OK if this flag just lasts forever? Do you still think it offers
a useful check?


Yeah, I think this is a useful assertion. It might catch some bugs in 
extensions too.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Is this a problem in GenericXLogFinish()?

2023-10-10 Thread Jeff Davis
I committed and backported 0001 (the GenericXLogFinish() fix but not
the Assert).

Strangely I didn't see the -committers email come through yet. If
anyone notices anything wrong, please let me know before the final v11
release.

On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote:
> Also, I ran into some problems with GIN that might require a bit more
> refactoring in ginPlaceToPage(). Perhaps we could consider
> REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
> use it temporarily until we have a chance to analyze/refactor each of
> the callers that need it.

For the Assert, I have attached a new patch for v17.

I changed the name of the flag to REGBUF_CLEAN_OK, because for some of
the callsites it was not clear to me whether the page is always
unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a
way to bypass the Assert for callsites where we either know that we
actually want to register an unchanged page, or for callsites where we
don't know if the page is changed or not.

If we commit this, ideally someone would look into the places where
REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps
refactor so that it would benefit from the Assert. But refactoring
those places is outside of the scope of this patch.

It sounds like we have no intention to change the hash index code, so
are we OK if this flag just lasts forever? Do you still think it offers
a useful check?

Regards,
Jeff Davis

From 859c8f4284de1030ab070630fa06af3c171264d9 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 28 Sep 2023 11:19:06 -0700
Subject: [PATCH v3] Assert that buffers are marked dirty before
 XLogRegisterBuffer().

A page must be marked dirty before writing WAL, as documented in
transam/README.

Enforce this rule in XLogRegisterBuffer(), though it's not actually a
bug if (a) the page really is clean; or (b) the buffer is marked dirty
after XLogRegisterBuffer() and before XLogInsert().

Update log_newpage_range(), where it's trivial to change the order so
that we can check in XLogRegisterBuffer(). Other callers are more
complex, and updating them is outside the scope of this patch (and
perhaps not desirable), so introduce a flag REGBUF_CLEAN_OK to bypass
the Assert.

Note that this commit may have missed some callsites which can, at
least in some cases, register a clean buffer. If such a callsite is
found, it can be updated to use REGBUF_CLEAN_OK assuming it doesn't
represent an actual bug.

Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a...@iki.fi
---
 src/backend/access/gin/ginbtree.c   |  3 ++-
 src/backend/access/gin/ginfast.c|  2 +-
 src/backend/access/hash/hash.c  | 10 ++---
 src/backend/access/hash/hashovfl.c  |  9 +---
 src/backend/access/transam/xloginsert.c |  3 ++-
 src/backend/storage/buffer/bufmgr.c | 30 +
 src/include/access/xloginsert.h |  2 ++
 src/include/storage/bufmgr.h|  1 +
 8 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 06e97a7fbb..c98176af77 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -389,7 +389,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
 		{
 			XLogBeginInsert();
-			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
+			XLogRegisterBuffer(0, stack->buffer,
+			   REGBUF_STANDARD | REGBUF_CLEAN_OK);
 			if (BufferIsValid(childbuf))
 XLogRegisterBuffer(1, childbuf, REGBUF_STANDARD);
 		}
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index eb6c554831..e46e11df07 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,7 +399,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 		Assert((ptr - collectordata) <= collector->sumsize);
 		if (needWal)
 		{
-			XLogRegisterBuffer(1, buffer, REGBUF_STANDARD);
+			XLogRegisterBuffer(1, buffer, REGBUF_STANDARD | REGBUF_CLEAN_OK);
 			XLogRegisterBufData(1, collectordata, collector->sumsize);
 		}
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606..7a62100c89 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -824,11 +824,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 XLogRegisterData((char *) , SizeOfHashDelete);
 
 /*
- * bucket buffer needs to be registered to ensure that we can
- * acquire a cleanup lock on it during replay.
+ * bucket buffer was not changed, but still needs to be
+ * registered to ensure that we can acquire a cleanup lock on
+ * it during replay.
  */
 if (!xlrec.is_primary_bucket_page)
-	XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+{
+	uint8 flags = REGBUF_STANDARD | 

Re: Is this a problem in GenericXLogFinish()?

2023-09-28 Thread Jeff Davis
On Wed, 2023-09-27 at 18:57 +0300, Heikki Linnakangas wrote:
> We could define a new REGBUF_NO_CHANGE flag to signal that the buffer
> is 
> registered just for locking purposes, and not actually modified by
> the 
> WAL record.

Out of curiosity I also added a trial assert (not intending to actually
add this):

  Assert(!(flags & REGBUF_NO_CHANGE) || !BufferIsExclusiveLocked());

I expected that to fail, but it didn't -- perhaps that buffer is only
locked during replay or something? I don't think we want that specific
Assert; I was just experimenting with some cross-checks we could do to
verify that REGBUF_NO_CHANGE is used properly.

Also, I ran into some problems with GIN that might require a bit more
refactoring in ginPlaceToPage(). Perhaps we could consider
REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
use it temporarily until we have a chance to analyze/refactor each of
the callers that need it.

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Heikki Linnakangas

On 27/09/2023 18:47, Robert Haas wrote:

On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis  wrote:

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?


I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(


That seems OK. Maybe there would be a better way to do that, but there's 
nothing wrong with that approach per se.


We could define a new REGBUF_NO_CHANGE flag to signal that the buffer is 
registered just for locking purposes, and not actually modified by the 
WAL record.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Robert Haas
On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis  wrote:
> So it looks like it's intentionally registering a clean buffer so that
> it can take a cleanup lock for reasons other than cleaning (or even
> modiying) the page. I would think that there's a better way of
> accomplishing that goal, so perhaps we can fix that?

I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Jeff Davis
On Wed, 2023-09-27 at 10:36 -0400, Robert Haas wrote:
> On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis  wrote:
> > That site is pretty trivial to fix, but there are also a couple
> > places
> > in hash.c and hashovfl.c that are registering a clean page and it's
> > not
> > clear to me exactly what's going on.
> 
> Huh, I wonder if that's just a bug. Do you know where exactly?

hashovfl.c:665 and hash.c:831

In both cases it's registering the bucket_buf, and has a comment like:

 /*
  * bucket buffer needs to be registered to ensure that we can
  * acquire a cleanup lock on it during replay.
  */

And various redo routines have comments like:

 /*
  * Ensure we have a cleanup lock on primary bucket page before we
start
  * with the actual replay operation.  This is to ensure that neither a
  * scan can start nor a scan can be already-in-progress during the
replay
  * of this operation.  If we allow scans during this operation, then
they
  * can miss some records or show the same record multiple times.
  */

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?

Regards,
Jeff Davis





Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Robert Haas
On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis  wrote:
> That site is pretty trivial to fix, but there are also a couple places
> in hash.c and hashovfl.c that are registering a clean page and it's not
> clear to me exactly what's going on.

Huh, I wonder if that's just a bug. Do you know where exactly?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-09-26 Thread Jeff Davis
On Wed, 2023-09-27 at 00:14 +0300, Heikki Linnakangas wrote:
> Looks correct. You now loop through all the block IDs three times, 
> however. I wonder if that is measurably slower, but even if it's not,
> was there a reason you wanted to move the XLogRegisterBuffer() calls
> to 
> a separate loop?

I did so to correspond more closely to what's outlined in the README
and in other places in the code, where marking the buffers dirty
happens before XLogBeginInsert(). It didn't occur to me that one extra
loop would matter, but I can combine them again.

It would be a bit more concise to do the XLogBeginInsert() first (like
before) and then register the buffers in the same loop that does the
writes and marks the buffers dirty. Updated patch attached.

> 
> Hmm, I'm sure there are exceptions but log_newpage_range() actually 
> seems to be doing the right thing; it calls MarkBufferDirty() before 
> XLogInsert(). It only calls it after XLogRegisterBuffer() though, and
> I 
> concur that XLogRegisterBuffer() would be the logical place for the 
> assertion. We could tighten this up, require that you call 
> MarkBufferDirty() before XLogRegisterBuffer(), and fix all the
> callers.

That site is pretty trivial to fix, but there are also a couple places
in hash.c and hashovfl.c that are registering a clean page and it's not
clear to me exactly what's going on.

Regards,
Jeff Davis

From 9548bb49865db3a04bbdda4c63df611142e22964 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 26 Sep 2023 12:10:11 -0700
Subject: [PATCH v2] Fix GenericXLogFinish().

Mark the buffers dirty before writing WAL.

Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c0...@iki.fi
---
 src/backend/access/transam/generic_xlog.c | 53 ++-
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6c68191ca6..2d5ff4aa7c 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -347,6 +347,10 @@ GenericXLogFinish(GenericXLogState *state)
 
 		START_CRIT_SECTION();
 
+		/*
+		 * Compute deltas if necessary, write changes to buffers, mark
+		 * buffers dirty, and register changes.
+		 */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = >pages[i];
@@ -359,41 +363,31 @@ GenericXLogFinish(GenericXLogState *state)
 			page = BufferGetPage(pageData->buffer);
 			pageHeader = (PageHeader) pageData->image;
 
+			/* delta not needed for a full page image */
+			if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE))
+computeDelta(pageData, page, (Page) pageData->image);
+
+			/*
+			 * Apply the image, being careful to zero the "hole" between
+			 * pd_lower and pd_upper in order to avoid divergence between
+			 * actual page state and what replay would produce.
+			 */
+			memcpy(page, pageData->image, pageHeader->pd_lower);
+			memset(page + pageHeader->pd_lower, 0,
+   pageHeader->pd_upper - pageHeader->pd_lower);
+			memcpy(page + pageHeader->pd_upper,
+   pageData->image + pageHeader->pd_upper,
+   BLCKSZ - pageHeader->pd_upper);
+
+			MarkBufferDirty(pageData->buffer);
+
 			if (pageData->flags & GENERIC_XLOG_FULL_IMAGE)
 			{
-/*
- * A full-page image does not require us to supply any xlog
- * data.  Just apply the image, being careful to zero the
- * "hole" between pd_lower and pd_upper in order to avoid
- * divergence between actual page state and what replay would
- * produce.
- */
-memcpy(page, pageData->image, pageHeader->pd_lower);
-memset(page + pageHeader->pd_lower, 0,
-	   pageHeader->pd_upper - pageHeader->pd_lower);
-memcpy(page + pageHeader->pd_upper,
-	   pageData->image + pageHeader->pd_upper,
-	   BLCKSZ - pageHeader->pd_upper);
-
 XLogRegisterBuffer(i, pageData->buffer,
    REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
 			}
 			else
 			{
-/*
- * In normal mode, calculate delta and write it as xlog data
- * associated with this page.
- */
-computeDelta(pageData, page, (Page) pageData->image);
-
-/* Apply the image, with zeroed "hole" as above */
-memcpy(page, pageData->image, pageHeader->pd_lower);
-memset(page + pageHeader->pd_lower, 0,
-	   pageHeader->pd_upper - pageHeader->pd_lower);
-memcpy(page + pageHeader->pd_upper,
-	   pageData->image + pageHeader->pd_upper,
-	   BLCKSZ - pageHeader->pd_upper);
-
 XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD);
 XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);
 			}
@@ -402,7 +396,7 @@ GenericXLogFinish(GenericXLogState *state)
 		/* Insert xlog record */
 		lsn = XLogInsert(RM_GENERIC_ID, 0);
 
-		/* Set LSN and mark buffers dirty */
+		/* Set LSN */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = >pages[i];
@@ -410,7 +404,6 @@ GenericXLogFinish(GenericXLogState *state)
 			if 

Re: Is this a problem in GenericXLogFinish()?

2023-09-26 Thread Heikki Linnakangas

On 26/09/2023 22:32, Jeff Davis wrote:

On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote:

Yes, that's a problem.


Patch attached. I rearranged the code a bit to follow the expected
pattern of: write, mark dirty, WAL, set LSN. I think computing the
deltas could also be moved earlier, outside of the critical section,
but I'm not sure that would be useful.


Looks correct. You now loop through all the block IDs three times, 
however. I wonder if that is measurably slower, but even if it's not, 
was there a reason you wanted to move the XLogRegisterBuffer() calls to 
a separate loop?



Do you have a suggestion for any kind of test addition, or should we
just review carefully?


I wish we had an assertion for that. XLogInsert() could assert that
the page is already marked dirty, for example.


Unfortunately that's not always the case, e.g. log_newpage_range().


Hmm, I'm sure there are exceptions but log_newpage_range() actually 
seems to be doing the right thing; it calls MarkBufferDirty() before 
XLogInsert(). It only calls it after XLogRegisterBuffer() though, and I 
concur that XLogRegisterBuffer() would be the logical place for the 
assertion. We could tighten this up, require that you call 
MarkBufferDirty() before XLogRegisterBuffer(), and fix all the callers.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Is this a problem in GenericXLogFinish()?

2023-09-26 Thread Jeff Davis
On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote:
> Yes, that's a problem.

Patch attached. I rearranged the code a bit to follow the expected
pattern of: write, mark dirty, WAL, set LSN. I think computing the
deltas could also be moved earlier, outside of the critical section,
but I'm not sure that would be useful.

Do you have a suggestion for any kind of test addition, or should we
just review carefully?

> I wish we had an assertion for that. XLogInsert() could assert that
> the 
> page is already marked dirty, for example.

Unfortunately that's not always the case, e.g. log_newpage_range().

Regards,
Jeff Davis

From c54749a7939721f0a838115abce09967216cbc0f Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 26 Sep 2023 12:10:11 -0700
Subject: [PATCH v1] Fix GenericXLogFinish().

Mark the buffers dirty before writing WAL.

Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c0...@iki.fi
---
 src/backend/access/transam/generic_xlog.c | 66 ---
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6c68191ca6..2b7e054ebe 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -343,10 +343,12 @@ GenericXLogFinish(GenericXLogState *state)
 	if (state->isLogged)
 	{
 		/* Logged relation: make xlog record in critical section. */
-		XLogBeginInsert();
-
 		START_CRIT_SECTION();
 
+		/*
+		 * Compute deltas if necessary, write changes to buffers, and mark
+		 * them dirty.
+		 */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = >pages[i];
@@ -359,41 +361,42 @@ GenericXLogFinish(GenericXLogState *state)
 			page = BufferGetPage(pageData->buffer);
 			pageHeader = (PageHeader) pageData->image;
 
+			/* delta not needed for a full page image */
+			if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE))
+computeDelta(pageData, page, (Page) pageData->image);
+
+			/*
+			 * Apply the image, being careful to zero the "hole" between
+			 * pd_lower and pd_upper in order to avoid divergence between
+			 * actual page state and what replay would produce.
+			 */
+			memcpy(page, pageData->image, pageHeader->pd_lower);
+			memset(page + pageHeader->pd_lower, 0,
+   pageHeader->pd_upper - pageHeader->pd_lower);
+			memcpy(page + pageHeader->pd_upper,
+   pageData->image + pageHeader->pd_upper,
+   BLCKSZ - pageHeader->pd_upper);
+
+			MarkBufferDirty(pageData->buffer);
+		}
+
+		XLogBeginInsert();
+
+		/* Register buffers */
+		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
+		{
+			PageData   *pageData = >pages[i];
+
+			if (BufferIsInvalid(pageData->buffer))
+continue;
+
 			if (pageData->flags & GENERIC_XLOG_FULL_IMAGE)
 			{
-/*
- * A full-page image does not require us to supply any xlog
- * data.  Just apply the image, being careful to zero the
- * "hole" between pd_lower and pd_upper in order to avoid
- * divergence between actual page state and what replay would
- * produce.
- */
-memcpy(page, pageData->image, pageHeader->pd_lower);
-memset(page + pageHeader->pd_lower, 0,
-	   pageHeader->pd_upper - pageHeader->pd_lower);
-memcpy(page + pageHeader->pd_upper,
-	   pageData->image + pageHeader->pd_upper,
-	   BLCKSZ - pageHeader->pd_upper);
-
 XLogRegisterBuffer(i, pageData->buffer,
    REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
 			}
 			else
 			{
-/*
- * In normal mode, calculate delta and write it as xlog data
- * associated with this page.
- */
-computeDelta(pageData, page, (Page) pageData->image);
-
-/* Apply the image, with zeroed "hole" as above */
-memcpy(page, pageData->image, pageHeader->pd_lower);
-memset(page + pageHeader->pd_lower, 0,
-	   pageHeader->pd_upper - pageHeader->pd_lower);
-memcpy(page + pageHeader->pd_upper,
-	   pageData->image + pageHeader->pd_upper,
-	   BLCKSZ - pageHeader->pd_upper);
-
 XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD);
 XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);
 			}
@@ -402,7 +405,7 @@ GenericXLogFinish(GenericXLogState *state)
 		/* Insert xlog record */
 		lsn = XLogInsert(RM_GENERIC_ID, 0);
 
-		/* Set LSN and mark buffers dirty */
+		/* Set LSN */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = >pages[i];
@@ -410,7 +413,6 @@ GenericXLogFinish(GenericXLogState *state)
 			if (BufferIsInvalid(pageData->buffer))
 continue;
 			PageSetLSN(BufferGetPage(pageData->buffer), lsn);
-			MarkBufferDirty(pageData->buffer);
 		}
 		END_CRIT_SECTION();
 	}
-- 
2.34.1



Re: Is this a problem in GenericXLogFinish()?

2023-09-25 Thread Heikki Linnakangas

On 22/09/2023 23:52, Jeff Davis wrote:


src/backend/transam/README says:

   ...
   4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This
   must happen before the WAL record is inserted; see notes in
   SyncOneBuffer().)
   ...

But GenericXLogFinish() does this:

   ...
   /* Insert xlog record */
   lsn = XLogInsert(RM_GENERIC_ID, 0);

   /* Set LSN and mark buffers dirty */
   for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
   {
   PageData   *pageData = >pages[i];

   if (BufferIsInvalid(pageData->buffer))
   continue;
   PageSetLSN(BufferGetPage(pageData->buffer), lsn);
   MarkBufferDirty(pageData->buffer);
   }
   END_CRIT_SECTION();

Am I missing something or is that a problem?


Yes, that's a problem.

I wish we had an assertion for that. XLogInsert() could assert that the 
page is already marked dirty, for example.


--
Heikki Linnakangas
Neon (https://neon.tech)





Is this a problem in GenericXLogFinish()?

2023-09-22 Thread Jeff Davis


src/backend/transam/README says:

  ...
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This 
  must happen before the WAL record is inserted; see notes in 
  SyncOneBuffer().)
  ...

But GenericXLogFinish() does this:

  ...
  /* Insert xlog record */
  lsn = XLogInsert(RM_GENERIC_ID, 0);

  /* Set LSN and mark buffers dirty */
  for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
  {
  PageData   *pageData = >pages[i];

  if (BufferIsInvalid(pageData->buffer))
  continue;
  PageSetLSN(BufferGetPage(pageData->buffer), lsn);
  MarkBufferDirty(pageData->buffer);
  }
  END_CRIT_SECTION();

Am I missing something or is that a problem?

Regards,
Jeff Davis