subject:"\[Proposal\] Fully WAL logged CREATE DATABASE \- No Checkpoints"

RE: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2023-01-11 Thread Yoshikazu Imai (Fujitsu)

Hi.

While investigating the codes in RelationCopyStorageUsingBuffer, I wonder that
there is any reason to use RBM_NORMAL for acquiring destination buffer.
I think we can use RBM_ZERO_AND_LOCK here instead of RBM_NORMAL.

When we use RBM_NORMAL, a destination block is read by smgrread and it's
content is verified with PageIsVerifiedExtended in ReadBuffer_common.
A page verification normally succeeds because destination blocks are
zero-filled by using smgrextend, but do we trust that it is surely zero-filled?

On Fri, Aug 5, 2022 at 00:32 AM Tom Lane  
wrote:
> Hmm ... makes sense.  We'd be using smgrextend to write just the last page
> of the file, relying on its API spec "Note that we assume writing a block
> beyond current EOF causes intervening file space to become filled with
> zeroes".  I don't know that we're using that assumption aggressively
> today, but as long as it doesn't confuse the kernel it'd probably be fine.

If there is a block which is not zero-filled, page verification would fail and
eventually CREATE DATABASE would fail.
(I also think that originally we don't need page verification for destination 
blocks
here because those blocks are just overwritten by source buffer.)

Also, from performance POV, I think it is good to use RBM_ZERO_AND_LOCK.
In RBM_NORMAL, destination blocks are read from disk by smgrread each time, but
in RBM_ZERO_AND_LOCK, we only set buffers zero-filled by MemSet and don't
access the disk which makes performance better.
If we expect the destination buffer is always zero-filled, we can use
RBM_ZERO_AND_LOCK.

Apart from above, there seems to be an old comment which is for the old codes
when acquiring destination buffer by using P_NEW.

"/* Use P_NEW to extend the destination relation. */"

--
Yoshikazu Imai

> -Original Message-
> From: Dilip Kumar 
> Sent: Friday, September 2, 2022 11:22 PM
> To: Justin Pryzby 
> Cc: Robert Haas ; Tom Lane ; 
> Andres Freund ;
> Ashutosh Sharma ; Maciek Sakrejda 
> ; Bruce Momjian
> ; Alvaro Herrera ; Andrew Dunstan 
> ; Heikki
> Linnakangas ; pgsql-hackers@lists.postgresql.org; Thomas 
> Munro 
> Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
> 
> On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby  wrote:
> >
> > On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> > > Also, if I understand correctly, this patch seems to assume that
> > > nobody is connected to the source database.  But what's actually
> > > enforced is just that nobody *else* is connected.  Is it any issue
> > > that the current DB can be used as a source?  Anyway, both of the
> > > above problems are reproducible using a different database.
> > >
> > > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> > > |CREATE DATABASE
> >
> > On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> > > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > > > The "invalidation" comment bothered me for awhile, but I think
> > > > it's
> > > > fine: we know that no other backend can connect to the source DB
> > > > because we have it locked,
> > >
> > > About that - is it any problem that the currently-connected db can
> > > be used as a template?  It's no issue for 2-phase commit, because
> > > "create database" cannot run in an txn.
> >
> > Would anybody want to comment on this ?
> > Is it okay that the *current* DB can be used as a template ?
> 
> I don't think there should be any problem with that.  The main problem could 
> have been that since we are reading the
> pg_class tuple block by block there could be an issue if someone concurrently 
> modifies the pg_class or there are some
> tuples that are inserted by the prepared transaction.  But in this case, the 
> same backend can not have an open prepared
> transaction while creating a database and that backend of course can not 
> perform any parallel operation as well.
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-09-02 Thread Dilip Kumar

On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby  wrote:
>
> On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> > Also, if I understand correctly, this patch seems to assume that nobody is
> > connected to the source database.  But what's actually enforced is just that
> > nobody *else* is connected.  Is it any issue that the current DB can be 
> > used as
> > a source?  Anyway, both of the above problems are reproducible using a
> > different database.
> >
> > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> > |CREATE DATABASE
>
> On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > > The "invalidation" comment bothered me for awhile, but I think it's
> > > fine: we know that no other backend can connect to the source DB
> > > because we have it locked,
> >
> > About that - is it any problem that the currently-connected db can be used 
> > as a
> > template?  It's no issue for 2-phase commit, because "create database" 
> > cannot
> > run in an txn.
>
> Would anybody want to comment on this ?
> Is it okay that the *current* DB can be used as a template ?

I don't think there should be any problem with that.  The main problem
could have been that since we are reading the pg_class tuple block by
block there could be an issue if someone concurrently modifies the
pg_class or there are some tuples that are inserted by the prepared
transaction.  But in this case, the same backend can not have an open
prepared transaction while creating a database and that backend of
course can not perform any parallel operation as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-09-02 Thread Justin Pryzby

On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> Also, if I understand correctly, this patch seems to assume that nobody is
> connected to the source database.  But what's actually enforced is just that
> nobody *else* is connected.  Is it any issue that the current DB can be used 
> as
> a source?  Anyway, both of the above problems are reproducible using a
> different database.
> 
> |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> |CREATE DATABASE

On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > The "invalidation" comment bothered me for awhile, but I think it's
> > fine: we know that no other backend can connect to the source DB
> > because we have it locked,
> 
> About that - is it any problem that the currently-connected db can be used as 
> a
> template?  It's no issue for 2-phase commit, because "create database" cannot
> run in an txn.

Would anybody want to comment on this ?
Is it okay that the *current* DB can be used as a template ?

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-18 Thread Robert Haas

On Wed, Aug 17, 2022 at 12:02 AM Dilip Kumar  wrote:
> > Although 0002 is formally a performance optimization, I'm inclined to
> > think that if we're going to commit it, it should also be back-patched
> > into v15, because letting the code diverge when we're not even out of
> > beta yet seems painful.
>
> Yeah that makes sense to me.

OK, done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-16 Thread Dilip Kumar

On Fri, Aug 12, 2022 at 6:33 PM Robert Haas  wrote:
>
> On Thu, Aug 11, 2022 at 2:15 PM Robert Haas  wrote:
> > As far as I know, this 0001 addresses all outstanding comments and
> > fixes the reported bug.
> >
> > Does anyone think otherwise?
>
> If they do, they're keeping quiet, so I committed this and
> back-patched it to v15.
>
> Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Yes we can do that, although here we are not using this buffer
directly as a "Page" so we do not have any real alignment issue but I
do not see any problem in using PGAlignedBlock so change that.

> Although 0002 is formally a performance optimization, I'm inclined to
> think that if we're going to commit it, it should also be back-patched
> into v15, because letting the code diverge when we're not even out of
> beta yet seems painful.

Yeah that makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 59fadefe04f8f2eeb6bc5e2e02efde56d5ace8aa Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 5 Aug 2022 11:25:23 +0530
Subject: [PATCH v4] Optimize copy storage from source to destination

Instead of extending block at a time directly bulkextend the destination
relation and then just perform the block level copy.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c1bd50..7a1202c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3712,6 +3712,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bool		use_wal;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	PGAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
 
@@ -3730,6 +3731,14 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	if (nblocks == 0)
 		return;
 
+	/*
+	 * Bulk extend the destination relation of the same size as the source
+	 * relation before starting to copy block by block.
+	 */
+	memset(buf.data, 0, BLCKSZ);
+	smgrextend(smgropen(dstlocator, InvalidBackendId), forkNum, nblocks - 1,
+			   buf.data, true);
+
 	/* This is a bulk operation, so use buffer access strategies. */
 	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
@@ -3747,7 +3756,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, blkno,
 		   RBM_NORMAL, bstrategy_dst,
 		   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
-- 
1.8.3.1

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-12 Thread Robert Haas

On Thu, Aug 11, 2022 at 2:15 PM Robert Haas  wrote:
> As far as I know, this 0001 addresses all outstanding comments and
> fixes the reported bug.
>
> Does anyone think otherwise?

If they do, they're keeping quiet, so I committed this and
back-patched it to v15.

Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Although 0002 is formally a performance optimization, I'm inclined to
think that if we're going to commit it, it should also be back-patched
into v15, because letting the code diverge when we're not even out of
beta yet seems painful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-11 Thread Robert Haas

On Wed, Aug 10, 2022 at 1:01 AM Dilip Kumar  wrote:
> Done, along with that, I have also got the hunk of smgropen and
> smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
> Because here we do not want to reuse the smgr of the pg_class again so
> instead of closing at the end with smgrcloserellocator() we can just
> keep the smgr reference and close immediately after getting the number
> of blocks.  Whereas in CreateAndCopyRelationData and
> RelationCopyStorageUsingBuffer() we are using the smgr of the source
> and dest relation multiple time so it make sense to not close it
> immediately and we can close while exiting the function with
> smgrcloserellocator().

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-09 Thread Dilip Kumar

On Sun, Aug 7, 2022 at 9:47 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:
> > On Sat, Aug 6, 2022 at 9:36 PM Tom Lane  wrote:
> > >
> > > Dilip Kumar  writes:
> > > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  
> > > > wrote:
> > > >> Yeah maybe it is not necessary to close as these unowned smgr will
> > > >> automatically get closed on the transaction end.
> > >
> > > I do not think this is a great idea for the per-relation smgrs created
> > > during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> > > transaction end, but that doesn't mean that creating possibly tens of
> > > thousands of transient smgrs isn't going to cause performance issues.
>
> I was assuming that the files would get reopened at the end of the transaction
> anyway, but it looks like that's not the case, unless wal_level=minimal.
>
> Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
> register_delete = false, which is ok because createdb_failure_callback will
> clean things up. But that's another thing that's not great for a routine with
> a general name...
>
>
> > Okay, so for that we can simply call smgrcloserellocator(rlocator);
> > before exiting the RelationCopyStorageUsingBuffer() right?
>
> Yea, I think so.

Done, along with that, I have also got the hunk of smgropen and
smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
Because here we do not want to reuse the smgr of the pg_class again so
instead of closing at the end with smgrcloserellocator() we can just
keep the smgr reference and close immediately after getting the number
of blocks.  Whereas in CreateAndCopyRelationData and
RelationCopyStorageUsingBuffer() we are using the smgr of the source
and dest relation multiple time so it make sense to not close it
immediately and we can close while exiting the function with
smgrcloserellocator().

[1]
+ smgr = smgropen(rlocator, InvalidBackendId);
+ nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+ smgrclose(smgr);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v3-0002-Optimize-copy-storage-from-source-to-destination.patch
Description: Binary data


v3-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-06 Thread Andres Freund

Hi,

On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:
> On Sat, Aug 6, 2022 at 9:36 PM Tom Lane  wrote:
> >
> > Dilip Kumar  writes:
> > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  wrote:
> > >> Yeah maybe it is not necessary to close as these unowned smgr will
> > >> automatically get closed on the transaction end.
> >
> > I do not think this is a great idea for the per-relation smgrs created
> > during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> > transaction end, but that doesn't mean that creating possibly tens of
> > thousands of transient smgrs isn't going to cause performance issues.

I was assuming that the files would get reopened at the end of the transaction
anyway, but it looks like that's not the case, unless wal_level=minimal.

Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
register_delete = false, which is ok because createdb_failure_callback will
clean things up. But that's another thing that's not great for a routine with
a general name...


> Okay, so for that we can simply call smgrcloserellocator(rlocator);
> before exiting the RelationCopyStorageUsingBuffer() right?

Yea, I think so.


> > I think RelationCopyStorageUsingBuffer needs to open and then close
> > the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
> > appropriate API for it to use, either; need to go down another level.
> 
> Not sure how going down another level would help, the whole point is
> that we don't want to keep the reference of the smgr for a long time
> especially in the loop which is interruptible.

Yea, I'm not following either.


Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-06 Thread Dilip Kumar

On Sat, Aug 6, 2022 at 9:36 PM Tom Lane  wrote:
>
> Dilip Kumar  writes:
> > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  wrote:
> >> Yeah maybe it is not necessary to close as these unowned smgr will
> >> automatically get closed on the transaction end.
>
> I do not think this is a great idea for the per-relation smgrs created
> during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> transaction end, but that doesn't mean that creating possibly tens of
> thousands of transient smgrs isn't going to cause performance issues.

Okay, so for that we can simply call smgrcloserellocator(rlocator);
before exiting the RelationCopyStorageUsingBuffer() right?

> I think RelationCopyStorageUsingBuffer needs to open and then close
> the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
> appropriate API for it to use, either; need to go down another level.

Not sure how going down another level would help, the whole point is
that we don't want to keep the reference of the smgr for a long time
especially in the loop which is interruptible.  So everytime we need
smgr we can call smgropen and if it is already in the smgr cache then
we will get it from there.  So I think it makes sense that when we are
exiting the function that time we can just call smgrcloserellocator()
so that if it is opened it will be closed and otherwise it will do
nothing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-06 Thread Tom Lane

Dilip Kumar  writes:
> On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  wrote:
>> Yeah maybe it is not necessary to close as these unowned smgr will
>> automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

I think RelationCopyStorageUsingBuffer needs to open and then close
the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
appropriate API for it to use, either; need to go down another level.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-06 Thread Tom Lane

Dilip Kumar  writes:
> PFA patches for different problems discussed in the thread

> 0001 - Fix the problem of skipping the empty block and buffer lock on
> source buffer
> 0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
> 0003 - Optimization to avoid extending block by block

I pushed 0001, because it seems fairly critical to get that in before
beta3.  The others can stand more leisurely discussion.

I note from
https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
that the block-skipping path is actually taken in our tests (this won't be
visible there for very much longer of course).  So we actually *are*
making a corrupt copy, and we haven't noticed.  This is perhaps not too
surprising, because the only test case that I can find is in
020_createdb.pl:

$node->issues_sql_like(
[ 'createdb', '-T', 'foobar2', '-S', 'wal_log', 'foobar6' ],
qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE 
foobar2/,
'create database with WAL_LOG strategy');

which is, um, not exactly a robust test of whether anything happened
at all, let alone whether it was correct.  I'm not real sure that
this test would even notice if the CREATE reported failure.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-05 Thread Andres Freund

Hi,

On 2022-08-04 19:14:08 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
> >> In any case, DROP DATABASE is far from the only place with a problem.
> 
> > What other place has a database corrupting potential of this magnitude just
> > because interrupts are accepted?  We throw valid s_b contents away and then
> > accept interrupts before committing - with predictable results. We also 
> > accept
> > interrupts as part of deleting the db data dir (due to catalog access).
> 
> Those things would be better handled by moving the data-discarding
> steps to post-commit.  Maybe that argues for having an internal
> commit halfway through DROP DATABASE: remove pg_database row,
> commit, start new transaction, clean up.

That'd still require holding interrupts, I think. We shouldn't accept
interrupts until the on-disk data is actually deleted.

In theory I think we should have a pg_database column indicating whether the
database is valid or not. For database creation, insert the pg_database row
with valid=false, commit, then do the filesystem operation, then mark as
valid, commit. For database drop, mark as invalid, commit, remove filesystem
stuff, delete row, commit.  With dropdb allowed against an invalid database,
but obviously nothing else.  But clearly this isn't a short term /
backpatchable thing.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-05 Thread Dilip Kumar

On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  wrote:
>
> Yeah maybe it is not necessary to close as these unowned smgr will
> automatically get closed on the transaction end.  Actually the
> previous person of the patch had both these comments fixed.  The
> reason for explicitly closing it is that I have noticed that most of
> the places we explicitly close the smgr where we do smgropen e.g.
> index_copy_data(), heapam_relation_copy_data() OTOH some places we
> don't close it e.g. IssuePendingWritebacks().   So now I think that in
> our case better we do not close it because I do not like this specific
> code at the end to close the smgr.

PFA patches for different problems discussed in the thread

0001 - Fix the problem of skipping the empty block and buffer lock on
source buffer
0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
0003 - Optimization to avoid extending block by block

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 8a71e5dd10ff65d250815dc17f8f64212c2e57b0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 5 Aug 2022 11:25:23 +0530
Subject: [PATCH v2 3/3] Optimize copy storage from source to destination

Instead of extending block at a time directly bulkextend the destination
relation and then just perform the block level copy.
---
 src/backend/storage/buffer/bufmgr.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b488306..b7df980 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3710,6 +3710,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	Page		srcPage;
 	Page		dstPage;
 	bool		use_wal;
+	char		buffer[BLCKSZ];
 	BlockNumber nblocks;
 	BlockNumber blkno;
 	BufferAccessStrategy bstrategy_src;
@@ -3730,6 +3731,14 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	if (nblocks == 0)
 		return;
 
+	/*
+	 * Bulk extend the destination relation of the same size as the source
+	 * relation before starting to copy block by block.
+	 */
+	memset(buffer, 0, BLCKSZ);
+	smgrextend(smgropen(dstlocator, InvalidBackendId), forkNum, nblocks - 1,
+			   buffer, true);
+
 	/* This is a bulk operation, so use buffer access strategies. */
 	bstrategy_src = GetAccessStrategy(BAS_BULKREAD);
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
@@ -3748,7 +3757,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 		srcPage = BufferGetPage(srcBuf);
 
 		/* Use P_NEW to extend the destination relation. */
-		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, P_NEW,
+		dstBuf = ReadBufferWithoutRelcache(dstlocator, forkNum, blkno,
 		   RBM_NORMAL, bstrategy_dst,
 		   permanent);
 		LockBuffer(dstBuf, BUFFER_LOCK_EXCLUSIVE);
-- 
1.8.3.1

From 85d17fb7c91fdb6b40f09d00e9a5606ba2c90e57 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 5 Aug 2022 10:59:18 +0530
Subject: [PATCH v2 2/3] Avoid setting the fake relcache entry as smgr owner
 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During CREATE DATABASE, we are not connected to the source and the
destination DB so in order to operate on the storage we are using
FakeRelCacheEntry and by using that we are calling RelationGetSmgr().

So the problem is that this function will set the temporary
FakeRelCacheEntry as an owner of the smgr.  Now if there is any
error before we close the FakeRelCacheEntry then the memory of the
fake relcache entry will be released at the transaction abort but
the smgr will survive the transaction.  So now smgr is pointing
to some already release memory and it will have random behavior
when we try to access the smgr next time.

For fixing the issue instead of using the FakeRelCacheEntry, directly
call the smgropen() but do not keep the reference to the smgr.
So every time call smgropen() whenever we need it.  This is required to
ensure that we do not access the smgr pointer which might have already
been closed during interrupt processing.
---
 src/backend/commands/dbcommands.c   | 12 +--
 src/backend/storage/buffer/bufmgr.c | 43 +
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9f990a8..88d4fe1 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,7 +258,6 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
 	BufferAccessStrategy bstrategy;
 
@@ -276,16 +275,7 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar

On Fri, Aug 5, 2022 at 2:59 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> What's the motivation behind the explicit close?
>
>
> > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> > *srcpath)
> >   Pagepage;
> >   List   *rlocatorlist = NIL;
> >   LockRelId   relid;
> > - Relationrel;
> >   Snapshotsnapshot;
> > + SMgrRelationsmgr;
> >   BufferAccessStrategy bstrategy;
> >
> >   /* Get pg_class relfilenumber. */
> > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> > *srcpath)
> >   rlocator.dbOid = dbid;
> >   rlocator.relNumber = relfilenumber;
> >
> > - /*
> > -  * We can't use a real relcache entry for a relation in some other
> > -  * database, but since we're only going to access the fields related 
> > to
> > -  * physical storage, a fake one is good enough. If we didn't do this 
> > and
> > -  * used the smgr layer directly, we would have to worry about
> > -  * invalidations.
> > -  */
> > - rel = CreateFakeRelcacheEntry(rlocator);
> > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> > - FreeFakeRelcacheEntry(rel);
> > + smgr = smgropen(rlocator, InvalidBackendId);
> > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> > + smgrclose(smgr);
>
> Why are you opening and then closing again? Part of the motivation for the
> question is that a local SMgrRelation variable may lead to it being used
> further, opening up interrupt processing issues.

Yeah okay, I think there is no reason to close, in the previous
version I had like below and I think that's a better idea.

nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM)

>
> > + rlocator.locator = src_rlocator;
> > + smgrcloserellocator(rlocator);
> > +
> > + rlocator.locator = dst_rlocator;
> > + smgrcloserellocator(rlocator);
>
> As mentioned above, it's not clear to me why this is now done...
>
> Otherwise looks good to me.

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.  Actually the
previous person of the patch had both these comments fixed.  The
reason for explicitly closing it is that I have noticed that most of
the places we explicitly close the smgr where we do smgropen e.g.
index_copy_data(), heapam_relation_copy_data() OTOH some places we
don't close it e.g. IssuePendingWritebacks().   So now I think that in
our case better we do not close it because I do not like this specific
code at the end to close the smgr.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar

On Fri, Aug 5, 2022 at 5:36 AM Andres Freund  wrote:
>
> Hi,
>
> On August 4, 2022 4:20:16 PM PDT, Tom Lane  wrote:
> >Yeah, the assumption that P_NEW would automatically match the source block
> >was making me itchy too.  An explicit test-and-elog seems worth the
> >cycles.
>
> Is there a good reason to rely on P_NEW at all?

I think there were 2 arguments for which we used bufmgr instead of
smgrextend for the destination database

1) (Comment from Andres) The big benefit would be that the *target*
database does not have to be written out / shared buffer is
immediately populated. [1]
2) (Comment from Robert) We wanted to avoid writing new code which
bypasses the shared buffers.

[1]https://www.postgresql.org/message-id/20210905202800.ji4fnfs3xzhvo7l6%40alap3.anarazel.de

 Both from an efficiency and robustness POV it seems like it'd be
better to use smgrextend to bulk extend just after smgrcreate() and
then fill s_b u using RBM_ZERO (or whatever it is called).  That bulk
smgrextend would later be a good point to use fallocate so the FS can
immediately size the file correctly, without a lot of pointless writes
of zeroes.

Yeah okay, so you mean since we already know the nblocks in the source
file so we can directly do smgrextend in bulk before the copy loop and
then we can just copy block by block using bufmgr with proper blkno
instead of P_NEW.  Yeah I think this looks optimized to me and this
will take care of the above 2 points I mentioned that we will still
have the target database pages in the shared buffers and we are not
bypassing the shared buffers also.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar

On Fri, Aug 5, 2022 at 4:31 AM Tom Lane  wrote:
>
> I wrote:
> > While I'm at it, I would like to strenuously object to the current
> > framing of CreateAndCopyRelationData as a general-purpose copying
> > mechanism.
>
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
>
> /* Read block from source relation. */
> srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>RBM_NORMAL, bstrategy_src,
>permanent);
> srcPage = BufferGetPage(srcBuf);
> if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
> {
> ReleaseBuffer(srcBuf);
> continue;
> }
>
> /* Use P_NEW to extend the destination relation. */
> dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>RBM_NORMAL, bstrategy_dst,
>permanent);
>
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
>
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
>
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Yeah this is broken.

--
Dilip
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas

On Thu, Aug 4, 2022 at 7:11 PM Tom Lane  wrote:
> [pile^2]  Also, what is the rationale for locking the target buffer
> but not the source buffer?  That seems pretty hard to justify from
> here, even granting the assumption that we don't expect any other
> processes to be interested in these buffers (which I don't grant,
> because checkpointer).

Ooph. I agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas

On Thu, Aug 4, 2022 at 7:01 PM Tom Lane  wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

Ouch. That's pretty bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas

On Thu, Aug 4, 2022 at 6:02 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I have reviewed this patch and I don't see a problem with it. However,
> > it would be nice if Andres or someone else who understands this area
> > well (Tom? Thomas?) would also review it, because I also reviewed
> > what's in the tree now and that turns out to be buggy, which leads me
> > to conclude that I don't understand this area as well as would be
> > desirable.
>
> FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
> here, because I do not think that mechanism is meant to be used
> outside of WAL replay.  However, this patch fails to remove it from
> CreateAndCopyRelationData, which seems likely to be just as much
> at risk.

It looks to me like it does?

> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked, and we know that no other backend can
> connect to the destination DB because it doesn't exist yet according
> to the catalogs, so nothing could possibly occur to invalidate our
> idea of where the physical files are.  It would be nice to document
> these assumptions, though, rather than merely remove all the relevant
> commentary.

I don't think that's the point. We could always suffer a sinval reset
or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever
reusing the smgr, it should be OK. I think.

> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.  Because of the above assumptions, I think it's utterly
> unsafe to use anywhere except in CREATE DATABASE.  The header comment
> fails to warn about that at all, and placing it in bufmgr.c rather
> than static in dbcommands.c is just an invitation to future misuse.
> Perhaps I'm overly sensitive to that because I just finished cleaning
> up somebody's misuse of non-general-purpose code (1aa8dad41), but
> as this stands I think it's positively dangerous.

OK. No objection to you revising the comments however you feel is appropriate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Andres Freund  writes:
> Is there a good reason to rely on P_NEW at all? Both from an efficiency
> and robustness POV it seems like it'd be better to use smgrextend to
> bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO
> (or whatever it is called).  That bulk smgrextend would later be a good
> point to use fallocate so the FS can immediately size the file
> correctly, without a lot of pointless writes of zeroes.

Hmm ... makes sense.  We'd be using smgrextend to write just the last page
of the file, relying on its API spec "Note that we assume writing a block
beyond current EOF causes intervening file space to become filled with
zeroes".  I don't know that we're using that assumption aggressively
today, but as long as it doesn't confuse the kernel it'd probably be fine.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Andres Freund  writes:
> On August 4, 2022 4:11:13 PM PDT, Tom Lane  wrote:
>> [pile^2]  Also, what is the rationale for locking the target buffer
>> but not the source buffer?  That seems pretty hard to justify from
>> here, even granting the assumption that we don't expect any other
>> processes to be interested in these buffers (which I don't grant,
>> because checkpointer).

> I'm not arguing it's good or should stay that way, but it's probably okayish 
> that checkpointer / bgwriter have access, given that they will never modify 
> buffers. They just take a lock to prevent concurrent modifications, which 
> RelationCopyStorageUsingBuffer hopefully doesn't do. 

I'm not arguing that it's actively broken today --- but AFAIR,
every other access to a shared buffer takes a buffer lock.
It does not seem to me to be very future-proof for this code to
decide it's exempt from that rule, without so much as a comment
justifying it.  Furthermore, what's the gain?  We aren't expecting
contention here, I think.  If we were, then it probably *would* be
actively broken.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi, 

On August 4, 2022 4:20:16 PM PDT, Tom Lane  wrote:
>Yeah, the assumption that P_NEW would automatically match the source block
>was making me itchy too.  An explicit test-and-elog seems worth the
>cycles.

Is there a good reason to rely on P_NEW at all? Both from an efficiency and 
robustness POV it seems like it'd be better to use smgrextend to bulk extend 
just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is 
called).  That bulk smgrextend would later be a good point to use fallocate so 
the FS can immediately size the file correctly, without a lot of pointless 
writes of zeroes. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi, 

On August 4, 2022 4:11:13 PM PDT, Tom Lane  wrote:
>I wrote:
>> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
>> not completely broken?
>
>[pile^2]  Also, what is the rationale for locking the target buffer
>but not the source buffer?  That seems pretty hard to justify from
>here, even granting the assumption that we don't expect any other
>processes to be interested in these buffers (which I don't grant,
>because checkpointer).

I'm not arguing it's good or should stay that way, but it's probably okayish 
that checkpointer / bgwriter have access, given that they will never modify 
buffers. They just take a lock to prevent concurrent modifications, which 
RelationCopyStorageUsingBuffer hopefully doesn't do. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Andres Freund  writes:
> On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
>> (This seems worth fixing before beta3, as it looks like a rather
>> nasty data corruption hazard.)

> Ugh, yes. And even with this fixed I think this should grow at least an
> assertion that the block numbers match, probably even an elog.

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too.  An explicit test-and-elog seems worth the
cycles.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Andres Freund  writes:
> On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
>> In any case, DROP DATABASE is far from the only place with a problem.

> What other place has a database corrupting potential of this magnitude just
> because interrupts are accepted?  We throw valid s_b contents away and then
> accept interrupts before committing - with predictable results. We also accept
> interrupts as part of deleting the db data dir (due to catalog access).

Those things would be better handled by moving the data-discarding
steps to post-commit.  Maybe that argues for having an internal
commit halfway through DROP DATABASE: remove pg_database row,
commit, start new transaction, clean up.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi,

On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
> 
> /* Read block from source relation. */
> srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>RBM_NORMAL, bstrategy_src,
>permanent);
> srcPage = BufferGetPage(srcBuf);
> if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
> {
> ReleaseBuffer(srcBuf);
> continue;
> }
> 
> /* Use P_NEW to extend the destination relation. */
> dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>RBM_NORMAL, bstrategy_dst,
>permanent);
> 
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
> 
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
> 
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Ugh, yes. And even with this fixed I think this should grow at least an
assertion that the block numbers match, probably even an elog.

Greetings,

Andres

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

I wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

[pile^2]  Also, what is the rationale for locking the target buffer
but not the source buffer?  That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

I wrote:
> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

/* Read block from source relation. */
srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
   RBM_NORMAL, bstrategy_src,
   permanent);
srcPage = BufferGetPage(srcBuf);
if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
{
ReleaseBuffer(srcBuf);
continue;
}

/* Use P_NEW to extend the destination relation. */
dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
   RBM_NORMAL, bstrategy_dst,
   permanent);

You can't skip pages just because they are empty.  Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side.  But this isn't doing that.  It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages.  Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza.  There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi,

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Yea. I think at the very least we need to start holding interrupts before 
> > the
> > DropDatabaseBuffers() and do so until commit. That's probably best done by
> > doing the transaction commit inside dropdb.
> 
> We've talked before about ignoring interrupts across commit, but
> I find the idea a bit scary.

I'm not actually suggesting to do so across commit, just until the
commit. Maintaining that seems easiest if dropdb() does the commit internally.

> In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted?  We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Justin Pryzby

On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked,

About that - is it any problem that the currently-connected db can be used as a
template?  It's no issue for 2-phase commit, because "create database" cannot
run in an txn.

-- 
Justin

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Andres Freund  writes:
> Yea. I think at the very least we need to start holding interrupts before the
> DropDatabaseBuffers() and do so until commit. That's probably best done by
> doing the transaction commit inside dropdb.

We've talked before about ignoring interrupts across commit, but
I find the idea a bit scary.  In any case, DROP DATABASE is far
from the only place with a problem.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Robert Haas  writes:
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
here, because I do not think that mechanism is meant to be used
outside of WAL replay.  However, this patch fails to remove it from
CreateAndCopyRelationData, which seems likely to be just as much
at risk.

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked, and we know that no other backend can
connect to the destination DB because it doesn't exist yet according
to the catalogs, so nothing could possibly occur to invalidate our
idea of where the physical files are.  It would be nice to document
these assumptions, though, rather than merely remove all the relevant
commentary.

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism.  Because of the above assumptions, I think it's utterly
unsafe to use anywhere except in CREATE DATABASE.  The header comment
fails to warn about that at all, and placing it in bufmgr.c rather
than static in dbcommands.c is just an invitation to future misuse.
Perhaps I'm overly sensitive to that because I just finished cleaning
up somebody's misuse of non-general-purpose code (1aa8dad41), but
as this stands I think it's positively dangerous.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi,

On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote:
> So basically, from this we can say it is completely a problem with
> drop databases, I mean I can produce any behavior by interrupting drop
> database
> 1. If we created some tables/inserted data and the drop database got
> cancelled, it might have a database directory and those objects are
> lost.
> 2.  Or you can even drop the database directory and then get cancelled
> before deleting the pg_database entry then also you will end up with a
> corrupted database, doesn't matter whether you created it with WAL LOG
> or FILE COPY.

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi,

On 2022-08-04 16:07:01 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
> 
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

I don't think this issue is something I'd have caught "originally"
either. It's probably more of a "fake relcache entry" issue (or undocumented
limitation) than a bug in the new code.

I did a quick review upthread - some minor quibbles aside, I think it looks
good.


> I'm inclined to hold off on committing this until next week, not only
> for that reason, but also because there's a wrap planned on Monday,
> and committing anything now seems like it might have too much of a
> risk of upsetting that plan.

Makes sense. Unlikely to be a blocker for anybody.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund

Hi,

On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

What's the motivation behind the explicit close?


> @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> *srcpath)
>   Pagepage;
>   List   *rlocatorlist = NIL;
>   LockRelId   relid;
> - Relationrel;
>   Snapshotsnapshot;
> + SMgrRelationsmgr;
>   BufferAccessStrategy bstrategy;
>  
>   /* Get pg_class relfilenumber. */
> @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> *srcpath)
>   rlocator.dbOid = dbid;
>   rlocator.relNumber = relfilenumber;
>  
> - /*
> -  * We can't use a real relcache entry for a relation in some other
> -  * database, but since we're only going to access the fields related to
> -  * physical storage, a fake one is good enough. If we didn't do this and
> -  * used the smgr layer directly, we would have to worry about
> -  * invalidations.
> -  */
> - rel = CreateFakeRelcacheEntry(rlocator);
> - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> - FreeFakeRelcacheEntry(rel);
> + smgr = smgropen(rlocator, InvalidBackendId);
> + nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> + smgrclose(smgr);

Why are you opening and then closing again? Part of the motivation for the
question is that a local SMgrRelation variable may lead to it being used
further, opening up interrupt processing issues.


> + rlocator.locator = src_rlocator;
> + smgrcloserellocator(rlocator);
> +
> + rlocator.locator = dst_rlocator;
> + smgrcloserellocator(rlocator);

As mentioned above, it's not clear to me why this is now done...

Otherwise looks good to me.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane

Justin Pryzby  writes:
> On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
>> I'm inclined to hold off on committing this until next week, not only

> +1

+1 ... there are some other v15 open items that I don't think we'll
see fixed for beta3, either.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Justin Pryzby

On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
> I'm inclined to hold off on committing this until next week, not only

+1

I don't see any reason to hurry to fix problems that occur when DROP DATABASE
is interrupted.

Sorry to beat up your patches so much and for such crappy test cases^C

-- 
Justin

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

I'm inclined to hold off on committing this until next week, not only
for that reason, but also because there's a wrap planned on Monday,
and committing anything now seems like it might have too much of a
risk of upsetting that plan.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar

On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar  wrote:
>
> On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby  wrote:
> >
> > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> > > Hm. This looks more like an issue of DROP DATABASE not being 
> > > interruptible. I
> > > suspect this isn't actually related to STRATEGY wal_log and could likely 
> > > be
> > > reproduced in older versions too.
> >
> > I couldn't reproduce it with file_copy, but my recipe isn't exactly 
> > reliable.
> > That may just mean that it's easier to hit now.
>
> I think this looks like a problem with drop db but IMHO you are seeing
> this behavior only when a database is created using WAL LOG because in
> this strategy we are using buffers to write the destination database
> pages and some of the dirty buffers and sync requests might still be
> pending.  And now when we try to drop the database it drops all the
> dirty buffers and all pending sync requests and then before it
> actually removes the directory it gets interrupted and now you see the
> database directory on disk which is partially corrupted.  See below
> sequence of drop database
>
>
> dropdb()
> {
> ...
> DropDatabaseBuffers(db_id);
> ...
> ForgetDatabaseSyncRequests(db_id);
> ...
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>  -- Inside this it can process the cancel query and get interrupted
> remove_dbtablespaces(db_id);
> ..
> }
>
> I reproduced the same error by inducing error just before
> WaitForProcSignalBarrier.
>
> postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
> CREATE DATABASE
> ERROR:  XX000: test error
> LOCATION:  dropdb, dbcommands.c:1684
> postgres[14968]=# \c a
> connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
> could not open critical system index 2662
> Previous connection kept
> postgres[14968]=#

So basically, from this we can say it is completely a problem with
drop databases, I mean I can produce any behavior by interrupting drop
database
1. If we created some tables/inserted data and the drop database got
cancelled, it might have a database directory and those objects are
lost.
2.  Or you can even drop the database directory and then get cancelled
before deleting the pg_database entry then also you will end up with a
corrupted database, doesn't matter whether you created it with WAL LOG
or FILE COPY.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby  wrote:
>
> On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> > Hm. This looks more like an issue of DROP DATABASE not being interruptible. 
> > I
> > suspect this isn't actually related to STRATEGY wal_log and could likely be
> > reproduced in older versions too.
>
> I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
> That may just mean that it's easier to hit now.

I think this looks like a problem with drop db but IMHO you are seeing
this behavior only when a database is created using WAL LOG because in
this strategy we are using buffers to write the destination database
pages and some of the dirty buffers and sync requests might still be
pending.  And now when we try to drop the database it drops all the
dirty buffers and all pending sync requests and then before it
actually removes the directory it gets interrupted and now you see the
database directory on disk which is partially corrupted.  See below
sequence of drop database

dropdb()
{
...
DropDatabaseBuffers(db_id);
...
ForgetDatabaseSyncRequests(db_id);
...
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
 -- Inside this it can process the cancel query and get interrupted
remove_dbtablespaces(db_id);
..
}

I reproduced the same error by inducing error just before
WaitForProcSignalBarrier.

postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
CREATE DATABASE
ERROR:  XX000: test error
LOCATION:  dropdb, dbcommands.c:1684
postgres[14968]=# \c a
connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
could not open critical system index 2662
Previous connection kept
postgres[14968]=#

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Justin Pryzby

On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
> suspect this isn't actually related to STRATEGY wal_log and could likely be
> reproduced in older versions too.

I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
That may just mean that it's easier to hit now.

-- 
Justin

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Andres Freund

Hi,

On 2022-08-03 12:01:18 -0500, Justin Pryzby wrote:
> Now, I've reproduced the problem under valgrind, but it doesn't show anything
> useful

Yea, that looks like an issue on a different level.

> 
> pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c 
> "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
> ERROR:  database "a" does not exist
> CREATE DATABASE
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> ERROR:  database "a" already exists
> ^C

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

It's pretty obvious that dropdb() isn't safe against being interrupted. We
delete the data before we have committed the deletion of the pg_database
entry.

Seems like we should hold interrupts across the remove_dbtablespaces() until
*after* we've committed the transaction?

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Justin Pryzby

On Wed, Aug 03, 2022 at 11:02:00AM -0500, Justin Pryzby wrote:
> But I reproduced the first problem with a handful of tries interrupting the
> while loop:
> 
> 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not 
> open critical system index 2662
...
> So far, I haven't succeeded in eliciting anything useful from valgrind.

Now, I've reproduced the problem under valgrind, but it doesn't show anything
useful:

pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c 
"CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
ERROR:  database "a" does not exist
CREATE DATABASE
^CCancel request sent
ERROR:  canceling statement due to user request
ERROR:  database "a" already exists
^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ psql -h /tmp a -c ''
2022-08-03 11:57:39.178 CDT client backend[31321] [unknown] PANIC:  could not 
open critical system index 2662
psql: error: falló la conexión al servidor en el socket «/tmp/.s.PGSQL.5432»: 
PANIC:  could not open critical system index 2662


On the server process, nothing interesting but the backtrace (the error was
before this, while writing the relation file, but there's nothing suspicious).

2022-08-03 11:08:06.628 CDT client backend[2841] [unknown] PANIC:  could not 
open critical system index 2662
==2841==
==2841== Process terminating with default action of signal 6 (SIGABRT)
==2841==at 0x5FBBE97: raise (raise.c:51)
==2841==by 0x5FBD800: abort (abort.c:79)
==2841==by 0x2118DEF: errfinish (elog.c:675)
==2841==by 0x20F6002: load_critical_index (relcache.c:4328)
==2841==by 0x20F727A: RelationCacheInitializePhase3 (relcache.c:4103)
==2841==by 0x213DFA5: InitPostgres (postinit.c:1087)
==2841==by 0x1D20D72: PostgresMain (postgres.c:4081)
==2841==by 0x1B15CFD: BackendRun (postmaster.c:4490)
==2841==by 0x1B1D6E1: BackendStartup (postmaster.c:4218)
==2841==by 0x1B1DD59: ServerLoop (postmaster.c:1808)
==2841==by 0x1B1F86D: PostmasterMain (postmaster.c:1480)
==2841==by 0x17B7150: main (main.c:197)

Below, I reproduced it without valgrind (and without LANG):

pryzbyj@pryzbyj:~/src/postgres$ while :; do psql -qh /tmp template1 -c "DROP 
DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done
2022-08-03 11:59:52.675 CDT checkpointer[1881] LOG:  checkpoint starting: 
immediate force wait
2022-08-03 11:59:52.862 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, 
sync=0.038 s, total=0.188 s; sync files=3, longest=0.019 s, average=0.013 s; 
distance=3 kB, estimate=3 kB; lsn=0/24862508, redo lsn=0/248624D0
2022-08-03 11:59:53.213 CDT checkpointer[1881] LOG:  checkpoint starting: 
immediate force wait
2022-08-03 11:59:53.409 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.030 s, 
sync=0.054 s, total=0.196 s; sync files=4, longest=0.029 s, average=0.014 s; 
distance=4042 kB, estimate=4042 kB; lsn=0/24C54D88, redo lsn=0/24C54D50
^CCancel request sent
2022-08-03 11:59:53.750 CDT checkpointer[1881] LOG:  checkpoint starting: 
immediate force wait
2022-08-03 11:59:53.930 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.029 s, 
sync=0.027 s, total=0.181 s; sync files=4, longest=0.022 s, average=0.007 s; 
distance=4042 kB, estimate=4042 kB; lsn=0/250476D0, redo lsn=0/25047698
2022-08-03 11:59:54.270 CDT checkpointer[1881] LOG:  checkpoint starting: 
immediate force wait
^C2022-08-03 11:59:54.294 CDT client backend[1903] psql ERROR:  canceling 
statement due to user request
2022-08-03 11:59:54.294 CDT client backend[1903] psql STATEMENT:  DROP DATABASE 
a
Cancel request sent
ERROR:  canceling statement due to user request
2022-08-03 11:59:54.296 CDT client backend[1903] psql ERROR:  database "a" 
already exists
2022-08-03 11:59:54.296 CDT client backend[1903] psql STATEMENT:  CREATE 
DATABASE a TEMPLATE postgres STRATEGY wal_log
ERROR:  database "a" already exists
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ 2022-08-03 11:59:54.427 CDT checkpointer[1881] 
LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 
removed, 0 recycled; write=0.024 s, sync=0.036 s, total=0.158 s; sync files=4, 
longest=0.027 s, average=0.009 s; distance=4042 kB, estimate=4042 kB; 
lsn=0/2543A108, redo lsn=0/2543A0A8
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ psql -h /tmp a -c ''

   2022-08-03 11:59:56.617 CDT 
client backend[1914] [unknown] PANIC:  could not open critical system index 2662

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Wed, Aug 3, 2022 at 9:32 PM Justin Pryzby  wrote:
>
> On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> Thanks for providing a patch.
> This seems to fix the second problem with accessing freed memory.

Thanks for the confirmation.

> But I reproduced the first problem with a handful of tries interrupting the
> while loop:
>
> 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not 
> open critical system index 2662
>
> In the failure, when trying to connect to the new "a" DB, it does this:
>
> [pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
> [pid 10700] read(11, 
> "\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337",
>  524) = 524
> [pid 10700] close(11)   = 0
> [pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 
> ENOENT (No such file or directory)
> [pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
> [pid 10700] lseek(11, 0, SEEK_END)  = 106496
> [pid 10700] lseek(11, 0, SEEK_END)  = 106496
>
> And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)
>
> So far, I haven't succeeded in eliciting anything useful from valgrind.

I tried multiple times but had no luck with reproducing this issue.
Do you have some logs to know that just before ^C what was the last
query executed and whether it got canceled or executed completely?
Because theoretically, if the create database failed anywhere in
between then it should at least clean the directory of that newly
created database but seems there are some corrupted data in that
directory so seems it is not symptoms of just the create database
failure but some combination of multiple things.  I will put more
thought into this tomorrow.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Justin Pryzby

On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

Thanks for providing a patch.
This seems to fix the second problem with accessing freed memory.

But I reproduced the first problem with a handful of tries interrupting the
while loop:

2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not 
open critical system index 2662

In the failure, when trying to connect to the new "a" DB, it does this:

[pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
[pid 10700] read(11, 
"\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337",
 524) = 524
[pid 10700] close(11)   = 0
[pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 
ENOENT (No such file or directory)
[pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
[pid 10700] lseek(11, 0, SEEK_END)  = 106496
[pid 10700] lseek(11, 0, SEEK_END)  = 106496

And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)

So far, I haven't succeeded in eliciting anything useful from valgrind.

-- 
Justin

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Wed, 3 Aug 2022 at 9:22 PM, Robert Haas  wrote:

> On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> Hmm, but didn't we decide against doing it that way intentionally? The
> comment you're deleting says "If we didn't do this and used the smgr
> layer directly, we would have to worry about invalidations."


I think we only need to worry if we keep the smgr reference around and try
to reuse it.  But in this patch I am not keeping the reference to the smgr.

—
Dilip

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Robert Haas

On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Wed, Aug 3, 2022 at 1:41 PM Dilip Kumar  wrote:
>
> On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar  wrote:
> >
>
> > Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
> > owned by a fake relcache and whose lifetime is just portal memory
> > context which will go away at the transaction end.  So as Andres
> > suggested options could be that a) we catch the error and do
> > FreeFakeRelcacheEntry.  b) directly use smgropen instead of
> > RelationGetSmgr because actually, we do not want the owner to be set
> > for these smgrs.
> >
> > I think option b) looks better to me, I will prepare a patch and test
> > whether the error goes away with that or not.
> >
>
> Here is the patch which directly uses smgropen instead of using fake
> relcache entry.  We don't preserve the smgr pointer and whenever
> required we again call the smgropen.
>
> With this patch it resolved the problem for me at least what I was
> able to reproduce.  I was able to reproduce the WARNING messages that
> Robert got as well as the valgrind error that Justin got and with this
> patch both are resolved.

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From b151b54880dd17c94a25e8de908e30fe0d9a8542 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 3 Aug 2022 13:28:46 +0530
Subject: [PATCH v1] Avoid setting the fake relcache entry as smgr owner
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During CREATE DATABASE, we are not connected to the source and the
destination DB so in order to operate on the storage we are using
FakeRelCacheEntry and by using that we are calling RelationGetSmgr().

So the problem is that this function will set the temporary
FakeRelCacheEntry as an owner of the smgr.  Now if there is any
error before we close the FakeRelCacheEntry then the memory of the
fake relcache entry will be released at the transaction abort but
the smgr will survive the transaction.  So now smgr is pointing
to some already release memory and it will have random behavior
when we try to access the smgr next time.

For fixing the issue instead of using the FakeRelCacheEntry, directly
call the smgropen() but do not keep the reference to the smgr.
So every time call smgropen() whenever we need it.  This is required to
ensure that we do not access the smgr pointer which might have already
been closed during interrupt processing.
---
 src/backend/commands/dbcommands.c   | 15 +++
 src/backend/storage/buffer/bufmgr.c | 51 +
 2 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 7bc53f3..9342e8e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
+	SMgrRelation	smgr;
 	BufferAccessStrategy bstrategy;
 
 	/* Get pg_class relfilenumber. */
@@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	smgr = smgropen(rlocator, InvalidBackendId);
+	nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+	smgrclose(smgr);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b30138..8a7ccf5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 	   ForkNumber forkNum,
 	   BlockNumber nForkBlock,
 	   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-		   ForkNumber forkNum,
-		   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+		   RelFileLocator dstlocator,
+		   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  *

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar  wrote:
>

> Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
> owned by a fake relcache and whose lifetime is just portal memory
> context which will go away at the transaction end.  So as Andres
> suggested options could be that a) we catch the error and do
> FreeFakeRelcacheEntry.  b) directly use smgropen instead of
> RelationGetSmgr because actually, we do not want the owner to be set
> for these smgrs.
>
> I think option b) looks better to me, I will prepare a patch and test
> whether the error goes away with that or not.
>

Here is the patch which directly uses smgropen instead of using fake
relcache entry.  We don't preserve the smgr pointer and whenever
required we again call the smgropen.

With this patch it resolved the problem for me at least what I was
able to reproduce.  I was able to reproduce the WARNING messages that
Robert got as well as the valgrind error that Justin got and with this
patch both are resolved.

@Justin can you help in verifying the original issue?

Another alternative could be that continue using fake relcache entry
but instead of RelationGetSmgr() create some new function which
doesn't set the owner in the smgr.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From aa1f6ff66f1c4bc41ffbc066a5543e7030a4501f Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 3 Aug 2022 13:28:46 +0530
Subject: [PATCH] BugfixInWalLogCreateDB

---
 src/backend/commands/dbcommands.c   | 12 +--
 src/backend/storage/buffer/bufmgr.c | 43 +
 2 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 7bc53f3..0423831 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -258,7 +258,6 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	Page		page;
 	List	   *rlocatorlist = NIL;
 	LockRelId	relid;
-	Relation	rel;
 	Snapshot	snapshot;
 	BufferAccessStrategy bstrategy;
 
@@ -276,16 +275,7 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
 	rlocator.dbOid = dbid;
 	rlocator.relNumber = relfilenumber;
 
-	/*
-	 * We can't use a real relcache entry for a relation in some other
-	 * database, but since we're only going to access the fields related to
-	 * physical storage, a fake one is good enough. If we didn't do this and
-	 * used the smgr layer directly, we would have to worry about
-	 * invalidations.
-	 */
-	rel = CreateFakeRelcacheEntry(rlocator);
-	nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
-	FreeFakeRelcacheEntry(rel);
+	nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM);
 
 	/* Use a buffer access strategy since this is a bulk read operation. */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b30138..5f6e242 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -487,9 +487,9 @@ static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 	   ForkNumber forkNum,
 	   BlockNumber nForkBlock,
 	   BlockNumber firstDelBlock);
-static void RelationCopyStorageUsingBuffer(Relation src, Relation dst,
-		   ForkNumber forkNum,
-		   bool isunlogged);
+static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+		   RelFileLocator dstlocator,
+		   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
@@ -3701,8 +3701,9 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
  * 
  */
 static void
-RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
-			   bool permanent)
+RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
+			   RelFileLocator dstlocator,
+			   ForkNumber forkNum, bool permanent)
 {
 	Buffer		srcBuf;
 	Buffer		dstBuf;
@@ -3722,7 +3723,8 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 	use_wal = XLogIsNeeded() && (permanent || forkNum == INIT_FORKNUM);
 
 	/* Get number of blocks in the source relation. */
-	nblocks = smgrnblocks(RelationGetSmgr(src), forkNum);
+	nblocks = smgrnblocks(smgropen(srclocator, InvalidBackendId),
+		  forkNum);
 
 	/* Nothing to copy; just return. */
 	if (nblocks == 0)
@@ -3738,7 +3740,7 @@ RelationCopyStorageUsingBuffer(Relation src, Relation dst, ForkNumber forkNum,
 		CHECK_FOR_INTERRUPTS();
 
 		/* Read block from source relation. */
-		srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
+		srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
 		   RBM_NORMAL, bstrategy_src,
 		   permanent);
 		srcPage =

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-03 Thread Dilip Kumar

On Wed, Aug 3, 2022 at 11:28 AM Dilip Kumar  wrote:
>
> On Wed, Aug 3, 2022 at 3:53 AM Andres Freund  wrote:
> >
> > On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> > > I got this interesting looking thing.
> > >
> > > ==11628== Invalid write of size 8
> > > ==11628==at 0x1D12B3A: smgrsetowner (smgr.c:213)
> > > ==11628==by 0x1C7C224: RelationGetSmgr (rel.h:572)
> > > ==11628==by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> > > ==11628==by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> > > ==11628==by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> > > ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> > > ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > > ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> > > ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > > ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
> > > ==11628==by 0x1D27A7C: PortalRun (pquery.c:791)
> > > ==11628==by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> > > ==11628==by 0x1D218BC: PostgresMain (postgres.c:4505)
> > > ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 
> > > free'd
> > > ==11628==at 0x4033A3F: free (in 
> > > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > > ==11628==by 0x217D7C2: AllocSetReset (aset.c:608)
> > > ==11628==by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> > > ==11628==by 0x217DBD5: AllocSetDelete (aset.c:654)
> > > ==11628==by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> > > ==11628==by 0x21A109F: PortalDrop (portalmem.c:596)
> > > ==11628==by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> > > ==11628==by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> > > ==11628==by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> > > ==11628==by 0x1D2158C: PostgresMain (postgres.c:4232)
> > > ==11628==by 0x1B15DB5: BackendRun (postmaster.c:4490)
> > > ==11628==by 0x1B1D799: BackendStartup (postmaster.c:4218)
> > > ==11628==  Block was alloc'd at
> > > ==11628==at 0x40327F3: malloc (in 
> > > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > > ==11628==by 0x217F0DC: AllocSetAlloc (aset.c:920)
> > > ==11628==by 0x219E4D2: palloc (mcxt.c:1082)
> > > ==11628==by 0x14A14BE: ScanSourceDatabasePgClassTuple 
> > > (dbcommands.c:444)
> > > ==11628==by 0x14A1CD8: ScanSourceDatabasePgClassPage 
> > > (dbcommands.c:384)
> > > ==11628==by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> > > ==11628==by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> > > ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> > > ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > > ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> > > ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > > ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
> >
> > Ick. That looks like somehow we end up with smgr entries still pointing to
> > fake relcache entries, created in a prior attempt at create database.
>
> The surprising thing is how the smgr entry survived the transaction
> abort, I mean AtEOXact_SMgr should have closed the smgr and should
> have removed from the
> smgr cache.
>

Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
owned by a fake relcache and whose lifetime is just portal memory
context which will go away at the transaction end.  So as Andres
suggested options could be that a) we catch the error and do
FreeFakeRelcacheEntry.  b) directly use smgropen instead of
RelationGetSmgr because actually, we do not want the owner to be set
for these smgrs.

I think option b) looks better to me, I will prepare a patch and test
whether the error goes away with that or not.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Dilip Kumar

On Wed, Aug 3, 2022 at 3:53 AM Andres Freund  wrote:
>
> On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> > I got this interesting looking thing.
> >
> > ==11628== Invalid write of size 8
> > ==11628==at 0x1D12B3A: smgrsetowner (smgr.c:213)
> > ==11628==by 0x1C7C224: RelationGetSmgr (rel.h:572)
> > ==11628==by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> > ==11628==by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> > ==11628==by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> > ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> > ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> > ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
> > ==11628==by 0x1D27A7C: PortalRun (pquery.c:791)
> > ==11628==by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> > ==11628==by 0x1D218BC: PostgresMain (postgres.c:4505)
> > ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 
> > free'd
> > ==11628==at 0x4033A3F: free (in 
> > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==11628==by 0x217D7C2: AllocSetReset (aset.c:608)
> > ==11628==by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> > ==11628==by 0x217DBD5: AllocSetDelete (aset.c:654)
> > ==11628==by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> > ==11628==by 0x21A109F: PortalDrop (portalmem.c:596)
> > ==11628==by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> > ==11628==by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> > ==11628==by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> > ==11628==by 0x1D2158C: PostgresMain (postgres.c:4232)
> > ==11628==by 0x1B15DB5: BackendRun (postmaster.c:4490)
> > ==11628==by 0x1B1D799: BackendStartup (postmaster.c:4218)
> > ==11628==  Block was alloc'd at
> > ==11628==at 0x40327F3: malloc (in 
> > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==11628==by 0x217F0DC: AllocSetAlloc (aset.c:920)
> > ==11628==by 0x219E4D2: palloc (mcxt.c:1082)
> > ==11628==by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
> > ==11628==by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
> > ==11628==by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> > ==11628==by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> > ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> > ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> > ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
>
> Ick. That looks like somehow we end up with smgr entries still pointing to
> fake relcache entries, created in a prior attempt at create database.

The surprising thing is how the smgr entry survived the transaction
abort, I mean AtEOXact_SMgr should have closed the smgr and should
have removed from the
smgr cache.

> Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
> smgrclearowner()) in case of error.
>
> Or perhaps we can instead prevent the fake relcache entry being set as the
> owner in the first place?
>
> Why do we even need fake relcache entries here?  Looks like all that they're
> used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
> rnode to smgropen()? Given that we're doing that once for every buffer in the
> body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
> less-frequent places can't be a problem.
> can't

I think in some of the previous versions of the patch we were using
smgropen() but changed it so that we do not reuse the smgr after it
gets removed during interrupt processing, see discussion here[1]

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA%40mail.gmail.com

>From the Valgrind report, it is clear that we are getting the smgr
entry whose smgr->smgr_owner is pointing into the fake relcache entry.
So I am investigating further how it is possible for the smgr created
during a previous create database attempt to survive beyond abort
transaction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Andres Freund

On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> I got this interesting looking thing.
> 
> ==11628== Invalid write of size 8
> ==11628==at 0x1D12B3A: smgrsetowner (smgr.c:213)
> ==11628==by 0x1C7C224: RelationGetSmgr (rel.h:572)
> ==11628==by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> ==11628==by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> ==11628==by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
> ==11628==by 0x1D27A7C: PortalRun (pquery.c:791)
> ==11628==by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> ==11628==by 0x1D218BC: PostgresMain (postgres.c:4505)
> ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 
> free'd
> ==11628==at 0x4033A3F: free (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==11628==by 0x217D7C2: AllocSetReset (aset.c:608)
> ==11628==by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> ==11628==by 0x217DBD5: AllocSetDelete (aset.c:654)
> ==11628==by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> ==11628==by 0x21A109F: PortalDrop (portalmem.c:596)
> ==11628==by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> ==11628==by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> ==11628==by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> ==11628==by 0x1D2158C: PostgresMain (postgres.c:4232)
> ==11628==by 0x1B15DB5: BackendRun (postmaster.c:4490)
> ==11628==by 0x1B1D799: BackendStartup (postmaster.c:4218)
> ==11628==  Block was alloc'd at
> ==11628==at 0x40327F3: malloc (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==11628==by 0x217F0DC: AllocSetAlloc (aset.c:920)
> ==11628==by 0x219E4D2: palloc (mcxt.c:1082)
> ==11628==by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
> ==11628==by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
> ==11628==by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> ==11628==by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> ==11628==by 0x14AB009: createdb (dbcommands.c:1393)
> ==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> ==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
> ==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> ==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)

Ick. That looks like somehow we end up with smgr entries still pointing to
fake relcache entries, created in a prior attempt at create database.

Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
smgrclearowner()) in case of error.

Or perhaps we can instead prevent the fake relcache entry being set as the
owner in the first place?

Why do we even need fake relcache entries here?  Looks like all that they're
used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
rnode to smgropen()? Given that we're doing that once for every buffer in the
body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
less-frequent places can't be a problem.
can't 

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Justin Pryzby

On Tue, Aug 02, 2022 at 05:46:34PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > WARNING:  problem in alloc set PortalContext: req size > alloc size for 
> > chunk 0x7f99508911f0 in block 0x7f9950890800
> 
> This looks like nothing so much as the fallout from something scribbling
> past the end of an allocated palloc chunk, or perhaps writing on
> already-freed space.  Perhaps running the test case under valgrind
> would help to finger the culprit.

Yeah but my test case is so poor that it's a chore ...

(Sorry for that, but it took me 2 days to be able to reproduce the problem so I
sent it sooner rather than looking for a better way ... )

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628==at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628==by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628==by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628==by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628==by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628==by 0x14AB009: createdb (dbcommands.c:1393)
==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628==by 0x1D27A7C: PortalRun (pquery.c:791)
==11628==by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628==by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628==at 0x4033A3F: free (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628==by 0x217D7C2: AllocSetReset (aset.c:608)
==11628==by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628==by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628==by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628==by 0x21A109F: PortalDrop (portalmem.c:596)
==11628==by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628==by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628==by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628==by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628==by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628==by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628==  Block was alloc'd at
==11628==at 0x40327F3: malloc (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628==by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628==by 0x219E4D2: palloc (mcxt.c:1082)
==11628==by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628==by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628==by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628==by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628==by 0x14AB009: createdb (dbcommands.c:1393)
==11628==by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628==by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628==by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628==by 0x1D27089: PortalRunMulti (pquery.c:1315)

-- 
Justin

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Tom Lane

Robert Haas  writes:
> WARNING:  problem in alloc set PortalContext: req size > alloc size
> for chunk 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
> 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad single-chunk
> 0x7f9950891208 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: found inconsistent
> memory block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: req size > alloc size
> for chunk 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
> 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad single-chunk
> 0x7f9950891208 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: found inconsistent
> memory block 0x7f9950890800

This looks like nothing so much as the fallout from something scribbling
past the end of an allocated palloc chunk, or perhaps writing on
already-freed space.  Perhaps running the test case under valgrind
would help to finger the culprit.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Robert Haas

On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby  wrote:
> Unfortunately, that isn't very consistent, and you have have to run it a bunch
> of times...

I was eventually able to reproduce this in part by using the
interactive psql method you describe. It didn't crash, but it did spit
out a bunch of funny error messages:

postgres=# SET statement_timeout=0; DROP DATABASE a; SET
statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log ; \c a \c postgres
SET
ERROR:  database "a" does not exist
SET
ERROR:  canceling statement due to statement timeout
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:
database "a" does not exist
Previous connection kept
postgres=# select * from pg_database;
 oid |  datname  | datdba | encoding | datlocprovider | datistemplate
| datallowconn | datconnlimit | datfrozenxid | datminmxid |
dattablespace | datcollate  |  datctype   | daticulocale |
datcollversion |   datacl
-+---++--++---+--+--+--++---+-+-+--++
   5 | postgres  | 10 |6 | c  | f
| t|   -1 |  718 |  1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |  ||
   1 | template1 | 10 |6 | c  | t
| t|   -1 |  718 |  1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |  ||
{=c/rhaas,rhaas=CTc/rhaas}
   4 | template0 | 10 |6 | c  | t
| f|   -1 |  718 |  1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |  ||
{=c/rhaas,rhaas=CTc/rhaas}
(3 rows)

I then set backtrace_functions='AllocSetCheck' and reproduced it
again, which led to stack traces like this:

2022-08-02 16:50:32.490 EDT [98814] WARNING:  problem in alloc set
PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00
2022-08-02 16:50:32.490 EDT [98814] BACKTRACE:
2   postgres0x00010cd37ef5 AllocSetCheck + 549
3   postgres0x00010cd37730 AllocSetReset + 48
4   postgres0x00010cd3f6f1
MemoryContextResetOnly + 81
5   postgres0x00010cd378b9 AllocSetDelete + 73
6   postgres0x00010cd41e09 PortalDrop + 425
7   postgres0x00010cd427bb
AtCleanup_Portals + 203
8   postgres0x00010c86476d
CleanupTransaction + 29
9   postgres0x00010c865d4f
AbortCurrentTransaction + 63
10  postgres0x00010cba1395 PostgresMain + 885
11  postgres0x00010caf5472 PostmasterMain + 7586
12  postgres0x00010ca31e3d main + 2205
13  libdyld.dylib   0x7fff699afcc9 start + 1
14  ??? 0x0001 0x0 + 1

I recompiled with -O0 and hacked the code that emits the BACKTRACE:
bit to go into an infinite loop if it's hit, which enabled me to hook
up a debugger at the point of the failure. The debugger says:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00010e98a157
postgres`send_message_to_server_log(edata=0x00010ec0f658) at
elog.c:2916:4
frame #1: 0x00010e9866d6 postgres`EmitErrorReport at elog.c:1537:3
frame #2: 0x00010e986016 postgres`errfinish(filename="aset.c",
lineno=1470, funcname="AllocSetCheck") at elog.c:592:2
frame #3: 0x00010e9c8465
postgres`AllocSetCheck(context=0x7ff77c80d200) at aset.c:1469:5
frame #4: 0x00010e9c7c05
postgres`AllocSetDelete(context=0x7ff77c80d200) at aset.c:638:2
frame #5: 0x00010e9d368b
postgres`MemoryContextDelete(context=0x7ff77c80d200) at
mcxt.c:252:2
  * frame #6: 0x00010e9d705b
postgres`PortalDrop(portal=0x7ff77e028920, isTopCommit=false) at
portalmem.c:596:2
frame #7:

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-02 Thread Justin Pryzby

On Tue, Mar 29, 2022 at 11:55:05AM -0400, Robert Haas wrote:
> On Mon, Mar 28, 2022 at 3:08 PM Robert Haas  wrote:
> > smgrcreate() as we would for most WAL records or whether it should be
> > adopting the new system introduced by
> > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> > over here:
> >
> > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=icfm0suyqbbqzea...@mail.gmail.com
> >
> > But apart from that question your adaptations here look reasonable to me.
> 
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...

There's a crash

2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not 
open critical system index 2662

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7efe27999801 in __GI_abort () at abort.c:79
#2  0x5583891941dc in errfinish (filename=, 
filename@entry=0x558389420437 "relcache.c", lineno=lineno@entry=4328,
funcname=funcname@entry=0x558389421680 <__func__.33178> 
"load_critical_index") at elog.c:675
#3  0x5583891713ef in load_critical_index (indexoid=indexoid@entry=2662, 
heapoid=heapoid@entry=1259) at relcache.c:4328
#4  0x558389172667 in RelationCacheInitializePhase3 () at relcache.c:4103
#5  0x5583891b93a4 in InitPostgres 
(in_dbname=in_dbname@entry=0x55838a50d468 "a", dboid=dboid@entry=0, 
username=username@entry=0x55838a50d448 "pryzbyj", useroid=useroid@entry=0,
load_session_libraries=, 
override_allow_connections=override_allow_connections@entry=false, 
out_dbname=0x0) at postinit.c:1087
#6  0x558388daa7bb in PostgresMain (dbname=0x55838a50d468 "a", 
username=username@entry=0x55838a50d448 "pryzbyj") at postgres.c:4081
#7  0x558388b9f423 in BackendRun (port=port@entry=0x55838a505dd0) at 
postmaster.c:4490
#8  0x558388ba6e07 in BackendStartup (port=port@entry=0x55838a505dd0) at 
postmaster.c:4218
#9  0x558388ba747f in ServerLoop () at postmaster.c:1808
#10 0x558388ba8f93 in PostmasterMain (argc=7, argv=) at 
postmaster.c:1480
#11 0x558388840e1f in main (argc=7, argv=0x55838a4dc000) at main.c:197

while :; do psql -qh /tmp postgres -c "DROP DATABASE a" -c "CREATE DATABASE a 
TEMPLATE postgres STRATEGY wal_log"; done
# Run this for a few loops and then ^C or hold down ^C until it stops,
# and then connect to postgres and try to connect to 'a':
postgres=# \c a
2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not 
open critical system index 2662

Unfortunately, that isn't very consistent, and you have have to run it a bunch
of times...

I don't know if it's an issue of any significance that CREATE DATABASE / ^C
leaves behind a broken database, but it is an issue that the cluster crashes.

While struggling to reproduce that problem, I also hit this warning, which may
or may not be the same.  I added an abort() after WARNING in aset.c to get a
backtrace.

WARNING:  problem in alloc set PortalContext: bogus aset link in block 
0x55a63f2f9d60, chunk 0x55a63f2fb138

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51  ../sysdeps/unix/sysv/linux/raise.c: No existe el archivo o el 
directorio.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f81144f1801 in __GI_abort () at abort.c:79
#2  0x55a63c834c5d in AllocSetCheck (context=context@entry=0x55a63f26fea0) 
at aset.c:1491
#3  0x55a63c835b09 in AllocSetDelete (context=0x55a63f26fea0) at aset.c:638
#4  0x55a63c854322 in MemoryContextDelete (context=0x55a63f26fea0) at 
mcxt.c:252
#5  0x55a63c8591d5 in PortalDrop (portal=portal@entry=0x55a63f2bb7a0, 
isTopCommit=isTopCommit@entry=false) at portalmem.c:596
#6  0x55a63c3e4a7b in exec_simple_query 
(query_string=query_string@entry=0x55a63f24db90 "CREATE DATABASE a TEMPLATE 
postgres STRATEGY wal_log ;") at postgres.c:1253
#7  0x55a63c3e7fc1 in PostgresMain (dbname=, 
username=username@entry=0x55a63f279448 "pryzbyj") at postgres.c:4505
#8  0x55a63c1dc423 in BackendRun (port=port@entry=0x55a63f271dd0) at 
postmaster.c:4490
#9  0x55a63c1e3e07 in BackendStartup (port=port@entry=0x55a63f271dd0) at 
postmaster.c:4218
#10 0x55a63c1e447f in ServerLoop () at postmaster.c:1808
#11 0x55a63c1e5f93 in PostmasterMain (argc=7, argv=) at 
postmaster.c:1480
#12 0x55a63be7de1f in main (argc=7, argv=0x55a63f248000) at main.c:197

I reproduced that by running this a couple dozen times in an interactive psql.
It doesn't seem to affect STRATEGY=file_copy.

SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE 
DATABASE a TEMPLATE postgres STRATEGY wal_log ; \c a \c postgres

Also, if I understand correctly, this patch seems to assume that nobody is
connected to the source database.  But what's actually enforced is just that
nobody *else* is connected.  Is it any issue that the current DB can be used as
a source?  Anyway, both of the

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-04-04 Thread Dilip Kumar

On Sun, Apr 3, 2022 at 9:52 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > I committed v6 instead.
>
> Coverity complains that this patch added GetDatabasePath() calls without
> freeing its return value. Normally that'd be easy to dismiss, due to memory
> contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
> obviously there can be a lot of relations in one database - we shouldn't hold
> onto the same path over and over again.

> The case in recovery is worse, because there we don't have a memory context to
> reset afaics. Oddly enough, it sure looks like we have an existing version of
> this bug in the file-copy path?

Yeah, I see that the createdb() and dbase_redo() had this existing
problem and with this patch we have created a few more such
occurrences.
The attached patch fixes it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index df16533..ff81c48 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -218,6 +218,8 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid,
 	}
 
 	list_free_deep(rnodelist);
+	pfree(srcpath);
+	pfree(dstpath);
 }
 
 /*
@@ -628,6 +630,9 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 			(void) XLogInsert(RM_DBASE_ID,
 			  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
 		}
+
+		pfree(srcpath);
+		pfree(dstpath);
 	}
 	table_endscan(scan);
 	table_close(rel, AccessShareLock);
@@ -3051,6 +3056,8 @@ dbase_redo(XLogReaderState *record)
 		 * We don't need to copy subdirectories
 		 */
 		copydir(src_path, dst_path, false);
+		pfree(src_path);
+		pfree(dst_path);
 	}
 	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
 	{
@@ -3063,6 +3070,7 @@ dbase_redo(XLogReaderState *record)
 		/* Create the database directory with the version file. */
 		CreateDirAndVersionFile(dbpath, xlrec->db_id, xlrec->tablespace_id,
 true);
+		pfree(dbpath);
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-04-03 Thread Andres Freund

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

Coverity complains that this patch added GetDatabasePath() calls without
freeing its return value. Normally that'd be easy to dismiss, due to memory
contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
obviously there can be a lot of relations in one database - we shouldn't hold
onto the same path over and over again.

The case in recovery is worse, because there we don't have a memory context to
reset afaics. Oddly enough, it sure looks like we have an existing version of
this bug in the file-copy path?

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-04-01 Thread Dilip Kumar

On Thu, Mar 31, 2022 at 9:52 PM Andres Freund  wrote:

> > + "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
> > +ok($result == 0, 'move database to tablespace 2');
>
> This just tests the command doesn't fail, but not whether it actually did
> something useful. Seem we should at least insert a row or two into the the
> table, and verify they can be accessed?

Now, added some tuples and verified them.


> >  # Check that only READ-only queries can run on standbys
> >  is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
> >   3, 'read-only queries on standby 1');
>
> I'd probably add a function for creating database / table and then testing it,
> with a strategy parameter. That way we can afterwards add more tests verifying
> that everything worked.

I have created a function to create a database and table and verify
the content in it.  Another option is we can just keep the database
and table creation inside the function and the verification part
outside it so that if some future test case wants to create some extra
content and verify it then they can do so.   But with the current
tests in mind the way I got it in the attached patch has less
duplicate code so I preferred it this way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From a672a27a9e502331a7c4ca2a16f3f660c8ed3fcd Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 1 Apr 2022 13:15:57 +0530
Subject: [PATCH v2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
 src/test/modules/test_misc/t/002_tablespace.pl | 14 ++
 src/test/recovery/t/001_stream_rep.pl  | 25 +
 2 files changed, 39 insertions(+)

diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..6ac6f79 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,21 @@ $result = $node->psql('postgres',
 	"ALTER TABLE t SET tablespace regress_ts1");
 ok($result == 0, 'move table in-place->abs');
 
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
+$result = $node->safe_psql('testdb', "SELECT count(*) FROM tab_int");
+is($result, qq(10), 'check contents after moving to a different tablespace');
+
 # Drop everything
+$result = $node->psql('postgres', "DROP DATABASE testdb");
 $result = $node->psql('postgres',
 	"DROP TABLE t");
 ok($result == 0, 'create table in tablespace 1');
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 583ee87..cf4041a 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -78,6 +78,31 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Create database with some contents for a given strategy and check on standby
+sub test_createdb_strategy_and_check
+{
+	my $node1   = shift;
+	my $node2   = shift;
+	my $dbname  = shift;
+	my $strategy= shift;
+
+	$node1->safe_psql('postgres',
+		"CREATE DATABASE $dbname STRATEGY = $strategy; ");
+	$node1->safe_psql($dbname,
+		"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+	my $lsn = $node1->lsn('write');
+	$node1->wait_for_catchup($node2, 'replay', $lsn);
+
+	my $result = $node2->safe_psql($dbname, "SELECT count(*) FROM tab_int");
+	is($result, qq(10), 'check streamed content on standby');
+}
+# Test replication of file_copy strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb1",
+	"file_copy");
+# Test replication of wal_log strategy
+test_createdb_strategy_and_check($node_primary, $node_standby_1, "testdb2",
+	"wal_log");
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
1.8.3.1

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Robert Haas

On Thu, Mar 31, 2022 at 2:44 PM Andres Freund  wrote:
> On 2022-03-31 14:31:43 -0400, Robert Haas wrote:
> > On Thu, Mar 31, 2022 at 12:25 PM Andres Freund  wrote:
> > > > Andres, thoughts? Do you want me to polish and commit 0001?
> > >
> > > Yes please!
> >
> > Here is a polished version. Comments?
>
> LGTM.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Andres Freund

On 2022-03-31 14:31:43 -0400, Robert Haas wrote:
> On Thu, Mar 31, 2022 at 12:25 PM Andres Freund  wrote:
> > > Andres, thoughts? Do you want me to polish and commit 0001?
> >
> > Yes please!
> 
> Here is a polished version. Comments?

LGTM.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Robert Haas

On Thu, Mar 31, 2022 at 12:25 PM Andres Freund  wrote:
> > Andres, thoughts? Do you want me to polish and commit 0001?
>
> Yes please!

Here is a polished version. Comments?

> FWIW, once the freeze is done I'm planning to set up scripting to see which
> parts of the code we whacked around don't have test coverage...

Sounds terrifying.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-initdb-When-running-CREATE-DATABASE-use-STRATEGY-WAL.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Andres Freund

Hi,

On 2022-03-31 10:05:10 -0400, Robert Haas wrote:
> On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar  wrote:
> > 0001 is changing the strategy to file copy during initdb and 0002
> > patch adds the test cases for both these cases.
> 
> IMHO, 0001 looks fine, except for needing some adjustments to the wording.

Agreed.

> I'm less sure about 0002. It's testing the stuff Andres mentioned, but
> I'm not sure how good the tests are.

I came to a similar conclusion. It's still better than nothing, but it's just
a small bit of additional effort to do some basic testing that e.g. the move
actually worked...

> Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!

FWIW, once the freeze is done I'm planning to set up scripting to see which
parts of the code we whacked around don't have test coverage...

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Andres Freund

Hi,

On 2022-03-31 13:22:24 +0530, Dilip Kumar wrote:
> 0001 is changing the strategy to file copy during initdb and 0002
> patch adds the test cases for both these cases.

Thanks!

> From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar 
> Date: Thu, 31 Mar 2022 10:43:16 +0530
> Subject: [PATCH 1/2] Use file_copy strategy during initdb
> 
> Because skipping the checkpoint during initdb will not result
> in significant savings, so there is no point in using wal_log
> as that will simply increase the cluster size by generating
> extra wal.
> ---
>  src/bin/initdb/initdb.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> index 5e36943..1256082 100644
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
> @@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd)
>* it would fail. To avoid that, assign a fixed OID to template0 rather
>* than letting the server choose one.
>*
> +  * Using file_copy strategy is preferable over wal_log here because
> +  * skipping the checkpoint during initdb will not result in significant
> +  * savings, so there is no point in using wal_log as that will simply
> +  * increase the cluster size by generating extra wal.

It's not just the increase in size, it's also the increase in time due to WAL 
logging.


>* (Note that, while the user could have dropped and recreated these
>* objects in the old cluster, the problem scenario only exists if the 
> OID
>* that is in use in the old cluster is also used in the new cluster - 
> and
> @@ -1863,7 +1868,7 @@ make_template0(FILE *cmdfd)
>*/
>   static const char *const template0_setup[] = {
>   "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS 
> = false OID = "
> - CppAsString2(Template0ObjectId) ";\n\n",
> + CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n",

I'd perhaps break this into a separate line, but...


> From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001
> From: Dilip Kumar 
> Date: Thu, 31 Mar 2022 12:07:19 +0530
> Subject: [PATCH 2/2] Create database test coverage
> 
> Test create database strategy wal replay and alter database
> set tablespace.
> ---
>  src/test/modules/test_misc/t/002_tablespace.pl | 12 
>  src/test/recovery/t/001_stream_rep.pl  | 24 
>  2 files changed, 36 insertions(+)
> 
> diff --git a/src/test/modules/test_misc/t/002_tablespace.pl 
> b/src/test/modules/test_misc/t/002_tablespace.pl
> index 04e5439..f3bbddc 100644
> --- a/src/test/modules/test_misc/t/002_tablespace.pl
> +++ b/src/test/modules/test_misc/t/002_tablespace.pl
> @@ -83,7 +83,19 @@ $result = $node->psql('postgres',
>   "ALTER TABLE t SET tablespace regress_ts1");
>  ok($result == 0, 'move table in-place->abs');
>  
> +# Test ALTER DATABASE SET TABLESPACE
> +$result = $node->psql('postgres',
> + "CREATE DATABASE testdb TABLESPACE regress_ts1");
> +ok($result == 0, 'create database in tablespace 1');
> +$result = $node->psql('testdb',
> + "CREATE TABLE t ()");
> +ok($result == 0, 'create table in testdb database');
> +$result = $node->psql('postgres',
> + "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
> +ok($result == 0, 'move database to tablespace 2');

This just tests the command doesn't fail, but not whether it actually did
something useful. Seem we should at least insert a row or two into the the
table, and verify they can be accessed?


> +# Create database with different strategies and check its presence in standby
> +$node_primary->safe_psql('postgres',
> + "CREATE DATABASE testdb1 STRATEGY = FILE_COPY; ");
> +$node_primary->safe_psql('testdb1',
> + "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
> +$node_primary->safe_psql('postgres',
> + "CREATE DATABASE testdb2 STRATEGY = WAL_LOG; ");
> +$node_primary->safe_psql('testdb2',
> + "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
> +
> +# Wait for standbys to catch up
> +$primary_lsn = $node_primary->lsn('write');
> +$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
> +
> +$result =
> +  $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(10), 'check streamed content on standby 1');
> +
> +$result =
> +  $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(10), 'check streamed content on standby 1');
> +
>  # Check that only READ-only queries can run on standbys
>  is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
>   3, 'read-only queries on standby 1');

I'd probably add a function for creating database / table and then testing it,
with a strategy parameter. That way we can afterwards add more tests

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Robert Haas

On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar  wrote:
> 0001 is changing the strategy to file copy during initdb and 0002
> patch adds the test cases for both these cases.

IMHO, 0001 looks fine, except for needing some adjustments to the wording.

I'm less sure about 0002. It's testing the stuff Andres mentioned, but
I'm not sure how good the tests are.

Andres, thoughts? Do you want me to polish and commit 0001?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-31 Thread Dilip Kumar

On Thu, Mar 31, 2022 at 9:46 AM Dilip Kumar  wrote:
>
> On Thu, Mar 31, 2022 at 5:07 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > > I committed v6 instead.
> >
> > I was just discussing the WAL prefetching patch with Thomas. A question in
> > that discussion made me look at the coverage of REDO for CREATE DATABASE:
> > https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html
> >
> > Seems there's currently nothing hitting the REDO for
> > XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
> > keep coverage for that. How about adding a
> >   CREATE DATABASE ... STRATEGY file_copy
> > to 001_stream_rep.pl?
> >
> >
> > Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the 
> > same
> > time, this patch did affect that path in some minor ways. And, somewhat
> > shockingly, we don't have a single test for it.
>
> I will add tests for both of these cases and send the patch.


0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Thu, 31 Mar 2022 12:07:19 +0530
Subject: [PATCH 2/2] Create database test coverage

Test create database strategy wal replay and alter database
set tablespace.
---
 src/test/modules/test_misc/t/002_tablespace.pl | 12 
 src/test/recovery/t/001_stream_rep.pl  | 24 
 2 files changed, 36 insertions(+)

diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
index 04e5439..f3bbddc 100644
--- a/src/test/modules/test_misc/t/002_tablespace.pl
+++ b/src/test/modules/test_misc/t/002_tablespace.pl
@@ -83,7 +83,19 @@ $result = $node->psql('postgres',
 	"ALTER TABLE t SET tablespace regress_ts1");
 ok($result == 0, 'move table in-place->abs');
 
+# Test ALTER DATABASE SET TABLESPACE
+$result = $node->psql('postgres',
+	"CREATE DATABASE testdb TABLESPACE regress_ts1");
+ok($result == 0, 'create database in tablespace 1');
+$result = $node->psql('testdb',
+	"CREATE TABLE t ()");
+ok($result == 0, 'create table in testdb database');
+$result = $node->psql('postgres',
+	"ALTER DATABASE testdb SET TABLESPACE regress_ts2");
+ok($result == 0, 'move database to tablespace 2');
+
 # Drop everything
+$result = $node->psql('postgres', "DROP DATABASE testdb");
 $result = $node->psql('postgres',
 	"DROP TABLE t");
 ok($result == 0, 'create table in tablespace 1');
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 583ee87..3f1dd59 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -78,6 +78,30 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Create database with different strategies and check its presence in standby
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb1 STRATEGY = FILE_COPY; ");
+$node_primary->safe_psql('testdb1',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+$node_primary->safe_psql('postgres',
+	"CREATE DATABASE testdb2 STRATEGY = WAL_LOG; ");
+$node_primary->safe_psql('testdb2',
+	"CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
+
+# Wait for standbys to catch up
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+$result =
+  $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
+$result =
+  $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int");
+print "standby 1: $result\n";
+is($result, qq(10), 'check streamed content on standby 1');
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
1.8.3.1

From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Thu, 31 Mar 2022 10:43:16 +0530
Subject: [PATCH 1/2] Use file_copy strategy during initdb

Because skipping the checkpoint during initdb will not result
in significant savings, so there is no point in using wal_log
as that will simply increase the cluster size by generating
extra wal.
---
 src/bin/initdb/initdb.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e36943..1256082 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd)
 	 * it would fail. To avoid that, assign a fixed OID to template0 rather
 	 * than letting the server choose

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-30 Thread Dilip Kumar

On Thu, Mar 31, 2022 at 5:07 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > I committed v6 instead.
>
> I was just discussing the WAL prefetching patch with Thomas. A question in
> that discussion made me look at the coverage of REDO for CREATE DATABASE:
> https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html
>
> Seems there's currently nothing hitting the REDO for
> XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
> keep coverage for that. How about adding a
>   CREATE DATABASE ... STRATEGY file_copy
> to 001_stream_rep.pl?
>
>
> Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
> time, this patch did affect that path in some minor ways. And, somewhat
> shockingly, we don't have a single test for it.

I will add tests for both of these cases and send the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

I was just discussing the WAL prefetching patch with Thomas. A question in
that discussion made me look at the coverage of REDO for CREATE DATABASE:
https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html

Seems there's currently nothing hitting the REDO for
XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
keep coverage for that. How about adding a
  CREATE DATABASE ... STRATEGY file_copy
to 001_stream_rep.pl?


Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
time, this patch did affect that path in some minor ways. And, somewhat
shockingly, we don't have a single test for it.

- Andres

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 09:28:58 +0530, Dilip Kumar wrote:
> On Wed, Mar 30, 2022 at 6:47 AM Andres Freund  wrote:
> >
> >
> > du -s /tmp/initdb/
> > WAL_LOG: 35112
> > FILE_COPY: 29288
> >
> > So it seems we should specify a strategy in initdb? It kind of makes sense -
> > we're not going to read anything from those database. And because of the
> > ringbuffer of 256kB, we'll not even reduce IO meaningfully.
> 
> I think this makes sense, so you mean with initdb we will always use
> file_copy or we want to give a command line option for initdb ?

Don't see a need for a commandline option / a situation where using WAL_LOG
would be preferrable for initdb.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-30 Thread Dilip Kumar

On Tue, Mar 29, 2022 at 9:25 PM Robert Haas  wrote:
>
> On Mon, Mar 28, 2022 at 3:08 PM Robert Haas  wrote:
> > smgrcreate() as we would for most WAL records or whether it should be
> > adopting the new system introduced by
> > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> > over here:
> >
> > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=icfm0suyqbbqzea...@mail.gmail.com
> >
> > But apart from that question your adaptations here look reasonable to me.
>
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...
>

There was a duplicate error check for the invalid createdb strategy
option in the test case, although it would not create any issue but it
is duplicate so I have fixed it in the attached patch.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index b81f06a..18f6e31 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -109,7 +109,7 @@ $node->command_checks_all(
 	1,
 	[qr/^$/],
 	[
-		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy|^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
+		qr/^createdb: error: database creation failed: ERROR:  invalid create database strategy foo/s
 	],
 	'createdb with incorrect --strategy');

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Dilip Kumar

On Wed, Mar 30, 2022 at 6:47 AM Andres Freund  wrote:
>
>
> du -s /tmp/initdb/
> WAL_LOG: 35112
> FILE_COPY: 29288
>
> So it seems we should specify a strategy in initdb? It kind of makes sense -
> we're not going to read anything from those database. And because of the
> ringbuffer of 256kB, we'll not even reduce IO meaningfully.

I think this makes sense, so you mean with initdb we will always use
file_copy or we want to give a command line option for initdb ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Andres Freund

Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

Just noticed that it makes initdb a bit slower / the cluster a bit bigger,
because now there's WAL traffic from creating the databases.  There's an
optimization (albeit insufficient) to reduce WAL traffic in bootstrap mode,
but not for single user mode when the CREATE DATABASEs happen.

In an optimized build, with wal-segsize 1 (the most extreme case) using
FILE_COPY vs WAL_LOG:

perf stat ~/build/postgres/dev-optimize/install/bin/initdb /tmp/initdb/ 
--wal-segsize=1
WAL_LOG:

487.58 msec task-clock#0.848 CPUs utilized
 2,874  context-switches  #5.894 K/sec
 0  cpu-migrations#0.000 /sec
10,209  page-faults   #   20.938 K/sec
 1,550,483,095  cycles#3.180 GHz
 2,537,618,094  instructions  #1.64  insn per cycle
   492,780,121  branches  #1.011 G/sec
 7,384,884  branch-misses #1.50% of all branches

   0.575213800 seconds time elapsed

   0.349812000 seconds user
   0.133225000 seconds sys

FILE_COPY:

476.54 msec task-clock#0.854 CPUs utilized
 3,005  context-switches  #6.306 K/sec
 0  cpu-migrations#0.000 /sec
10,050  page-faults   #   21.090 K/sec
 1,516,058,200  cycles#3.181 GHz
 2,504,126,907  instructions  #1.65  insn per cycle
   488,042,856  branches  #1.024 G/sec
 7,327,364  branch-misses #1.50% of all branches

   0.557934976 seconds time elapsed

   0.360473000 seconds user
   0.112109000 seconds sys


the numbers are similar if repeated.

du -s /tmp/initdb/
WAL_LOG: 35112
FILE_COPY: 29288

So it seems we should specify a strategy in initdb? It kind of makes sense -
we're not going to read anything from those database. And because of the
ringbuffer of 256kB, we'll not even reduce IO meaningfully.

- Andres

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Robert Haas

On Tue, Mar 29, 2022 at 2:37 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Tue, Mar 29, 2022 at 2:17 PM Tom Lane  wrote:
> >> it's some random algorithm that they probably feel at liberty
> >> to change.
>
> > I guess that characterization surprises me. The man page for
> > getopt_long() says this, and has for a long time at least on systems
> > I've used:
>
> Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
> What they don't spell out in any detail is what they do when you don't.
> We know that it involves rearranging the argv[] array behind the
> application's back, but not what the rules are for doing that.  In
> particular, they must have some undocumented and probably not very safe
> method for deciding which arguments are neither switches nor switch
> arguments.

I mean, I think of an option as something that starts with '-'. The
documentation contains a caveat that says: "The special argument ‘--’
forces in all cases the end of option scanning." So I think I would
expect it just looks for arguments starting with '-' that do not
follow an argument that is exactly "--".

https://github.com/gcc-mirror/gcc/blob/master/libiberty/getopt.c

   If an element of ARGV starts with '-', and is not exactly "-" or "--",
   then it is an option element.  The characters of this element
   (aside from the initial '-') are option characters.  If `getopt'
   is called repeatedly, it returns successively each of the option characters
   from each of the option elements.

OK - so I was off slightly. Either "-" or "--" terminates the options
list. Apart from that anything starting with "-" is an option.

I think you're overestimating the level of mystery that's present
here, as well as the likelihood that the rules could ever be changed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Tom Lane

Robert Haas  writes:
> On Tue, Mar 29, 2022 at 2:17 PM Tom Lane  wrote:
>> it's some random algorithm that they probably feel at liberty
>> to change.

> I guess that characterization surprises me. The man page for
> getopt_long() says this, and has for a long time at least on systems
> I've used:

Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
What they don't spell out in any detail is what they do when you don't.
We know that it involves rearranging the argv[] array behind the
application's back, but not what the rules are for doing that.  In
particular, they must have some undocumented and probably not very safe
method for deciding which arguments are neither switches nor switch
arguments.

(Actually, if I recall previous discussions properly, another stumbling
block to doing anything here is that we'd also have to change all our
documentation to explain it.  Fixing the command line synopses would
be a mess already, and explaining the rules would be worse.)

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Robert Haas

On Tue, Mar 29, 2022 at 2:17 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Tue, Mar 29, 2022 at 1:53 PM Tom Lane  wrote:
> >> That test script is expecting glibc-like laxness of switch
> >> parsing.  Put the switches before the non-switch arguments.
>
> > I just did that. :-)
>
> Yup, you pushed while I was typing.
>
> FWIW, I don't think it's "Windows" enforcing this, it's our own
> src/port/getopt[_long].c.  If there were a well-defined spec
> for what glibc does with such cases, it might be interesting to
> try to make our version bug-compatible with theirs.  But AFAIK
> it's some random algorithm that they probably feel at liberty
> to change.

I guess that characterization surprises me. The man page for
getopt_long() says this, and has for a long time at least on systems
I've used:

ENVIRONMENT
 POSIXLY_CORRECT  If set, option processing stops when the first non-
  option is found and a leading `-' or `+' in the
  optstring is ignored.

And also this:

BUGS
 The argv argument is not really const as its elements may be permuted
 (unless POSIXLY_CORRECT is set).

Doesn't that make it pretty clear what the GNU version is doing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Tom Lane

Robert Haas  writes:
> On Tue, Mar 29, 2022 at 1:53 PM Tom Lane  wrote:
>> That test script is expecting glibc-like laxness of switch
>> parsing.  Put the switches before the non-switch arguments.

> I just did that. :-)

Yup, you pushed while I was typing.

FWIW, I don't think it's "Windows" enforcing this, it's our own
src/port/getopt[_long].c.  If there were a well-defined spec
for what glibc does with such cases, it might be interesting to
try to make our version bug-compatible with theirs.  But AFAIK
it's some random algorithm that they probably feel at liberty
to change.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Robert Haas

On Tue, Mar 29, 2022 at 1:53 PM Tom Lane  wrote:
> Andres Freund  writes:
> > Looks like there's some problem with commandline parsing?
>
> That test script is expecting glibc-like laxness of switch
> parsing.  Put the switches before the non-switch arguments.

I just did that. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Tom Lane

Andres Freund  writes:
> Looks like there's some problem with commandline parsing?

That test script is expecting glibc-like laxness of switch
parsing.  Put the switches before the non-switch arguments.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Robert Haas

On Tue, Mar 29, 2022 at 1:35 PM Andres Freund  wrote:
> # Running: createdb -T foobar2 foobar6 -S wal_log
> createdb: error: too many command-line arguments (first is "wal_log")
> Try "createdb --help" for more information.
> not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0
>
> Looks like there's some problem with commandline parsing?

Apparently getopt_long() is fussier on Windows. I have committed a fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Andres Freund

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...

It fails in CI (for the mirror of the postgres repo on github):
https://cirrus-ci.com/task/6279465603956736?logs=test_bin#L121
tap test log: 
https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/regress_log_020_createdb
postmaster log: 
https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/020_createdb_main.log

recent versions failed similarly on cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3192
https://cirrus-ci.com/task/5217140407009280?logs=test_bin#L121

# Running: createdb -T foobar2 foobar6 -S wal_log
createdb: error: too many command-line arguments (first is "wal_log")
Try "createdb --help" for more information.
not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0

#   Failed test 'createdb -T foobar2 foobar6 -S wal_log exit code 0'
#   at t/020_createdb.pl line 117.
not ok 32 - create database with WAL_LOG strategy: SQL found in server log

#   Failed test 'create database with WAL_LOG strategy: SQL found in server log'
#   at t/020_createdb.pl line 117.
#   ''
# doesn't match '(?^:statement: CREATE DATABASE foobar6 STRATEGY wal_log 
TEMPLATE foobar2)'
# Running: createdb -T foobar2 foobar7 -S file_copy
createdb: error: too many command-line arguments (first is "file_copy")
Try "createdb --help" for more information.
not ok 33 - createdb -T foobar2 foobar7 -S file_copy exit code 0

#   Failed test 'createdb -T foobar2 foobar7 -S file_copy exit code 0'
#   at t/020_createdb.pl line 122.
not ok 34 - create database with FILE_COPY strategy: SQL found in server log

#   Failed test 'create database with FILE_COPY strategy: SQL found in server 
log'
#   at t/020_createdb.pl line 122.
#   ''
# doesn't match '(?^:statement: CREATE DATABASE foobar7 STRATEGY file_copy 
TEMPLATE foobar2)'

Looks like there's some problem with commandline parsing?

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-29 Thread Robert Haas

On Mon, Mar 28, 2022 at 3:08 PM Robert Haas  wrote:
> smgrcreate() as we would for most WAL records or whether it should be
> adopting the new system introduced by
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> over here:
>
> http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=icfm0suyqbbqzea...@mail.gmail.com
>
> But apart from that question your adaptations here look reasonable to me.

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-28 Thread Dilip Kumar

On Tue, Mar 29, 2022 at 12:38 AM Robert Haas  wrote:
>
> On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar  wrote:
> > > I have put the similar logic for relmap_update WAL replay as well,
> >
> > There was some mistake in the last patch, basically, for relmap update
> > also I have checked the missing tablespace directory but I should have
> > checked the missing database directory so I have fixed that.
> >
> > > Now, is it possible to get the FPI without smgr_create wal in other
> > > cases?  If it is then that problem is orthogonal to this path, but
> > > anyway I could not find any such scenario.
> >
> > I have digged further into it, tried manually removing the directory
> > before XLOG_FPI, but I noticed that during FPI also
> > XLogReadBufferExtended() take cares of creating the missing files
> > using smgrcreate() and that intern take care of missing directory
> > creation so I don't think we have any problem here.
>
> I don't understand whether XLOG_RELMAP_UPDATE should be just doing
> smgrcreate()

XLOG_RELMAP_UPDATE is for the complete database so for which relnode
it will create smgr?  I think you probably meant
TablespaceCreateDbspace()?

 as we would for most WAL records or whether it should be
> adopting the new system introduced by
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> over here:

okay, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-28 Thread Robert Haas

On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar  wrote:
> > I have put the similar logic for relmap_update WAL replay as well,
>
> There was some mistake in the last patch, basically, for relmap update
> also I have checked the missing tablespace directory but I should have
> checked the missing database directory so I have fixed that.
>
> > Now, is it possible to get the FPI without smgr_create wal in other
> > cases?  If it is then that problem is orthogonal to this path, but
> > anyway I could not find any such scenario.
>
> I have digged further into it, tried manually removing the directory
> before XLOG_FPI, but I noticed that during FPI also
> XLogReadBufferExtended() take cares of creating the missing files
> using smgrcreate() and that intern take care of missing directory
> creation so I don't think we have any problem here.

I don't understand whether XLOG_RELMAP_UPDATE should be just doing
smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=icfm0suyqbbqzea...@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-28 Thread Dilip Kumar

On Sat, Mar 26, 2022 at 5:55 PM Dilip Kumar  wrote:
>
> On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar  wrote:
> >
> > > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> > > to mirror some of the logic that was added to the replay code for the
> > > existing strategy, but I haven't figured out the details.
> > >
> >
> > Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
> > to handle the missing parent directory case, like Alvaro handled for
> > the XLOG_DBASE_CREATE(_FILE_COPY) case.
>
> I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
> as well if the tablespace directory is missing.  But with our new
> wal_log method there will be other follow up wal logs like,
> XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.
>
> I have put the similar logic for relmap_update WAL replay as well,

There was some mistake in the last patch, basically, for relmap update
also I have checked the missing tablespace directory but I should have
checked the missing database directory so I have fixed that.

> Now, is it possible to get the FPI without smgr_create wal in other
> cases?  If it is then that problem is orthogonal to this path, but
> anyway I could not find any such scenario.

I have digged further into it, tried manually removing the directory
before XLOG_FPI, but I noticed that during FPI also
XLogReadBufferExtended() take cares of creating the missing files
using smgrcreate() and that intern take care of missing directory
creation so I don't think we have any problem here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From c35ba040770184d3048d9529668e30f8a59f5b75 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sat, 26 Mar 2022 10:35:44 +0530
Subject: [PATCH v8] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c |   2 +-
 doc/src/sgml/monitoring.sgml |   4 +
 doc/src/sgml/ref/create_database.sgml|  22 +
 doc/src/sgml/ref/createdb.sgml   |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c   |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c   |   2 +-
 src/backend/catalog/storage.c|  34 +-
 src/backend/commands/dbcommands.c| 795 ++-
 src/backend/commands/tablecmds.c |   2 +-
 src/backend/storage/buffer/bufmgr.c  | 172 ++-
 src/backend/storage/lmgr/lmgr.c  |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c   |   2 +-
 src/backend/utils/cache/relmapper.c  |  93 
 src/bin/pg_rewind/parsexlog.c|   9 +-
 src/bin/psql/tab-complete.c  |   4 +-
 src/bin/scripts/createdb.c   |  10 +-
 src/bin/scripts/t/020_createdb.pl|  20 +
 src/include/catalog/storage.h|   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h |   6 +-
 src/include/storage/lmgr.h   |   1 +
 src/include/utils/relmapper.h|   4 +-
 src/include/utils/wait_event.h   |   1 +
 src/tools/pgindent/typedefs.list |   5 +-
 28 files changed, 1136 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk,

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-26 Thread Dilip Kumar

On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar  wrote:
>
> > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> > to mirror some of the logic that was added to the replay code for the
> > existing strategy, but I haven't figured out the details.
> >
>
> Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
> to handle the missing parent directory case, like Alvaro handled for
> the XLOG_DBASE_CREATE(_FILE_COPY) case.

I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
as well if the tablespace directory is missing.  But with our new
wal_log method there will be other follow up wal logs like,
XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.

I have put the similar logic for relmap_update WAL replay as well, but
we don't need this  for smgr_create or fpi.  Because the mdcreate() is
taking care of creating missing directory in TablespaceCreateDbspace()
and fpi only logged after we create the new smgr at least in case of
create database.

Now, is it possible to get the FPI without smgr_create wal in other
cases?  If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From e0133aa89ca5da8309d07e4236ded4af513d3905 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sat, 26 Mar 2022 10:35:44 +0530
Subject: [PATCH v7] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c |   2 +-
 doc/src/sgml/monitoring.sgml |   4 +
 doc/src/sgml/ref/create_database.sgml|  22 +
 doc/src/sgml/ref/createdb.sgml   |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c   |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c   |   2 +-
 src/backend/catalog/storage.c|  34 +-
 src/backend/commands/dbcommands.c| 795 ++-
 src/backend/commands/tablecmds.c |   2 +-
 src/backend/storage/buffer/bufmgr.c  | 172 ++-
 src/backend/storage/lmgr/lmgr.c  |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c   |   2 +-
 src/backend/utils/cache/relmapper.c  |  97 
 src/bin/pg_rewind/parsexlog.c|   9 +-
 src/bin/psql/tab-complete.c  |   4 +-
 src/bin/scripts/createdb.c   |  10 +-
 src/bin/scripts/t/020_createdb.pl|  20 +
 src/include/catalog/storage.h|   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h |   6 +-
 src/include/storage/lmgr.h   |   1 +
 src/include/utils/relmapper.h|   4 +-
 src/include/utils/wait_event.h   |   1 +
 src/tools/pgindent/typedefs.list |   5 +-
 28 files changed, 1140 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff --git a/doc/src/sgml/monitoring.sgml

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-25 Thread Dilip Kumar

On Fri, Mar 25, 2022 at 7:41 PM Robert Haas  wrote:
>
> On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar  wrote:
> > Thanks, I have gone through your changes in comments and docs and those 
> > LGTM.
>
> It looks like this patch will need to be updated for Alvaro's commit
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
> 029_replay_tsp_drops.pl fails with this patch applied. The standby log
> shows:
>
> 2022-03-25 10:00:10.022 EDT [38209] LOG:  entering standby mode
> 2022-03-25 10:00:10.024 EDT [38209] LOG:  redo starts at 0/328
> 2022-03-25 10:00:10.062 EDT [38209] FATAL:  could not create directory
> "pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
> 2022-03-25 10:00:10.062 EDT [38209] CONTEXT:  WAL redo at 0/43EBD88
> for Database/CREATE_WAL_LOG: create dir 16385/16390
>
> On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> to mirror some of the logic that was added to the replay code for the
> existing strategy, but I haven't figured out the details.
>

Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
to handle the missing parent directory case, like Alvaro handled for
the XLOG_DBASE_CREATE(_FILE_COPY) case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-25 Thread Robert Haas

On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar  wrote:
> Thanks, I have gone through your changes in comments and docs and those LGTM.

It looks like this patch will need to be updated for Alvaro's commit
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
029_replay_tsp_drops.pl fails with this patch applied. The standby log
shows:

2022-03-25 10:00:10.022 EDT [38209] LOG:  entering standby mode
2022-03-25 10:00:10.024 EDT [38209] LOG:  redo starts at 0/328
2022-03-25 10:00:10.062 EDT [38209] FATAL:  could not create directory
"pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
2022-03-25 10:00:10.062 EDT [38209] CONTEXT:  WAL redo at 0/43EBD88
for Database/CREATE_WAL_LOG: create dir 16385/16390

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-24 Thread Dilip Kumar

On Thu, Mar 24, 2022 at 9:29 PM Robert Haas  wrote:
>
> On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar  wrote:
> > In the latest version I have fixed this issue by using a non
> > conflicting name, because when it was compiled with-icu the foobar5
> > was already used and we were seeing failure.  Apart from this I have
> > fixed the duplicate cleanup problem by passing an extra parameter to
> > RelationCreateStorage, which decides whether to register for on-abort
> > delete or not and added the comments for the same.  IMHO this looks
> > the most cleaner way to do it, please check the patch and let me know
> > your thoughts.
>
> I think that might be an OK way to do it. I think if we were starting
> from scratch we'd probably want to come up with some better system,
> but that's true of a lot of things.

Right.

> I went over your version and changed some comments. I also added
> documentation for the new wait event. Here's a new version.
>

Thanks, I have gone through your changes in comments and docs and those LGTM.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-24 Thread Robert Haas

On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar  wrote:
> In the latest version I have fixed this issue by using a non
> conflicting name, because when it was compiled with-icu the foobar5
> was already used and we were seeing failure.  Apart from this I have
> fixed the duplicate cleanup problem by passing an extra parameter to
> RelationCreateStorage, which decides whether to register for on-abort
> delete or not and added the comments for the same.  IMHO this looks
> the most cleaner way to do it, please check the patch and let me know
> your thoughts.

I think that might be an OK way to do it. I think if we were starting
from scratch we'd probably want to come up with some better system,
but that's true of a lot of things.

I went over your version and changed some comments. I also added
documentation for the new wait event. Here's a new version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v6-0001-Add-new-block-by-block-strategy-for-CREATE-DATABA.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 10:50 PM Dilip Kumar  wrote:
>
> On Wed, Mar 23, 2022 at 10:37 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> > > I could not see any reason for it to fail, and I could not reproduce
> > > it either.  Is it possible to access the server log for this cfbot
> > > failure?
> >
> > Yes, near the top, below the cpu / memory graphs, there's a file
> > navigator. Should have all files ending with *.log or starting with
> > regress_log_*.
>
> Okay, I think I have found the reasoning for this failure, basically,
> if we see the below logs then the second statement is failing with
> foobar5 already exists and that is because some of the above test case
> is conditionally generating the same name.  So the fix is to use a
> different name.

In the latest version I have fixed this issue by using a non
conflicting name, because when it was compiled with-icu the foobar5
was already used and we were seeing failure.  Apart from this I have
fixed the duplicate cleanup problem by passing an extra parameter to
RelationCreateStorage, which decides whether to register for on-abort
delete or not and added the comments for the same.  IMHO this looks
the most cleaner way to do it, please check the patch and let me know
your thoughts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From d9821c93b8d5b4a5707943d23f7beae6826627f0 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v5] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c |   2 +-
 doc/src/sgml/ref/create_database.sgml|  22 +
 doc/src/sgml/ref/createdb.sgml   |  11 +
 src/backend/access/heap/heapam_handler.c |   6 +-
 src/backend/access/nbtree/nbtree.c   |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/catalog/heap.c   |   2 +-
 src/backend/catalog/storage.c|  34 +-
 src/backend/commands/dbcommands.c| 761 ++-
 src/backend/commands/tablecmds.c |   2 +-
 src/backend/storage/buffer/bufmgr.c  | 172 ++-
 src/backend/storage/lmgr/lmgr.c  |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relcache.c   |   2 +-
 src/backend/utils/cache/relmapper.c  |  64 +++
 src/bin/pg_rewind/parsexlog.c|   9 +-
 src/bin/psql/tab-complete.c  |   4 +-
 src/bin/scripts/createdb.c   |  10 +-
 src/bin/scripts/t/020_createdb.pl|  20 +
 src/include/catalog/storage.h|   4 +-
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h |   6 +-
 src/include/storage/lmgr.h   |   1 +
 src/include/utils/relmapper.h|   4 +-
 src/include/utils/wait_event.h   |   1 +
 src/tools/pgindent/typedefs.list |   5 +-
 27 files changed, 1069 insertions(+), 157 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage,

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 10:37 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> > I could not see any reason for it to fail, and I could not reproduce
> > it either.  Is it possible to access the server log for this cfbot
> > failure?
>
> Yes, near the top, below the cpu / memory graphs, there's a file
> navigator. Should have all files ending with *.log or starting with
> regress_log_*.

Okay, I think I have found the reasoning for this failure, basically,
if we see the below logs then the second statement is failing with
foobar5 already exists and that is because some of the above test case
is conditionally generating the same name.  So the fix is to use a
different name.

2022-03-23 13:53:54.554 UTC [32647][client backend]
[020_createdb.pl][3/12:0] LOG:  statement: CREATE DATABASE foobar5
TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'en';
..
2022-03-23 13:53:55.374 UTC [32717][client backend]
[020_createdb.pl][3/46:0] LOG:  statement: CREATE DATABASE foobar5
STRATEGY file_copy TEMPLATE foobar2;
2022-03-23 13:53:55.390 UTC [32717][client backend]
[020_createdb.pl][3/46:0] ERROR:  database "foobar5" already exists

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Andres Freund

Hi,

On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> I could not see any reason for it to fail, and I could not reproduce
> it either.  Is it possible to access the server log for this cfbot
> failure?

Yes, near the top, below the cpu / memory graphs, there's a file
navigator. Should have all files ending with *.log or starting with
regress_log_*.

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 9:25 PM Dilip Kumar  wrote:
>
> On Wed, Mar 23, 2022 at 9:13 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> > > I think directly using smgrcreate() is a better idea instead of first
> > > registering and then unregistering it.   I have made that change in
> > > the attached patch.  After this change now we can merge creating the
> > > MAIN_FORKNUM also in the loop below where we are creating other
> > > fork[1] with one extra condition but I think current code is in more
> > > sync with the other code where we are doing the similar things so I
> > > have not merged it in the loop.  Please let me know if you think
> > > otherwise.
> >
> > FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
> > https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
> > https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377
>
> Strange to see that these changes are making a failure in the
> file_copy strategy[1] because we made changes only related to the
> wal_log strategy.  However I will look into this.  Thanks.
> [1]
> Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'

I could not see any reason for it to fail, and I could not reproduce
it either.  Is it possible to access the server log for this cfbot
failure?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 9:05 PM Dilip Kumar  wrote:
>
> On Wed, Mar 23, 2022 at 7:03 PM Robert Haas  wrote:
> >
> > On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar  wrote:
> > > I think directly using smgrcreate() is a better idea instead of first
> > > registering and then unregistering it.   I have made that change in
> > > the attached patch.  After this change now we can merge creating the
> > > MAIN_FORKNUM also in the loop below where we are creating other
> > > fork[1] with one extra condition but I think current code is in more
> > > sync with the other code where we are doing the similar things so I
> > > have not merged it in the loop.  Please let me know if you think
> > > otherwise.
> >
> > Generally I think our practice is that we do the main fork
> > unconditionally (because it should always be there) and the others
> > only if they exist. I suggest that you make this consistent with that,
> > but you could do it like if (forkNum != MAIN_FORKNUM &&
> > !smgrexists(...)) continue if that seems nicer.
>
> Maybe we can do that.
>
> > Do you think that this version handles pending syncs correctly? I
> > think perhaps that is overlooked.
>
> Yeah I missed that.  So options are either we go to the other approach
> and call RelationPreserveStorage() after
> RelationCreateStorage(),

Here is the patch with this approach, I am not sending both patches
with different approaches in the same mail otherwise cfbot might
generate conflict while applying the patch I think, so I will send it
in a seperate mail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 2091047bf391424528c685059268cc8b1212a2dc Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v4] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c |   2 +-
 doc/src/sgml/ref/create_database.sgml|  22 +
 doc/src/sgml/ref/createdb.sgml   |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c   |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c| 761 ++-
 src/backend/storage/buffer/bufmgr.c  | 181 +++-
 src/backend/storage/lmgr/lmgr.c  |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relmapper.c  |  64 +++
 src/bin/pg_rewind/parsexlog.c|   9 +-
 src/bin/psql/tab-complete.c  |   4 +-
 src/bin/scripts/createdb.c   |  10 +-
 src/bin/scripts/t/020_createdb.pl|  20 +
 src/include/commands/dbcommands_xlog.h   |  25 +-
 src/include/storage/bufmgr.h |   6 +-
 src/include/storage/lmgr.h   |   1 +
 src/include/utils/relmapper.h|   4 +-
 src/include/utils/wait_event.h   |   1 +
 src/tools/pgindent/typedefs.list |   5 +-
 22 files changed, 1048 insertions(+), 139 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34..82378db 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * Write the page and log it.  It might seem that an immediate sync would
 	 * be sufficient to guarantee that the file exists on disk, but recovery
 	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
+	 * XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
diff

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 9:13 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> > I think directly using smgrcreate() is a better idea instead of first
> > registering and then unregistering it.   I have made that change in
> > the attached patch.  After this change now we can merge creating the
> > MAIN_FORKNUM also in the loop below where we are creating other
> > fork[1] with one extra condition but I think current code is in more
> > sync with the other code where we are doing the similar things so I
> > have not merged it in the loop.  Please let me know if you think
> > otherwise.
>
> FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
> https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
> https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Strange to see that these changes are making a failure in the
file_copy strategy[1] because we made changes only related to the
wal_log strategy.  However I will look into this.  Thanks.
[1]
Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Andres Freund

Hi,

On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> I think directly using smgrcreate() is a better idea instead of first
> registering and then unregistering it.   I have made that change in
> the attached patch.  After this change now we can merge creating the
> MAIN_FORKNUM also in the loop below where we are creating other
> fork[1] with one extra condition but I think current code is in more
> sync with the other code where we are doing the similar things so I
> have not merged it in the loop.  Please let me know if you think
> otherwise.

FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Greetings,

Andres Freund

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 7:03 PM Robert Haas  wrote:
>
> On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar  wrote:
> > I think directly using smgrcreate() is a better idea instead of first
> > registering and then unregistering it.   I have made that change in
> > the attached patch.  After this change now we can merge creating the
> > MAIN_FORKNUM also in the loop below where we are creating other
> > fork[1] with one extra condition but I think current code is in more
> > sync with the other code where we are doing the similar things so I
> > have not merged it in the loop.  Please let me know if you think
> > otherwise.
>
> Generally I think our practice is that we do the main fork
> unconditionally (because it should always be there) and the others
> only if they exist. I suggest that you make this consistent with that,
> but you could do it like if (forkNum != MAIN_FORKNUM &&
> !smgrexists(...)) continue if that seems nicer.

Maybe we can do that.

> Do you think that this version handles pending syncs correctly? I
> think perhaps that is overlooked.

Yeah I missed that.  So options are either we go to the other approach
and call RelationPreserveStorage() after
RelationCreateStorage(), or we expose the AddPendingSync() function
from the storage layer and then conditionally use it.  I think if we
are planning to expose this api then we better rename it to
RelationAddPendingSync().  Honestly, I do not have any specific
preference here.  I can try both the approaches and send both if you
or anyone else do not have any preference here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Robert Haas

On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar  wrote:
> I think directly using smgrcreate() is a better idea instead of first
> registering and then unregistering it.   I have made that change in
> the attached patch.  After this change now we can merge creating the
> MAIN_FORKNUM also in the loop below where we are creating other
> fork[1] with one extra condition but I think current code is in more
> sync with the other code where we are doing the similar things so I
> have not merged it in the loop.  Please let me know if you think
> otherwise.

Generally I think our practice is that we do the main fork
unconditionally (because it should always be there) and the others
only if they exist. I suggest that you make this consistent with that,
but you could do it like if (forkNum != MAIN_FORKNUM &&
!smgrexists(...)) continue if that seems nicer.

Do you think that this version handles pending syncs correctly? I
think perhaps that is overlooked.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-23 Thread Dilip Kumar

On Wed, Mar 23, 2022 at 5:54 PM Robert Haas  wrote:
>
> On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar  wrote:
> > Okay this is an interesting point.  So one option is that in case of
> > failure while using the wal log strategy we do not remove the database
> > directory, because an abort transaction will take care of removing the
> > relation file.  But then in failure case we will leave the orphaned
> > database directory with version file and the relmap file.  Another
> > option is to do the redundant cleanup as we are doing now.  Any other
> > options?
>
> I think our overriding goal should be to get everything using one
> mechanism. It doesn't look straightforward to get everything to go
> through the PendingRelDelete mechanism, because as you say, it can't
> handle non-relation files or directories. However, what if we opt out
> of that mechanism? We could do that either by not using
> RelationCreateStorage() in the first place and directly calling
> smgrcreate(), or by using RelationPreserveStorage() afterwards to yank
> the file back out of the list.

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it.   I have made that change in
the attached patch.  After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop.  Please let me know if you think
otherwise.

[1]
+/*
+ * Create and copy all forks of the relation.  We are not using
+ * RelationCreateStorage() as it is registering the cleanup for the
+ * underlying relation storage on the transaction abort.  But during create
+ * database failure, we have a separate cleanup mechanism for the whole
+ * database directory. Therefore, we don't need to register cleanup for
+ * each individual relation storage.
+ */
+smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false);
+if (permanent)
+log_smgrcreate(_rnode, MAIN_FORKNUM);
+
+/* copy main fork. */
+RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+/* copy those extra forks that exist */
+for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+ forkNum <= MAX_FORKNUM; forkNum++)
+{
+if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+{
+smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From fe6d9d5a2e1d0791749768a92a08dcb5dd4ca0ce Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 22 Mar 2022 11:22:26 -0400
Subject: [PATCH v3] Add new block-by-block strategy for CREATE DATABASE.

Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.

Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.

Dilip Kumar, with a fairly large number of cosmetic changes by me.
---
 contrib/bloom/blinsert.c |   2 +-
 doc/src/sgml/ref/create_database.sgml|  22 +
 doc/src/sgml/ref/createdb.sgml   |  11 +
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/nbtree/nbtree.c   |   2 +-
 src/backend/access/rmgrdesc/dbasedesc.c  |  20 +-
 src/backend/access/transam/xlogutils.c   |   6 +-
 src/backend/commands/dbcommands.c| 761 ++-
 src/backend/storage/buffer/bufmgr.c  | 171 ++-
 src/backend/storage/lmgr/lmgr.c  |  28 ++
 src/backend/utils/activity/wait_event.c  |   3 +
 src/backend/utils/cache/relmapper.c  |  64 +++
 src/bin/pg_rewind/parsexlog.c|   9 +-
 src/bin/psql/tab-complete.c  |   4 +-
 src/bin/scripts/createdb.c   |  10 +-
 src/bin/scripts/t/020_createdb.pl|  20 +

1 2 3 >

1 - 100 of 281 matches

Mail list logo