Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> The one design point that worries me a little is the non-configurability of
>> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
>> or something, but given how often I've had to fiddle with
>> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.
> 
> Well, we could add a command-line switch to pg_upgrade, but I'm
> unconvinced that it'd be worth the trouble.  I think a very large
> fraction of users invoke pg_upgrade by means of packager-supplied
> scripts that are unlikely to provide a way to pass through such
> a switch.  I'm inclined to say let's leave it as-is until we get
> some actual field requests for a switch.

Okay.  I'll let you know if I see anything.  IIRC usually the pg_dump side
of pg_upgrade is more prone to lock exhaustion, so you may very well be
right that this is unnecessary.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Tom Lane
Nathan Bossart  writes:
> Sorry for taking so long to get back to this one.  Overall, I think the
> code is in decent shape.

Thanks for looking at it!

> The one design point that worries me a little is the non-configurability of
> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
> or something, but given how often I've had to fiddle with
> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.

Well, we could add a command-line switch to pg_upgrade, but I'm
unconvinced that it'd be worth the trouble.  I think a very large
fraction of users invoke pg_upgrade by means of packager-supplied
scripts that are unlikely to provide a way to pass through such
a switch.  I'm inclined to say let's leave it as-is until we get
some actual field requests for a switch.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Nathan Bossart
On Wed, Mar 27, 2024 at 10:08:26AM -0500, Nathan Bossart wrote:
> On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote:
>> Michael Banck  writes:
>>> What is the status of this? In the commitfest, this patch is marked as
>>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take
>>> another look at this or was your mail from January 12th a full review?
>> 
>> In my mind the ball is in Nathan's court.  I feel it's about
>> committable, but he might not agree.
> 
> I'll prioritize another round of review on this one.  FWIW I don't remember
> having any major concerns on a previous version of the patch set I looked
> at.

Sorry for taking so long to get back to this one.  Overall, I think the
code is in decent shape.  Nothing stands out after a couple of passes.  The
small amount of runtime improvement cited upthread is indeed a bit
disappointing, but IIUC this at least sets the stage for additional
parallelism in the future, and the memory/disk usage improvements are
nothing to sneeze at, either.

The one design point that worries me a little is the non-configurability of
--transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
or something, but given how often I've had to fiddle with
max_locks_per_transaction, I'm wondering if we might regret hard-coding it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Nathan Bossart
On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote:
> Michael Banck  writes:
>> What is the status of this? In the commitfest, this patch is marked as
>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take
>> another look at this or was your mail from January 12th a full review?
> 
> In my mind the ball is in Nathan's court.  I feel it's about
> committable, but he might not agree.

I'll prioritize another round of review on this one.  FWIW I don't remember
having any major concerns on a previous version of the patch set I looked
at.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Tom Lane
Michael Banck  writes:
> What is the status of this? In the commitfest, this patch is marked as
> "Needs Review" with Nathan as reviewer - Nathan, were you going to take
> another look at this or was your mail from January 12th a full review?

In my mind the ball is in Nathan's court.  I feel it's about
committable, but he might not agree.

> Also, is there a chance this is going to be back-patched?

No chance of that I'm afraid.  The patch bumps the archive version
number, because it creates TOC entries that older pg_restore would
not know what to do with.  We can't put that kind of compatibility
break into stable branches.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Michael Banck
Hi,

On Wed, Mar 27, 2024 at 10:53:51AM +0100, Laurenz Albe wrote:
> On Wed, 2024-03-27 at 10:20 +0100, Michael Banck wrote:
> > Also, is there a chance this is going to be back-patched? I guess it
> > would be enough if the ugprade target is v17 so it is less of a concern,
> > but it would be nice if people with millions of large objects are not
> > stuck until they are ready to ugprade to v17.
> 
> It is a quite invasive patch, and it adds new features (pg_restore in
> bigger transaction patches), so I think this is not for backpatching,
> desirable as it may seem from the usability angle.

Right, I forgot about those changes, makes sense.


Michael




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Laurenz Albe
On Wed, 2024-03-27 at 10:20 +0100, Michael Banck wrote:
> Also, is there a chance this is going to be back-patched? I guess it
> would be enough if the ugprade target is v17 so it is less of a concern,
> but it would be nice if people with millions of large objects are not
> stuck until they are ready to ugprade to v17.

It is a quite invasive patch, and it adds new features (pg_restore in
bigger transaction patches), so I think this is not for backpatching,
desirable as it may seem from the usability angle.

Yours,
Laurenz Albe




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Michael Banck
Hi,

On Sat, Mar 16, 2024 at 06:46:15PM -0400, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote:
> >> This patch seems to have stalled out again.  In hopes of getting it
> >> over the finish line, I've done a bit more work to address the two
> >> loose ends I felt were probably essential to deal with:
> 
> > Applies and builds fine.
> > I didn't scrutinize the code, but I gave it a spin on a database with
> > 15 million (small) large objects.  I tried pg_upgrade --link with and
> > without the patch on a debug build with the default configuration.
> 
> Thanks for looking at it!
> 
> > Without the patch:
> > Runtime: 74.5 minutes
> 
> > With the patch:
> > Runtime: 70 minutes
> 
> Hm, I'd have hoped for a bit more runtime improvement.  

I also think that this is quite a large runtime for pg_upgrade, but the
more important savings should be the memory usage.

> But perhaps not --- most of the win we saw upthread was from
> parallelism, and I don't think you'd get any parallelism in a
> pg_upgrade with all the data in one database.  (Perhaps there is more
> to do there later, but I'm still not clear on how this should interact
> with the existing cross-DB parallelism; so I'm content to leave that
> question for another patch.)

What is the status of this? In the commitfest, this patch is marked as
"Needs Review" with Nathan as reviewer - Nathan, were you going to take
another look at this or was your mail from January 12th a full review?

My feeling is that this patch is "Ready for Committer" and it is Tom's
call to commit it during the next days or not.

I am +1 that this is an important feature/bug fix to have. Because we
have customers stuck on older versions due to their pathological large
objects usage, I did some benchmarks (jsut doing pg_dump, not
pg_upgarde) a while ago which were also very promising; however, I lost
the exact numbers/results. I am happy to do further tests if that is
required for this patch to go forward.

Also, is there a chance this is going to be back-patched? I guess it
would be enough if the ugprade target is v17 so it is less of a concern,
but it would be nice if people with millions of large objects are not
stuck until they are ready to ugprade to v17.


Michael




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-17 Thread Laurenz Albe
On Sat, 2024-03-16 at 18:46 -0400, Tom Lane wrote:
> > Without the patch:
> > Runtime: 74.5 minutes
> 
> > With the patch:
> > Runtime: 70 minutes
> 
> Hm, I'd have hoped for a bit more runtime improvement.

I did a second run with the patch, and that finished in 66 minutes,
so there is some jitter there.

I think the reduced memory footprint and the reduced transaction ID
consumption alone make this patch worthwhile.

Yours,
Laurenz Albe




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-16 Thread Tom Lane
Laurenz Albe  writes:
> On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote:
>> This patch seems to have stalled out again.  In hopes of getting it
>> over the finish line, I've done a bit more work to address the two
>> loose ends I felt were probably essential to deal with:

> Applies and builds fine.
> I didn't scrutinize the code, but I gave it a spin on a database with
> 15 million (small) large objects.  I tried pg_upgrade --link with and
> without the patch on a debug build with the default configuration.

Thanks for looking at it!

> Without the patch:
> Runtime: 74.5 minutes

> With the patch:
> Runtime: 70 minutes

Hm, I'd have hoped for a bit more runtime improvement.  But perhaps
not --- most of the win we saw upthread was from parallelism, and
I don't think you'd get any parallelism in a pg_upgrade with all
the data in one database.  (Perhaps there is more to do there later,
but I'm still not clear on how this should interact with the existing
cross-DB parallelism; so I'm content to leave that question for
another patch.)

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-16 Thread Laurenz Albe
On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote:
> This patch seems to have stalled out again.  In hopes of getting it
> over the finish line, I've done a bit more work to address the two
> loose ends I felt were probably essential to deal with:

Applies and builds fine.

I didn't scrutinize the code, but I gave it a spin on a database with
15 million (small) large objects.  I tried pg_upgrade --link with and
without the patch on a debug build with the default configuration.

Without the patch:

Runtime: 74.5 minutes
Memory usage: ~7GB
Disk usage: an extra 5GB dump file + log file during the dump

With the patch:

Runtime: 70 minutes
Memory usage: ~1GB
Disk usage: an extra 0.5GB during the dump

Memory usage stayed stable once it reached its peak, so no noticeable
memory leaks.

The reduced memory usage is great.  I was surprised by the difference
in disk usage: the lion's share is the dump file, and that got substantially
smaller.  But also the log file shrank considerably, because not every
individual large object gets logged.

I had a look at "perf top", and the profile looked pretty similar in
both cases.

The patch is a clear improvement.

Yours,
Laurenz Albe




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-15 Thread Tom Lane
This patch seems to have stalled out again.  In hopes of getting it
over the finish line, I've done a bit more work to address the two
loose ends I felt were probably essential to deal with:

* Duplicative blob ACLs are now merged into a single TOC entry
(per metadata group) with the GRANT/REVOKE commands stored only
once.  This is to address the possibly-common case where a database
has a ton of blobs that have identical-but-not-default ACLs.

I have not done anything about improving efficiency for blob comments
or security labels.  I think it's reasonable to assume that blobs with
comments are pets not cattle, and there won't be many of them.
I suppose it could be argued that seclabels might be used like ACLs
with a lot of duplication, but I doubt that there's anyone out there
at all putting seclabels on blobs in practice.  So I don't care to
expend effort on that.

* Parallel pg_upgrade cuts the --transaction-size given to concurrent
pg_restore jobs by the -j factor.  This is to ensure we keep the
shared locks table within bounds even in parallel mode.

Now we could go further than that and provide some direct user
control over these hard-wired settings, but I think that could
be left for later, getting some field experience before we design
an API.  In short, I think this patchset is more or less commitable.

0001-0004 are rebased up to HEAD, but differ only in line numbers
from the v10 patchset.  0005 handles ACL merging, and 0006 does
the other thing.

regards, tom lane

From 1df5cc3ffdee85db8f9815dc0839769192b57158 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 26 Jan 2024 11:10:00 -0500
Subject: [PATCH v11 1/6] Some small preliminaries for pg_dump changes.

Centralize management of the lo_buf used to hold data while restoring
blobs.  The code previously had each format handler create lo_buf,
which seems rather pointless given that the format handlers all make
it the same way.  Moreover, the format handlers never use lo_buf
directly, making this setup a failure from a separation-of-concerns
standpoint.  Let's move the responsibility into pg_backup_archiver.c,
which is the only module concerned with lo_buf.  The main reason to do
this now is that it allows a centralized fix for the soon-to-be-false
assumption that we never restore blobs in parallel.

Also, get rid of dead code in DropLOIfExists: it's been a long time
since we had any need to be able to restore to a pre-9.0 server.
---
 src/bin/pg_dump/pg_backup_archiver.c  |  9 +
 src/bin/pg_dump/pg_backup_custom.c|  7 ---
 src/bin/pg_dump/pg_backup_db.c| 27 +--
 src/bin/pg_dump/pg_backup_directory.c |  6 --
 src/bin/pg_dump/pg_backup_null.c  |  4 
 src/bin/pg_dump/pg_backup_tar.c   |  4 
 6 files changed, 14 insertions(+), 43 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..f5935b08bb 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
 	AH->loCount++;
 
 	/* Initialize the LO Buffer */
+	if (AH->lo_buf == NULL)
+	{
+		/* First time through (in this process) so allocate the buffer */
+		AH->lo_buf_size = LOBBUFSIZE;
+		AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
+	}
 	AH->lo_buf_used = 0;
 
 	pg_log_info("restoring large object with OID %u", oid);
@@ -4749,6 +4755,9 @@ CloneArchive(ArchiveHandle *AH)
 	/* clone has its own error count, too */
 	clone->public.n_errors = 0;
 
+	/* clones should not share lo_buf */
+	clone->lo_buf = NULL;
+
 	/*
 	 * Connect our new clone object to the database, using the same connection
 	 * parameters used for the original connection.
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index b576b29924..7c6ac89dd4 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	ctx = (lclContext *) pg_malloc0(sizeof(lclContext));
 	AH->formatData = (void *) ctx;
 
-	/* Initialize LO buffering */
-	AH->lo_buf_size = LOBBUFSIZE;
-	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
-
 	/*
 	 * Now open the file
 	 */
@@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH)
 	 * share knowledge about where the data blocks are across threads.
 	 * _PrintTocData has to be careful about the order of operations on that
 	 * state, though.
-	 *
-	 * Note: we do not make a local lo_buf because we expect at most one BLOBS
-	 * entry per archive, so no parallelism is possible.
 	 */
 }
 
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index f766b65059..b297ca049d 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX)
 void
 DropLOIfExists(ArchiveHandle *AH, Oid oid)
 {
-	/*
-	 * If we are not restoring to a direct database connection, we 

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-26 Thread Tom Lane
vignesh C  writes:
> CFBot shows that the patch does not apply anymore as in [1]:
> === Applying patches on top of PostgreSQL commit ID
> 46a0cd4cefb4d9b462d8cc4df5e7ecdd190bea92 ===
> === applying patch ./v9-005-parallel_pg_restore.patch
> patching file src/bin/pg_upgrade/pg_upgrade.c
> Hunk #3 FAILED at 650.
> 1 out of 3 hunks FAILED -- saving rejects to file
> src/bin/pg_upgrade/pg_upgrade.c.rej

That's because v9-005 was posted by itself.  But I don't think
we should use it anyway.

Here's 0001-0004 again, updated to current HEAD (only line numbers
changed) and with Nathan's suggestion to define some macros for
the magic constants.

regards, tom lane

From c48b11547c6eb95ab217dddc047da5378042452c Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 26 Jan 2024 11:10:00 -0500
Subject: [PATCH v10 1/4] Some small preliminaries for pg_dump changes.

Centralize management of the lo_buf used to hold data while restoring
blobs.  The code previously had each format handler create lo_buf,
which seems rather pointless given that the format handlers all make
it the same way.  Moreover, the format handlers never use lo_buf
directly, making this setup a failure from a separation-of-concerns
standpoint.  Let's move the responsibility into pg_backup_archiver.c,
which is the only module concerned with lo_buf.  The main reason to do
this now is that it allows a centralized fix for the soon-to-be-false
assumption that we never restore blobs in parallel.

Also, get rid of dead code in DropLOIfExists: it's been a long time
since we had any need to be able to restore to a pre-9.0 server.
---
 src/bin/pg_dump/pg_backup_archiver.c  |  9 +
 src/bin/pg_dump/pg_backup_custom.c|  7 ---
 src/bin/pg_dump/pg_backup_db.c| 27 +--
 src/bin/pg_dump/pg_backup_directory.c |  6 --
 src/bin/pg_dump/pg_backup_null.c  |  4 
 src/bin/pg_dump/pg_backup_tar.c   |  4 
 6 files changed, 14 insertions(+), 43 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..26c2c684c8 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
 	AH->loCount++;
 
 	/* Initialize the LO Buffer */
+	if (AH->lo_buf == NULL)
+	{
+		/* First time through (in this process) so allocate the buffer */
+		AH->lo_buf_size = LOBBUFSIZE;
+		AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
+	}
 	AH->lo_buf_used = 0;
 
 	pg_log_info("restoring large object with OID %u", oid);
@@ -4748,6 +4754,9 @@ CloneArchive(ArchiveHandle *AH)
 	/* clone has its own error count, too */
 	clone->public.n_errors = 0;
 
+	/* clones should not share lo_buf */
+	clone->lo_buf = NULL;
+
 	/*
 	 * Connect our new clone object to the database, using the same connection
 	 * parameters used for the original connection.
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index b576b29924..7c6ac89dd4 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	ctx = (lclContext *) pg_malloc0(sizeof(lclContext));
 	AH->formatData = (void *) ctx;
 
-	/* Initialize LO buffering */
-	AH->lo_buf_size = LOBBUFSIZE;
-	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
-
 	/*
 	 * Now open the file
 	 */
@@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH)
 	 * share knowledge about where the data blocks are across threads.
 	 * _PrintTocData has to be careful about the order of operations on that
 	 * state, though.
-	 *
-	 * Note: we do not make a local lo_buf because we expect at most one BLOBS
-	 * entry per archive, so no parallelism is possible.
 	 */
 }
 
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index f766b65059..b297ca049d 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX)
 void
 DropLOIfExists(ArchiveHandle *AH, Oid oid)
 {
-	/*
-	 * If we are not restoring to a direct database connection, we have to
-	 * guess about how to detect whether the LO exists.  Assume new-style.
-	 */
-	if (AH->connection == NULL ||
-		PQserverVersion(AH->connection) >= 9)
-	{
-		ahprintf(AH,
- "SELECT pg_catalog.lo_unlink(oid) "
- "FROM pg_catalog.pg_largeobject_metadata "
- "WHERE oid = '%u';\n",
- oid);
-	}
-	else
-	{
-		/* Restoring to pre-9.0 server, so do it the old way */
-		ahprintf(AH,
- "SELECT CASE WHEN EXISTS("
- "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'"
- ") THEN pg_catalog.lo_unlink('%u') END;\n",
- oid, oid);
-	}
+	ahprintf(AH,
+			 "SELECT pg_catalog.lo_unlink(oid) "
+			 "FROM pg_catalog.pg_largeobject_metadata "
+			 "WHERE oid = '%u';\n",
+			 oid);
 }
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-26 Thread vignesh C
On Tue, 2 Jan 2024 at 23:03, Kumar, Sachin  wrote:
>
> > On 11/12/2023, 01:43, "Tom Lane"  > > wrote:
>
> > I had initially supposed that in a parallel restore we could
> > have child workers also commit after every N TOC items, but was
> > soon disabused of that idea. After a worker processes a TOC
> > item, any dependent items (such as index builds) might get
> > dispatched to some other worker, which had better be able to
> > see the results of the first worker's step. So at least in
> > this implementation, we disable the multi-command-per-COMMIT
> > behavior during the parallel part of the restore. Maybe that
> > could be improved in future, but it seems like it'd add a
> > lot more complexity, and it wouldn't make life any better for
> > pg_upgrade (which doesn't use parallel pg_restore, and seems
> > unlikely to want to in future).
>
> I was not able to find email thread which details why we are not using
> parallel pg_restore for pg_upgrade. IMHO most of the customer will have 
> single large
> database, and not using parallel restore will cause slow pg_upgrade.
>
> I am attaching a patch which enables parallel pg_restore for DATA and 
> POST-DATA part
> of dump. It will push down --jobs value to pg_restore and will restore 
> database sequentially.

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
46a0cd4cefb4d9b462d8cc4df5e7ecdd190bea92 ===
=== applying patch ./v9-005-parallel_pg_restore.patch
patching file src/bin/pg_upgrade/pg_upgrade.c
Hunk #3 FAILED at 650.
1 out of 3 hunks FAILED -- saving rejects to file
src/bin/pg_upgrade/pg_upgrade.c.rej

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_4713.log

Regards,
Vignesh




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Tom Lane
Nathan Bossart  writes:
> On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
>> +char   *cmdEnd = psprintf(" OWNER TO %s", 
>> fmtId(te->owner));
>> +
>> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
>> cmdEnd);

> This is just a nitpick, but is there any reason not to have
> IssueCommandPerBlob() accept a format string and the corresponding
> arguments?

The problem is how to combine the individual per-blob OID with the
command string given by the caller.  If we want the caller to also be
able to inject data values, I don't really see how to combine the OID
with the va_args list from the caller.  If we stick with the design
that the caller provides separate "front" and "back" strings, but ask
to be able to inject data values into those, then we need two va_args
lists which C doesn't support, or else an arbitrary decision about
which one gets the va_args.  (Admittedly, with only one caller that
needs a non-constant string, we could make such a decision; but by the
same token we'd gain little.)

It'd be notationally simpler if we could have the caller supply one
string that includes %u where the OID is supposed to go; but then
we have problems if an owner name includes %.  So on the whole I'm
not seeing anything much better than what I did.  Maybe I missed
an idea though.

> Another nitpick: it might be worth moving these hard-wired constants to
> macros.  Without a control switch, that'd at least make it easy for anyone
> determined to change the value for their installation.

OK.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote:
> On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
>> +char   *cmdEnd = psprintf(" OWNER TO %s", 
>> fmtId(te->owner));
>> +
>> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
>> cmdEnd);
> 
> This is just a nitpick, but is there any reason not to have
> IssueCommandPerBlob() accept a format string and the corresponding
> arguments?

Eh, I guess you'd have to find some other way of specifying where the OID
is supposed to go, which would probably be weird.  Please disregard this
one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote:
> On further reflection, there is a very good reason why it's done like
> that.  Because pg_upgrade is doing schema-only dump and restore,
> there's next to no opportunity for parallelism within either pg_dump
> or pg_restore.  There's no data-loading steps, and there's no
> index-building either, so the time-consuming stuff that could be
> parallelized just isn't happening in pg_upgrade's usage.
> 
> Now it's true that my 0003 patch moves the needle a little bit:
> since it makes BLOB creation (as opposed to loading) parallelizable,
> there'd be some hope for parallel pg_restore doing something useful in
> a database with very many blobs.  But it makes no sense to remove the
> existing cross-database parallelism in pursuit of that; you'd make
> many more people unhappy than happy.

I assume the concern is that we'd end up multiplying the effective number
of workers if we parallelized both in-database and cross-database?  Would
it be sufficient to make those separately configurable with a note about
the multiplicative effects of setting both?  I think it'd be unfortunate if
pg_upgrade completely missed out on this improvement.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
> * I did not invent a switch to control the batching of blobs; it's
> just hard-wired at 1000 blobs per group here.  Probably we need some
> user knob for that, but I'm unsure if we want to expose a count or
> just a boolean for one vs more than one blob per batch.  The point of
> forcing one blob per batch would be to allow exact control during
> selective restore, and I'm not sure if there's any value in random
> other settings.  On the other hand, selective restore of blobs has
> been completely broken for the last dozen years and I can't recall any
> user complaints about that; so maybe nobody cares and we could just
> leave this as an internal choice.

I think the argument for making this configurable is that folks might have
fewer larger blobs, in which case it might make sense to lower the batch
size, or they might have many smaller blobs, in which case they might want
to increase it.  But I'm a bit skeptical that will make any sort of
tremendous difference in practice, and I'm not sure how a user would decide
on the right value besides trial-and-error.  In any case, at the moment I
think it'd be okay to keep this an internal setting and wait for feedback
from the field.

> * As the patch stands, we still build a separate TOC entry for each
> comment or seclabel or ACL attached to a blob.  If you have a lot of
> blobs with non-default properties then the TOC bloat problem comes
> back again.  We could do something about that, but it would take a bit
> of tedious refactoring, and the most obvious way to handle it probably
> re-introduces too-many-locks problems.  Is this a scenario that's
> worth spending a lot of time on?

I'll ask around about this.

> Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes.

This one looked good to me.

> Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata
>  and data entries.

I spent most of my review time reading this one.  Overall, it looks pretty
good to me, and I think you've called out most of the interesting design
choices.

> + char   *cmdEnd = psprintf(" OWNER TO %s", 
> fmtId(te->owner));
> +
> + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
> cmdEnd);

This is just a nitpick, but is there any reason not to have
IssueCommandPerBlob() accept a format string and the corresponding
arguments?

> + while (n < 1000 && i + n < ntups)

Another nitpick: it might be worth moving these hard-wired constants to
macros.  Without a control switch, that'd at least make it easy for anyone
determined to change the value for their installation.

> Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA.

This one looks reasonable, too.

> In this patch I have just hard-wired pg_upgrade to use
> --transaction-size 1000.  Perhaps there would be value in adding
> another pg_upgrade option to allow user control of that, but I'm
> unsure that it's worth the trouble; I think few users would use it,
> and any who did would see not that much benefit.  However, we
> might need to adjust the logic to make the size be 1000 divided
> by the number of parallel restore jobs allowed.

I wonder if it'd be worth making this configurable for pg_upgrade as an
escape hatch in case the default setting is problematic.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-05 Thread Tom Lane
I wrote:
> "Kumar, Sachin"  writes:
>> I was not able to find email thread which details why we are not using
>> parallel pg_restore for pg_upgrade.

> Well, it's pretty obvious isn't it?  The parallelism is being applied
> at the per-database level instead.

On further reflection, there is a very good reason why it's done like
that.  Because pg_upgrade is doing schema-only dump and restore,
there's next to no opportunity for parallelism within either pg_dump
or pg_restore.  There's no data-loading steps, and there's no
index-building either, so the time-consuming stuff that could be
parallelized just isn't happening in pg_upgrade's usage.

Now it's true that my 0003 patch moves the needle a little bit:
since it makes BLOB creation (as opposed to loading) parallelizable,
there'd be some hope for parallel pg_restore doing something useful in
a database with very many blobs.  But it makes no sense to remove the
existing cross-database parallelism in pursuit of that; you'd make
many more people unhappy than happy.

Conceivably something could be salvaged of your idea by having
pg_upgrade handle databases with many blobs differently from
those without, applying parallelism within pg_restore for the
first kind and then using cross-database parallelism for the
rest.  But that seems like a lot of complexity compared to the
possible win.

In any case I'd stay far away from using --section in pg_upgrade.
Too many moving parts there.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-02 Thread Tom Lane
"Kumar, Sachin"  writes:
>> On 11/12/2023, 01:43, "Tom Lane" > > wrote:
>> ... Maybe that
>> could be improved in future, but it seems like it'd add a
>> lot more complexity, and it wouldn't make life any better for
>> pg_upgrade (which doesn't use parallel pg_restore, and seems
>> unlikely to want to in future).

> I was not able to find email thread which details why we are not using
> parallel pg_restore for pg_upgrade.

Well, it's pretty obvious isn't it?  The parallelism is being applied
at the per-database level instead.

> IMHO most of the customer will have single large
> database, and not using parallel restore will cause slow pg_upgrade.

You've offered no justification for that opinion ...

> I am attaching a patch which enables parallel pg_restore for DATA and 
> POST-DATA part
> of dump. It will push down --jobs value to pg_restore and will restore
> database sequentially.

I don't think I trust this patch one bit.  It makes way too many
assumptions about how the --section options work, or even that they
will work at all in a binary-upgrade situation.  I've spent enough
time with that code to know that --section is pretty close to being
a fiction.  One point in particular is that this would change the
order of ACL restore relative to other steps, which almost certainly
will cause problems for somebody.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-02 Thread Kumar, Sachin
> On 11/12/2023, 01:43, "Tom Lane"  > wrote:

> I had initially supposed that in a parallel restore we could
> have child workers also commit after every N TOC items, but was
> soon disabused of that idea. After a worker processes a TOC
> item, any dependent items (such as index builds) might get
> dispatched to some other worker, which had better be able to
> see the results of the first worker's step. So at least in
> this implementation, we disable the multi-command-per-COMMIT
> behavior during the parallel part of the restore. Maybe that
> could be improved in future, but it seems like it'd add a
> lot more complexity, and it wouldn't make life any better for
> pg_upgrade (which doesn't use parallel pg_restore, and seems
> unlikely to want to in future).

I was not able to find email thread which details why we are not using
parallel pg_restore for pg_upgrade. IMHO most of the customer will have single 
large
database, and not using parallel restore will cause slow pg_upgrade.

I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA 
part
of dump. It will push down --jobs value to pg_restore and will restore database 
sequentially.

Benchmarks

{5 million LOs 1 large DB}
Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
17.51s user 65.80s system 35% cpu 3:56.64 total


time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
17.51s user 65.85s system 34% cpu 3:58.39 total


HEAD
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
53.95s user 82.44s system 41% cpu 5:25.23 total

time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
54.94s user 81.26s system 41% cpu 5:24.86 total



Fix with --jobs propagation to pg_restore {on top of v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
29.12s user 69.85s system 275% cpu 35.930 total 


Although parallel restore does have small regression in ideal case of 
pg_upgrade --jobs


Multiple DBs {4 DBs each having 2 million LOs}

Fix with --jobs scheduling
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
31.80s user 109.52s system 120% cpu 1:57.35 total


Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
30.88s user 110.05s system 135% cpu 1:43.97 total


Regards
Sachin



v9-005-parallel_pg_restore.patch
Description: v9-005-parallel_pg_restore.patch


Re: pg_upgrade failing for 200+ million Large Objects

2023-12-28 Thread Robins Tharakan
On Thu, 28 Dec 2023 at 01:48, Tom Lane  wrote:

> Robins Tharakan  writes:
> > Applying all 4 patches, I also see good performance improvement.
> > With more Large Objects, although pg_dump improved significantly,
> > pg_restore is now comfortably an order of magnitude faster.
>
> Yeah.  The key thing here is that pg_dump can only parallelize
> the data transfer, while (with 0004) pg_restore can parallelize
> large object creation and owner-setting as well as data transfer.
> I don't see any simple way to improve that on the dump side,
> but I'm not sure we need to.  Zillions of empty objects is not
> really the use case to worry about.  I suspect that a more realistic
> case with moderate amounts of data in the blobs would make pg_dump
> look better.
>


Thanks for elaborating, and yes pg_dump times do reflect that
expectation.

The first test involved a fixed number (32k) of
Large Objects (LOs) with varying sizes - I chose that number
intentionally since this was being tested on a 32vCPU instance
and the patch employs 1k batches.


We again see that pg_restore is an order of magnitude faster.

 LO Size (bytes)  restore-HEAD restore-patched  improvement (Nx)
   124.182 1.4  17x
  1024.741 1.5  17x
 10024.574 1.6  15x
   1,00025.314 1.7  15x
  10,00025.644 1.7  15x
 100,00050.046 4.3  12x
   1,000,000   281.54930.0   9x


pg_dump also sees improvements. Really small sized LOs
see a decent ~20% improvement which grows considerably as LOs
get bigger (beyond ~10-100kb).


 LO Size (bytes)  dump-HEAD  dump-patchedimprovement (%)
   112.9  10.7  18%
  1012.9  10.4  19%
 10012.8  10.3  20%
   1,00013.0  10.3  21%
  10,00014.2  10.3  27%
 100,00032.8  11.5  65%
   1,000,000   211.8  23.6  89%


To test pg_restore scaling, 1 Million LOs (100kb each)
were created and pg_restore times tested for increasing
concurrency (on a 192vCPU instance). We see major speedup
upto -j64 and the best time was at -j96, after which
performance decreases slowly - see attached image.

Concurrencypg_restore-patched
384  75.87
352  75.63
320  72.11
288  70.05
256  70.98
224  66.98
192  63.04
160  61.37
128  58.82
 96  58.55
 64  60.46
 32  77.29
 16 115.51
  8 203.48
  4 366.33



Test details:
- Command used to generate SQL - create 1k LOs of 1kb each
  - echo "SELECT lo_from_bytea(0, '\x`  printf 'ff%.0s' {1..1000}`') FROM
generate_series(1,1000);" > /tmp/tempdel
- Verify the LO size: select pg_column_size(lo_get(oid));
- Only GUC changed: max_connections=1000 (for the last test)

-
Robins Tharakan
Amazon Web Services


Re: pg_upgrade failing for 200+ million Large Objects

2023-12-27 Thread Tom Lane
Robins Tharakan  writes:
> Applying all 4 patches, I also see good performance improvement.
> With more Large Objects, although pg_dump improved significantly,
> pg_restore is now comfortably an order of magnitude faster.

Yeah.  The key thing here is that pg_dump can only parallelize
the data transfer, while (with 0004) pg_restore can parallelize
large object creation and owner-setting as well as data transfer.
I don't see any simple way to improve that on the dump side,
but I'm not sure we need to.  Zillions of empty objects is not
really the use case to worry about.  I suspect that a more realistic
case with moderate amounts of data in the blobs would make pg_dump
look better.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2023-12-20 Thread Tom Lane
Nathan Bossart  writes:
> Wow, thanks for putting together these patches.  I intend to help review,

Thanks!

> but I'm not sure I'll find much time to do so before the new year.

There's no urgency, surely.  If we can get these in during the
January CF, I'll be happy.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2023-12-20 Thread Nathan Bossart
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
> I have spent some more effort in this area and developed a patch
> series that I think addresses all of the performance issues that
> we've discussed in this thread, both for pg_upgrade and more
> general use of pg_dump/pg_restore.  Concretely, it absorbs
> the pg_restore --transaction-size switch that I proposed before
> to cut the number of transactions needed during restore, and
> rearranges the representation of BLOB-related TOC entries to
> reduce the client-side memory requirements, and fixes some
> ancient mistakes that prevent both selective restore of BLOBs
> and parallel restore of BLOBs.
> 
> As a demonstration, I made a database containing 100K empty blobs,
> and measured the time needed to dump/restore that using -Fd
> and -j 10.  HEAD doesn't get any useful parallelism on blobs,
> but with this patch series we do:
> 
>   dumprestore
> HEAD: 14sec   15sec
> after 0002:   7sec10sec
> after 0003:   7sec3sec

Wow, thanks for putting together these patches.  I intend to help review,
but I'm not sure I'll find much time to do so before the new year.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2023-12-20 Thread Tom Lane
I have spent some more effort in this area and developed a patch
series that I think addresses all of the performance issues that
we've discussed in this thread, both for pg_upgrade and more
general use of pg_dump/pg_restore.  Concretely, it absorbs
the pg_restore --transaction-size switch that I proposed before
to cut the number of transactions needed during restore, and
rearranges the representation of BLOB-related TOC entries to
reduce the client-side memory requirements, and fixes some
ancient mistakes that prevent both selective restore of BLOBs
and parallel restore of BLOBs.

As a demonstration, I made a database containing 100K empty blobs,
and measured the time needed to dump/restore that using -Fd
and -j 10.  HEAD doesn't get any useful parallelism on blobs,
but with this patch series we do:

dumprestore
HEAD:   14sec   15sec
after 0002: 7sec10sec
after 0003: 7sec3sec

There are a few loose ends:

* I did not invent a switch to control the batching of blobs; it's
just hard-wired at 1000 blobs per group here.  Probably we need some
user knob for that, but I'm unsure if we want to expose a count or
just a boolean for one vs more than one blob per batch.  The point of
forcing one blob per batch would be to allow exact control during
selective restore, and I'm not sure if there's any value in random
other settings.  On the other hand, selective restore of blobs has
been completely broken for the last dozen years and I can't recall any
user complaints about that; so maybe nobody cares and we could just
leave this as an internal choice.

* Likewise, there's no user-accessible knob to control what
transaction size pg_upgrade uses.  Do we need one?  In any case, it's
likely that the default needs a bit more thought than I've given it.
I used 1000, but if pg_upgrade is launching parallel restore jobs we
likely need to divide that by the number of restore jobs.

* As the patch stands, we still build a separate TOC entry for each
comment or seclabel or ACL attached to a blob.  If you have a lot of
blobs with non-default properties then the TOC bloat problem comes
back again.  We could do something about that, but it would take a bit
of tedious refactoring, and the most obvious way to handle it probably
re-introduces too-many-locks problems.  Is this a scenario that's
worth spending a lot of time on?

More details appear in the commit messages below.  Patch 0004
is nearly the same as the v8 patch I posted before, although
it adds some logic to ensure that a large blob metadata batch
doesn't create too many locks.

Comments?

regards, tom lane

PS: I don't see any active CF entry for this thread, so
I'm going to go make one.

From eecef8f312967ff7cc0f47899c6db2c3e654371d Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Wed, 20 Dec 2023 13:52:28 -0500
Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes.

Centralize management of the lo_buf used to hold data while restoring
blobs.  The code previously had each format handler create lo_buf,
which seems rather pointless given that the format handlers all make
it the same way.  Moreover, the format handlers never use lo_buf
directly, making this setup a failure from a separation-of-concerns
standpoint.  Let's move the responsibility into pg_backup_archiver.c,
which is the only module concerned with lo_buf.  The main reason to do
this now is that it allows a centralized fix for the soon-to-be-false
assumption that we never restore blobs in parallel.

Also, get rid of dead code in DropLOIfExists: it's been a long time
since we had any need to be able to restore to a pre-9.0 server.
---
 src/bin/pg_dump/pg_backup_archiver.c  |  9 +
 src/bin/pg_dump/pg_backup_custom.c|  7 ---
 src/bin/pg_dump/pg_backup_db.c| 27 +--
 src/bin/pg_dump/pg_backup_directory.c |  6 --
 src/bin/pg_dump/pg_backup_null.c  |  4 
 src/bin/pg_dump/pg_backup_tar.c   |  4 
 6 files changed, 14 insertions(+), 43 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..26c2c684c8 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
 	AH->loCount++;
 
 	/* Initialize the LO Buffer */
+	if (AH->lo_buf == NULL)
+	{
+		/* First time through (in this process) so allocate the buffer */
+		AH->lo_buf_size = LOBBUFSIZE;
+		AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
+	}
 	AH->lo_buf_used = 0;
 
 	pg_log_info("restoring large object with OID %u", oid);
@@ -4748,6 +4754,9 @@ CloneArchive(ArchiveHandle *AH)
 	/* clone has its own error count, too */
 	clone->public.n_errors = 0;
 
+	/* clones should not share lo_buf */
+	clone->lo_buf = NULL;
+
 	/*
 	 * Connect our new clone object to the database, using the same connection
 	 * parameters used for the original connection.
diff --git 

Re: pg_upgrade failing for 200+ million Large Objects

2023-12-10 Thread Tom Lane
I spent some time looking at the v7 patch.  I can't help feeling
that this is going off in the wrong direction, primarily for
these reasons:

* It focuses only on cutting the number of transactions needed
to restore a large number of blobs (large objects).  Certainly
that's a pain point, but it's not the only one of this sort.
If you have a lot of tables, restore will consume just as many
transactions as it would for a similar number of blobs --- probably
more, in fact, since we usually need more commands per table than
per blob.

* I'm not too thrilled with the (undocumented) rearrangements in
pg_dump.  I really don't like the idea of emitting a fundamentally
different TOC layout in binary-upgrade mode; that seems unmaintainably
bug-prone.  Plus, the XID-consumption problem is not really confined
to pg_upgrade.

What I think we actually ought to do is one of the alternatives
discussed upthread: teach pg_restore to be able to commit
every so often, without trying to provide the all-or-nothing
guarantees of --single-transaction mode.  This cuts its XID
consumption by whatever multiple "every so often" is, while
allowing us to limit the number of locks taken during any one
transaction.  It also seems a great deal safer than the idea
I floated of not taking locks at all during a binary upgrade;
plus, it has some usefulness with regular pg_restore that's not
under control of pg_upgrade.

So I had a go at coding that, and attached is the result.
It invents a --transaction-size option, and when that's active
it will COMMIT after every N TOC items.  (This seems simpler to
implement and less bug-prone than every-N-SQL-commands.)

I had initially supposed that in a parallel restore we could
have child workers also commit after every N TOC items, but was
soon disabused of that idea.  After a worker processes a TOC
item, any dependent items (such as index builds) might get
dispatched to some other worker, which had better be able to
see the results of the first worker's step.  So at least in
this implementation, we disable the multi-command-per-COMMIT
behavior during the parallel part of the restore.  Maybe that
could be improved in future, but it seems like it'd add a
lot more complexity, and it wouldn't make life any better for
pg_upgrade (which doesn't use parallel pg_restore, and seems
unlikely to want to in future).

I've not spent a lot of effort on pg_upgrade changes here:
I just hard-wired it to select --transaction-size=1000.
Given the default lock table size of 64*100, that gives us
enough headroom for each TOC to take half a dozen locks.
We could go higher than that by making pg_upgrade force the
destination postmaster to create a larger-than-default lock
table, but I'm not sure if it's worth any trouble.  We've
already bought three orders of magnitude improvement as it
stands, which seems like enough ambition for today.  (Also,
having pg_upgrade override the user's settings in the
destination cluster might not be without downsides.)

Another thing I'm wondering about is why this is only a pg_restore
option not also a pg_dump/pg_dumpall option.  I did it like that
because --single-transaction is pg_restore only, but that seems more
like an oversight or laziness than a well-considered decision.
Maybe we should back-fill that omission; but it could be done later.

Thoughts?

regards, tom lane

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 1a23874da6..2e3ba80258 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -786,6 +786,30 @@ PostgreSQL documentation
   
  
 
+ 
+  --transaction-size=N
+  
+   
+Execute the restore as a series of transactions, each processing
+up to N database
+objects.  This option implies --exit-on-error.
+   
+   
+--transaction-size offers an intermediate choice
+between the default behavior (one transaction per SQL command)
+and -1/--single-transaction
+(one transaction for all restored objects).
+While --single-transaction has the least
+overhead, it may be impractical for large databases because the
+transaction will take a lock on each restored object, possibly
+exhausting the server's lock table space.
+Using --transaction-size with a size of a few
+thousand objects offers nearly the same performance benefits while
+capping the amount of lock table space needed.
+   
+  
+ 
+
  
   --use-set-session-authorization
   
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 9ef2f2017e..fbf5f1c515 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -149,7 +149,9 @@ typedef struct _restoreOptions
  * compression */
 	int			suppressDumpWarnings;	/* Suppress output of WARNING entries
 		 * to stderr */
-	bool		single_txn;
+
+	bool		single_txn;		/* 

Re: pg_upgrade failing for 200+ million Large Objects

2023-12-07 Thread Kumar, Sachin

> I have updated the patch to use heuristic, During pg_upgrade we count
> Large objects per database. During pg_restore execution if db large_objects
> count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use 
> --restore-blob-batch-size.


I think both SECTION_DATA and SECTION_POST_DATA can be parallelized by 
pg_restore, So instead of storing 
large objects in heuristics, we can store SECTION_DATA + SECTION_POST_DATA.

Regards
Sachin



Re: pg_upgrade failing for 200+ million Large Objects

2023-12-04 Thread Kumar, Sachin


> "Tom Lane" mailto:t...@sss.pgh.pa.us>> wrote:

> FWIW, I agree with Jacob's concern about it being a bad idea to let
> users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore.
> I think we'd regret going there, because it'd hugely expand the set
> of cases pg_upgrade has to deal with.

> Also, pg_upgrade is often invoked indirectly via scripts, so I do
> not especially buy the idea that we're going to get useful control
> input from some human somewhere. I think we'd be better off to
> assume that pg_upgrade is on its own to manage the process, so that
> if we need to switch strategies based on object count or whatever,
> we should put in a heuristic to choose the strategy automatically.
> It might not be perfect, but that will give better results for the
> pretty large fraction of users who are not going to mess with
> weird little switches.


I have updated the patch to use heuristic, During pg_upgrade we count
Large objects per database. During pg_restore execution if db large_objects
count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use 
--restore-blob-batch-size.
I also modified pg_upgrade --jobs behavior if we have large_objects (> 
LARGE_OBJECTS_THRESOLD)

+  /*  Restore all the dbs where LARGE_OBJECTS_THRESOLD is not breached */
+  restore_dbs(stats, true);
+  /* reap all children */
+  while (reap_child(true) == true)
+ ;
+  /*  Restore rest of the dbs one by one  with pg_restore --jobs = 
user_opts.jobs */
+  restore_dbs(stats, false);
   /* reap all children */
   while (reap_child(true) == true)
  ;

Regards
Sachin











pg_upgrade_improvements_v7.diff
Description: pg_upgrade_improvements_v7.diff


Re: pg_upgrade failing for 200+ million Large Objects

2023-11-13 Thread Kumar, Sachin
> On 09/11/2023, 18:41, "Tom Lane"  > wrote:
> Um ... you didn't attach the patch?

Sorry , patch attached

Regards
Sachin




pg_upgrade_improvements_v6.diff
Description: pg_upgrade_improvements_v6.diff


Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Andres Freund
Hi, 

On November 9, 2023 10:41:01 AM PST, Tom Lane  wrote:
>Also, pg_upgrade is often invoked indirectly via scripts, so I do
>not especially buy the idea that we're going to get useful control
>input from some human somewhere.  I think we'd be better off to
>assume that pg_upgrade is on its own to manage the process, so that
>if we need to switch strategies based on object count or whatever,
>we should put in a heuristic to choose the strategy automatically.
>It might not be perfect, but that will give better results for the
>pretty large fraction of users who are not going to mess with
>weird little switches.

+1 - even leaving everything else aside, just about no user would know about 
the option. There are cases where we can't do better than giving the user 
control, but we are certainly adding options at a rate that doesn't seem 
sustainable. And here it doesn't seem that hard to do better. 

Andres 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Tom Lane
[ Jacob's email address updated ]

"Kumar, Sachin"  writes:
> Hi Everyone , I want to continue this thread , I have rebased the patch to 
> latest
> master and fixed an issue when pg_restore prints to file.

Um ... you didn't attach the patch?

FWIW, I agree with Jacob's concern about it being a bad idea to let
users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore.
I think we'd regret going there, because it'd hugely expand the set
of cases pg_upgrade has to deal with.

Also, pg_upgrade is often invoked indirectly via scripts, so I do
not especially buy the idea that we're going to get useful control
input from some human somewhere.  I think we'd be better off to
assume that pg_upgrade is on its own to manage the process, so that
if we need to switch strategies based on object count or whatever,
we should put in a heuristic to choose the strategy automatically.
It might not be perfect, but that will give better results for the
pretty large fraction of users who are not going to mess with
weird little switches.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Kumar, Sachin




Hi Everyone , I want to continue this thread , I have rebased the patch to 
latest
master and fixed an issue when pg_restore prints to file.

`
╰─$ pg_restore  dump_small.custom  --restore-blob-batch-size=2 --file=a
--
-- End BLOB restore batch
--
COMMIT;
`

> On 09/11/2023, 17:05, "Jacob Champion"  > wrote:
> To clarify, I agree that pg_dump should contain the core fix. What I'm
> questioning is the addition of --dump-options to make use of that fix
> from pg_upgrade, since it also lets the user do "exciting" new things
> like --exclude-schema and --include-foreign-data and so on. I don't
> think we should let them do that without a good reason.

Earlier idea was to not expose these options to users and use [1]
   --restore-jobs=NUM --jobs parameter passed to pg_restore
   --restore-blob-batch-size=NUM  number of blobs restored in one xact
But this was later expanded to use --dump-options and --restore-options [2].
With --restore-options user can use --exclude-schema , 
So maybe we can go back to [1]

[1] 
https://www.postgresql.org/message-id/a1e200e6-adde-2561-422b-a166ec084e3b%40wi3ck.info
[2] 
https://www.postgresql.org/message-id/8d8d3961-8e8b-3dbe-f911-6f418c5fb1d3%40wi3ck.info

Regards
Sachin
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2022-10-11 Thread Michael Paquier
On Thu, Sep 08, 2022 at 04:34:07PM -0700, Nathan Bossart wrote:
> On Thu, Sep 08, 2022 at 04:29:10PM -0700, Jacob Champion wrote:
>> To clarify, I agree that pg_dump should contain the core fix. What I'm
>> questioning is the addition of --dump-options to make use of that fix
>> from pg_upgrade, since it also lets the user do "exciting" new things
>> like --exclude-schema and --include-foreign-data and so on. I don't
>> think we should let them do that without a good reason.
> 
> Ah, yes, I think that is a fair point.

It has been more than four weeks since the last activity of this
thread and there has been what looks like some feedback to me, so
marked as RwF for the time being.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade failing for 200+ million Large Objects

2022-09-08 Thread Nathan Bossart
On Thu, Sep 08, 2022 at 04:29:10PM -0700, Jacob Champion wrote:
> On Thu, Sep 8, 2022 at 4:18 PM Nathan Bossart  
> wrote:
>> IIUC the main benefit of this approach is that it isn't dependent on
>> binary-upgrade mode, which seems to be a goal based on the discussion
>> upthread [0].
> 
> To clarify, I agree that pg_dump should contain the core fix. What I'm
> questioning is the addition of --dump-options to make use of that fix
> from pg_upgrade, since it also lets the user do "exciting" new things
> like --exclude-schema and --include-foreign-data and so on. I don't
> think we should let them do that without a good reason.

Ah, yes, I think that is a fair point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2022-09-08 Thread Jacob Champion
On Thu, Sep 8, 2022 at 4:18 PM Nathan Bossart  wrote:
> IIUC the main benefit of this approach is that it isn't dependent on
> binary-upgrade mode, which seems to be a goal based on the discussion
> upthread [0].

To clarify, I agree that pg_dump should contain the core fix. What I'm
questioning is the addition of --dump-options to make use of that fix
from pg_upgrade, since it also lets the user do "exciting" new things
like --exclude-schema and --include-foreign-data and so on. I don't
think we should let them do that without a good reason.

Thanks,
--Jacob




Re: pg_upgrade failing for 200+ million Large Objects

2022-09-08 Thread Nathan Bossart
On Wed, Sep 07, 2022 at 02:42:05PM -0700, Jacob Champion wrote:
> Just to clarify, was Justin's statement upthread (that the XID problem
> is fixed) correct? And is this patch just trying to improve the
> remaining memory and lock usage problems?

I think "fixed" might not be totally accurate, but that is the gist.

> I took a quick look at the pg_upgrade diffs. I agree with Jan that the
> escaping problem is a pretty bad smell, but even putting that aside for
> a bit, is it safe to expose arbitrary options to pg_dump/restore during
> upgrade? It's super flexible, but I can imagine that some of those flags
> might really mess up the new cluster...
> 
> And yeah, if you choose to do that then you get to keep both pieces, I
> guess, but I like that pg_upgrade tries to be (IMO) fairly bulletproof.

IIUC the main benefit of this approach is that it isn't dependent on
binary-upgrade mode, which seems to be a goal based on the discussion
upthread [0].  I think it'd be easily possible to fix only pg_upgrade by
simply dumping and restoring pg_largeobject_metadata, as Andres suggested
in 2018 [1].  In fact, it seems like it ought to be possible to just copy
pg_largeobject_metadata's files as was done before 12a53c7.  AFAICT this
would only work for clusters upgrading from v12 and newer, and it'd break
if any of the underlying data types change their storage format.  This
seems unlikely for OIDs, but there is ongoing discussion about changing
aclitem.

I still think this is a problem worth fixing, but it's not yet clear how to
proceed.

[0] https://postgr.es/m/227228.1616259220%40sss.pgh.pa.us
[1] https://postgr.es/m/20181122001415.ef5bncxqin2y3esb%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2022-09-07 Thread Jacob Champion
On 8/24/22 17:32, Nathan Bossart wrote:
> I'd like to revive this thread, so I've created a commitfest entry [0] and
> attached a hastily rebased patch that compiles and passes the tests.  I am
> aiming to spend some more time on this in the near future.

Just to clarify, was Justin's statement upthread (that the XID problem
is fixed) correct? And is this patch just trying to improve the
remaining memory and lock usage problems?

I took a quick look at the pg_upgrade diffs. I agree with Jan that the
escaping problem is a pretty bad smell, but even putting that aside for
a bit, is it safe to expose arbitrary options to pg_dump/restore during
upgrade? It's super flexible, but I can imagine that some of those flags
might really mess up the new cluster...

And yeah, if you choose to do that then you get to keep both pieces, I
guess, but I like that pg_upgrade tries to be (IMO) fairly bulletproof.

--Jacob




Re: pg_upgrade failing for 200+ million Large Objects

2022-08-24 Thread Nathan Bossart
On Wed, Mar 24, 2021 at 12:05:27PM -0400, Jan Wieck wrote:
> On 3/24/21 12:04 PM, Jan Wieck wrote:
>> In any case I changed the options so that they behave the same way, the
>> existing -o and -O (for old/new postmaster options) work. I don't think
>> it would be wise to have option forwarding work differently between
>> options for postmaster and options for pg_dump/pg_restore.
> 
> Attaching the actual diff might help.

I'd like to revive this thread, so I've created a commitfest entry [0] and
attached a hastily rebased patch that compiles and passes the tests.  I am
aiming to spend some more time on this in the near future.

[0] https://commitfest.postgresql.org/39/3841/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c8a70d9bc1..faf1953e18 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -858,6 +858,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	 */
 	WaitForCommands(AH, pipefd);
 
+	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
 	/*
 	 * Disconnect from database and clean up.
 	 */
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fcc5f6bd05..f16ecdecc0 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -220,6 +220,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -290,6 +292,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 233198afc0..7cfbed5e75 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -266,6 +267,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 		pg_fatal("could not close output file: %m");
 }
 
+/* Public */
+void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
 /* Public */
 void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
@@ -3489,6 +3509,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d25709ad5f..17c0dd7f0c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -196,11 +196,20 @@ static 

Re: pg_upgrade failing for 200+ million Large Objects

2021-12-11 Thread Justin Pryzby
On Wed, Mar 24, 2021 at 12:05:27PM -0400, Jan Wieck wrote:
> On 3/24/21 12:04 PM, Jan Wieck wrote:
> > In any case I changed the options so that they behave the same way, the
> > existing -o and -O (for old/new postmaster options) work. I don't think
> > it would be wise to have option forwarding work differently between
> > options for postmaster and options for pg_dump/pg_restore.
> 
> Attaching the actual diff might help.

I think the original issue with XIDs was fixed by 74cf7d46a.

Are you still planning to progress the patches addressing huge memory use of
pg_restore?

Note this other, old thread on -general, which I believe has variations on the
same patches.
https://www.postgresql.org/message-id/flat/7bf19bf2-e6b7-01a7-1d96-f0607c728...@wi3ck.info

There was discussion about using pg_restore --single.  Note that that was used
at some point in the past: see 12ee6ec71 and 861ad67bd.

The immediate problem is that --single conflicts with --create.
I cleaned up a patch I'd written to work around that.  It preserves DB settings
and passes pg_upgrade's test.  It's probably not portable as written, but if 
need be
could pass an empty file instead of /dev/null...

diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 3628bd74a7..9c504aff79 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -364,6 +364,16 @@ create_new_objects(void)
DbInfo *old_db = _cluster.dbarr.dbs[dbnum];
const char *create_opts;
 
+   PQExpBufferData connstr,
+   escaped_connstr;
+
+   initPQExpBuffer();
+   initPQExpBuffer(_connstr);
+   appendPQExpBufferStr(, "dbname=");
+   appendConnStrVal(, old_db->db_name);
+   appendShellString(_connstr, connstr.data);
+   termPQExpBuffer();
+
/* Skip template1 in this pass */
if (strcmp(old_db->db_name, "template1") == 0)
continue;
@@ -378,18 +388,31 @@ create_new_objects(void)
 * propagate its database-level properties.
 */
if (strcmp(old_db->db_name, "postgres") == 0)
-   create_opts = "--clean --create";
+   create_opts = "--clean";
else
-   create_opts = "--create";
+   create_opts = "";
 
+   /* Create the DB but exclude all objects */
parallel_exec_prog(log_file_name,
   NULL,
   "\"%s/pg_restore\" %s %s 
--exit-on-error --verbose "
+   "--create -L /dev/null "
   "--dbname template1 \"%s\"",
   new_cluster.bindir,
   
cluster_conn_opts(_cluster),
   create_opts,
   sql_file_name);
+
+   parallel_exec_prog(log_file_name,
+  NULL,
+  "\"%s/pg_restore\" %s %s 
--exit-on-error --verbose --single "
+  "--dbname=%s \"%s\"",
+  new_cluster.bindir,
+  
cluster_conn_opts(_cluster),
+  create_opts,
+   escaped_connstr.data,
+  sql_file_name);
+
}
 
/* reap all children */





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck

On 3/24/21 12:04 PM, Jan Wieck wrote:

In any case I changed the options so that they behave the same way, the
existing -o and -O (for old/new postmaster options) work. I don't think
it would be wise to have option forwarding work differently between
options for postmaster and options for pg_dump/pg_restore.


Attaching the actual diff might help.

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..8331e8a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -265,6 +266,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 
 /* Public */
 void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
+/* Public */
+void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
 {
 	/* Caller can omit dump options, in which case we synthesize them */
@@ -3531,6 +3551,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..f153f08 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -165,12 +165,20 @@ static void guessConstraintInheritance(TableInfo *tblinfo, int numTables);
 static void dumpComment(Archive *fout, const char *type, const char *name,
 		const char *namespace, const char *owner,
 		CatalogId catalogId, int subid, DumpId dumpId);
+static bool dumpCommentQuery(Archive *fout, PQExpBuffer 

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck

On 3/23/21 4:55 PM, Tom Lane wrote:

Jan Wieck  writes:
Have we even reached a consensus yet on that doing it the way, my patch 
is proposing, is the right way to go? Like that emitting BLOB TOC 
entries into SECTION_DATA when in binary upgrade mode is a good thing? 
Or that bunching all the SQL statements for creating the blob, changing 
the ACL and COMMENT and SECLABEL all in one multi-statement-query is.


Now you're asking for actual review effort, which is a little hard
to come by towards the tail end of the last CF of a cycle.  I'm
interested in this topic, but I can't justify spending much time
on it right now.


Understood.

In any case I changed the options so that they behave the same way, the 
existing -o and -O (for old/new postmaster options) work. I don't think 
it would be wise to have option forwarding work differently between 
options for postmaster and options for pg_dump/pg_restore.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Tom Lane
Jan Wieck  writes:
> Have we even reached a consensus yet on that doing it the way, my patch 
> is proposing, is the right way to go? Like that emitting BLOB TOC 
> entries into SECTION_DATA when in binary upgrade mode is a good thing? 
> Or that bunching all the SQL statements for creating the blob, changing 
> the ACL and COMMENT and SECLABEL all in one multi-statement-query is.

Now you're asking for actual review effort, which is a little hard
to come by towards the tail end of the last CF of a cycle.  I'm
interested in this topic, but I can't justify spending much time
on it right now.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 3:35 PM, Tom Lane wrote:

Jan Wieck  writes:
The problem here is that pg_upgrade itself is invoking a shell again. It 
is not assembling an array of arguments to pass into exec*(). I'd be a 
happy camper if it did the latter. But as things are we'd have to add 
full shell escapeing for arbitrary strings.


Surely we need that (and have it already) anyway?


There are functions to shell escape a single string, like

appendShellString()

but that is hardly enough when a single optarg for --restore-option 
could look like any of


--jobs 8
--jobs=8
--jobs='8'
--jobs '8'
--jobs "8"
--jobs="8"
--dont-bother-about-jobs

When placed into a shell string, those things have very different 
effects on your args[].


I also want to say that we are overengineering this whole thing. Yes, 
there is the problem of shell quoting possibly going wrong as it passes 
from one shell to another. But for now this is all about passing a few 
numbers down from pg_upgrade to pg_restore (and eventually pg_dump).


Have we even reached a consensus yet on that doing it the way, my patch 
is proposing, is the right way to go? Like that emitting BLOB TOC 
entries into SECTION_DATA when in binary upgrade mode is a good thing? 
Or that bunching all the SQL statements for creating the blob, changing 
the ACL and COMMENT and SECLABEL all in one multi-statement-query is.


Maybe we should focus on those details before getting into all the 
parameter naming stuff.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Tom Lane
Jan Wieck  writes:
> The problem here is that pg_upgrade itself is invoking a shell again. It 
> is not assembling an array of arguments to pass into exec*(). I'd be a 
> happy camper if it did the latter. But as things are we'd have to add 
> full shell escapeing for arbitrary strings.

Surely we need that (and have it already) anyway?

I think we've stayed away from exec* because we'd have to write an
emulation for Windows.  Maybe somebody will get fed up and produce
such code, but it's not likely to be the least-effort route to the
goal.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:59 PM, Tom Lane wrote:

Jan Wieck  writes:

On 3/23/21 2:35 PM, Tom Lane wrote:

If you're passing multiple options, that is
--pg-dump-options "--foo=x --bar=y"
it seems just horribly fragile.  Lose the double quotes and suddenly
--bar is a separate option to pg_upgrade itself, not part of the argument
for the previous option.  That's pretty easy to do when passing things
through shell scripts, too.


... which would be all really easy if pg_upgrade wouldn't be assembling 
a shell script string to pass into parallel_exec_prog() by itself.


No, what I was worried about is shell script(s) that invoke pg_upgrade
and have to pass down some of these options through multiple levels of
option parsing.


The problem here is that pg_upgrade itself is invoking a shell again. It 
is not assembling an array of arguments to pass into exec*(). I'd be a 
happy camper if it did the latter. But as things are we'd have to add 
full shell escapeing for arbitrary strings.




BTW, it doesn't seem like the "pg-" prefix has any value-add here,
so maybe "--dump-option" and "--restore-option" would be suitable
spellings.


Agreed.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Tom Lane
Jan Wieck  writes:
> On 3/23/21 2:35 PM, Tom Lane wrote:
>> If you're passing multiple options, that is
>> --pg-dump-options "--foo=x --bar=y"
>> it seems just horribly fragile.  Lose the double quotes and suddenly
>> --bar is a separate option to pg_upgrade itself, not part of the argument
>> for the previous option.  That's pretty easy to do when passing things
>> through shell scripts, too.

> ... which would be all really easy if pg_upgrade wouldn't be assembling 
> a shell script string to pass into parallel_exec_prog() by itself.

No, what I was worried about is shell script(s) that invoke pg_upgrade
and have to pass down some of these options through multiple levels of
option parsing.

BTW, it doesn't seem like the "pg-" prefix has any value-add here,
so maybe "--dump-option" and "--restore-option" would be suitable
spellings.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:35 PM, Tom Lane wrote:

Jan Wieck  writes:

So the question remains, how do we name this?



 --pg-dump-options ""
 --pg-restore-options ""


If you're passing multiple options, that is

--pg-dump-options "--foo=x --bar=y"

it seems just horribly fragile.  Lose the double quotes and suddenly
--bar is a separate option to pg_upgrade itself, not part of the argument
for the previous option.  That's pretty easy to do when passing things
through shell scripts, too.  So it'd likely be safer to write

--pg-dump-option=--foo=x --pg-dump-option=--bar=y

which requires pg_upgrade to allow aggregating multiple options,
but you'd probably want it to act that way anyway.


... which would be all really easy if pg_upgrade wouldn't be assembling 
a shell script string to pass into parallel_exec_prog() by itself.


But I will see what I can do ...


Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Tom Lane
Jan Wieck  writes:
> So the question remains, how do we name this?

>  --pg-dump-options ""
>  --pg-restore-options ""

If you're passing multiple options, that is

--pg-dump-options "--foo=x --bar=y"

it seems just horribly fragile.  Lose the double quotes and suddenly
--bar is a separate option to pg_upgrade itself, not part of the argument
for the previous option.  That's pretty easy to do when passing things
through shell scripts, too.  So it'd likely be safer to write

--pg-dump-option=--foo=x --pg-dump-option=--bar=y

which requires pg_upgrade to allow aggregating multiple options,
but you'd probably want it to act that way anyway.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Bruce Momjian
On Tue, Mar 23, 2021 at 02:23:03PM -0400, Jan Wieck wrote:
> On 3/23/21 2:06 PM, Bruce Momjian wrote:
> > We have the postmaster which can pass arbitrary arguments to postgres
> > processes using -o.
> 
> Right, and -o is already taken in pg_upgrade for sending options to the old
> postmaster.
> 
> What we are looking for are options for sending options to pg_dump and
> pg_restore, which are not postmasters or children of postmaster, but rather
> clients. There is no option to send options to clients of postmasters.
> 
> So the question remains, how do we name this?
> 
> --pg-dump-options ""
> --pg-restore-options ""
> 
> where "" could be something like "--whatever[=NUM] [...]" would be
> something unambiguous.

Sure.  I don't think the letter you use is a problem.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 2:06 PM, Bruce Momjian wrote:

We have the postmaster which can pass arbitrary arguments to postgres
processes using -o.


Right, and -o is already taken in pg_upgrade for sending options to the 
old postmaster.


What we are looking for are options for sending options to pg_dump and 
pg_restore, which are not postmasters or children of postmaster, but 
rather clients. There is no option to send options to clients of 
postmasters.


So the question remains, how do we name this?

--pg-dump-options ""
--pg-restore-options ""

where "" could be something like "--whatever[=NUM] [...]" would 
be something unambiguous.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Bruce Momjian
On Tue, Mar 23, 2021 at 01:25:15PM -0400, Jan Wieck wrote:
> On 3/23/21 10:56 AM, Bruce Momjian wrote:
> > Would it be better to allow pg_upgrade to pass arbitrary arguments to
> > pg_restore, instead of just these specific ones?
> > 
> 
> That would mean arbitrary parameters to pg_dump as well as pg_restore. But
> yes, that would probably be better in the long run.
> 
> Any suggestion as to how that would actually look like? Unfortunately
> pg_restore has -[dDoOr] already used, so it doesn't look like there will be
> any naturally intelligible short options for that.

We have the postmaster which can pass arbitrary arguments to postgres
processes using -o.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/23/21 10:56 AM, Bruce Momjian wrote:

On Tue, Mar 23, 2021 at 08:51:32AM -0400, Jan Wieck wrote:

On 3/22/21 7:18 PM, Jan Wieck wrote:
> On 3/22/21 5:36 PM, Zhihong Yu wrote:
> > Hi,
> > 
> > w.r.t. pg_upgrade_improvements.v2.diff.
> > 
> > +       blobBatchCount = 0;

> > +       blobInXact = false;
> > 
> > The count and bool flag are always reset in tandem. It seems

> > variable blobInXact is not needed.
> 
> You are right. I will fix that.


New patch v3 attached.


Would it be better to allow pg_upgrade to pass arbitrary arguments to
pg_restore, instead of just these specific ones?



That would mean arbitrary parameters to pg_dump as well as pg_restore. 
But yes, that would probably be better in the long run.


Any suggestion as to how that would actually look like? Unfortunately 
pg_restore has -[dDoOr] already used, so it doesn't look like there will 
be any naturally intelligible short options for that.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Bruce Momjian
On Tue, Mar 23, 2021 at 08:51:32AM -0400, Jan Wieck wrote:
> On 3/22/21 7:18 PM, Jan Wieck wrote:
> > On 3/22/21 5:36 PM, Zhihong Yu wrote:
> > > Hi,
> > > 
> > > w.r.t. pg_upgrade_improvements.v2.diff.
> > > 
> > > +       blobBatchCount = 0;
> > > +       blobInXact = false;
> > > 
> > > The count and bool flag are always reset in tandem. It seems
> > > variable blobInXact is not needed.
> > 
> > You are right. I will fix that.
> 
> New patch v3 attached.

Would it be better to allow pg_upgrade to pass arbitrary arguments to
pg_restore, instead of just these specific ones?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-23 Thread Jan Wieck

On 3/22/21 7:18 PM, Jan Wieck wrote:

On 3/22/21 5:36 PM, Zhihong Yu wrote:

Hi,

w.r.t. pg_upgrade_improvements.v2.diff.

+       blobBatchCount = 0;
+       blobInXact = false;

The count and bool flag are always reset in tandem. It seems 
variable blobInXact is not needed.


You are right. I will fix that.


New patch v3 attached.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..8331e8a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -265,6 +266,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 
 /* Public */
 void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
+/* Public */
+void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
 {
 	/* Caller can omit dump options, in which case we synthesize them */
@@ -3531,6 +3551,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..f153f08 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -165,12 +165,20 @@ static void guessConstraintInheritance(TableInfo *tblinfo, int numTables);
 static void dumpComment(Archive *fout, const char *type, const char *name,
 		const char *namespace, const char *owner,
 		CatalogId catalogId, int subid, DumpId dumpId);
+static bool dumpCommentQuery(Archive *fout, 

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-22 Thread Jan Wieck

On 3/22/21 5:36 PM, Zhihong Yu wrote:

Hi,

w.r.t. pg_upgrade_improvements.v2.diff.

+       blobBatchCount = 0;
+       blobInXact = false;

The count and bool flag are always reset in tandem. It seems 
variable blobInXact is not needed.


You are right. I will fix that.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-22 Thread Zhihong Yu
>
> Hi,
>
w.r.t. pg_upgrade_improvements.v2.diff.

+   blobBatchCount = 0;
+   blobInXact = false;

The count and bool flag are always reset in tandem. It seems
variable blobInXact is not needed.

Cheers


Re: pg_upgrade failing for 200+ million Large Objects

2021-03-21 Thread Andrew Dunstan


On 3/21/21 12:56 PM, Jan Wieck wrote:
> On 3/21/21 7:47 AM, Andrew Dunstan wrote:
>> One possible (probable?) source is the JDBC driver, which currently
>> treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
>> improving that some: 
>
> You mean the user is using OID columns pointing to large objects and
> the JDBC driver is mapping those for streaming operations?
>
> Yeah, that would explain a lot.
>
>
>


Probably in most cases the database is designed by Hibernate, and the
front end programmers know nothing at all of Oids or LOs, they just ask
for and get a Blob.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)

2021-03-21 Thread Tom Lane
Jan Wieck  writes:
> On 3/20/21 12:39 AM, Jan Wieck wrote:
>> On the way pg_upgrade also mangles the pg_database.datdba
>> (all databases are owned by postgres after an upgrade; will submit a
>> separate patch for that as I consider that a bug by itself).

> Patch attached.

Hmm, doesn't this lose all *other* database-level properties?

I think maybe what we have here is a bug in pg_restore, its
--create switch ought to be trying to update the database's
ownership.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-21 Thread Jan Wieck

On 3/21/21 7:47 AM, Andrew Dunstan wrote:

One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: 


You mean the user is using OID columns pointing to large objects and the 
JDBC driver is mapping those for streaming operations?


Yeah, that would explain a lot.


Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)

2021-03-21 Thread Jan Wieck

On 3/20/21 12:39 AM, Jan Wieck wrote:

On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).


Patch attached.


Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 5d9a26c..38f7202 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -344,6 +344,7 @@ get_db_infos(ClusterInfo *cluster)
 	DbInfo	   *dbinfos;
 	int			i_datname,
 i_oid,
+i_datdba,
 i_encoding,
 i_datcollate,
 i_datctype,
@@ -351,9 +352,12 @@ get_db_infos(ClusterInfo *cluster)
 	char		query[QUERY_ALLOC];
 
 	snprintf(query, sizeof(query),
-			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "
+			 "SELECT d.oid, d.datname, u.rolname, d.encoding, "
+			 "d.datcollate, d.datctype, "
 			 "%s AS spclocation "
 			 "FROM pg_catalog.pg_database d "
+			 " JOIN pg_catalog.pg_authid u "
+			 " ON d.datdba = u.oid "
 			 " LEFT OUTER JOIN pg_catalog.pg_tablespace t "
 			 " ON d.dattablespace = t.oid "
 			 "WHERE d.datallowconn = true "
@@ -367,6 +371,7 @@ get_db_infos(ClusterInfo *cluster)
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
+	i_datdba = PQfnumber(res, "rolname");
 	i_encoding = PQfnumber(res, "encoding");
 	i_datcollate = PQfnumber(res, "datcollate");
 	i_datctype = PQfnumber(res, "datctype");
@@ -379,6 +384,7 @@ get_db_infos(ClusterInfo *cluster)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
+		dbinfos[tupnum].db_owner = pg_strdup(PQgetvalue(res, tupnum, i_datdba));
 		dbinfos[tupnum].db_encoding = atoi(PQgetvalue(res, tupnum, i_encoding));
 		dbinfos[tupnum].db_collate = pg_strdup(PQgetvalue(res, tupnum, i_datcollate));
 		dbinfos[tupnum].db_ctype = pg_strdup(PQgetvalue(res, tupnum, i_datctype));
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index e23b8ca..8fd9a13 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -378,18 +378,36 @@ create_new_objects(void)
 		 * propagate its database-level properties.
 		 */
 		if (strcmp(old_db->db_name, "postgres") == 0)
-			create_opts = "--clean --create";
+		{
+			parallel_exec_prog(log_file_name,
+			   NULL,
+			   "\"%s/pg_restore\" %s --exit-on-error "
+			   "--verbose --clean --create "
+			   "--dbname template1 \"%s\"",
+			   new_cluster.bindir,
+			   cluster_conn_opts(_cluster),
+			   sql_file_name);
+		}
 		else
-			create_opts = "--create";
-
-		parallel_exec_prog(log_file_name,
-		   NULL,
-		   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
-		   "--dbname template1 \"%s\"",
-		   new_cluster.bindir,
-		   cluster_conn_opts(_cluster),
-		   create_opts,
-		   sql_file_name);
+		{
+			exec_prog(log_file_name, NULL, true, true,
+	  "\"%s/createdb\" -O \"%s\" %s \"%s\"",
+	  new_cluster.bindir,
+	  old_db->db_owner,
+	  cluster_conn_opts(_cluster),
+	  old_db->db_name);
+			parallel_exec_prog(log_file_name,
+			   NULL,
+			   "\"%s/pg_restore\" %s --exit-on-error "
+			   "--verbose "
+			   "--dbname \"%s\" \"%s\"",
+			   new_cluster.bindir,
+			   cluster_conn_opts(_cluster),
+			   old_db->db_name,
+			   sql_file_name);
+		}
+
+
 	}
 
 	/* reap all children */
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 919a784..a3cda97 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -177,6 +177,7 @@ typedef struct
 {
 	Oid			db_oid;			/* oid of the database */
 	char	   *db_name;		/* database name */
+	char	   *db_owner;		/* database owner */
 	char		db_tablespace[MAXPGPATH];	/* database default tablespace
 			 * path */
 	char	   *db_collate;


Re: pg_upgrade failing for 200+ million Large Objects

2021-03-21 Thread Andrew Dunstan


On 3/20/21 12:55 PM, Jan Wieck wrote:
> On 3/20/21 11:23 AM, Tom Lane wrote:
>> Jan Wieck  writes:
>>> All that aside, the entire approach doesn't scale.
>>
>> Yeah, agreed.  When we gave large objects individual ownership and ACL
>> info, it was argued that pg_dump could afford to treat each one as a
>> separate TOC entry because "you wouldn't have that many of them, if
>> they're large".  The limits of that approach were obvious even at the
>> time, and I think now we're starting to see people for whom it really
>> doesn't work.
>
> It actually looks more like some users have millions of "small
> objects". I am still wondering where that is coming from and why they
> are abusing LOs in that way, but that is more out of curiosity. Fact
> is that they are out there and that they cannot upgrade from their 9.5
> databases, which are now past EOL.
>

One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: 


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Jan Wieck

On 3/20/21 11:23 AM, Tom Lane wrote:

Jan Wieck  writes:

All that aside, the entire approach doesn't scale.


Yeah, agreed.  When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large".  The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.


It actually looks more like some users have millions of "small objects". 
I am still wondering where that is coming from and why they are abusing 
LOs in that way, but that is more out of curiosity. Fact is that they 
are out there and that they cannot upgrade from their 9.5 databases, 
which are now past EOL.




I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents.  That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result.  Then dump out
all the matching blobs under that heading.


What I am currently experimenting with is moving the BLOB TOC entries 
into the parallel data phase of pg_restore "when doing binary upgrade". 
It seems to scale nicely with the number of cores in the system. In 
addition to that have options for pg_upgrade and pg_restore that cause 
the restore to batch them into transactions, like 10,000 objects at a 
time. There was a separate thread for that but I guess it is better to 
keep it all together here now.




A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.


I fully intend to make all this into new "options". I am afraid that 
there is no one-size-fits-all solution here.


Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks.  Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?


I'm not very fond of the idea going lockless when at the same time 
trying to parallelize the restore phase. That can lead to really nasty 
race conditions. For now I'm aiming at batches in transactions.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Tom Lane
Bruce Momjian  writes:
> On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:
>> Of course, that just reduces the memory consumption on the client
>> side; it does nothing for the locks.  Can we get away with releasing the
>> lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

> Well, in pg_upgrade mode you can, since there are no other cluster
> users, but you might be asking for general pg_dump usage.

Yeah, this problem doesn't only affect pg_upgrade scenarios, so it'd
really be better to find a way that isn't dependent on binary-upgrade
mode.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Bruce Momjian
On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:
> I wonder if pg_dump could improve matters cheaply by aggregating the
> large objects by owner and ACL contents.  That is, do
> 
> select distinct lomowner, lomacl from pg_largeobject_metadata;
> 
> and make just *one* BLOB TOC entry for each result.  Then dump out
> all the matching blobs under that heading.
> 
> A possible objection is that it'd reduce the ability to restore blobs
> selectively, so maybe we'd need to make it optional.
> 
> Of course, that just reduces the memory consumption on the client
> side; it does nothing for the locks.  Can we get away with releasing the
> lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

Well, in pg_upgrade mode you can, since there are no other cluster
users, but you might be asking for general pg_dump usage.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Tom Lane
Jan Wieck  writes:
> On 3/8/21 11:58 AM, Tom Lane wrote:
>> So it seems like the path of least resistance is
>> (a) make pg_upgrade use --single-transaction when calling pg_restore
>> (b) document (better) how to get around too-many-locks failures.

> That would first require to fix how pg_upgrade is creating the 
> databases. It uses "pg_restore --create", which is mutually exclusive 
> with --single-transaction because we cannot create a database inside of 
> a transaction.

Ugh.

> All that aside, the entire approach doesn't scale.

Yeah, agreed.  When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large".  The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.

I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents.  That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result.  Then dump out
all the matching blobs under that heading.

A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.

Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks.  Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-20 Thread Andrew Dunstan


On 3/20/21 12:39 AM, Jan Wieck wrote:
> On 3/8/21 11:58 AM, Tom Lane wrote:
>> The answer up to now has been "raise max_locks_per_transaction enough
>> so you don't see the failure".  Having now consumed a little more
>> caffeine, I remember that that works in pg_upgrade scenarios too,
>> since the user can fiddle with the target cluster's postgresql.conf
>> before starting pg_upgrade.
>>
>> So it seems like the path of least resistance is
>>
>> (a) make pg_upgrade use --single-transaction when calling pg_restore
>>
>> (b) document (better) how to get around too-many-locks failures.
>
> That would first require to fix how pg_upgrade is creating the
> databases. It uses "pg_restore --create", which is mutually exclusive
> with --single-transaction because we cannot create a database inside
> of a transaction. On the way pg_upgrade also mangles the
> pg_database.datdba (all databases are owned by postgres after an
> upgrade; will submit a separate patch for that as I consider that a
> bug by itself).
>
> All that aside, the entire approach doesn't scale.
>
> In a hacked up pg_upgrade that does "createdb" first before calling
> pg_upgrade with --single-transaction. I can upgrade 1M large objects with
>     max_locks_per_transaction = 5300
>     max_connectinons=100
> which contradicts the docs. Need to find out where that math went off
> the rails because that config should only have room for 530,000 locks,
> not 1M. The same test fails with max_locks_per_transaction = 5200.
>
> But this would mean that one has to modify the postgresql.conf to
> something like 530,000 max_locks_per_transaction at 100
> max_connections in order to actually run a successful upgrade of 100M
> large objects. This config requires 26GB of memory just for locks. Add
> to that the memory pg_restore needs to load the entire TOC before even
> restoring a single object.
>
> Not going to work. But tests are still ongoing ...



I thought Tom's suggestion upthread:


> Would it be sane to have the backend not bother to
> take any locks in binary-upgrade mode?


was interesting. Could we do that on the restore side? After all, what
are we locking against in binary upgrade mode?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-19 Thread Jan Wieck

On 3/8/21 11:58 AM, Tom Lane wrote:

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure".  Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.


That would first require to fix how pg_upgrade is creating the 
databases. It uses "pg_restore --create", which is mutually exclusive 
with --single-transaction because we cannot create a database inside of 
a transaction. On the way pg_upgrade also mangles the pg_database.datdba 
(all databases are owned by postgres after an upgrade; will submit a 
separate patch for that as I consider that a bug by itself).


All that aside, the entire approach doesn't scale.

In a hacked up pg_upgrade that does "createdb" first before calling 
pg_upgrade with --single-transaction. I can upgrade 1M large objects with

max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off 
the rails because that config should only have room for 530,000 locks, 
not 1M. The same test fails with max_locks_per_transaction = 5200.


But this would mean that one has to modify the postgresql.conf to 
something like 530,000 max_locks_per_transaction at 100 max_connections 
in order to actually run a successful upgrade of 100M large objects. 
This config requires 26GB of memory just for locks. Add to that the 
memory pg_restore needs to load the entire TOC before even restoring a 
single object.


Not going to work. But tests are still ongoing ...


Regards, Jan


--
Jan Wieck
Principle Database Engineer
Amazon Web Services




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-09 Thread Justin Pryzby
On Wed, Mar 03, 2021 at 11:36:26AM +, Tharakan, Robins wrote:
> While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the
> instance had ~200 million (in-use) Large Objects. I was able to reproduce
> this on a test instance which too fails with a similar error.

If pg_upgrade can't handle millions of objects/transactions/XIDs, that seems
like a legitimate complaint, since apparently the system is working okay
otherwise.

But it also seems like you're using it outside the range of its intended use
(See also [1]).  I'm guessing that not many people are going to spend time
running tests of pg_upgrade, each of which takes 25hr, not to mention some
multiple of 128GB RAM+swap.

Creating millions of large objects was too slow for me to test like this:
| time { echo 'begin;'; for a in `seq 1 9`; do echo '\lo_import /dev/null'; 
done; echo 'commit;'; } |psql -qh /tmp postgres&

This seems to be enough for what's needed:
| ALTER SYSTEM SET fsync=no; ALTER SYSTEM SET full_page_writes=no; SELECT 
pg_reload_conf();
| INSERT INTO pg_largeobject_metadata SELECT a, 0 FROM generate_series(10, 
200111222)a;

Now, testing the pg_upgrade was killed after runnning 100min and using 60GB
RAM, so you might say that's a problem too.  I converted getBlobs() to use a
cursor, like dumpBlobs(), but it was still killed.  I think a test case and a
way to exercizes this failure with a more reasonable amount of time and
resources might be a prerequisite for a patch to fix it.

pg_upgrade is meant for "immediate" upgrades, frequently allowing upgrade in
minutes, where pg_dump |pg_restore might take hours or days.  There's two
components to consider: the catalog/metadata part, and the data part.  If the
data is large (let's say more than 100GB), then pg_upgrade is expected to be an
improvement over the "dump and restore" process, which is usually infeasible
for large DBs measure in TB.

But the *catalog* part is large, and pg_upgrade still has to run pg_dump, and
pg_restore.  The time to do this might dominate over the data part.  Our own
customers DBs are 100s of GB to 10TB.  For large customers, pg_upgrade takes
45min.  In the past, we had tables with many column defaults, which caused the
dump+restore to be slow at a larger fraction of customers.

If it were me, in an EOL situation, I would look at either: 1) find a way to do
dump+restore rather than pg_upgrade; and/or, 2) separately pg_dump the large
objects, drop as many as you can, then pg_upgrade the DB, then restore the
large objects.  (And find a better way to store them in the future).

I was able to hack pg_upgrade to call pg_restore --single (with a separate
invocation to handle --create).  That passes tests...but I can't say much
beyond that.

Regarding your existing patch: "make check" only tests SQL features.
For development, you'll want to configure like:
|./configure --enable-debug --enable-cassert --enable-tap-tests
And then use "make check-world", and in particular:
time make check -C src/bin/pg_resetwal
time make check -C src/bin/pg_upgrade

I don't think pg_restore needs a user-facing option for XIDs.  I think it
should "just work", since a user might be as likely to shoot themselves in the
foot with a commandline option as they are to make an upgrade succeed that
would otherwise fail.  pg_upgrade has a --check mode, and if that passes, the
upgrade is intended to work, and not fail halfway through between the schema
dump and restore, with the expectation that the user know to rerun with some
commandline flags.  If you pursue the patch with setting a different XID
threshold, maybe you could count the number of objects to be created, or
transactions to be used, and use that as the argument to resetxlog ?  I'm not
sure, but pg_restore -l might be a good place to start looking.

I think a goal for this patch should be to allow an increased number of
objects to be handled by pg_upgrade.  Large objects may be a special case, and
increasing the number of other objects to be restored to the 100s of millions
might be unimportant.

-- 
Justin

[1] https://www.postgresql.org/message-id/502641.1606334432%40sss.pgh.pa.us
| Does pg_dump really have sane performance for that situation, or
| are we soon going to be fielding requests to make it not be O(N^2)
| in the number of listed tables?
>From cfc7400bb021659d49170e8b17d067c8e1b9fa33 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 9 Mar 2021 14:06:17 -0600
Subject: [PATCH] pg_dump: use a cursor in getBlobs..

..to mitigate huge memory use in the case of millions of large objects
---
 src/bin/pg_dump/pg_dump.c | 96 +--
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index aa02ada079..3fd7f48605 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -,7 +,7 @@ getBlobs(Archive *fout)
 	BlobInfo   *binfo;
 	DumpableObject *bdata;
 	PGresult   *res;
-	int			ntups;

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Magnus Hagander
On Mon, Mar 8, 2021 at 5:58 PM Tom Lane  wrote:
>
> Magnus Hagander  writes:
> > On Mon, Mar 8, 2021 at 5:33 PM Tom Lane  wrote:
> >> It does seem that --single-transaction is a better idea than fiddling with
> >> the transaction wraparound parameters, since the latter is just going to
> >> put off the onset of trouble.  However, we'd have to do something about
> >> the lock consumption.  Would it be sane to have the backend not bother to
> >> take any locks in binary-upgrade mode?
>
> > I believe the problem occurs when writing them rather than when
> > reading them, and I don't think we have a binary upgrade mode there.
>
> You're confusing pg_dump's --binary-upgrade switch (indeed applied on
> the dumping side) with the backend's -b switch (IsBinaryUpgrade,
> applied on the restoring side).

Ah. Yes, I am.


> > We could invent one of course. Another option might be to exclusively
> > lock pg_largeobject, and just say that if you do that, we don't have
> > to lock the individual objects (ever)?
>
> What was in the back of my mind is that we've sometimes seen complaints
> about too many locks needed to dump or restore a database with $MANY
> tables; so the large-object case seems like just a special case.

It is -- but I guess it's more likely to have 100M large objects than
to have 100M tables. (and the cutoff point comes a lot earlier than
100M). But the fundamental onei s the same.


> The answer up to now has been "raise max_locks_per_transaction enough
> so you don't see the failure".  Having now consumed a little more
> caffeine, I remember that that works in pg_upgrade scenarios too,
> since the user can fiddle with the target cluster's postgresql.conf
> before starting pg_upgrade.
>
> So it seems like the path of least resistance is
>
> (a) make pg_upgrade use --single-transaction when calling pg_restore
>
> (b) document (better) how to get around too-many-locks failures.

Agreed. Certainly seems like a better path forward than arbitrarily
pushing the limit on number of transactions which just postpones the
problem.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Tom Lane
Magnus Hagander  writes:
> On Mon, Mar 8, 2021 at 5:33 PM Tom Lane  wrote:
>> It does seem that --single-transaction is a better idea than fiddling with
>> the transaction wraparound parameters, since the latter is just going to
>> put off the onset of trouble.  However, we'd have to do something about
>> the lock consumption.  Would it be sane to have the backend not bother to
>> take any locks in binary-upgrade mode?

> I believe the problem occurs when writing them rather than when
> reading them, and I don't think we have a binary upgrade mode there.

You're confusing pg_dump's --binary-upgrade switch (indeed applied on
the dumping side) with the backend's -b switch (IsBinaryUpgrade,
applied on the restoring side).

> We could invent one of course. Another option might be to exclusively
> lock pg_largeobject, and just say that if you do that, we don't have
> to lock the individual objects (ever)?

What was in the back of my mind is that we've sometimes seen complaints
about too many locks needed to dump or restore a database with $MANY
tables; so the large-object case seems like just a special case.

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure".  Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Magnus Hagander
On Mon, Mar 8, 2021 at 5:33 PM Tom Lane  wrote:
>
> Robins Tharakan  writes:
> > On Mon, 8 Mar 2021 at 23:34, Magnus Hagander  wrote:
> >> Without looking, I would guess it's the schema reload using
> >> pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
> >> issue in pg_dump/pg_restore. And if that is the case -- perhaps just
> >> running all of those in a single transaction would be a better choice?
> >> One could argue it's still not a proper fix, because we'd still have a
> >> huge memory usage etc, but it would then only burn 1 xid instead of
> >> 500M...
>
> > (I hope I am not missing something but) When I tried to force pg_restore to
> > use a single transaction (by hacking pg_upgrade's pg_restore call to use
> > --single-transaction), it too failed owing to being unable to lock so many
> > objects in a single transaction.
>
> It does seem that --single-transaction is a better idea than fiddling with
> the transaction wraparound parameters, since the latter is just going to
> put off the onset of trouble.  However, we'd have to do something about
> the lock consumption.  Would it be sane to have the backend not bother to
> take any locks in binary-upgrade mode?

I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.

We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Tom Lane
Robins Tharakan  writes:
> On Mon, 8 Mar 2021 at 23:34, Magnus Hagander  wrote:
>> Without looking, I would guess it's the schema reload using
>> pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
>> issue in pg_dump/pg_restore. And if that is the case -- perhaps just
>> running all of those in a single transaction would be a better choice?
>> One could argue it's still not a proper fix, because we'd still have a
>> huge memory usage etc, but it would then only burn 1 xid instead of
>> 500M...

> (I hope I am not missing something but) When I tried to force pg_restore to
> use a single transaction (by hacking pg_upgrade's pg_restore call to use
> --single-transaction), it too failed owing to being unable to lock so many
> objects in a single transaction.

It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble.  However, we'd have to do something about
the lock consumption.  Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Robins Tharakan
Hi Magnus,

On Mon, 8 Mar 2021 at 23:34, Magnus Hagander  wrote:

> AFAICT at a quick check, pg_dump in binary upgrade mode emits one

lo_create() and one ALTER ... OWNER TO for each large object - so with
> 500M large objects that would be a billion statements, and thus a
> billion xids. And without checking, I'm fairly sure it doesn't load in
> a single transaction...
>

Your assumptions are pretty much correct.

The issue isn't with pg_upgrade itself. During pg_restore, each Large
Object (and separately each ALTER LARGE OBJECT OWNER TO) consumes an XID
each. For background, that's the reason the v9.5 production instance I was
reviewing, was unable to process more than 73 Million large objects since
each object required a CREATE + ALTER. (To clarify, 73 million = (2^31 - 2
billion magic constant - 1 Million wraparound protection) / 2)


Without looking, I would guess it's the schema reload using
> pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
> issue in pg_dump/pg_restore. And if that is the case -- perhaps just
> running all of those in a single transaction would be a better choice?
> One could argue it's still not a proper fix, because we'd still have a
> huge memory usage etc, but it would then only burn 1 xid instead of
> 500M...
>
(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.


This still seems to just fix the symptoms and not the actual problem.
>

I agree that the patch doesn't address the root-cause, but it did get the
upgrade to complete on a test-setup. Do you think that (instead of all
objects) batching multiple Large Objects in a single transaction (and
allowing the caller to size that batch via command line) would be a good /
acceptable idea here?

Please take a look at your email configuration -- all your emails are
> lacking both References and In-reply-to headers.
>

Thanks for highlighting the cause here. Hopefully switching mail clients
would help.
-
Robins Tharakan


Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Magnus Hagander
On Mon, Mar 8, 2021 at 12:02 PM Tharakan, Robins  wrote:
>
> Thanks Peter.
>
> The original email [1] had some more context that somehow didn't get
> associated with this recent email. Apologies for any confusion.

Please take a look at your email configuration -- all your emails are
lacking both References and In-reply-to headers, so every email starts
a new thread, both for each reader and in the archives. It seems quite
broken. It makes it very hard to follow.


> In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
> both v9.6 as well as master) which seems to have been selected to force an
> aggressive autovacuum as soon as the upgrade completes. Although that works
> as planned, it narrows the window of Transaction IDs available for the
> upgrade (before which XID wraparound protection kicks and aborts the
> upgrade) to 146 Million.
>
> Reducing this magic constant allows a larger XID window, which is what the
> patch is trying to do. With the patch, I was able to upgrade a cluster with
> 500m Large Objects successfully (which otherwise reliably fails). In the
> original email [1] I had also listed a few other possible workarounds, but
> was unsure which would be a good direction to start working on thus this
> patch to make a start.

This still seems to just fix the symptoms and not the actual problem.

What part of the pg_upgrade process is it that actually burns through
that many transactions?

Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...

AFAICT at a quick check, pg_dump in binary upgrade mode emits one
lo_create() and one ALTER ... OWNER TO for each large object - so with
500M large objects that would be a billion statements, and thus a
billion xids. And without checking, I'm fairly sure it doesn't load in
a single transaction...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Tharakan, Robins
Thanks Peter.

The original email [1] had some more context that somehow didn't get
associated with this recent email. Apologies for any confusion.

In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
both v9.6 as well as master) which seems to have been selected to force an
aggressive autovacuum as soon as the upgrade completes. Although that works
as planned, it narrows the window of Transaction IDs available for the
upgrade (before which XID wraparound protection kicks and aborts the
upgrade) to 146 Million.

Reducing this magic constant allows a larger XID window, which is what the
patch is trying to do. With the patch, I was able to upgrade a cluster with
500m Large Objects successfully (which otherwise reliably fails). In the
original email [1] I had also listed a few other possible workarounds, but
was unsure which would be a good direction to start working on thus this
patch to make a start.

Reference:
1) https://www.postgresql.org/message-id/12601596dbbc4c01b86b4ac4d2bd4d48%40
EX13D05UWC001.ant.amazon.com
2) https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_r
esetwal.c#L444

-
robins | tharar@ | syd12


> -Original Message-
> From: Peter Eisentraut 
> Sent: Monday, 8 March 2021 9:25 PM
> To: Tharakan, Robins ; pgsql-hack...@postgresql.org
> Subject: [EXTERNAL] [UNVERIFIED SENDER] Re: pg_upgrade failing for 200+
> million Large Objects
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and
> know the content is safe.
> 
> 
> 
> On 07.03.21 09:43, Tharakan, Robins wrote:
> > Attached is a proof-of-concept patch that allows Postgres to perform
> > pg_upgrade if the instance has Millions of objects.
> >
> > It would be great if someone could take a look and see if this patch
> > is in the right direction. There are some pending tasks (such as
> > documentation / pg_resetxlog vs pg_resetwal related changes) but for
> > now, the patch helps remove a stalemate where if a Postgres instance
> > has a large number (accurately speaking 146+ Million) of Large
> > Objects, pg_upgrade fails. This is easily reproducible and besides
> > deleting Large Objects before upgrade, there is no other (apparent) way
> for pg_upgrade to complete.
> >
> > The patch (attached):
> > - Applies cleanly on REL9_6_STABLE -
> > c7a4fc3dd001646d5938687ad59ab84545d5d043
> > - 'make check' passes
> > - Allows the user to provide a constant via pg_upgrade command-line,
> > that overrides the 2 billion constant in pg_resetxlog [1] thereby
> > increasing the (window of) Transaction IDs available for pg_upgrade to
> complete.
> 
> Could you explain what your analysis of the problem is and why this patch
> (might) fix it?
> 
> Right now, all I see here is, pass a big number via a command-line option
> and hope it works.


smime.p7s
Description: S/MIME cryptographic signature


RE: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Tharakan, Robins
Thanks Daniel for the input / next-steps.

I see that 'master' too has this same magic constant [1] and so I expect it
to have similar restrictions, although I haven't tested this yet.

I do agree that the need then is to re-submit a patch that works with
'master'. (I am travelling the next few days but) Unless discussions go
tangential, I expect to revert with an updated patch by the end of this week
and create a commitfest entry while at it.

Reference:
1)
https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_rese
twal.c#L444

-
Robins Tharakan

> -Original Message-
> From: Daniel Gustafsson 
> Sent: Monday, 8 March 2021 9:42 AM
> To: Tharakan, Robins 
> Cc: pgsql-hack...@postgresql.org
> Subject: RE: [EXTERNAL] pg_upgrade failing for 200+ million Large Objects
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and
> know the content is safe.
> 
> 
> 
> > On 7 Mar 2021, at 09:43, Tharakan, Robins  wrote:
> 
> > The patch (attached):
> > - Applies cleanly on REL9_6_STABLE -
> > c7a4fc3dd001646d5938687ad59ab84545d5d043
> 
> Did you target 9.6 because that's where you want to upgrade to, or is
> this not a problem on HEAD?  If it's still a problem on HEAD you should
> probably submit the patch against there.  You probably also want to add
> it to the next commit fest to make sure it's not forgotten about:
> https://commitfest.postgresql.org/33/
> 
> > I am not married to the patch (especially the argument name) but
> > ideally I'd prefer a way to get this upgrade going without a patch.
> > For now, I am unable to find any other way to upgrade a v9.5 Postgres
> > database in this scenario, facing End-of-Life.
> 
> It's obviously not my call to make in any shape or form, but this doesn't
> really seem to fall under what is generally backported into a stable
> release?
> 
> --
> Daniel Gustafsson   https://vmware.com/



smime.p7s
Description: S/MIME cryptographic signature


Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Peter Eisentraut

On 07.03.21 09:43, Tharakan, Robins wrote:

Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.

It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete.


Could you explain what your analysis of the problem is and why this 
patch (might) fix it?


Right now, all I see here is, pass a big number via a command-line 
option and hope it works.





Re: pg_upgrade failing for 200+ million Large Objects

2021-03-07 Thread Daniel Gustafsson
> On 7 Mar 2021, at 09:43, Tharakan, Robins  wrote:

> The patch (attached):
> - Applies cleanly on REL9_6_STABLE -
> c7a4fc3dd001646d5938687ad59ab84545d5d043

Did you target 9.6 because that's where you want to upgrade to, or is this not
a problem on HEAD?  If it's still a problem on HEAD you should probably submit
the patch against there.  You probably also want to add it to the next commit
fest to make sure it's not forgotten about: 
https://commitfest.postgresql.org/33/

> I am not married to the patch (especially the argument name) but ideally I'd
> prefer a way to get this upgrade going without a patch. For now, I am unable
> to find any other way to upgrade a v9.5 Postgres database in this scenario,
> facing End-of-Life.

It's obviously not my call to make in any shape or form, but this doesn't
really seem to fall under what is generally backported into a stable release?

--
Daniel Gustafsson   https://vmware.com/