Re: optimize lookups in snapshot [sub]xip arrays

2022-07-23 Thread Zhang Mingli
Hi, all


> 
>   if (!snapshot->suboverflowed)
>   {
>   /* we have full data, so search subxip */
> - int32   j;
> -
> - for (j = 0; j < snapshot->subxcnt; j++)
> - {
> - if (TransactionIdEquals(xid, 
> snapshot->subxip[j]))
> - return true;
> - }
> + if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt,
> +  &snapshot->subxiph))
> + return true;
>  
>   /* not there, fall through to search xip[] */
>   }


If snaphost->suboverflowed is  false then the subxcnt must be less than 
PGPROC_MAX_CACHED_SUBXIDS which is 64 now.

And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 
128 currently during discussion.

So that, subxid’s hash table will never be used, right?

Regards,

Zhang Mingli


> On Jul 14, 2022, at 01:09, Nathan Bossart  wrote:
> 
> Hi hackers,
> 
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
> 
> writers  HEAD  patch  diff
>
> 16   659   664+1%
> 32   645   663+3%
> 64   659   692+5%
> 128  641   716+12%
> 256  619   610-1%
> 512  530   702+32%
> 768  469   582+24%
> 1000 367   577+57%
> 
> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
> 
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
> 
> Thoughts?
> 
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
> 
> -- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
> 





Re: redacting password in SQL statement in server log

2022-07-23 Thread Zhihong Yu
On Sat, Jul 23, 2022 at 5:27 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > Currently, in situation such as duplicate role creation, the server log
> > would show something such as the following:
>
> > 2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
> > PASSWORD 'foobar';
>
> > The password itself should be redacted before logging the statement.
>
> This has been proposed multiple times, and rejected multiple times,
> primarily because it offers only false security: you'll never cover
> all the cases.  (The proposed patch manages to create a bunch of
> false positives to go along with its false negatives, too.)
>
> The only safe answer is to be sure to keep the server log contents
> secure.  Please see prior discussions in the archives.
>
> regards, tom lane
>

Pardon my laziness.

I will pay more attention.


Re: redacting password in SQL statement in server log

2022-07-23 Thread Tom Lane
Zhihong Yu  writes:
> Currently, in situation such as duplicate role creation, the server log
> would show something such as the following:

> 2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
> PASSWORD 'foobar';

> The password itself should be redacted before logging the statement.

This has been proposed multiple times, and rejected multiple times,
primarily because it offers only false security: you'll never cover
all the cases.  (The proposed patch manages to create a bunch of
false positives to go along with its false negatives, too.)

The only safe answer is to be sure to keep the server log contents
secure.  Please see prior discussions in the archives.

regards, tom lane




Re: Cleaning up historical portability baggage

2022-07-23 Thread Tom Lane
Thomas Munro  writes:
> Here are some more, a couple of which I posted before but I've now
> gone a bit further with them in terms of removing configure checks
> etc:

After looking through these briefly, I'm pretty concerned about
whether this won't break our Cygwin build in significant ways.
For example, lorikeet reports "HAVE_SETSID 1", a condition that
you want to replace with !WIN32.  The question here is whether
or not WIN32 is defined in a Cygwin build.  I see some places
in our code that believe it is not, but others that believe that
it is --- and the former ones are mostly like
#if defined(__CYGWIN__) || defined(WIN32)
which means they wouldn't actually fail if they are wrong about that.

More generally, I'm not exactly convinced that changes like
this are a readability improvement:

-#ifdef HAVE_SETSID
+#ifndef WIN32

I'd rather not have the code cluttered with a sea of
indistinguishable "#ifndef WIN32" tests when some of them could be
more specific and more mnemonic.  So I think we'd be better off
leaving that as-is.  I don't mind nuking the configure-time test
and hard-wiring "#define HAVE_SETSID 1" somewhere, but changing
the code's #if tests doesn't seem to bring any advantage.

Specific to 0001, I don't especially like what you did to
src/port/dlopen.c.  The original intent (and reality until
not so long ago) was that that would be a container for
various dlopen replacements.  Well, okay, maybe there will
never be any more besides Windows, but I think then we should
either rename the file to (say) win32dlopen.c or move it to
src/backend/port/win32.  Likewise for link.c in 0007 and
pread.c et al in 0011.  (But 0010 is fine, because the
replacement code is already handled that way.)

OTOH, 0012 seems to immediately change pread.c et al back
to NOT being Windows-only, though it's hard to tell for
sure because the configure support seems all wrong.
I'm quite confused by those two patches ... are they really
correct?

regards, tom lane




redacting password in SQL statement in server log

2022-07-23 Thread Zhihong Yu
Hi,
Currently, in situation such as duplicate role creation, the server log
would show something such as the following:

2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
PASSWORD 'foobar';

The password itself should be redacted before logging the statement.

Here is sample output with the patch applied:

2022-07-23 23:28:20.359 UTC [16850] ERROR:  role "test" already exists
2022-07-23 23:28:20.359 UTC [16850] STATEMENT:  CREATE ROLE test WITH LOGIN
PASSWORD

Please take a look at the short patch.
I know variables should be declared at the start of the func - I can do
that once the approach is confirmed.

Cheers


redact-password-in-log.patch
Description: Binary data


Re: Cleaning up historical portability baggage

2022-07-23 Thread Tom Lane
Thomas Munro  writes:
> Some of these depend on SUSv2 options (not just "base"), but we
> already do that (fsync, ...) and they're all features that are by now
> ubiquitous, which means the fallback code is untested and the probes
> are pointless.

Reading this, it occurred to me that it'd be interesting to scrape
all of the latest configure results from the buildfarm, and see which
tests actually produce more than one answer among the set of tested
platforms.  Those that don't could be targets for further simplification,
or else an indicator that we'd better go find some more animals.

Before I go off and do that, though, I wonder if you already did.

regards, tom lane




PSA for folks forwarding personal email domains to Gmail

2022-07-23 Thread Tom Lane
If you happen to have noticed that you aren't getting any email
directly from me, or other people who set an SPF policy for their
domain, the reason might be this:

: host gmail-smtp-in.l.google.com[74.125.140.26] said:
550-5.7.26 The MAIL FROM domain [sss.pgh.pa.us] has an SPF record with a hard
550-5.7.26 fail policy (-all) but it fails to pass SPF checks with the ip:
550-5.7.26 [redacted]. To best protect our users from spam and phishing,
550-5.7.26 the message has been blocked. Please visit
550-5.7.26 https://support.google.com/mail/answer/81126#authentication for more
550-5.7.26 information.

I've been seeing these bounces from a number of PG people for a couple
of months now.  The messages didn't use to be quite this explicit, but
it seems absolutely clear now that 's private email domain is
trying to forward his email to a Gmail account, and it ain't working
because the mail's envelope sender is still me.  Gmail looks at my SPF
record, notes that the mail is not coming from my IP address, and
bounces it.  Unfortunately it bounces it to me, who can't do anything
about the misconfiguration.

If you want to do this kind of forwarding, please fix your mail
processing recipe so that the outgoing envelope sender is yourself,
not the incoming sender.

For extra credit, you could lobby Gmail to think a bit harder about
who they send bounces to.  From my perspective this behavior is not
much better than a spam amplifier.

regards, tom lane




Re: Handle infinite recursion in logical replication setup

2022-07-23 Thread vignesh C
On Fri, Jul 22, 2022 at 10:17 AM Amit Kapila  wrote:
>
> On Fri, Jul 22, 2022 at 1:39 AM Jonathan S. Katz  wrote:
> >
> > Thanks for the work on this feature -- this is definitely very helpful
> > towards supporting more types of use cases with logical replication!
> >
> > I've read through the proposed documentation and did some light testing
> > of the patch. I have two general comments about the docs as they
> > currently read:
> >
> > 1. I'm concerned by calling this "Bidirectional replication" in the docs
> > that we are overstating the current capabilities. I think this is
> > accentuated int he opening paragraph:
> >
> > ==snip==
> >   Bidirectional replication is useful for creating a multi-master database
> >   environment for replicating read/write operations performed by any of the
> >   member nodes.
> > ==snip==
> >
> > For one, we're not replicating reads, we're replicating writes. Amongst
> > the writes, at this point we're only replicating DML. A reader could
> > think that deploying can work for a full bidirectional solution.
> >
> > (Even if we're aspirationally calling this section "Bidirectional
> > replication", that does make it sound like we're limited to two nodes,
> > when we can support more than two).
> >
>
> Right, I think the system can support N-Way replication.
>
> > Perhaps "Logical replication between writers" or "Logical replication
> > between primaries" or "Replicating changes between primaries", or
> > something better.
> >
>
> Among the above "Replicating changes between primaries" sounds good to
> me or simply "Replication between primaries". As this is a sub-section
> on the Logical Replication page, I feel it is okay to not use Logical
> in the title.

I have changed it to "Replication between primaries".

> > 2. There is no mention of conflicts in the documentation, e.g.
> > referencing the "Conflicts" section of the documentation. It's very easy
> > to create a conflicting transaction that causes a subscriber to be
> > unable to continue to apply transactions:
> >
> >-- DB 1
> >CREATE TABLE abc (id int);
> >CREATE PUBLICATION node1 FOR ALL TABLES ;
> >
> >-- DB2
> >CREATE TABLE abc (id int);
> >CREATE PUBLICATION node2 FOR ALL TABLES ;
> >CREATE SUBSCRIPTION node2_node1
> >  CONNECTION 'dbname=logi port=5433'
> >  PUBLICATION node1
> >  WITH (copy_data = off, origin = none);
> >
> >-- DB1
> >CREATE SUBSCRIPTION node1_node2
> >  CONNECTION 'dbname=logi port=5434'
> >  PUBLICATION node2
> >  WITH (copy_data = off, origin = none);
> >INSERT INTO abc VALUES (1);
> >
> >-- DB2
> >INSERT INTO abc VALUES (2);
> >
> >-- DB1
> >ALTER TABLE abc ADD PRIMARY KEY id;
> >INSERT INTO abc VALUES (3);
> >
> >-- DB2
> >INSERT INTO abc VALUES (3);
> >
> >-- DB1 cannot apply the transactions
> >
> > At a minimum, I think we should reference the documentation we have in
> > the logical replication section on conflicts. We may also want to advise
> > that a user is responsible for designing their schemas in a way to
> > minimize the risk of conflicts.
> >
>
> This sounds reasonable to me.
>
> One more point about docs, it appears to be added as the last
> sub-section on the Logical Replication page. Is there a reason for
> doing so? I feel this should be third sub-section after describing
> Publication and Subscription.

I had initially kept it at the end since we demonstrated the various
steps to create the replication  between the primaries. Since it gives
an introduction about the "Replication between primaries" and then
states the steps, it looks ok to move it as suggested. I have modified
this in the v38 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm01x0sLz8YzfCSjxcMFxM4NDQxcFzZa%2B4eesUmD40DdTg%40mail.gmail.com

Regards,
Vignesh




Re: Handle infinite recursion in logical replication setup

2022-07-23 Thread vignesh C
On Fri, Jul 22, 2022 at 1:39 AM Jonathan S. Katz  wrote:
>
> Hi,
>
> On 7/21/22 6:34 AM, vignesh C wrote:
> > On Thu, Jul 21, 2022 at 2:06 PM Amit Kapila  wrote:
> >>
> >> On Wed, Jul 20, 2022 at 2:33 PM vignesh C  wrote:
> >>>
> >>> Modified. Apart from this I have run pgperltidy on the perl file and
> >>> renamed 032_origin.pl to 030_origin.pl as currently there is
> >>> 029_on_error.pl, 031_column_list.pl and there is no 030_*.pl file.
> >>> Thanks for the comment, the attached patch has the changes for the same.
> >>>
> >>
> >> Pushed. Kindly rebase the remaining patches.
> >
> > Thanks for pushing the patch.
> > The attached v37 version contains the rebased patch for the remaining 
> > patches.
>
> Thanks for the work on this feature -- this is definitely very helpful
> towards supporting more types of use cases with logical replication!
>
> I've read through the proposed documentation and did some light testing
> of the patch. I have two general comments about the docs as they
> currently read:
>
> 1. I'm concerned by calling this "Bidirectional replication" in the docs
> that we are overstating the current capabilities. I think this is
> accentuated int he opening paragraph:
>
> ==snip==
>   Bidirectional replication is useful for creating a multi-master database
>   environment for replicating read/write operations performed by any of the
>   member nodes.
> ==snip==
>
> For one, we're not replicating reads, we're replicating writes. Amongst
> the writes, at this point we're only replicating DML. A reader could
> think that deploying can work for a full bidirectional solution.

I have changed read/write operations to write operations. I have also
added a note saying "The logical replication restrictions applies to
the replication between primaries also.", to clarify that non DML
operations and other restrictions apply in this case too.

> (Even if we're aspirationally calling this section "Bidirectional
> replication", that does make it sound like we're limited to two nodes,
> when we can support more than two).
>
> Perhaps "Logical replication between writers" or "Logical replication
> between primaries" or "Replicating changes between primaries", or
> something better.

I have changed it to "Replication between primaries".

> 2. There is no mention of conflicts in the documentation, e.g.
> referencing the "Conflicts" section of the documentation. It's very easy
> to create a conflicting transaction that causes a subscriber to be
> unable to continue to apply transactions:
>
>-- DB 1
>CREATE TABLE abc (id int);
>CREATE PUBLICATION node1 FOR ALL TABLES ;
>
>-- DB2
>CREATE TABLE abc (id int);
>CREATE PUBLICATION node2 FOR ALL TABLES ;
>CREATE SUBSCRIPTION node2_node1
>  CONNECTION 'dbname=logi port=5433'
>  PUBLICATION node1
>  WITH (copy_data = off, origin = none);
>
>-- DB1
>CREATE SUBSCRIPTION node1_node2
>  CONNECTION 'dbname=logi port=5434'
>  PUBLICATION node2
>  WITH (copy_data = off, origin = none);
>INSERT INTO abc VALUES (1);
>
>-- DB2
>INSERT INTO abc VALUES (2);
>
>-- DB1
>ALTER TABLE abc ADD PRIMARY KEY id;
>INSERT INTO abc VALUES (3);
>
>-- DB2
>INSERT INTO abc VALUES (3);
>
>-- DB1 cannot apply the transactions
>
> At a minimum, I think we should reference the documentation we have in
> the logical replication section on conflicts. We may also want to advise
> that a user is responsible for designing their schemas in a way to
> minimize the risk of conflicts.

Added a  note for the same and referred it to the conflicts section.

Thanks for the comments, the attached v38 patch has the changes for the same.

Regards,
Vignesh
From 9207ff99cab50d4456862ba6e9f613d37234f15d Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Thu, 21 Jul 2022 14:56:57 +0530
Subject: [PATCH v38 1/2] Check and throw an error if publication tables were
 also subscribing from other publishers and support force value for copy_data
 parameter.

This patch does a couple of things:
1) Checks and throws an error if 'copy_data = on' and 'origin =
none' but the publication tables were also replicated from other publishers.
2) Adds 'force' value for copy_data parameter.

---
The steps below help to demonstrate how the new exception is useful:

The initial copy phase has no way to know the origin of the row data,
so if 'copy_data = on' in the step 4 below, then an error will be
thrown to prevent any potentially non-local data from being copied:

e.g.
CREATE SUBSCRIPTION sub_node2_node1 CONNECTION ''
PUBLICATION pub_node1 WITH (copy_data = on, origin = none);
ERROR:  CREATE/ALTER SUBSCRIPTION with origin = none and copy_data = on is
not allowed when the publisher might have replicated data.

---
The following steps help to demonstrate how the 'copy_data = force'
change will

Re: Unprivileged user can induce crash by using an SUSET param in PGOPTIONS

2022-07-23 Thread Tom Lane
Nathan Bossart  writes:
> On Fri, Jul 22, 2022 at 06:44:04PM -0400, Tom Lane wrote:
>> Another idea is to add a "bool interactive" parameter to InitPostgres,
>> thereby shoving the issue out to the call sites.  Still wouldn't
>> expose the am_walsender angle, but conceivably it'd be more
>> future-proof anyway?

> I hesitated to suggest this exactly because of the WAL sender problem, but
> it does seem slightly more future-proof, so +1 for this approach.

So about like this then.  (I spent some effort on cleaning up the
disjointed-to-nonexistent presentation of InitPostgres' parameters.)

regards, tom lane

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 088556ab54..58752368e7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -354,7 +354,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	if (pg_link_canary_is_frontend())
 		elog(ERROR, "backend is incorrectly linked to frontend functions");
 
-	InitPostgres(NULL, InvalidOid, NULL, InvalidOid, NULL, false);
+	InitPostgres(NULL, InvalidOid, NULL, InvalidOid, false, false, NULL);
 
 	/* Initialize stuff for bootstrap-file processing */
 	for (i = 0; i < MAXATTR; i++)
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2e146aac93..70a9176c54 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -475,7 +475,7 @@ AutoVacLauncherMain(int argc, char *argv[])
 	/* Early initialization */
 	BaseInit();
 
-	InitPostgres(NULL, InvalidOid, NULL, InvalidOid, NULL, false);
+	InitPostgres(NULL, InvalidOid, NULL, InvalidOid, false, false, NULL);
 
 	SetProcessingMode(NormalProcessing);
 
@@ -1694,12 +1694,13 @@ AutoVacWorkerMain(int argc, char *argv[])
 		pgstat_report_autovac(dbid);
 
 		/*
-		 * Connect to the selected database
+		 * Connect to the selected database, specifying no particular user
 		 *
 		 * Note: if we have selected a just-deleted database (due to using
 		 * stale stats info), we'll fail and exit here.
 		 */
-		InitPostgres(NULL, dbid, NULL, InvalidOid, dbname, false);
+		InitPostgres(NULL, dbid, NULL, InvalidOid, false, false,
+	 dbname);
 		SetProcessingMode(NormalProcessing);
 		set_ps_display(dbname);
 		ereport(DEBUG1,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1c25457526..e541b16bdb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5654,7 +5654,11 @@ BackgroundWorkerInitializeConnection(const char *dbname, const char *username, u
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("database connection requirement not indicated during registration")));
 
-	InitPostgres(dbname, InvalidOid, username, InvalidOid, NULL, (flags & BGWORKER_BYPASS_ALLOWCONN) != 0);
+	InitPostgres(dbname, InvalidOid,	/* database to connect to */
+ username, InvalidOid,	/* role to connect as */
+ false,			/* never honor session_preload_libraries */
+ (flags & BGWORKER_BYPASS_ALLOWCONN) != 0,	/* ignore datallowconn? */
+ NULL);			/* no out_dbname */
 
 	/* it had better not gotten out of "init" mode yet */
 	if (!IsInitProcessingMode())
@@ -5677,7 +5681,11 @@ BackgroundWorkerInitializeConnectionByOid(Oid dboid, Oid useroid, uint32 flags)
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("database connection requirement not indicated during registration")));
 
-	InitPostgres(NULL, dboid, NULL, useroid, NULL, (flags & BGWORKER_BYPASS_ALLOWCONN) != 0);
+	InitPostgres(NULL, dboid,	/* database to connect to */
+ NULL, useroid, /* role to connect as */
+ false,			/* never honor session_preload_libraries */
+ (flags & BGWORKER_BYPASS_ALLOWCONN) != 0,	/* ignore datallowconn? */
+ NULL);			/* no out_dbname */
 
 	/* it had better not gotten out of "init" mode yet */
 	if (!IsInitProcessingMode())
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8ba1c170f0..3772329759 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4076,7 +4076,11 @@ PostgresMain(const char *dbname, const char *username)
 	 * it inside InitPostgres() instead.  In particular, anything that
 	 * involves database access should be there, not here.
 	 */
-	InitPostgres(dbname, InvalidOid, username, InvalidOid, NULL, false);
+	InitPostgres(dbname, InvalidOid,	/* database to connect to */
+ username, InvalidOid,	/* role to connect as */
+ !am_walsender, /* honor session_preload_libraries? */
+ false,			/* don't ignore datallowconn */
+ NULL);			/* no out_dbname */
 
 	/*
 	 * If the PostmasterContext is still around, recycle the space; we don't
@@ -4112,12 +4116,6 @@ PostgresMain(const char *dbname, const char *username)
 	if (am_walsender)
 		InitWalSender();
 
-	/*
-	 * process any libraries that should be preloaded at backend start (this
-	 * likewise can't be done until GUC settings are complete)
-	 *

Re: Support logical replication of DDLs

2022-07-23 Thread Zheng Li
On Sat, Jul 23, 2022 at 11:33 AM Joe Conway  wrote:
>
> On 7/22/22 17:18, Zheng Li wrote:
> > Here is a patch that supports replication of global object commands,
> > these include ROLE statements, database statements and tablespace 
> > statements.
> > The patch should be applied on top of the v13 DDL replication patch set that
> > ZJ Hou sent in the previous email.
> >
> > Global objects commands are different from other DDL commands in
> > that:
> > 1. Global objects commands are allowed to be executed in any databases
> > 2. Global objects are not schema qualified
> > 2. Global objects commands are not captured by event triggers
> >
> > This patch supports global objects commands replication by WAL
> > logging the command using the same function for DDL logging -
> > LogLogicalDDLMessage, towards the end of standard_ProcessUtility.
> > Because global objects are not schema qualified, we can skip the deparser
> > invocation and directly log the original command string for replay on
> > the subscriber.
>
> I have not looked at the patch but +1 for the general concept. Seems
> like you might want to start a separate thread, perhaps after the
> currently running commitfest is over.

Thanks for the suggestion. I'll start a new thread on the replication
of global objects
commands. I think it's different enough to get its own attention.

>
> > A key problem is global objects can get inconsistent between the
> > publisher and the subscriber if a command changes the global object
> > in a database (on the source side) which doesn't configure logical 
> > replication.
> > I think we can work on the following directions in order to avoid such
> > inconsistency:
> >
> > 1. Introduce a publication option for global objects command replication
> > and document that logical replication of global objects commands is 
> > preferred
> > to be configured on all databases. Otherwise inconsistency can happen
> > if a command changes the global object in a database which doesn't configure
> > logical replication.
> >
> > 2. Introduce database cluster level logical replication to avoid
> > such inconsistency, this is especially handy when there is a large
> > number of databases to configure for logical replication.
>
> I would strongly favor #2, although I admittedly have no idea what
> complexities it adds.

I will also start a new thread once we have more concrete plans on this.

Regards,
Zheng




Re: Support logical replication of DDLs

2022-07-23 Thread Joe Conway

On 7/22/22 17:18, Zheng Li wrote:

Here is a patch that supports replication of global object commands,
these include ROLE statements, database statements and tablespace statements.
The patch should be applied on top of the v13 DDL replication patch set that
ZJ Hou sent in the previous email.

Global objects commands are different from other DDL commands in
that:
1. Global objects commands are allowed to be executed in any databases
2. Global objects are not schema qualified
2. Global objects commands are not captured by event triggers

This patch supports global objects commands replication by WAL
logging the command using the same function for DDL logging -
LogLogicalDDLMessage, towards the end of standard_ProcessUtility.
Because global objects are not schema qualified, we can skip the deparser
invocation and directly log the original command string for replay on
the subscriber.


I have not looked at the patch but +1 for the general concept. Seems 
like you might want to start a separate thread, perhaps after the 
currently running commitfest is over.



A key problem is global objects can get inconsistent between the
publisher and the subscriber if a command changes the global object
in a database (on the source side) which doesn't configure logical replication.
I think we can work on the following directions in order to avoid such
inconsistency:

1. Introduce a publication option for global objects command replication
and document that logical replication of global objects commands is preferred
to be configured on all databases. Otherwise inconsistency can happen
if a command changes the global object in a database which doesn't configure
logical replication.

2. Introduce database cluster level logical replication to avoid
such inconsistency, this is especially handy when there is a large
number of databases to configure for logical replication.


I would strongly favor #2, although I admittedly have no idea what 
complexities it adds.


--
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-23 Thread Martin Kalcher

Am 22.07.22 um 11:31 schrieb Martin Kalcher:


i came to the same conclusions and went with Option 1 (see patch). 
Mainly because most code in utils/adt is organized by type and this way 
it is clear, that this is a thin wrapper around pg_prng.




Small patch update. I realized the new functions should live 
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers 
and added some comments to the code.From 138777531961c31250074d1c0d87417e31d63656 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  35 +
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 182 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/user_prng.c   |  87 +++
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  58 
 src/test/regress/sql/arrays.sql |  14 ++
 9 files changed, 406 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..87df8a9c81 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+The order of the elements in resulting array is unspecified.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..e3ba14a17c 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,184 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with max n randomly chosen items from given array in
+ * random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+rnitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &relms, &rnuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	rnitem = Min(n, nitem);		/* target number of items */
+	nelm /= nitem;/* number of elements per item */
+
+	elms = relms;
+	nuls = rnuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < rnitem; i++)
+	{
+		k = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nu

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-23 Thread Amit Kapila
On Fri, Jul 22, 2022 at 11:48 AM Masahiko Sawada  wrote:
>
> On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila  wrote:
> >
> > On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > This is required if we don't want to introduce a new set of functions
> > as you proposed above. I am not sure which one is better w.r.t back
> > patching effort later but it seems to me using flag stuff would make
> > future back patches easier if we make any changes in
> > SnapBuildCommitTxn.
>
> Understood.
>
> I've implemented this idea as well for discussion. Both patches have
> the common change to remember the initial running transactions and to
> purge them when decoding xl_running_xacts records. The difference is
> how to mark the transactions as needing to be added to the snapshot.
>
> In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
> when the transaction is in the initial running xact list and its
> commit record has XINFO_HAS_INVAL flag, we mark both the top
> transaction and its all subtransactions as containing catalog changes
> (which also means to create ReorderBufferTXN entries for them). These
> transactions are added to the snapshot in SnapBuildCommitTxn() since
> ReorderBufferXidHasCatalogChanges () for them returns true.
>
> In poc_mark_top_txn_has_inval.patch, when the transaction is in the
> initial running xacts list and its commit record has XINFO_HAS_INVALS
> flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
> transaction.
>

It seems that the patch has missed the part to check if the xid is in
the initial running xacts list?

> In SnapBuildCommitTxn(), we add all subtransactions to
> the snapshot without checking ReorderBufferXidHasCatalogChanges() for
> subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
> flag.
>
> A difference between the two ideas is the scope of changes: the former
> changes only snapbuild.c but the latter changes both snapbuild.c and
> reorderbuffer.c. Moreover, while the former uses the existing flag,
> the latter adds a new flag to the reorder buffer for dealing with only
> this case. I think the former idea is simpler in terms of that. But,
> an advantage of the latter idea is that the latter idea can save to
> create ReorderBufferTXN entries for subtransactions.
>
> Overall I prefer the former for now but I'd like to hear what others think.
>

I agree that the latter idea can have better performance in extremely
special scenarios but introducing a new flag for the same sounds a bit
ugly to me. So, I would also prefer to go with the former idea,
however, I would also like to hear what Horiguchi-San and others have
to say.

Few comments on v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during:
1.
+void
+SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId xid,
+   int subxcnt, TransactionId *subxacts,
+   XLogRecPtr lsn)
+{

I think it is better to name this function as
SnapBuildXIDSetCatalogChanges as we use this to mark a particular
transaction as having catalog changes.

2. Changed/added a few comments in the attached.

-- 
With Regards,
Amit Kapila.


v7-0001-diff-amit.patch
Description: Binary data


Re: Refactoring the regression tests for more independence

2022-07-23 Thread Aleksander Alekseev
Hi Tom,

> FWIW, I tried to replicate this locally on my own RPi3B+, using
> current Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-1066-raspi aarch64).
> No luck: it all works fine for me.  We have at least one Raspbian
> buildfarm animal too, and it's not been unhappy either.  I suspect
> there is something odd about your environment settings.

Thanks for sharing this.

I repeated the experiment in a clean environment (Raspbian installed
from scratch on a brand new SD-card) and can confirm that the problem
is gone.

Sorry for the disturbance.

-- 
Best regards,
Aleksander Alekseev




Re: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-23 Thread Michael Paquier
On Sat, Jul 23, 2022 at 05:44:28PM +0900, Michael Paquier wrote:
> Changing get_altertable_subcmdtypes() to return a set of rows made of
> (subcommand, object description) is what I actually meant upthread as
> it feels natural given a CollectedCommand in input, and as
> pg_event_trigger_ddl_commands() only gives access to a set of
> CollectedCommands.  This is also a test module so 
> there is no issue in changing the existing function definitions.
> 
> But your point would be to have a new function that takes in input a
> CollectedATSubcmd, returning back the object address or its
> description?  How would you make sure that a subcommand maps to a
> correct object address?

FWIW, I was thinking about something among the lines of 0002 on top of
Hou's patch.
--
Michael
From df118fedd1204ac4596de59ad2ef87cb1f734a35 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Wed, 13 Jul 2022 17:39:13 +0800
Subject: [PATCH v2 1/2] Collect ObjectAddress for ATTACH DETACH PARTITION

---
 src/backend/commands/tablecmds.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7fbee0c1f7..9cf4671da0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5193,11 +5193,11 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  cur_pass, context);
 			Assert(cmd != NULL);
 			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-ATExecAttachPartition(wqueue, rel, (PartitionCmd *) cmd->def,
-	  context);
+address = ATExecAttachPartition(wqueue, rel, (PartitionCmd *) cmd->def,
+context);
 			else
-ATExecAttachPartitionIdx(wqueue, rel,
-		 ((PartitionCmd *) cmd->def)->name);
+address = ATExecAttachPartitionIdx(wqueue, rel,
+   ((PartitionCmd *) cmd->def)->name);
 			break;
 		case AT_DetachPartition:
 			cmd = ATParseTransformCmd(wqueue, tab, rel, cmd, false, lockmode,
@@ -5205,12 +5205,12 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			Assert(cmd != NULL);
 			/* ATPrepCmd ensures it must be a table */
 			Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-			ATExecDetachPartition(wqueue, tab, rel,
-  ((PartitionCmd *) cmd->def)->name,
-  ((PartitionCmd *) cmd->def)->concurrent);
+			address = ATExecDetachPartition(wqueue, tab, rel,
+			((PartitionCmd *) cmd->def)->name,
+			((PartitionCmd *) cmd->def)->concurrent);
 			break;
 		case AT_DetachPartitionFinalize:
-			ATExecDetachPartitionFinalize(rel, ((PartitionCmd *) cmd->def)->name);
+			address = ATExecDetachPartitionFinalize(rel, ((PartitionCmd *) cmd->def)->name);
 			break;
 		default:/* oops */
 			elog(ERROR, "unrecognized alter table type: %d",
-- 
2.36.1

From a013fb2948986302490f09c5e44a3b510030f293 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sat, 23 Jul 2022 19:53:07 +0900
Subject: [PATCH v2 2/2] Extend test_ddl_deparse for ALTER TABLE ..
 ATTACH/DETACH PARTITION

---
 .../test_ddl_deparse/expected/alter_table.out | 19 ++--
 .../expected/create_table.out |  8 ++--
 .../test_ddl_deparse/expected/create_view.out |  2 +-
 .../expected/test_ddl_deparse.out |  4 +-
 .../test_ddl_deparse/sql/alter_table.sql  |  5 ++
 .../test_ddl_deparse/sql/test_ddl_deparse.sql |  4 +-
 .../test_ddl_deparse--1.0.sql |  6 ++-
 .../test_ddl_deparse/test_ddl_deparse.c   | 46 +++
 8 files changed, 68 insertions(+), 26 deletions(-)

diff --git a/src/test/modules/test_ddl_deparse/expected/alter_table.out b/src/test/modules/test_ddl_deparse/expected/alter_table.out
index 141060fbdc..ec22d688b1 100644
--- a/src/test/modules/test_ddl_deparse/expected/alter_table.out
+++ b/src/test/modules/test_ddl_deparse/expected/alter_table.out
@@ -9,21 +9,30 @@ NOTICE:  DDL test: type simple, tag CREATE TABLE
 ALTER TABLE parent ADD COLUMN b serial;
 NOTICE:  DDL test: type simple, tag CREATE SEQUENCE
 NOTICE:  DDL test: type alter table, tag ALTER TABLE
-NOTICE:subcommand: ADD COLUMN (and recurse)
+NOTICE:subcommand: type ADD COLUMN (and recurse) desc column b of table parent
 NOTICE:  DDL test: type simple, tag ALTER SEQUENCE
 ALTER TABLE parent RENAME COLUMN b TO c;
 NOTICE:  DDL test: type simple, tag ALTER TABLE
 ALTER TABLE parent ADD CONSTRAINT a_pos CHECK (a > 0);
 NOTICE:  DDL test: type alter table, tag ALTER TABLE
-NOTICE:subcommand: ADD CONSTRAINT (and recurse)
+NOTICE:subcommand: type ADD CONSTRAINT (and recurse) desc constraint a_pos on table parent
 CREATE TABLE part (
 	a int
 ) PARTITION BY RANGE (a);
 NOTICE:  DDL test: type simple, tag CREATE TABLE
 CREATE TABLE part1 PARTITION OF part FOR VALUES FROM (1) to (100);
 NOTICE:  DDL test: type simple, tag CREATE TABLE
+CREATE TABLE part2 (a int);
+NOTICE:  DDL test: type simple, tag CREATE TABLE
+ALTER TABLE part ATTACH PARTITION part2 FOR VALUES FROM (101) to (200);
+NOTICE:  DDL test: type alter tab

Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages

2022-07-23 Thread Bharath Rupireddy
On Thu, Jul 21, 2022 at 9:50 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 20 Jul 2022 17:25:33 +0530, Bharath Rupireddy 
>  wrote in
> > On Wed, Jul 20, 2022 at 12:55 PM Kyotaro Horiguchi
> >  wrote:
> > PSA v7 patch set.
>
> Thanks. Looks perfect, but (sorry..) in the final checking, I found
> "log archive" in the doc.  If you agree to it please merge the
> attached (or refined one) and I'd call it a day.

Merged. PSA v8 patch set.

Regards,
Bharath Rupireddy.


v8-0001-Use-WAL-segment-instead-of-log-segment.patch
Description: Binary data


v8-0002-Consistently-use-WAL-file-s-in-docs.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2022-07-23 Thread Andrey Borodin


> On 26 Jun 2022, at 00:10, Andrey Borodin  wrote:
> 
> I will split the patch in 3 steps:
> 1. extract generic functions to amcheck.c
> 2. add gist functions
> 3. add gin functions
> 
> I'll fix other notes too in the next version.


Done. PFA attached patchset.

Thanks!

Best regards, Andrey Borodin.


v12-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v12-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


v12-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


Re: Avoiding smgrimmedsync() during nbtree index builds

2022-07-23 Thread Heikki Linnakangas

On 05/03/2022 00:03, Melanie Plageman wrote:

On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby  wrote:


Rebased to appease cfbot.

I ran these paches under a branch which shows code coverage in cirrus.  It
looks good to my eyes.
https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html


thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.


I'm surprised by all the changes in nbtsort.c to choose between using 
the buffer manager or the new API. I would've expected the new API to 
abstract that away. Otherwise we need to copy that logic to all the 
index AMs.


I'd suggest an API along the lines of:

/*
 * Start building a relation in bulk.
 *
 * If the relation is going to be small, we will use the buffer manager,
 * but if it's going to be large, this will skip the buffer manager and
 * write the pages directly to disk.
 */
bulk_init(SmgrRelation smgr, BlockNumber estimated_size)

/*
 * Extend relation by one page
 */
bulk_extend(SmgrRelation, BlockNumber, Page)

/*
 * Finish building the relation
 *
 * This will fsync() the data to disk, if required.
 */
bulk_finish()


Behind this interface, you can encapsulate the logic to choose whether 
to use the buffer manager or not. And I think bulk_extend() could do the 
WAL-logging too.


Or you could make the interface look more like the buffer manager:

/* as above */
bulk_init(SmgrRelation smgr, BlockNumber estimated_size)
bulk_finish()

/*
 * Get a buffer for writing out a page.
 *
 * The contents of the buffer are uninitialized. The caller
 * must fill it in before releasing it.
 */
BulkBuffer
bulk_getbuf(SmgrRelation smgr, BlockNumber blkno)

/*
 * Release buffer. It will be WAL-logged and written out to disk.
 * Not necessarily immediately, but at bulk_finish() at latest.
 *
 * NOTE: There is no way to read back the page after you release
 * it, until you finish the build with bulk_finish().
 */
void
bulk_releasebuf(SmgrRelation smgr, BulkBuffer buf)


With this interface, you could also batch multiple pages and WAL-log 
them all in one WAL record with log_newpage_range(), for example.


- Heikki




A proposal for shared memory based backup infrastructure

2022-07-23 Thread Bharath Rupireddy
Hi,

Right now, the session that starts the backup with pg_backup_start()
has to end it with pg_backup_stop() which returns the backup_label and
tablespace_map contents (commit 39969e2a1). If the backups were to be
taken using custom disk snapshot tools on production servers,
following are the high-level steps involved:
1) open a session
2) run pg_backup_start() using the same session opened in (1)
3) run custom disk snapshot tools which may, many-a-times, will copy
the entire data directory over the network
4) run pg_backup_stop() using the same session opened in (1)

Typically, step (3) takes a good amount of time in production
environments with terabytes or petabytes scale of data and keeping the
session alive from step (1) to (4) has overhead and it wastes the
resources.  And the session can get closed for various reasons - idle
in session timeout, tcp/ip keepalive timeout, network problems etc.
All of these can render the backup useless.

What if the backup started by a session can also be closed by another
session? This seems to be achievable, if we can place the
backup_label, tablespace_map and other required session/backend level
contents in shared memory with the key as backup_label name. It's a
long way to go. The idea may be naive at this stage and there might be
something important that doesn't let us do the proposed solution. I
would like to hear more thoughts from the hackers.

Thanks to Sameer, Satya (cc-ed) for the offlist discussion.

Regards,
Bharath Rupireddy.




Re: MultiXact\SLRU buffers configuration

2022-07-23 Thread Thomas Munro
On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin  wrote:
> Thomas, do you still have any doubts? Or is it certain that SLRU will be 
> replaced by any better subsystem in 16?

Hi Andrey,

Sorry for my lack of replies on this and the other SLRU thread -- I'm
thinking and experimenting.  More soon.




Re: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-23 Thread Michael Paquier
On Fri, Jul 22, 2022 at 02:26:02PM +0530, Amit Kapila wrote:
> Yeah, that would be a good idea but I think instead of changing
> get_altertable_subcmdtypes(), can we have a new function say
> get_altertable_subcmdinfo() that returns additional information from
> address. The other alternative could be that instead of returning the
> address as a string, we can return some fields as a set of records
> (one row for each subcommand) as we do in
> pg_event_trigger_ddl_commands().

Changing get_altertable_subcmdtypes() to return a set of rows made of
(subcommand, object description) is what I actually meant upthread as
it feels natural given a CollectedCommand in input, and as
pg_event_trigger_ddl_commands() only gives access to a set of
CollectedCommands.  This is also a test module so 
there is no issue in changing the existing function definitions.

But your point would be to have a new function that takes in input a
CollectedATSubcmd, returning back the object address or its
description?  How would you make sure that a subcommand maps to a
correct object address?
--
Michael


signature.asc
Description: PGP signature


Re: MultiXact\SLRU buffers configuration

2022-07-23 Thread Andrey Borodin



> On 21 Jul 2022, at 18:00, Yura Sokolov  wrote:
> 
> In this case simple buffer increase does help. But "buckets"
> increase performance gain.
Yura, thank you for your benchmarks!
We already knew that patch can save the day on pathological workloads, now we 
have a proof of this.
Also there's the evidence that user can blindly increase size of SLRU if they 
want (with the second patch). So there's no need for hard explanations on how 
to tune the buffers size.

Thomas, do you still have any doubts? Or is it certain that SLRU will be 
replaced by any better subsystem in 16?


Best regards, Andrey Borodin.



Re: Pluggable toaster

2022-07-23 Thread Nikita Malakhov
Hi hackers!

Matthias, thank you very much for your feedback!
Sorry, I forgot to attach files.
Attaching here, but they are for the commit tagged "15beta2", I am
currently
rebasing this branch onto the actual master and will provide rebased
version,
with some corrections according to your feedback, in a day or two.

>Indexes cannot index toasted values, so why would the toaster oid be
>interesting for index storage properties?

Here Teodor might correct me. Toast tables are indexed, and Heap TOAST
mechanics accesses Toasted tuples by index, isn't it the case?

>Moving toasters seems quite expensive when compared to just index
>checks. When you have many toasters, but only a few hot ones, this
>currently will move the "cold" toasters around a lot. Why not use a
>stack instead (or alternatively, a 'zipper' (or similar) data
>structure), where the hottest toaster is on top, so that we avoid
>larger memmove calls?

This is a reasonable remark, we'll consider it for the next iteration. Our
reason
is that we think there won't be a lot of custom Toasters, most likely less
then
a dozen, for the most complex/heavy datatypes so we haven't considered
these expenses.

>To the best of my knowledge varlena-pointers are unaligned; and this
>is affirmed by the comment immediately under varattrib_1b_e. Assuming
>alignment to 16 bits should/would be incorrect in some of the cases.
>This is handled for normal varatt_external values by memcpy-ing the
>value into local (aligned) fields before using them, but that doesn't
>seem to be the case for varatt_custom?

Alignment correction seemed reasonable for us because structures are
anyway aligned in memory, so when we use 1 and 2-byte fields along
with 4-byte it may create a lot of padding. Am I wrong? Also, correct
alignment somewhat simplifies access to structure fields.

>0003: (re-implement default toaster using toaster interface)
>I see significant changes to the dummy toaster (created in 0002),
>could those be moved to patch 0002 in the next iteration?
Will do.

>detoast.c and detoast.h are effectively empty after this patch (only
>imports and commented-out code remain), please fully remove them
>instead - that saves on patch diff size.
Will do.

About the location of toast_ functions: these functions are part of Heap
TOAST mechanics, and they were scattered among other Heap internals
sources. I've tried to gather them and put them in more straight order, but
this work is not fully finished yet and will take some time. Working on it.

I'll check if table_relation_fetch_toast_slice could be removed, thanks for
the remark.

toast_helper - done, will be provided in rebased version.

toast_internals - this one is an internal part of TOAST implemented in
Heap AM, but I'll try to straighten it out as much as I could.

naming conventions in some sources - done, will be provided in rebased
patch set.

>Shouldn't the creation of toast tables be delegated to the toaster?

Yes, you're right, and actually, it is. I'll check that and correct in
rebased
version.

>Although this is code being moved around, the comment is demonstrably
>false: A cancelled REINDEX CONCURRENTLY with a subsequent REINDEX can
>leave a toast relation with 2 valid indexes.

This code is quite old, we've not changed it but thanks for the remark,
I'll check it more carefully.

Small fixes are already merged into larger patches in attached files. Also,
I appreciate your feedback on documentation - if you would have an
opportunity
please check README provided in 0003. I've took your comments on
documentation
into account and will include corrections according to them into rebased
patch.

As Aleksander recommended, I've shortened the patch set and left only the
most
important part - API and re-implemented default Toast. All bells and
whistles are not
of so much importance and could be sent later after the API itself will be
straightened
out and commited.

Thank you very much!

On Fri, Jul 22, 2022 at 4:17 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Wed, 20 Jul 2022 at 11:16, Nikita Malakhov  wrote:
> >
> > Hi hackers!
>
> Hi,
>
> Please don't top-post here.  See
> https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics.
>
> > We really need your feedback on the last patchset update!
>
> This is feedback on the latest version that was shared on the mailing
> list here [0]. Your mail from today didn't seem to have an attachment,
> and I haven't checked the git repository for changes.
>
> 0001: Create Table Storage:
> LGTM
>
> ---
>
> 0002: Toaster API interface
>
> > tablecmds.c
>
> > SetIndexStorageProperties(Relation rel, Relation attrelation,
> >   AttrNumber attnum,
> >   bool setstorage, char newstorage,
> >   bool setcompression, char newcompression,
> > +  bool settoaster, Oid toasterOid,
> >   LOCKMODE lockmode)
>
> Indexes cannot index toasted values,