date:20220919

On Mon, Sep 19, 2022 at 08:51:47PM -0700, Nathan Bossart wrote:
> Are there any concerns with simply expanding AclMode to 64 bits, as done in
> v5 [0]?
> 
> [0] https://postgr.es/m/20220908055035.GA2100193%40nathanxps13

I have gone through the thread, and I'd agree with getting more
granularity when it comes to assigning ACLs to relations rather than
just an on/off switch for the objects of a given type would be nice.
I've been looking at the whole use of AclMode and AclItem in the code,
and I don't quite see why a larger size could have a noticeable
impact.  There are a few things that could handle a large number of
AclItems, though, say for array operations like aclupdate().  These
could be easily checked with some micro-benchmarking or some SQL
queries that emulate a large number of items in aclitem[] arrays.

Any impact for the column sizes of the catalogs holding ACL
information?  Just asking while browsing the patch set.

Some comments in utils/acl.h need a refresh as the number of lower and
upper bits looked at from ai_privs changes.
--
Michael

signature.asc
Description: PGP signature

Re: Reducing the chunk header sizes on all memory context types

David Rowley  writes:
> On Tue, 20 Sept 2022 at 13:23, Tom Lane  wrote:
>> ... but I'm completely not satisfied with the current
>> situation in HEAD.

> Maybe you've forgotten that MemoryContextContains() is broken in the
> back branches or just don't think it is broken?

"Broken" is a strong claim.  There's reason to think it could fail
in the back branches, but little evidence that it actually has failed
in the field.  So yeah, we have work to do --- which is the exact
opposite of your apparent stand that we can walk away from the
problem.

regards, tom lane

Re: [PoC] Federated Authn/z with OAUTHBEARER

2022-09-19 Thread mahendrakar s

Hi  Hackers,

We are trying to implement AAD(Azure AD) support in PostgreSQL and it
can be achieved with support of the OAuth method. To support AAD on
top of OAuth in a generic fashion (i.e for all other OAuth providers),
we are proposing this patch. It basically exposes two new hooks (one
for error reporting and one for OAuth provider specific token
validation) and passing OAuth bearer token to backend. It also adds
support for client credentials flow of OAuth additional to device code
flow which Jacob has proposed.

The changes for each component are summarized below.

1. Provider-specific extension:
Each OAuth provider implements their own token validator as an
extension. Extension registers an OAuth provider hook which is matched
to a line in the HBA file.

2. Add support to pass on the OAuth bearer token. In this
obtaining the bearer token is left to 3rd party application or user.

./psql -U  -d 'dbname=postgres
oauth_client_id= oauth_bearer_token=

3. HBA: An additional param ‘provider’ is added for the oauth method.
Defining "oauth" as method + passing provider, issuer endpoint
and expected audience

* * * * oauth   provider=
issuer= scope=

4. Engine Backend:
Support for generic OAUTHBEARER type, requesting client to
provide token and passing to token for provider-specific extension.

5. Engine Frontend: Two-tiered approach.
   a)  libpq transparently passes on the token received
from 3rd party client as is to the backend.
   b)  libpq optionally compiled for the clients which
explicitly need libpq to orchestrate OAuth communication with the
issuer (it depends heavily on 3rd party library iddawc as Jacob
already pointed out. The library seems to be supporting all the OAuth
flows.)

Please let us know your thoughts as the proposed method supports
different OAuth flows with the use of provider specific hooks. We
think that the proposal would be useful for various OAuth providers.

Thanks,
Mahendrakar.

On Tue, 20 Sept 2022 at 10:18, Jacob Champion  wrote:
>
> On Tue, 2021-06-22 at 23:22 +, Jacob Champion wrote:
> > On Fri, 2021-06-18 at 11:31 +0300, Heikki Linnakangas wrote:
> > >
> > > A few small things caught my eye in the backend oauth_exchange function:
> > >
> > > > +   /* Handle the client's initial message. */
> > > > +   p = strdup(input);
> > >
> > > this strdup() should be pstrdup().
> >
> > Thanks, I'll fix that in the next re-roll.
> >
> > > In the same function, there are a bunch of reports like this:
> > >
> > > >ereport(ERROR,
> > > > +  (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > > +   errmsg("malformed OAUTHBEARER message"),
> > > > +   errdetail("Comma expected, but found 
> > > > character \"%s\".",
> > > > + sanitize_char(*p;
> > >
> > > I don't think the double quotes are needed here, because sanitize_char
> > > will return quotes if it's a single character. So it would end up
> > > looking like this: ... found character "'x'".
> >
> > I'll fix this too. Thanks!
>
> v2, attached, incorporates Heikki's suggested fixes and also rebases on
> top of latest HEAD, which had the SASL refactoring changes committed
> last month.
>
> The biggest change from the last patchset is 0001, an attempt at
> enabling jsonapi in the frontend without the use of palloc(), based on
> suggestions by Michael and Tom from last commitfest. I've also made
> some improvements to the pytest suite. No major changes to the OAuth
> implementation yet.
>
> --Jacob
diff --git a/src/backend/libpq/auth-oauth.c b/src/backend/libpq/auth-oauth.c
index c47211132c..86f820482b 100644
--- a/src/backend/libpq/auth-oauth.c
+++ b/src/backend/libpq/auth-oauth.c
@@ -24,7 +24,9 @@
 #include "libpq/hba.h"
 #include "libpq/oauth.h"
 #include "libpq/sasl.h"
+#include "miscadmin.h"
 #include "storage/fd.h"
+#include "utils/memutils.h"

 /* GUC */
 char *oauth_validator_command;
@@ -34,6 +36,13 @@ static void *oauth_init(Port *port, const char *selected_mech, const char *shado
 static int   oauth_exchange(void *opaq, const char *input, int inputlen,
 			char **output, int *outputlen, char **logdetail);

+/*
+ * OAuth Authentication
+ *
+ */
+static List *oauth_providers = NIL;
+static OAuthProvider* oauth_provider = NULL;
+
 /* Mechanism declaration */
 const pg_be_sasl_mech pg_be_oauth_mech = {
 	oauth_get_mechanisms,
@@ -63,15 +72,90 @@ static char *sanitize_char(char c);
 static char *parse_kvpairs_for_auth(char **input);
 static void generate_error_response(struct oauth_ctx *ctx, char **output, int *outputlen);
 static bool validate(Port *port, const char *auth, char **logdetail);
-static bool run_validator_command(Port *port, const char *token);
+static const

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 9:48 PM Tom Lane  wrote:
> As Munro adduces nearby, it'd be a stretch to conclude that the current
> format was designed with any Postgres-related goals in mind at all.
> I think he's right that it's a variant of some Lisp-y dump format that's
> probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

> I think the principal mistake in what we have now is that the storage
> format is identical to the "developer friendly" text format (plus or
> minus some whitespace).  First we need to separate those.  We could
> have more than one equivalent text format perhaps, and I don't have
> any strong objection to basing the text format (or one of them) on
> JSON.

Agreed.

-- 
Peter Geoghegan

Re: remove more archiving overhead

2022-09-19 Thread Noah Misch

On Mon, Sep 19, 2022 at 12:49:58PM -0400, Robert Haas wrote:
> On Mon, Sep 19, 2022 at 10:39 AM Noah Misch  wrote:
> > I wanted it to stop saying anything like the second paragraph, hence commit
> > d263ced.  Implementing a proper archiving setup is not especially difficult,
> > and inviting the operator to work around a wrong implementation invites
> > damaging mistakes under time pressure.
> 
> I agree wholeheartedly with the second sentence. I do think that we
> need to be a little careful not to be too prescriptive. Telling people
> what they have to do when they don't really have to do it often leads
> to non-compliance. It can be more productive to spell out the precise
> consequences of non-compliance, so I think Peter's proposal has some
> merit.

Good point.  We could say it from a perspective like, "if you don't do this,
your setup will appear to work most of the time, but your operator will be
stuck with dangerous manual fixes at times."

> On the other hand, I don't see any real problem with the
> current language, either.
> 
> Unfortunately, no matter what we do here, it is not likely that we'll
> soon be able to eliminate the phenomenon where people use a buggy
> home-grown archive_command, both because we've been encouraging that
> in our documentation for a very long time, and also because the people
> who are most likely to do something half-baked here are probably the
> least likely to actually read the updated documentation.

True.  If I wanted to eliminate such setups, I would make the server
double-archive one WAL file each time the archive setup changes.

Re: Reducing the chunk header sizes on all memory context types

On Tue, 20 Sept 2022 at 13:23, Tom Lane  wrote:
>
> David Rowley  writes:
> > Aside from that, I don't have any ideas on how to get rid of the
> > possible additional datumCopy() from non-Var arguments to these window
> > functions.  Should we just suffer it? It's quite likely that most
> > arguments to these functions are plain Vars anyway.
>
> No, we shouldn't.  I'm pretty sure that we have various window
> functions that are deliberately designed to take advantage of the
> no-copy behavior, and that they have taken a significant speed
> hit from your having disabled that optimization.  I don't say
> that this is enough to justify reverting the chunk header changes
> altogether ... but I'm completely not satisfied with the current
> situation in HEAD.

Maybe you've forgotten that MemoryContextContains() is broken in the
back branches or just don't think it is broken?

David

Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan  writes:
> On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
>> Our existing format is certainly not great on those metrics, but
>> I do not see how "let's use JSON!" is a route to improvement.

> The existing format was designed with developer convenience as a goal,
> though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

> If it didn't have to be easy (or even practical) for developers to
> directly work with the output format, then presumably the format used
> internally could be replaced with something lower level and faster. So
> it seems like the two goals (developer ergonomics and faster
> interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace).  First we need to separate those.  We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

regards, tom lane

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

On Tue, Sep 20, 2022 at 12:09:33AM -0400, Tom Lane wrote:
> You have to assume that somebody (a) has a role or DB name starting
> with slash, (b) has an explicit reference to that name in their
> pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
> notice that things are misbehaving until after some hacker manages
> to break into their installation on the strength of the misbehaving
> entry.  OK, I'll grant that the probability of (c) is depressingly
> close to unity; but each of the other steps seems quite low probability.
> All four of them happening in one installation is something I doubt
> will happen.

It is the kind of things that could blow up as a CVE and some bad PR
for the project, so I cannot get excited about enforcing this new rule
in an authentication file (aka before a role is authenticated) while
we are talking about 3~4 code paths (?) that would need an extra check
to make sure that no instances have such object names.

> On the contrary side, if we make this work differently from the
> pg_ident.conf precedent, or install weird rules to try to prevent
> accidental misinterpretations, that could also lead to security
> problems because things don't work as someone would expect.  I see
> no a-priori reason to believe that this risk is negligible compared
> to the other one.

I also do like a lot the idea of making things consistent across all
the auth configuration files for all the fields where this can be
applied.
--
Michael


signature.asc
Description: PGP signature

Re: A question about wording in messages

2022-09-19 Thread Amit Kapila

On Fri, Sep 16, 2022 at 12:29 PM Amit Kapila  wrote:
>
> > > >
> > > > +1, I saw that today and thought it was outside our usual style.
> > > > The whole thing is awfully verbose for a GUC description, too.
> > > > Maybe
> > > >
> > > > "Maximum distance to read ahead in WAL to prefetch data blocks."
> > >
> > > +1
> > >
> > > For "we", I must have been distracted by code comment style.  For the
> > > extra useless verbiage, it's common for GUC description to begin "This
> > > control/affects/blah" like that, but I agree it's useless noise.
> > >
> > > For the other cases, Amit's suggestion of 'server' seems sensible to me.
> >
> > Thaks for the opinion. I'm fine with that, too.
> >
>
> So, the change related to wal_decode_buffer_size needs to be
> backpatched to 15 whereas other message changes will be HEAD only, am
> I correct?
>

I would like to pursue as per above unless there is more feedback on this.

-- 
With Regards,
Amit Kapila.

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:03 PM Thomas Munro  wrote:
> On Tue, Sep 20, 2022 at 3:58 PM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> > > default reader/printer behaviour in small ways, including being case
> > > sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> > > structs for reasons that are lost AFAIK (there's a dark age between
> > > the commit history of the old Berkeley repo and our current repo, and
> > > it looks like plan nodes were still printed as #(NAME ...) at
> > > Berkeley).
> >
> > Wow, where did you find a commit history for Berkeley's code?
> > There's evidence in the tarballs I have that they were using
> > RCS, but I never heard that the repo was made public.
>
> One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
> complete RCS history, but Stas Kelvich imported it to github as Peter
> G has just reported faster than I could.

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs.  (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

https://github.com/kelvich/postgres_pre95/blob/master/src/backend/lib/lispread.c#L132

It may still be within the bounds of what a real Lisp could be
convinced to read though, given a reader macro to handle {} and maybe
some other little tweaks here and there.

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

Michael Paquier  writes:
>> On Thu, Sep 8, 2022 at 5:46 PM Tom Lane  wrote:
>>> Meh ... that concern seems overblown to me.  I guess it's possible
>>> that somebody has an HBA entry that looks like that, but it doesn't
>>> seem very plausible.  Note that we made this exact same change in
>>> pg_ident.conf years ago, and AFAIR we got zero complaints.

> This concern does not sound overblown to me.

You have to assume that somebody (a) has a role or DB name starting
with slash, (b) has an explicit reference to that name in their
pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
notice that things are misbehaving until after some hacker manages
to break into their installation on the strength of the misbehaving
entry.  OK, I'll grant that the probability of (c) is depressingly
close to unity; but each of the other steps seems quite low probability.
All four of them happening in one installation is something I doubt
will happen.

On the contrary side, if we make this work differently from the
pg_ident.conf precedent, or install weird rules to try to prevent
accidental misinterpretations, that could also lead to security
problems because things don't work as someone would expect.  I see
no a-priori reason to believe that this risk is negligible compared
to the other one.

regards, tom lane

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 3:58 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> > default reader/printer behaviour in small ways, including being case
> > sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> > structs for reasons that are lost AFAIK (there's a dark age between
> > the commit history of the old Berkeley repo and our current repo, and
> > it looks like plan nodes were still printed as #(NAME ...) at
> > Berkeley).
>
> Wow, where did you find a commit history for Berkeley's code?
> There's evidence in the tarballs I have that they were using
> RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:58 PM Tom Lane  wrote:
> Wow, where did you find a commit history for Berkeley's code?
> There's evidence in the tarballs I have that they were using
> RCS, but I never heard that the repo was made public.

It's on Github:

https://github.com/kelvich/postgres_pre95
-- 
Peter Geoghegan

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
> There are certainly use-cases for something like that, but let's
> be clear about it: that's a niche case of interest to developers
> and pretty much nobody else.  For ordinary users, what matters about
> the node tree storage format is compactness and speed of loading.

Of course. But is there any reason to think that there has to be even
a tiny cost imposed on users?

> Our existing format is certainly not great on those metrics, but
> I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections. This
is certainly not a new consideration.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

-- 
Peter Geoghegan

Re: Proposal to use JSON for Postgres Parser format

Thomas Munro  writes:
> FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> default reader/printer behaviour in small ways, including being case
> sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> structs for reasons that are lost AFAIK (there's a dark age between
> the commit history of the old Berkeley repo and our current repo, and
> it looks like plan nodes were still printed as #(NAME ...) at
> Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

regards, tom lane

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

On Fri, Sep 09, 2022 at 03:05:18PM -0700, Jacob Champion wrote:
> On Thu, Sep 8, 2022 at 5:46 PM Tom Lane  wrote:
>> Jacob Champion  writes:
>> > I think you're going to have to address backwards compatibility
>> > concerns. Today, I can create a role named "/a", and I can put that
>> > into the HBA without quoting it. I'd be unamused if, after an upgrade,
>> > my rule suddenly matched any role name containing an 'a'.
>>
>> Meh ... that concern seems overblown to me.  I guess it's possible
>> that somebody has an HBA entry that looks like that, but it doesn't
>> seem very plausible.  Note that we made this exact same change in
>> pg_ident.conf years ago, and AFAIR we got zero complaints.
> 
> What percentage of users actually use pg_ident maps? My assumption
> would be that a change to pg_hba would affect many more people, but
> then I don't have any proof that there are users with role names that
> look like that to begin with. I won't pound the table with it.

This concern does not sound overblown to me.  A change in pg_hba.conf
impacts everybody.  I was just looking at this patch, and the logic
with usernames and databases is changed so as we would *always* treat
*any* entries beginning with a slash as a regular expression, skipping
them if they don't match with an error fed to the logs and
pg_hba_file_rules.  This could lead to silent security issues as 
stricter HBA policies need to be located first in pg_hba.conf and
these could suddenly fail to load.  It would be much safer to me if we
had in place some restrictions to avoid such problems to happen,
meaning some extra checks in the DDL code paths for such object names
and perhaps even something in pg_upgrade with a scan at pg_database
and pg_authid.

On the bright side, I have been looking at some of the RFCs covering
the set of characters allowed in DNS names and slashes are not
authorized in hostnames, making this change rather safe AFAIK.
--
Michael

signature.asc
Description: PGP signature

Re: predefined role(s) for VACUUM and ANALYZE

2022-09-19 Thread Nathan Bossart

On Thu, Sep 08, 2022 at 09:41:20AM -0400, Robert Haas wrote:
> Now on the other hand, I also do think we need more privilege bits.
> You're not alone in making the case that this is a problem which needs
> to be solved, and the set of other people who are also making that
> argument includes me. At the same time, there is certainly a double
> standard here. When Andrew and Tom committed
> d11e84ea466b4e3855d7bd5142fb68f51c273567 and
> a0ffa885e478f5eeacc4e250e35ce25a4740c487 respectively, we used up 2 of
> the remaining 4 bits, bits which other people would have liked to have
> used up years ago and they were told "no you can't." I don't believe I
> would have been willing to commit those patches without doing
> something to solve this problem, because I would have been worried
> about getting yelled at by Tom. But now here we are with only 2 bits
> left instead of 4, and we're telling the next patch author - who is
> not Tom - that he's on the hook to solve the problem.
> 
> Well, we do need to solve the problem. But we're not necessarily being
> fair about how the work involved gets distributed. It's a heck of a
> lot easier for a committer to get something committed to address this
> issue than a non-committer, and it's a heck of a lot easier for a
> committer to ignore the fact that the problem hasn't been solved and
> press ahead anyway, and yet somehow we're trying to dump a problem
> that's a decade in the making on Nathan. I'm not exactly sure what to
> propose as an alternative, but that doesn't seem quite fair.

Are there any concerns with simply expanding AclMode to 64 bits, as done in
v5 [0]?

[0] https://postgr.es/m/20220908055035.GA2100193%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Proposal to use JSON for Postgres Parser format

2022-09-19 Thread Amit Kapila

On Tue, Sep 20, 2022 at 7:59 AM Tom Lane  wrote:
>
> Michel Pelletier  writes:
> > I would like to propose a discussion that in a future major release
> > Postgres switch from this custom format to JSON.
>
> There are certainly reasons to think about changing the node tree
> storage format; but if we change it, I'd like to see it go to something
> more compact not more verbose.  JSON doesn't fit our needs all that
> closely, so some things like bitmapsets would become a lot longer;
> and even where the semantics are pretty-much-the-same, JSON's
> insistence on details like quoting field names will add bytes.
> Perhaps making the physical storage be JSONB not JSON would help that
> pain point.  It's still far from ideal though.
>
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.
>

I think the DDL deparsing stuff that is being discussed as a base for
DDL logical replication provides something like what you are saying
[1][2].

[1] - 
https://www.postgresql.org/message-id/CAFPTHDaqqGxqncAP42Z%3Dw9GVXDR92HN-57O%3D2Zy6tmayV2_eZw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-09-19 Thread shiy.f...@fujitsu.com

On Mon, Sept 19, 2022 11:26 AM Wang, Wei/王 威  wrote:
> 
> 
> Improved as suggested.
> 

Thanks for updating the patch. Here are some comments on 0001 patch.

1.
+   case TRANS_LEADER_SERIALIZE:
 
-   oldctx = MemoryContextSwitchTo(ApplyContext);
+   /*
+* Notify handle methods we're processing a remote 
in-progress
+* transaction.
+*/
+   in_streamed_transaction = true;
 
-   MyLogicalRepWorker->stream_fileset = palloc(sizeof(FileSet));
-   FileSetInit(MyLogicalRepWorker->stream_fileset);
+   /*
+* Since no parallel apply worker is used for the first 
stream
+* start, serialize all the changes of the transaction.
+*
+* Start a transaction on stream start, this 
transaction will be


It seems that the following comment can be removed after using switch case.
+* Since no parallel apply worker is used for the first 
stream
+* start, serialize all the changes of the transaction.

2.
+   switch (apply_action)
+   {
+   case TRANS_LEADER_SERIALIZE:
+   if (!in_streamed_transaction)
+   ereport(ERROR,
+   
(errcode(ERRCODE_PROTOCOL_VIOLATION),
+errmsg_internal("STREAM STOP 
message without STREAM START")));

In apply_handle_stream_stop(), I think we can move this check to the beginning 
of
this function, to be consistent to other functions.

3. I think the some of the changes in 0005 patch can be merged to 0001 patch,
0005 patch can only contain the changes about new column 'apply_leader_pid'.

4.
+ * ParallelApplyWorkersList. After successfully, launching a new worker it's
+ * information is added to the ParallelApplyWorkersList. Once the worker

Should `it's` be `its` ?

Regards
Shi yu

Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan  writes:
> Writing declarative @> containment queries against (say) a JSON
> variant of node tree format seems like it could be a huge quality of
> life improvement.

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else.  For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.
Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

regards, tom lane

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 12:16 PM Michel Pelletier
 wrote:
> This non-standard format

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).  At some point it was probably exchanging data between the
Lisp and C parts of POSTGRES, and you could maybe sorta claim it's
based on an ANSI standard (Lisp!), but not with a straight face :-)

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-19 Thread Nathan Bossart

On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote:
> It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
> we could add a function like pg_walfile_offset_lsn() that accepts a WAL
> file name and byte offset and returns the LSN.

Like so...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f8ed45d9fa59690d8b3375d658c1baedb8510195 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 19 Sep 2022 20:24:10 -0700
Subject: [PATCH v1 1/1] introduce pg_walfile_offset_lsn()

---
 doc/src/sgml/func.sgml   | 14 +++
 src/backend/access/transam/xlogfuncs.c   | 39 
 src/include/catalog/pg_proc.dat  |  5 +++
 src/test/regress/expected/misc_functions.out | 19 ++
 src/test/regress/sql/misc_functions.sql  |  9 +
 5 files changed, 86 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 67eb380632..318ac22769 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25891,6 +25891,20 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560

   
 
+  
+   
+
+ pg_walfile_offset_lsn
+
+pg_walfile_offset_lsn ( file_name text, file_offset integer )
+pg_lsn
+   
+   
+Converts a WAL file name and byte offset within that file to a
+write-ahead log location.
+   
+  
+
   

 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 27aeb6e281..75f58cb118 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -313,6 +313,45 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Compute an LSN given a WAL file name and decimal byte offset.
+ */
+Datum
+pg_walfile_offset_lsn(PG_FUNCTION_ARGS)
+{
+	char	   *filename = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	int			offset = PG_GETARG_INT32(1);
+	TimeLineID	tli;
+	XLogSegNo	segno;
+	XLogRecPtr	result;
+	uint32		log;
+	uint32		seg;
+
+	if (!IsXLogFileName(filename))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", filename)));
+
+	sscanf(filename, "%08X%08X%08X", , , );
+	if (seg >= XLogSegmentsPerXLogId(wal_segment_size) ||
+		(log == 0 && seg == 0) ||
+		tli == 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", filename)));
+
+	if (offset < 0 || offset >= wal_segment_size)
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"offset\" must not be negative or greater than or "
+		"equal to WAL segment size")));
+
+	XLogFromFileName(filename, , , wal_segment_size);
+	XLogSegNoOffsetToRecPtr(segno, offset, wal_segment_size, result);
+
+	PG_RETURN_LSN(result);
+}
+
 /*
  * Compute an xlog file name and decimal byte offset given a WAL location,
  * such as is returned by pg_backup_stop() or pg_switch_wal().
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..224fe590ac 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6328,6 +6328,11 @@
   proargtypes => 'pg_lsn', proallargtypes => '{pg_lsn,text,int4}',
   proargmodes => '{i,o,o}', proargnames => '{lsn,file_name,file_offset}',
   prosrc => 'pg_walfile_name_offset' },
+{ oid => '8205',
+  descr => 'wal location, given a wal filename and byte offset',
+  proname => 'pg_walfile_offset_lsn', prorettype => 'pg_lsn',
+  proargtypes => 'text int4', proargnames => '{file_name,file_offset}',
+  prosrc => 'pg_walfile_offset_lsn' },
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..b2d6f23ac1 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,22 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+-- pg_walfile_offset_lsn
+SELECT pg_walfile_offset_lsn('invalid', 15);
+ERROR:  invalid WAL file name "invalid"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('0001', 15);
+ERROR:  invalid WAL file name "0001"
+SELECT pg_walfile_offset_lsn('00010001', -1);
+ERROR:  "offset" must not be negative or greater than or equal to WAL segment size
+SELECT pg_walfile_offset_lsn('00010001', 20);
+ERROR:  "offset" must not be

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 7:29 PM Tom Lane  wrote:
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.

That seems like a perfectly good solution, as long as it can be done
in a way that doesn't leave consumers of the JSON output at any kind
of disadvantage.

I find the current node tree format ludicrously verbose, and generally
hard to work with. But it's not the format itself, really -- that's
not the problem. The underlying data structures are typically very
information dense. So having an output format that's a known quantity
sounds very valuable to me.

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement.  It will make the output format even more verbose,
but that might not matter in the same way as it does right now.

-- 
Peter Geoghegan

Re: Support pg_attribute_aligned and noreturn in MSVC

On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote:
> Yes, fixed.

The CF bot is failing compilation on Windows:
http://commitfest.cputube.org/james-coleman.html
https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log

There is something going on with noreturn() after applying it to
elog.h:
01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085:
'ThrowErrorData': not in formal parameter list (compiling source file
src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
[01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error
C2085: 'pgwin32_message_to_UTF16': not in formal parameter list
(compiling source file src/common/encnames.c)
[c:\cirrus\libpgcommon.vcxproj] 
[01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error
C2085: 'pg_re_throw': not in formal parameter list (compiling source
file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] 

align() seems to look fine, at least.  I'd be tempted to apply the
align part first, as that's independently useful for itemptr.h.
--
Michael


signature.asc
Description: PGP signature

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Jonathan S. Katz

On 9/19/22 4:52 PM, Jonathan S. Katz wrote:

On 9/19/22 11:16 AM, Alvaro Herrera wrote:

This seems a pretty arbitrary restriction. It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected. But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects. Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only. Why are we building more superuser-only
features in this day and age? I think not even FOR ALL TABLES should
require superuser.

FYI, I've added this to the PG15 open items as there are some open
questions to resolve in this thread.

(Replying personally, not RMT).

I wanted to enumerate the concerns raised in this thread in the context
of the open item to understand what needs to be addressed, and also give
an opinion. I did read up on the original thread to better understand
context around decisions.

I believe the concerns are these 3 things:

1. Allowing calls that have "ALL TABLES IN SCHEMA" that include calls to
specific tables in schema
2. The syntax of the "ALL TABLES IN SCHEMA" and comparing it to similar
behaviors in PostgreSQL

3. Adding on an additional "superuser-only" feature

For #1 (allowing calls that have schema/table overlap...), there appears
to be both a patch that allows this (reversing[8]), and a suggestion for
dealing with a corner-case that is reasonable, i.e. disallowing adding
schemas to a publication when specifying column-lists. Do we think we
can have consensus on this prior to the RC1 freeze?

For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
the original thread[1][3][4][5][7]. I thought Tom's proposal on the
syntax[3] was reasonable as it "future proofs" for when we allow other
schema-scoped objects to be published and give control over which ones
can be published.

The bigger issue seems to be around the behavior in regards to the
syntax. The current behavior is that when one specifies "ALL TABLES IN
SCHEMA", any future tables created in that schema are added to the
publication. While folks tried to find parallels to GRANT[6], I think
this actually resembles how we handle partitions that are
published[9][10], i.e.:

"When a partitioned table is added to a publication, all of its existing
and future partitions are implicitly considered to be part of the
publication."[10]

Additionally, this is the behavior that is already present in "FOR ALL
TABLES":

"Marks the publication as one that replicates changes for all tables in
the database, including tables created in the future."[10]

I don't think we should change this behavior that's already in logical
replication. While I understand the reasons why "GRANT ... ALL TABLES IN
SCHEMA" has a different behavior (i.e. it's not applied to future
objects) and do not advocate to change it, I have personally been
affected where I thought a permission would be applied to all future
objects, only to discover otherwise. I believe it's more intuitive to
think that "ALL" applies to "everything, always."

For #3 (more superuser-only), in general I do agree that we shouldn't be
adding more of these. However, we have in this release, and not just to
this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
think it's easier for us to "relax" privileges (e.g. w/predefined roles)
than to make something "superuser-only" in the future, so I don't see
this being a blocker for v15. The feature will continue to work for
users even if we remove "superuser-only" in the future.

To summarize:

1. I do think we should fix the issue that Peter originally brought up
in this thread before v15. That is an open item.
2. I don't see why we need to change the syntax/behavior, I think that
will make this feature much harder to use.
3. I don't think we need to change the superuser requirement right now,
but we should do that for a future release.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/CAFiTN-u_m0cq7Rm5Bcu9EW4gSHG94WaLuxLfibwE-o7%2BLea2GQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/C4D04B90-AC4D-42A7-B93C-4799CE96%40enterprisedb.com

[3] https://www.postgresql.org/message-id/155565.1628954580%40sss.pgh.pa.us
[4]
https://www.postgresql.org/message-id/CAHut%2BPvNwzp-EdtsDNazwrNrV4ziqCovNdLywzOJKSy52LvRjw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAHut%2BPt6Czj0KsE0ip6nMsPf4FatHgNDni-wSu2KkYNYF9mDAw%40mail.gmail.com
[6]
https://www.postgresql.org/message-id/CAA4eK1Lwtea0St1MV5nfSg9FrFeU04YKpHvhQ0i4W-tOBw%3D9Qw%40mail.gmail.com
[7]
https://www.postgresql.org/message-id/202109241325.eag5g6mpvoup@alvherre.pgsql
[8]

Re: Proposal to use JSON for Postgres Parser format

Michel Pelletier  writes:
> I would like to propose a discussion that in a future major release
> Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose.  JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point.  It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent.  That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

regards, tom lane

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Peter Smith

I had quick look at the latest v11-0001 patch differences from v10.

Here are some initial comments:

==

1. Commit message

It looks like some small mistake happened. You wrote [1] that my
previous review comments about the commit message were fixed, but it
seems the v11 commit message is unchanged since v10.

==

2. src/backend/replication/logical/relation.c -
GenerateDummySelectPlannerInfoForRelation

+/*
+ * This is not a generic function, helper function for
+ * GetCheapestReplicaIdentityFullPath. The function creates
+ * a dummy PlannerInfo for the given relationId as if the
+ * relation is queried with SELECT command.
+ */
+static PlannerInfo *
+GenerateDummySelectPlannerInfoForRelation(Oid relationId)

"generic function, helper function" -> "generic function. It is a
helper function"


--
[1] 
https://www.postgresql.org/message-id/CACawEhXnTcXBOTofptkgSBOyD81Pohd7MSfFaW0SKo-0oKrCJg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [RFC] building postgres with meson - v13

2022-09-19 Thread Andres Freund

Hi,

On 2022-09-19 05:25:59 -0400, Peter Eisentraut wrote:
> IMO, the following commits are ready to be pushed now:

Slowly working through them.

To have some initial "translation" for other developers I've started a wiki
page with a translation table. Still very WIP:
https://wiki.postgresql.org/wiki/Meson

For now, a bit of polishing aside, I'm just planning to add a minimal
explanation of what's happening, and a reference to this thread.

> b7d7fe009731 Remove DLLTOOL, DLLWRAP from configure / Makefile.global.in
> 979f26889544 Don't hardcode tmp_check/ as test directory for tap tests
> 9fc657fbb7e2 Split TESTDIR into TESTLOGDIR and TESTDATADIR
> 6de8f1de0ffa meson: prereq: Extend gendef.pl in preparation for meson
> 5a9731dcc2e6 meson: prereq: port: Include c.h instead of postgres.h in 
> *p{read,write}*.c

Done

> 7054861f0fef meson: prereq: Add src/tools/gen_export.pl

This one I'm planning to merge with the "main" commit, given there's no other 
user.

Greetings,

Andres Freund

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Amit Kapila

On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera  wrote:
>
> > diff --git a/doc/src/sgml/logical-replication.sgml 
> > b/doc/src/sgml/logical-replication.sgml
> > index 1ae3287..0ab768d 100644
> > --- a/doc/src/sgml/logical-replication.sgml
> > +++ b/doc/src/sgml/logical-replication.sgml
> > @@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;
> >
> >
> >
> > +   Specifying a column list when the publication also publishes
> > +   FOR ALL TABLES IN SCHEMA is not supported.
> > +  
> > +
> > +  
> > For partitioned tables, the publication parameter
> > publish_via_partition_root determines which column 
> > list
> > is used. If publish_via_partition_root is
> >
> > diff --git a/doc/src/sgml/ref/create_publication.sgml 
> > b/doc/src/sgml/ref/create_publication.sgml
> > index 0a68c4b..0ced7da 100644
> > --- a/doc/src/sgml/ref/create_publication.sgml
> > +++ b/doc/src/sgml/ref/create_publication.sgml
> > @@ -103,17 +103,17 @@ CREATE PUBLICATION  > class="parameter">name
> >   
> >
> >   
> > +  Specifying a column list when the publication also publishes
> > +  FOR ALL TABLES IN SCHEMA is not supported.
> > + 
> >
> > @@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
> > *queryString,
> >   continue;
> >
> >   /*
> > +  * Disallow specifying column list if any schema is in the
> > +  * publication.
> > +  *
> > +  * XXX We could instead just forbid the case when the 
> > publication
> > +  * tries to publish the table with a column list and a schema 
> > for that
> > +  * table. However, if we do that then we need a restriction 
> > during
> > +  * ALTER TABLE ... SET SCHEMA to prevent such a case which 
> > doesn't
> > +  * seem to be a good idea.
> > +  */
> > + if (publish_schema)
> > + ereport(ERROR,
> > + 
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("cannot use publication column 
> > list for relation \"%s.%s\"",
> > +
> > get_namespace_name(RelationGetNamespace(pri->relation)),
> > +
> > RelationGetRelationName(pri->relation)),
> > + errdetail("Column list cannot be 
> > specified if any schema is part of the publication or specified in the 
> > list."));
> > +
>
> This seems a pretty arbitrary restriction.  It feels like you're adding
> this restriction precisely so that you don't have to write the code to
> reject the ALTER .. SET SCHEMA if an incompatible configuration is
> detected.  But we already have such checks in several cases, so I don't
> see why this one does not seem a good idea.
>

I agree that we have such checks at other places as well and one
somewhat similar is in ATPrepChangePersistence().

ATPrepChangePersistence()
{
...
...
/*
 * Check that the table is not part of any publication when changing to
 * UNLOGGED, as UNLOGGED tables can't be published.
 */

However, another angle to look at it is that we try to avoid adding
restrictions in other DDL commands for defined publications. I am not
sure but it appears to me Peter E. is not in favor of restrictions in
other DDLs. I think we don't have a strict rule in this regard, so we
are trying to see what makes the most sense based on feedback and do
it accordingly.

> The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
> aspects.  Others have already commented about the syntax, which is
> unlike what GRANT uses; I'm also surprised that we've gotten away with
> it being superuser-only.  Why are we building more superuser-only
> features in this day and age?  I think not even FOR ALL TABLES should
> require superuser.
>

The intention was to be in sync with FOR ALL TABLES.

-- 
With Regards,
Amit Kapila.

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread kuroda.hay...@fujitsu.com

Dear Önder,

Thanks for updating the patch! I will check it later.
Currently I just reply to your comments.

> Hmm, I couldn't realize this comment earlier. So you suggest "slow" here 
> refers to the additional function call "GetRelationIdentityOrPK"? If so, yes 
> I'll update that.

Yes I meant to say that, because functions will be called like:

GetRelationIdentityOrPK() -> RelationGetPrimaryKeyIndex() -> 
RelationGetIndexList() -> ..

and according to comments last one seems to do the heavy lifting.


> Makes sense, simplified the function. Though, it is always hard to pick good 
> names for these kinds of helper functions. I picked 
> GenerateDummySelectPlannerInfoForRelation(), does that sound good to you as 
> well?

I could not find any better naming than yours. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

On Mon, Sep 19, 2022 at 06:26:34PM +0530, Bharath Rupireddy wrote:
> On Mon, Sep 19, 2022 at 2:38 PM Fujii Masao  
> wrote:
> Fixed. I believed that the regression tests cover pg_backup_start()
> and pg_backup_stop(), and relied on make check-world, surprisingly
> there's no test that covers these functions. Is it a good idea to add
> tests for these functions in misc_functions.sql or backup.sql or
> somewhere so that they get run as part of make check? Thoughts?

The main regression test suite should not include direct calls to
pg_backup_start() or pg_backup_stop() as these depend on wal_level,
and we spend a certain amount of resources in keeping the tests a
maximum portable across such configurations, wal_level = minimal being
one of them.  One portable example is in 001_stream_rep.pl.

> That's a good idea. I'm marking a flag if the label name overflows (>
> MAXPGPATH), later using it in do_pg_backup_start() to error out. We
> could've thrown error directly, but that changes the behaviour - right
> now, first "
> wal_level must be set to \"replica\" or \"logical\" at server start."
> gets emitted and then label name overflow error - I don't want to do
> that.

-   if (strlen(backupidstr) > MAXPGPATH)
+   if (state->name_overflowed == true)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("backup label too long (max %d bytes)",
It does not strike me as a huge issue to force a truncation of such
backup label names.  1024 is large enough for basically all users,
in my opinion.  Or you could just truncate in the internal logic, but
still complain at MAXPGPATH - 1 as the last byte would be for the zero
termination.  In short, there is no need to complicate things with
name_overflowed.

>> In v7 patch, since pg_backup_stop() calls build_backup_content(),
>> backup_label and history_file seem not to be carried from do_pg_backup_stop()
>> to pg_backup_stop(). This makes me still think that it's better not to 
>> include
>> them in BackupState...
> 
> I'm a bit worried about the backup state being spread across if we
> separate out backup_label and history_file from BackupState and keep
> tablespace_map contents there. As I said upthread, we are not
> allocating memory for them at the beginning, we allocate only when
> needed. IMO, this code is readable and more extensible.

+   StringInfo  backup_label;   /* backup_label file contents */
+   StringInfo  tablespace_map; /* tablespace_map file contents */
+   StringInfo  history_file;   /* history file contents */
IMV, repeating a point I already made once upthread, BackupState
should hold none of these.  Let's just generate the contents of these
files in the contexts where they are needed, making BackupState
something to rely on to build them in the code paths where they are
necessary.  This is going to make the reasoning around the memory
contexts where each one of them is stored much easier and reduce the
changes of bugs in the long-term.

> I've also taken help of the error callback mechanism to clean up the
> allocated memory in case of a failure. For do_pg_abort_backup() cases,
> I think we don't need to clean the memory because that gets called on
> proc exit (before_shmem_exit()).

Memory could still bloat while the process running the SQL functions
is running depending on the error code path, anyway.
--
Michael


signature.asc
Description: PGP signature

Re: pg_create_logical_replication_slot argument incongruency

On Mon, Sep 19, 2022 at 07:02:16PM +0200, Florin Irion wrote:
> This was introduced in commit 19890a06.
> 
> IMHO we should use the documented argument name `two_phase` and change the
> function to accept it.
> 
> What do you think?

19890a0 is included in REL_14_STABLE, and changing an argument name is
not acceptable in a stable branch as it would imply a catversion
bump.  Let's change the docs so as we document the parameter as
"twophase", instead.
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade test failure

2022-09-19 Thread Justin Pryzby

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> Hi,
> 
> After my last rebase of the meson tree I encountered the following test
> failure:
> 
> https://cirrus-ci.com/task/5532444261613568
> 
> [20:23:04.171] - 8< 
> -
> [20:23:04.171] stderr:
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
> pg_upgrade --check success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 249.
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
> success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 261.
> [20:23:04.171] # Looks like you failed 2 tests of 13.
> 
> regress_log:
> https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> 
> The pg_upgrade output contains these potentially relevant warnings:
> 
> ...
> *Clusters are compatible*
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
>  Directory not empty
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
>  Directory not empty
> ...

It looks like it failed to remove a *.log file under windows, which
caused rmtree to fail.

src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK   
"pg_upgrade_dump_%u.log"
src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE 
"pg_upgrade_server.log"
src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE
"pg_upgrade_utility.log"
src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE   
"pg_upgrade_internal.log"

ee5353abb only changed .txt files used for errors so can't be the cause,
but the original commit 38bfae3 might be related.

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

-- 
Justin

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote:
> I don't really see what'd race with what here? pg_upgrade has precise control
> over what's happening here, no?

A code path could have forgotten a fclose() for example, but this code
is rather old and close-proof as far as I know.  Most of the log files
are used with redirections for external calls, though  I don't see
how these could still be hold after pg_upgrade finishes, though :/
Could the use meson somewhat influence when running tests on Windows?

> I've only seen it once so far, but there haven't been many CI runs of the
> meson branch since rebasing ontop of the last changes to pg_upgrade.

Hmm, okay.  Is that a specific branch in one of your public repos?
--
Michael

signature.asc
Description: PGP signature

Re: Reducing the chunk header sizes on all memory context types

David Rowley  writes:
> Aside from that, I don't have any ideas on how to get rid of the
> possible additional datumCopy() from non-Var arguments to these window
> functions.  Should we just suffer it? It's quite likely that most
> arguments to these functions are plain Vars anyway.

No, we shouldn't.  I'm pretty sure that we have various window
functions that are deliberately designed to take advantage of the
no-copy behavior, and that they have taken a significant speed
hit from your having disabled that optimization.  I don't say
that this is enough to justify reverting the chunk header changes
altogether ... but I'm completely not satisfied with the current
situation in HEAD.

regards, tom lane

Re: pg_upgrade test failure

2022-09-19 Thread Andres Freund

Hi,

On 2022-09-20 10:08:41 +0900, Michael Paquier wrote:
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > I don't know if actually related to the commit below, but there've been a
> > lot of runs of the pg_upgrade tests in the meson branch, and this is the 
> > first
> > failure of this kind. Unfortunately the error seems to be transient -
> > rerunning the tests succeeded.
> 
> This smells to me like a race condition in pg_upgrade (or even pg_ctl
> for SERVER_LOG_FILE) where the code still has handles on some of the
> files in the log/ subdirectory, causing its removal to not be able to
> finish happen.

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?


> If this proves to be rather easy to reproduce, giving
> a list of the files still present in this path would give a hint easy
> to follow.  Does this reproduce with a good frequency?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Greetings,

Andres Freund

Re: Reducing the chunk header sizes on all memory context types

On Tue, 13 Sept 2022 at 20:27, David Rowley  wrote:
> I see that one of the drawbacks from not using MemoryContextContains()
> is that window functions such as lead(), lag(), first_value(),
> last_value() and nth_value() may now do the datumCopy() when it's not
> needed. For example, with a window function call such as
> lead(byref_col ), the expression evaluation code in
> WinGetFuncArgInPartition() will just return the address in the
> tuplestore tuple for "byref_col".  The datumCopy() is needed for that.
> However, if the function call was lead(byref_col || 'something') then
> we'd have ended up with a new allocation in CurrentMemoryContext to
> concatenate the two values.  We'll now do a datumCopy() where we
> previously wouldn't have. I don't really see any way around that
> without doing some highly invasive surgery to the expression
> evaluation code.

It feels like a terrible idea, but  I wondered if we could look at the
WindowFunc->args and make a decision if we should do the datumCopy()
based on the type of the argument. Vars would need to be copied as
they will point into the tuple's memory, but an OpExpr likely would
not need to be copied.

Aside from that, I don't have any ideas on how to get rid of the
possible additional datumCopy() from non-Var arguments to these window
functions.  Should we just suffer it? It's quite likely that most
arguments to these functions are plain Vars anyway.

David

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> I don't know if actually related to the commit below, but there've been a
> lot of runs of the pg_upgrade tests in the meson branch, and this is the first
> failure of this kind. Unfortunately the error seems to be transient -
> rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen.  If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow.  Does this reproduce with a good frequency?
--
Michael


signature.asc
Description: PGP signature

Re: Support tls-exporter as channel binding for TLSv1.3

On Mon, Sep 19, 2022 at 09:27:41AM -0700, Jacob Champion wrote:
> While looking into this I noticed that I left the following code in place:
> 
>> #ifdef HAVE_BE_TLS_GET_CERTIFICATE_HASH
>> if (strcmp(selected_mech, SCRAM_SHA_256_PLUS_NAME) == 0 && 
>> port->ssl_in_use)
> 
> In other words, we're still deciding whether to advertise -PLUS based
> only on whether we support tls-server-end-point.

X509_get_signature_nid() has been introduced in 1.0.2.
SSL_export_keying_material() is older than that, being present since
1.0.1.  Considering the fact that we want to always have
tls-server-end-point as default, it seems to me that we should always
publish SCRAM-SHA-256-PLUS and support channel binding when building
with OpenSSL >= 1.0.2.  The same can be said about the areas where we
have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH.

There could be a point in supporting tls-exporter as default in 1.0.1,
or just allow it if the caller gives it in the connection string, but
as that's the next version we are going to drop support for (sooner
than later would be better IMO), I don't really want to add more
maintenance burden in this area as 1.0.1 is not that used anyway as
far as I recall.

> Maybe all the necessary features landed in OpenSSL in the same
> version, but I haven't double-checked that, and in any case I think
> I need to make this code more correct in the next version of this
> patch.

I was planning to continue working on this patch based on your latest
review.  Anyway, as that's originally your code, I am fine to let you
take the lead here.  Just let me know which way you prefer, and I'll
stick to it :)
--
Michael

signature.asc
Description: PGP signature

Re: Support pg_attribute_aligned and noreturn in MSVC

2022-09-19 Thread James Coleman

On Mon, Sep 19, 2022 at 8:21 PM Michael Paquier  wrote:
>
> On Mon, Sep 19, 2022 at 06:21:58PM -0400, James Coleman wrote:
> > It turns out that MSVC supports both noreturn [2] [3] and alignment
> > [4] [5] attributes, so this patch adds support for those. MSVC also
> > supports a form of packing, but the implementation as I can tell
> > requires wrapping the entire struct (with a push/pop declaration set)
> > [6], which doesn't seem to match the style of macros we're using for
> > packing in other compilers, so I opted not to implement that
> > attribute.
>
> Interesting.  Thanks for the investigation.
>
> +/*
> + * MSVC supports aligned and noreturn
> + * Packing is also possible but only by wrapping the entire struct definition
> + * which doesn't fit into our current macro declarations.
> + */
> +#elif defined(_MSC_VER)
> +#define pg_attribute_aligned(a) __declspec(align(a))
> +#define pg_attribute_noreturn() __declspec(noreturn)
>  #else
> Nit: I think that the comment should be in the elif block for Visual.

I was following the style of the comment outside the "if", but I'm not
attached to that style, so changed in this version.

> pg_attribute_aligned() could be used in generic-msvc.h's
> pg_atomic_uint64 as it uses now align.

Added.

> Shouldn't HAVE_PG_ATTRIBUTE_NORETURN be set for the MSVC case as well?

Yes, fixed.

James Coleman


v2-0001-Support-pg_attribute_aligned-and-noreturn-in-MSVC.patch
Description: Binary data

Re: Support pg_attribute_aligned and noreturn in MSVC

On Mon, Sep 19, 2022 at 06:21:58PM -0400, James Coleman wrote:
> It turns out that MSVC supports both noreturn [2] [3] and alignment
> [4] [5] attributes, so this patch adds support for those. MSVC also
> supports a form of packing, but the implementation as I can tell
> requires wrapping the entire struct (with a push/pop declaration set)
> [6], which doesn't seem to match the style of macros we're using for
> packing in other compilers, so I opted not to implement that
> attribute.

Interesting.  Thanks for the investigation.

+/*
+ * MSVC supports aligned and noreturn
+ * Packing is also possible but only by wrapping the entire struct definition
+ * which doesn't fit into our current macro declarations.
+ */
+#elif defined(_MSC_VER)
+#define pg_attribute_aligned(a) __declspec(align(a))
+#define pg_attribute_noreturn() __declspec(noreturn)
 #else
Nit: I think that the comment should be in the elif block for Visual.

pg_attribute_aligned() could be used in generic-msvc.h's
pg_atomic_uint64 as it uses now align.

Shouldn't HAVE_PG_ATTRIBUTE_NORETURN be set for the MSVC case as well?
--
Michael


signature.asc
Description: PGP signature

Proposal to use JSON for Postgres Parser format

2022-09-19 Thread Michel Pelletier

Hello hackers,

As noted in the source:

https://github.com/postgres/postgres/blob/master/src/include/nodes/pg_list.h#L6-L11

 * Once upon a time, parts of Postgres were written in Lisp and used real
 * cons-cell lists for major data structures.  When that code was rewritten
 * in C, we initially had a faithful emulation of cons-cell lists, which
 * unsurprisingly was a performance bottleneck.  A couple of major rewrites
 * later, these data structures are actually simple expansible arrays;
 * but the "List" name and a lot of the notation survives.

The Postgres parser format as described in the wiki page:

https://wiki.postgresql.org/wiki/Query_Parsing

looks almost, but not quite, entirely like JSON:

SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
   (
  {SELECT
  :distinctClause <>
  :intoClause <>
  :targetList (
 {RESTARGET
 :name <>
 :indirection <>
 :val
{COLUMNREF
:fields (
   {A_STAR
   }
)
:location 7
}
 :location 7
 }
  )
  :fromClause (
 {RANGEVAR
 :schemaname <>
 :relname foo
 :inhOpt 2
 :relpersistence p
 :alias <>
 :location 14
 }
  )
  ... and so on
   )

This non-standard format is useful for visual inspection and perhaps
simple parsing.  Parsers that do exist for it are generally specific
to some languages.  If there were a standard way to parse queries,
tools like code generators and analysis tools can work with a variety
of libraries that already handle JSON quite well.  Future potential
would include exposing this data to command_ddl_start event triggers.
Providing a JSON Schema would also aid tools that want to validate or
transform the json with rule based systems.

I would like to propose a discussion that in a future major release
Postgres switch
from this custom format to JSON.  The current format is question is
generated from macros and functions found in
`src/backend/nodes/readfuncs.c` and `src/backend/nodes/outfuncs.c` and
converting them to emit valid JSON would be relatively
straightforward.

One downside would be that this would not be a forward compatible
binary change across releases.  Since it is unlikely that very much
code is reliant on this custom format; this would not be a huge problem
for most.

Thoughts?

-Michel

Re: Silencing the remaining clang 15 warnings

Thomas Munro  writes:
> On Tue, Sep 20, 2022 at 7:20 AM Tom Lane  wrote:
>> * xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
>> that is only read in the "#ifdef WAL_DEBUG" stanza at the
>> bottom.  Here I've done the rather ugly and brute-force thing
>> of wrapping all the variable's references in "#ifdef WAL_DEBUG".
>> (I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
>> did not silence the warning.)  I kind of wonder how useful this
>> function's WAL_DEBUG output is --- maybe just dropping that
>> altogether would be better?

> No opinion on the value of the message, but maybe
> pg_attribute_unused() would be better?

I realized that the reason PG_USED_FOR_ASSERTS_ONLY didn't help
is that it expands to empty in an assert-enabled build, which is
what I was testing.  So yeah, using pg_attribute_unused() directly
would probably work better.

(Also, I tried an assert-disabled build and found one additional new
warning; same deal where clang doesn't believe "foo++;" is reason to
consider foo to be used.  That one, PG_USED_FOR_ASSERTS_ONLY can fix.)

regards, tom lane

Re: Silencing the remaining clang 15 warnings

On Tue, Sep 20, 2022 at 7:20 AM Tom Lane  wrote:
> * With %pure-parser, Bison makes the "yynerrs" variable local
> instead of static, and then if you don't use it clang notices
> that it's set but never read.  There doesn't seem to be a way
> to persuade Bison not to emit the variable at all, so here I've
> just added "(void) yynerrs;" to the topmost production of each
> affected grammar.  If anyone has a nicer idea, let's hear it.

+1.  FTR they know about this:

https://www.mail-archive.com/bison-patches@gnu.org/msg07836.html
https://github.com/akimd/bison/commit/a166d5450e3f47587b98f6005f9f5627dbe21a5b

... but that YY_ATTRIBUTE_UNUSED hasn't landed in my systems'
/usr[/local]/share/bison/skeletons/yacc.c yet and it seems harmless
and also legitimate to reference yynerrs from an action.

> * xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
> that is only read in the "#ifdef WAL_DEBUG" stanza at the
> bottom.  Here I've done the rather ugly and brute-force thing
> of wrapping all the variable's references in "#ifdef WAL_DEBUG".
> (I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
> did not silence the warning.)  I kind of wonder how useful this
> function's WAL_DEBUG output is --- maybe just dropping that
> altogether would be better?

No opinion on the value of the message, but maybe
pg_attribute_unused() would be better?

> * array_typanalyze.c's compute_array_stats counts the number
> of null arrays in the column, but then does nothing with the
> result.  AFAICS this is redundant with what std_compute_stats
> will do, so I just removed the variable.

+1

Support pg_attribute_aligned and noreturn in MSVC

2022-09-19 Thread James Coleman

Over in the "Add last commit LSN to pg_last_committed_xact()" [1]
thread this patch had been added as a precursor, but Michael Paquier
suggested it be broken out separately, so I'm doing that here.

It turns out that MSVC supports both noreturn [2] [3] and alignment
[4] [5] attributes, so this patch adds support for those. MSVC also
supports a form of packing, but the implementation as I can tell
requires wrapping the entire struct (with a push/pop declaration set)
[6], which doesn't seem to match the style of macros we're using for
packing in other compilers, so I opted not to implement that
attribute.

James Coleman

1: https://www.postgresql.org/message-id/Yk6UgCGlZKuxRr4n%40paquier.xyz
2: 2008+ 
https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2008/k6ktzx3s(v=vs.90)
3. 2015+ https://learn.microsoft.com/en-us/cpp/c-language/noreturn?view=msvc-140
4. 2008+ 
https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2008/dabb5z75(v=vs.90)
5. 2015+ https://learn.microsoft.com/en-us/cpp/cpp/align-cpp?view=msvc-170
6. https://learn.microsoft.com/en-us/cpp/preprocessor/pack?view=msvc-170


v1-0001-Support-pg_attribute_aligned-and-noretur-in-MSVC.patch
Description: Binary data

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-19 Thread Nathan Bossart

On Mon, Sep 19, 2022 at 07:26:57PM +0530, Bharath Rupireddy wrote:
> We have a bunch of messages [1] that have an offset, but not LSN in
> the error message. Firstly, is there an easiest way to figure out LSN
> from offset reported in the error messages? If not, is adding LSN to
> these messages along with offset a good idea? Of course, we can't just
> convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
> something meaningful like reporting the LSN of the page that we are
> reading-in or writing-out etc.

It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
we could add a function like pg_walfile_offset_lsn() that accepts a WAL
file name and byte offset and returns the LSN.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

pg_upgrade test failure

2022-09-19 Thread Andres Freund

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] - 8< 
-
[20:23:04.171] stderr:
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
pg_upgrade --check success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
 Directory not empty
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
 Directory not empty
...

I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

On 2022-09-13 01:39:59 +, Michael Paquier wrote:
> Move any remaining files generated by pg_upgrade into an internal subdir
>
> This change concerns a couple of .txt files (for internal state checks)
> that were still written in the path where the binary is executed, and
> not in the subdirectory located in the target cluster.  Like the other
> .txt files doing already so (like loadable_libraries.txt), these are
> saved in the base output directory.  Note that on failure, the logs
> report the full path to the .txt file generated, so these are easy to
> find.
>
> Oversight in 38bfae3.
>
> Author: Daniel Gustafsson
> Reviewed-by: Michael Paquier, Justin Prysby
> Discussion: https://postgr.es/m/181a6da8-3b7f-4b71-82d5-363ff0146...@yesql.se
> Backpatch-through: 15

Greetings,

Andres Freund

Re: POC: GROUP BY optimization

On Wed, 13 Jul 2022 at 15:37, David Rowley  wrote:
> I'm just in this general area of the code again today and wondered
> about the header comment for the preprocess_groupclause() function.
>
> It says:
>
>  * In principle it might be interesting to consider other orderings of the
>  * GROUP BY elements, which could match the sort ordering of other
>  * possible plans (eg an indexscan) and thereby reduce cost.  We don't
>  * bother with that, though.  Hashed grouping will frequently win anyway.
>
> I'd say this commit makes that paragraph mostly obsolete.  It's only
> true now in the sense that we don't try orders that suit some index
> that would provide pre-sorted results for a GroupAggregate path.  The
> comment leads me to believe that we don't do anything at all to find a
> better order, and that's not true now.

I've just pushed a fix for this.

David

Re: Kerberos delegation support in libpq and postgres_fdw

2022-09-19 Thread Jacob Champion

On 9/19/22 10:05, Stephen Frost wrote:
> This is coming across as if it's a surprise of some kind when it
> certainly isn't..  If the delegated credentials are being used to
> authenticate and establish the connection from that backend to another
> system then, yes, naturally that means that the keys provided are coming
> from the client and the client knows them.

I think it may be surprising to end users that credential delegation
lets them trivially break transport encryption. Like I said before, it
was a surprise to me, because the cryptosystems I'm familiar with
prevent that.

If it wasn't surprising to you, I could really have used a heads up back
when I asked you about it.

> The idea of arranging to
> have an admin's credentials used to authenticate to another system where
> the backend is actually controlled by a non-admin user is, in fact, the
> issue in what is being outlined above as that's clearly a situation
> where the user's connection is being elevated to an admin level.

Yes, controlled elevation is the goal in the scenario I'm describing.

> That's
> also something that we try to avoid having happen because it's not
> really a good idea, which is why we require a password today for the
> connection to be established (postgres_fdw/connection.c:
> 
> Non-superuser cannot connect if the server does not request a password.

A password is being used in this scenario. The password is the secret
being stolen.

The rest of your email describes a scenario different from what I'm
attacking here. Here's my sample HBA line for the backend again:

hostgssenc all all ... password

I'm using password authentication with a Kerberos-encrypted channel.
It's similar to protecting password authentication with TLS and a client
cert:

hostssl all all ... password clientcert=verify-*

> Consider that, in general, the user could also simply directly connect
> to the other system themselves

No, because they don't have the password. They don't have USAGE on the
foreign table, so they can't see the password in the USER MAPPING.

With the new default introduced in this patch, they can now steal the
password by delegating their credentials and cracking the transport
encryption. This bypasses the protections that are documented for the
pg_user_mappings view.

> Consider SSH instead of PG.  What you're pointing out, accurately, is
> that if an admin were to install their keys into a user's .ssh directory
> unencrypted and then the user logged into the system, they'd then be
> able to SSH to another system with the admin's credentials and then
> they'd need the admin's credentials to decrypt the traffic, but that if,
> instead, the user brings their own credentials then they could
> potentially decrypt the connection between the systems.  Is that really
> the issue here?

No, it's not the issue here. This is more like setting up a restricted
shell that provides limited access to a resource on another machine
(analogous to the foreign table). The user SSHs into this restricted
shell, and then invokes an admin-blessed command whose implementation
uses some credentials (which they cannot read, analogous to the USER
MAPPING) over an encrypted channel to access the backend resource. In
this situation an admin would want to ensure that the encrypted tunnel
couldn't be weakened by the client, so that they can't learn how to
bypass the blessed command and connect to the backend directly.

Unlike SSH, we've never supported credential delegation, and now we're
introducing it. So my claim is, it's possible for someone who was
previously in a secure situation to be broken by the new default.

>> So the client can decrypt backend communications that make use of its
>> delegated key material. (This also means that gssencmode is a lot
>> weaker than I expected.)
> 
> The backend wouldn't be able to establish the connection in the first
> place without those delegated credentials.

That's not true in the situation I'm describing; hopefully my comments
above help clarify.

> The server in the middle should *not* be using its own Kerberos
> credentials to establish the connection to the other system- that's
> elevating the credentials used for that connection and is something that
> should be prevented for non-superusers already (see above).

It's not prevented, because a password is being used. In my tests I'm
connecting as an unprivileged user.

You're claiming that the middlebox shouldn't be doing this. If this new
default behavior were the historical behavior, then I would have agreed.
But the cat's already out of the bag on that, right? It's safe today.
And if it's not safe today for some other reason, please share why, and
maybe I can work on a patch to try to prevent people from doing it.

Thanks,
--Jacob

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Jonathan S. Katz


On 9/19/22 11:16 AM, Alvaro Herrera wrote:



diff --git a/doc/src/sgml/logical-replication.sgml 
b/doc/src/sgml/logical-replication.sgml
index 1ae3287..0ab768d 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;

  


+   Specifying a column list when the publication also publishes
+   FOR ALL TABLES IN SCHEMA is not supported.
+  
+
+  
 For partitioned tables, the publication parameter
 publish_via_partition_root determines which column list
 is used. If publish_via_partition_root is

diff --git a/doc/src/sgml/ref/create_publication.sgml 
b/doc/src/sgml/ref/create_publication.sgml
index 0a68c4b..0ced7da 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -103,17 +103,17 @@ CREATE PUBLICATION name
   
  
   

+  Specifying a column list when the publication also publishes
+  FOR ALL TABLES IN SCHEMA is not supported.
+ 

@@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
*queryString,
continue;
  
  		/*

+* Disallow specifying column list if any schema is in the
+* publication.
+*
+* XXX We could instead just forbid the case when the 
publication
+* tries to publish the table with a column list and a schema 
for that
+* table. However, if we do that then we need a restriction 
during
+* ALTER TABLE ... SET SCHEMA to prevent such a case which 
doesn't
+* seem to be a good idea.
+*/
+   if (publish_schema)
+   ereport(ERROR,
+   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+   errmsg("cannot use publication column list for 
relation \"%s.%s\"",
+  
get_namespace_name(RelationGetNamespace(pri->relation)),
+  
RelationGetRelationName(pri->relation)),
+   errdetail("Column list cannot be specified 
if any schema is part of the publication or specified in the list."));
+


This seems a pretty arbitrary restriction.  It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected.  But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects.  Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only.  Why are we building more superuser-only
features in this day and age?  I think not even FOR ALL TABLES should
require superuser.


FYI, I've added this to the PG15 open items as there are some open 
questions to resolve in this thread.


Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-09-19 Thread Dmitry Koval


Thanks for comments and advice!
I thought about this problem and discussed about it with colleagues.
Unfortunately, I don't know of a good general solution.

19.09.2022 22:56, Robert Haas пишет:

If you know that a certain partition is not changing, and you would
like to split it, you can create two or more new standalone tables and
populate them from the original partition using INSERT .. SELECT. Then
you can BEGIN a transaction, DETACH the existing partitions, and
ATTACH the replacement ones. By doing this, you take an ACCESS
EXCLUSIVE lock on the partitioned table only for a brief period. The
same kind of idea can be used to merge partitions.


But for specific situation like this (certain partition is not changing) 
we can add CONCURRENTLY modifier.

Our DDL query can be like

ALTER TABLE...SPLIT PARTITION [CONCURRENTLY];

With CONCURRENTLY modifier we can lock partitioned table in 
ShareUpdateExclusiveLock mode and split partition - in 
AccessExclusiveLock mode. So we don't lock partitioned table in 
AccessExclusiveLock mode and can modify other partitions during SPLIT 
operation (except split partition).
If smb try to modify split partition, he will receive error "relation 
does not exist" at end of operation (because split partition will be drop).



--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Fix typos in code comments

On Mon, 19 Sept 2022 at 23:10, Justin Pryzby  wrote:
> Find below some others.

Thanks. Pushed.

David

Re: Consider parallel for lateral subqueries with limit

On Mon, Sep 19, 2022 at 3:58 PM James Coleman  wrote:
> But in the case where there's correlation via LATERAL we already don't
> guarantee unique executions for a given set of params into the lateral
> subquery execution, right? For example, suppose we have:
>
>   select *
>   from foo
>   left join lateral (
> select n
> from bar
> where bar.a = foo.a
> limit 1
>   ) on true
>
> and suppose that foo.a (in the table scan) returns these values in
> order: 1, 2, 1. In that case we'll execute the lateral subquery 3
> separate times rather than attempting to order the results of foo.a
> such that we can re-execute the subquery only when the param changes
> to a new unique value (and we definitely don't cache the results to
> guarantee unique subquery executions).

I think this is true, but I don't really understand why we should
focus on LATERAL here. What we really need, and I feel like we've
talked about this before, is a way to reason about where parameters
are set and used. Your sample query gets a plan like this:

 Nested Loop Left Join  (cost=0.00..1700245.00 rows=1 width=8)
   ->  Seq Scan on foo  (cost=0.00..145.00 rows=1 width=4)
   ->  Limit  (cost=0.00..170.00 rows=1 width=4)
 ->  Seq Scan on bar  (cost=0.00..170.00 rows=1 width=4)
   Filter: (foo.a = a)

If this were to occur inside a larger plan tree someplace, it would be
OK to insert a Gather node above the Nested Loop node without doing
anything further, because then the parameter that stores foo.a would
be both set and used in the worker. If you wanted to insert a Gather
at any other place in this plan, things get more complicated. But just
because you have LATERAL doesn't mean that you have this problem,
because if you delete the "limit 1" then the subqueries get flattened
together and the parameter disappears, and if you delete the lateral
reference (i.e. WHERE foo.a = bar.a) then there's still a subquery but
it no longer refers to an outer parameter. And on the flip side just
because you don't have LATERAL doesn't mean that you don't have this
problem. e.g. the query could instead be:

select *, (select n from bar where bar.a = foo.a limit 1) from foo;

...which I think is pretty much equivalent to your formulation and has
the same problem as far as parallel query as your formulation but does
not involve the LATERAL keyword.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Consider parallel for lateral subqueries with limit

2022-09-19 Thread James Coleman

On Tue, Mar 1, 2022 at 5:35 PM Tom Lane  wrote:
>
> But more generally, I don't think you've addressed the fundamental
> concern, which is that a query involving Limit is potentially
> nondeterministic (if it lacks a fully-deterministic ORDER BY),
> so that different workers could get different answers from it if
> they're using a plan type that permits that to happen.  (See the
> original discussion that led to 75f9c4ca5, at [1].)  I do not see
> how a lateral dependency removes that hazard.
> ...
> > In this case we are already conceivably getting different
> > results for each execution of the subquery "Limit(Y)" even if we're
> > not running those executions across multiple workers.
>
> That seems to be about the same argument Andres made initially
> in the old thread, but we soon shot that down as not being the
> level of guarantee we want to provide.  There's nothing in the
> SQL standard that says that
> select * from events where account in
> (select account from events
>  where data->>'page' = 'success.html' limit 3);
> (the original problem query) shall execute the sub-query
> only once, but people expect it to act that way.

I understand that particular case being a bit of a gotcha (i.e., what
we would naturally expect exceeds what the spec says).

But in the case where there's correlation via LATERAL we already don't
guarantee unique executions for a given set of params into the lateral
subquery execution, right? For example, suppose we have:

  select *
  from foo
  left join lateral (
select n
from bar
where bar.a = foo.a
limit 1
  ) on true

and suppose that foo.a (in the table scan) returns these values in
order: 1, 2, 1. In that case we'll execute the lateral subquery 3
separate times rather than attempting to order the results of foo.a
such that we can re-execute the subquery only when the param changes
to a new unique value (and we definitely don't cache the results to
guarantee unique subquery executions).

So, assuming that we can guarantee we're talking about the proper
lateral reference (not something unrelated as you pointed out earlier)
then I don't believe we're actually changing even implicit guarantees
about how the query executes. I think that probably true both in
theory (I claim that once someone declares a lateral join they're
definitionally expecting a subquery execution per outer row) and in
practice (e.g., your hypothesis about synchronized seq scans would
apply here also to subsequent executions of the subquery).

Is there still something I'm missing about the concern you have?

> If you want to improve this area, my feeling is that it'd be
> better to look into what was speculated about in the old
> thread: LIMIT doesn't create nondeterminism if the query has
> an ORDER BY that imposes a unique row ordering, ie
> order-by-primary-key.  We didn't have planner infrastructure
> that would allow checking that cheaply in 2018, but maybe
> there is some now?

Assuming what I argued earlier holds true for lateral [at minimum when
correlated] subquery execution, then I believe your suggestion here is
orthogonal and would expand the use cases even more. For example, if
we were able to guarantee a unique result set (including order), then
we could allow parallelizing subqueries even if they're not lateral
and correlated.

James Coleman

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

On Tue, May 31, 2022 at 5:33 AM Dmitry Koval  wrote:
> There are not many commands in PostgreSQL for working with partitioned
> tables. This is an obstacle to their widespread use.
> Adding SPLIT PARTITION/MERGE PARTITIONS operations can make easier to
> use partitioned tables in PostgreSQL.
> (This is especially important when migrating projects from ORACLE DBMS.)
>
> SPLIT PARTITION/MERGE PARTITIONS commands are supported for range
> partitioning (BY RANGE) and for list partitioning (BY LIST).
> For hash partitioning (BY HASH) these operations are not supported.

This may be a good idea, but I would like to point out one
disadvantage of this approach.

If you know that a certain partition is not changing, and you would
like to split it, you can create two or more new standalone tables and
populate them from the original partition using INSERT .. SELECT. Then
you can BEGIN a transaction, DETACH the existing partitions, and
ATTACH the replacement ones. By doing this, you take an ACCESS
EXCLUSIVE lock on the partitioned table only for a brief period. The
same kind of idea can be used to merge partitions.

It seems hard to do something comparable with built-in DDL for SPLIT
PARTITION and MERGE PARTITION. You could start by taking e.g. SHARE
lock on the existing partition(s) and then wait until the end to take
ACCESS EXCLUSIVE lock on the partitions, but we typically avoid such
coding patterns, because the lock upgrade might deadlock and then a
lot of work would be wasted. So most likely with the approach you
propose here you will end up acquiring ACCESS EXCLUSIVE lock at the
beginning of the operation and then shuffle a lot of data around while
still holding it, which is pretty painful.

Because of this problem, I find it hard to believe that these commands
would get much use, except perhaps on small tables or in
non-production environments, unless people just didn't know about the
alternatives. That's not to say that something like this has no value.
As a convenience feature, it's fine. It's just hard for me to see it
as any more than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: has_privs_of_role vs. is_member_of_role, redux

On Thu, Sep 8, 2022 at 1:06 PM  wrote:
> A different line of thought (compared to the "USAGE" privilege I
> discussed earlier), would be:
> To transfer ownership of an object, you need two sets of privileges:
> - You need to have the privilege to initiate a request to transfer
> ownership.
> - You need to have the privilege to accept a request to transfer ownership.
>
> Let's imagine there'd be such a request created temporarily, then when I
> start the process of changing ownership, I would have to change to the
> other role and then accept that request.
>
> In theory, I could also inherit that privilege, but that's not how the
> system works today. By using is_member_of_role, the decision was already
> made that this should not depend on inheritance. What is left, is the
> ability to do it via SET ROLE only.

I do not accept the argument that we've already made the decision that
this should not depend on inheritance. It's pretty clear that we
haven't thought carefully enough about which checks should depend only
on membership, and which ones should depend on inheritance. The patch
I committed just now to fix ALTER DEFAULT PRIVILEGES is one clear
example of where we've gotten that wrong. We also changed the way
predefined roles worked with inheritance not too long ago, so that
they started using has_privs_of_role() rather than
is_member_of_role(). Our past thinking on this topic has been fuzzy
enough that we can't really conclude that because something uses
is_member_of_role() now that's what it should continue to do in the
future. We are working to get from a messy situation where the rules
aren't consistent or understandable to one where they are, and that
may mean changing some things.

> So it should not be has_privs_of_role() nor has_privs_of_role() ||
> member_can_set_role(), as you suggested above, but rather just
> member_can_set_role() only. Of course, only in the context of the SET
> ROLE patch.

Now, having said that, this choice of behavior might have some
advantages. It would mean that you could GRANT pg_read_all_settings TO
someone WITH INHERIT TRUE, SET FALSE and that user would be able to
read all settings but would not be able to create objects owned by
pg_read_all_settings. It would also be upward-compatible with the
existing behavior, which is nice.

Well, maybe. Suppose that role A has been granted pg_read_all_settings
WITH INHERIT TRUE, SET TRUE and role B has been granted
pg_read_all_settings WITH INHERIT TRUE, SET FALSE. A can create a
table owned by pg_read_all_settings. If A does that, then B can now
create a trigger on that table and usurp the privileges of
pg_read_all_settings, after which B can now create any number of
objects owned by pg_read_all_settings. If A does not do that, though,
I think that with the proposed rule, B would have no way to create
objects owned by A. This is a bit unsatisfying. It seems like B should
either have the right to usurp pg_read_all_settings's privileges or
not, rather than maybe having that right depending on what some other
user chooses to do.

But maybe it's OK. It's hard to come up with perfect solutions here.
One could take the view that the issue here is that
pg_read_all_settings shouldn't have the right to create objects in the
first place, and that this INHERIT vs. SET ROLE distinction is just a
distraction. However, that would require accepting the idea that it's
possible for a role to lack privileges granted to PUBLIC, which also
sounds pretty unsatisfying. On the whole, I'm inclined to think it's
reasonable to suppose that if you want to grant a role to someone
without letting them create objects owned by that role, it should be a
role that doesn't own any existing objects either. Essentially, that's
legislating that predefined roles should be minimally privileged: they
should hold the ability to do whatever it is that they are there to do
(like read all settings) but not have any other privileges (like the
ability to do stuff to objects they own).

But maybe there's a better answer. Ideas/opinions welcome.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Silencing the remaining clang 15 warnings

While working on the -Wdeprecated-non-prototype fixups discussed
nearby, I saw that clang 15.0 produces a few other new warnings
(which are also visible in the buildfarm).  Pursuant to our
usual policy that we should suppress warnings on compilers likely
to be used for development, here's a patch to silence them.

There are three groups of these:

* With %pure-parser, Bison makes the "yynerrs" variable local
instead of static, and then if you don't use it clang notices
that it's set but never read.  There doesn't seem to be a way
to persuade Bison not to emit the variable at all, so here I've
just added "(void) yynerrs;" to the topmost production of each
affected grammar.  If anyone has a nicer idea, let's hear it.

* xlog.c's AdvanceXLInsertBuffer has a local variable "npages"
that is only read in the "#ifdef WAL_DEBUG" stanza at the
bottom.  Here I've done the rather ugly and brute-force thing
of wrapping all the variable's references in "#ifdef WAL_DEBUG".
(I tried marking it PG_USED_FOR_ASSERTS_ONLY, but oddly that
did not silence the warning.)  I kind of wonder how useful this
function's WAL_DEBUG output is --- maybe just dropping that
altogether would be better?

* array_typanalyze.c's compute_array_stats counts the number
of null arrays in the column, but then does nothing with the
result.  AFAICS this is redundant with what std_compute_stats
will do, so I just removed the variable.

Any thoughts?

regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81d339d57d..2651bcaa76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1785,7 +1785,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
+#ifdef WAL_DEBUG
 	int			npages = 0;
+#endif
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -1928,7 +1930,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
+#ifdef WAL_DEBUG
 		npages++;
+#endif
 	}
 	LWLockRelease(WALBufMappingLock);
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 82f03fc9c9..c6ec694ceb 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -864,6 +864,7 @@ parse_toplevel:
 			stmtmulti
 			{
 pg_yyget_extra(yyscanner)->parsetree = $1;
+(void) yynerrs;		/* suppress compiler warning */
 			}
 			| MODE_TYPE_NAME Typename
 			{
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 1964cedd93..2360c680ac 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -218,7 +218,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 {
 	ArrayAnalyzeExtraData *extra_data;
 	int			num_mcelem;
-	int			null_cnt = 0;
 	int			null_elem_cnt = 0;
 	int			analyzed_rows = 0;
 
@@ -320,8 +319,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 		value = fetchfunc(stats, array_no, );
 		if (isnull)
 		{
-			/* array is null, just count that */
-			null_cnt++;
+			/* ignore arrays that are null overall */
 			continue;
 		}
 
diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y
index 7e28853a57..2a56629cc3 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -113,6 +113,7 @@ result:
 		*result = palloc(sizeof(JsonPathParseResult));
 		(*result)->expr = $2;
 		(*result)->lax = $1;
+		(void) yynerrs;
 	}
 	| /* EMPTY */	{ *result = NULL; }
 	;
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index ade2ecdaab..f6387d4a3c 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -80,7 +80,10 @@ static PgBenchExpr *make_case(yyscan_t yyscanner, PgBenchExprList *when_then_lis
 
 %%
 
-result: expr{ expr_parse_result = $1; }
+result: expr{
+expr_parse_result = $1;
+(void) yynerrs; /* suppress compiler warning */
+			}
 
 elist:		{ $$ = NULL; }
 	| expr	{ $$ = make_elist($1, NULL); }

Re: has_privs_of_role vs. is_member_of_role, redux

On Fri, Aug 26, 2022 at 10:11 AM Robert Haas  wrote:
> Here's a small patch. Despite the small size of the patch, there are a
> couple of debatable points here:

Nobody's commented on this patch specifically, but it seemed like we
had consensus that ALTER DEFAULT PRIVILEGES was doing The Wrong Thing,
so I've pushed the patch I posted previously for that issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

On Sun, Sep 18, 2022 at 4:58 PM Tom Lane  wrote:
> BTW, I was distressed to discover that someone decided they could
> use ExecShutdownNode as a planstate_tree_walker() walker even though
> its argument list is not even the right length.  I'm a bit flabbergasted
> that we seem to have gotten away with that so far, because I'd have
> thought for sure that it'd break some platform's convention for which
> argument gets passed where.  I think we need to fix that, independently
> of what we do about the larger scope of these problems.  To avoid an
> API break, I propose making ExecShutdownNode just be a one-liner that
> calls an internal ExecShutdownNode_walker() function.  (I've not done
> it that way in the attached, though.)

I think this was brain fade on my part ... or possibly on Amit
Kapila's part, but I believe it was probably me. I agree that it's
impressive that it actually seemed to work that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

Here's a second-generation patch that fixes the warnings by inserting
casts into a layer of macro wrappers.  I had supposed that this would
cause us to lose all detection of wrongly-chosen walker functions,
so I was very pleased to see this when applying it to yesterday's HEAD:

execProcnode.c:792:2: warning: cast from 'bool (*)(PlanState *)' (aka 'bool 
(*)(struct PlanState *)') to 'planstate_tree_walker_callback' (aka 'bool 
(*)(struct PlanState *, void *)') converts to incompatible function type 
[-Wcast-function-type]
planstate_tree_walker(node, ExecShutdownNode, NULL);
^~~
../../../src/include/nodes/nodeFuncs.h:180:33: note: expanded from macro 
'planstate_tree_walker'
planstate_tree_walker_impl(ps, (planstate_tree_walker_callback) (w), c)
   ^~~~

So we've successfully suppressed the pedantic -Wdeprecated-non-prototype
warnings, and we have activated the actually-useful -Wcast-function-type
warnings, which seem to do exactly what we want in this context:

'-Wcast-function-type'
 Warn when a function pointer is cast to an incompatible function
 pointer.  In a cast involving function types with a variable
 argument list only the types of initial arguments that are provided
 are considered.  Any parameter of pointer-type matches any other
  ^^^
 pointer-type.  Any benign differences in integral types are
 
 ignored, like 'int' vs.  'long' on ILP32 targets.  Likewise type
 qualifiers are ignored.  The function type 'void (*) (void)' is
 special and matches everything, which can be used to suppress this
 warning.  In a cast involving pointer to member types this warning
 warns whenever the type cast is changing the pointer to member
 type.  This warning is enabled by '-Wextra'.

(That verbiage is from the gcc manual; clang seems to act the same
except that -Wcast-function-type is selected by -Wall, or perhaps is
even on by default.)

So I'm pretty pleased with this formulation: no caller changes are
needed, and it does exactly what we want warning-wise.

regards, tom lane

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 3bac350bf5..724d076674 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -27,10 +27,12 @@
 static bool expression_returns_set_walker(Node *node, void *context);
 static int	leftmostLoc(int loc1, int loc2);
 static bool fix_opfuncids_walker(Node *node, void *context);
-static bool planstate_walk_subplans(List *plans, bool (*walker) (),
+static bool planstate_walk_subplans(List *plans,
+	planstate_tree_walker_callback walker,
 	void *context);
 static bool planstate_walk_members(PlanState **planstates, int nplans,
-   bool (*walker) (), void *context);
+   planstate_tree_walker_callback walker,
+   void *context);
 
 
 /*
@@ -1909,9 +1911,9 @@ check_functions_in_node(Node *node, check_function_callback checker,
  */
 
 bool
-expression_tree_walker(Node *node,
-	   bool (*walker) (),
-	   void *context)
+expression_tree_walker_impl(Node *node,
+			tree_walker_callback walker,
+			void *context)
 {
 	ListCell   *temp;
 
@@ -1923,6 +1925,10 @@ expression_tree_walker(Node *node,
 	 * when we expect a List we just recurse directly to self without
 	 * bothering to call the walker.
 	 */
+#define WALK(n) walker((Node *) (n), context)
+
+#define LIST_WALK(l) expression_tree_walker_impl((Node *) (l), walker, context)
+
 	if (node == NULL)
 		return false;
 
@@ -1946,25 +1952,21 @@ expression_tree_walker(Node *node,
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
-			return walker(((WithCheckOption *) node)->qual, context);
+			return WALK(((WithCheckOption *) node)->qual);
 		case T_Aggref:
 			{
 Aggref	   *expr = (Aggref *) node;
 
-/* recurse directly on List */
-if (expression_tree_walker((Node *) expr->aggdirectargs,
-		   walker, context))
+/* recurse directly on Lists */
+if (LIST_WALK(expr->aggdirectargs))
 	return true;
-if (expression_tree_walker((Node *) expr->args,
-		   walker, context))
+if (LIST_WALK(expr->args))
 	return true;
-if (expression_tree_walker((Node *) expr->aggorder,
-		   walker, context))
+if (LIST_WALK(expr->aggorder))
 	return true;
-if (expression_tree_walker((Node *) expr->aggdistinct,
-		   walker, context))
+if (LIST_WALK(expr->aggdistinct))
 	return true;
-if (walker((Node *) expr->aggfilter, context))
+if (WALK(expr->aggfilter))
 	return true;
 			}
 			break;
@@ -1972,8 +1974,7 @@ expression_tree_walker(Node *node,
 			{
 GroupingFunc *grouping = (GroupingFunc *) node;
 
-if

Re: walmethods.c/h are doing some strange things

On Thu, Sep 15, 2022 at 9:39 PM Kyotaro Horiguchi
 wrote:
> At Fri, 2 Sep 2022 11:52:38 -0400, Robert Haas  wrote 
> in
> > that type that can ever exist, and the pointer to that object is
> > stored in a global variable managed by walmethods.c. So whereas in
> > other cases we give you the object and then a way to get the
> > corresponding set of callbacks, here we only give you the callbacks,
> > and we therefore have to impose the artificial restriction that there
> > can only ever be one object.
>
> Makes sense to me.

OK, I have committed the patches. Before doing that, I fixed a couple
of bugs in the first one, and did a little bit of rebasing of the
second one.

> I agree to the view. That part seems to be a part of
> open_for_write()'s body functions. But, I'm not sure how we untangle
> them at a glance, too.  In the first place, I'm not sure why we need
> to do that despite the file going to be overwritten from the
> beginning, though..

I suspect that we're going to have to untangle things bit by bit.

One place where I think we might be able to improve things is with the
handling of compression suffixes (e.g. .gz, .lz4) and temp suffixes
(e.g. .partial, .tmp). At present, responsibility for adding these
suffixes to pathnames is spread across receivelog.c and walmethods.c
in a way that, to me, looks pretty random. It's not exactly clear to
me what the best design is here right now, but I think either (A)
receivelog.c should take full responsibility for computing the exact
filename and walmethods.c should just blindly write the data into the
exact filename it's given, or else (B) receivelog.c should take no
responsibility for pathname construction and the fact that there is
pathname munging happening should be hidden inside walmethods.c. Right
now, walmethods.c is doing pathname munging, but the munging is
visible from receivelog.c, so the responsibility is spread across the
two files rather than being the sole responsibility of either.

What's also a little bit aggravating about this is that it doesn't
feel like we're accomplishing all that much code reuse here.
walmethods.c and receivelog.c are shared only between pg_basebackup
and pg_receivewal, but the tar method is used only by pg_basebackup.
The directory mode is shared, but actually the two programs need a
bunch of different things. pg_recievelog needs a bunch of logic to
deal with the possibility that it is receiving data into an existing
directory and that this directory might at any time start to be used
to feed a standby, or used for PITR. That's not an issue for
pg_basebackup: if it fails, the whole directory will be removed, so
there is no need to worry about fsyncs or padding or overwriting
existing files. On the other hand, pg_basebackup needs a bunch of
logic to create a .done file for each WAL segment which is not
required in the case of pg_receivewal. It feels like we've got as much
conditional logic as we do common logic...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Kerberos delegation support in libpq and postgres_fdw

2022-09-19 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On Thu, Jul 7, 2022 at 4:24 PM Jacob Champion  wrote:
> > So my question is this: does substituting my credentials for the admin's
> > credentials let me weaken or break the transport encryption on the
> > backend connection, and grab the password that I'm not supposed to have
> > access to as a front-end client?
> 
> With some further research: yes, it does.
> 
> If a DBA is using a GSS encrypted tunnel to communicate to a foreign
> server, accepting delegation by default means that clients will be
> able to break that backend encryption at will, because the keys in use
> will be under their control.

This is coming across as if it's a surprise of some kind when it
certainly isn't..  If the delegated credentials are being used to
authenticate and establish the connection from that backend to another
system then, yes, naturally that means that the keys provided are coming
from the client and the client knows them.  The idea of arranging to
have an admin's credentials used to authenticate to another system where
the backend is actually controlled by a non-admin user is, in fact, the
issue in what is being outlined above as that's clearly a situation
where the user's connection is being elevated to an admin level.  That's
also something that we try to avoid having happen because it's not
really a good idea, which is why we require a password today for the
connection to be established (postgres_fdw/connection.c:

Non-superuser cannot connect if the server does not request a password.

).

Consider that, in general, the user could also simply directly connect
to the other system themselves instead of having a PG backend make that
connection for them- the point in doing it from PG would be to avoid
having to pass all the data back through the client's system.

Consider SSH instead of PG.  What you're pointing out, accurately, is
that if an admin were to install their keys into a user's .ssh directory
unencrypted and then the user logged into the system, they'd then be
able to SSH to another system with the admin's credentials and then
they'd need the admin's credentials to decrypt the traffic, but that if,
instead, the user brings their own credentials then they could
potentially decrypt the connection between the systems.  Is that really
the issue here?  Doesn't seem like that's where the concern should be in
this scenario.

> > Maybe there's some ephemeral exchange going on that makes it too hard to
> > attack in practice, or some other mitigations.
> 
> There is no forward secrecy, ephemeral exchange, etc. to mitigate this [1]:
> 
>The Kerberos protocol in its basic form does not provide perfect
>forward secrecy for communications.  If traffic has been recorded by
>an eavesdropper, then messages encrypted using the KRB_PRIV message,
>or messages encrypted using application-specific encryption under
>keys exchanged using Kerberos can be decrypted if the user's,
>application server's, or KDC's key is subsequently discovered.
> 
> So the client can decrypt backend communications that make use of its
> delegated key material. (This also means that gssencmode is a lot
> weaker than I expected.)

The backend wouldn't be able to establish the connection in the first
place without those delegated credentials.

> > I'm trying to avoid writing Wireshark dissector
> > code, but maybe that'd be useful either way...
> 
> I did end up filling out the existing PGSQL dissector so that it could
> decrypt GSSAPI exchanges (with the use of a keytab, that is). If you'd
> like to give it a try, the patch, based on Wireshark 3.7.1, is
> attached. Note the GPLv2 license. It isn't correct code yet, because I
> didn't understand how packet reassembly worked in Wireshark when I
> started writing the code, so really large GSSAPI messages that are
> split across multiple TCP packets will confuse the dissector. But it's
> enough to prove the concept.
> 
> To see this in action, set up an FDW connection that uses gssencmode
> (so the server in the middle will need its own Kerberos credentials).

The server in the middle should *not* be using its own Kerberos
credentials to establish the connection to the other system- that's
elevating the credentials used for that connection and is something that
should be prevented for non-superusers already (see above).  We do allow
that when a superuser is involved because they are considered to
essentially have OS-level privileges and therefore could see those
credentials anyway, but that's not the case for non-superusers.

Thanks,

Stephen

signature.asc
Description: PGP signature

pg_create_logical_replication_slot argument incongruency

2022-09-19 Thread Florin Irion

Hello,

The function `pg_create_logical_replication_slot()`  is documented to have
a `two_phase` argument(note the underscore), but the function instead
requires `twophase`.

```
\df pg_catalog.pg_create_logical_replication_slot
List of functions
-[ RECORD 1
]---+-

Schema  | pg_catalog
Name| pg_create_logical_replication_slot
Result data type| record
Argument data types | slot_name name, plugin name, temporary boolean
DEFAULT false, twophase boolean DEFAULT false, OUT slot_name name, OUT lsn
pg_lsn
Type| func
```

This was introduced in commit 19890a06.

IMHO we should use the documented argument name `two_phase` and change the
function to accept it.

What do you think?

Please, check the attached patch.


Cheers,
Florin
--
*www.enterprisedb.com *


two_phase_slot_v1.patch
Description: Binary data

Re: remove more archiving overhead

On Mon, Sep 19, 2022 at 10:39 AM Noah Misch  wrote:
> I wanted it to stop saying anything like the second paragraph, hence commit
> d263ced.  Implementing a proper archiving setup is not especially difficult,
> and inviting the operator to work around a wrong implementation invites
> damaging mistakes under time pressure.

I agree wholeheartedly with the second sentence. I do think that we
need to be a little careful not to be too prescriptive. Telling people
what they have to do when they don't really have to do it often leads
to non-compliance. It can be more productive to spell out the precise
consequences of non-compliance, so I think Peter's proposal has some
merit. On the other hand, I don't see any real problem with the
current language, either.

Unfortunately, no matter what we do here, it is not likely that we'll
soon be able to eliminate the phenomenon where people use a buggy
home-grown archive_command, both because we've been encouraging that
in our documentation for a very long time, and also because the people
who are most likely to do something half-baked here are probably the
least likely to actually read the updated documentation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: remove more archiving overhead

On Mon, Sep 19, 2022 at 6:08 AM Peter Eisentraut
 wrote:
> I suspect what we are really trying to say here is
>
> ===
> Archiving setups (using either archive_command or archive_library)
> should be prepared for the rare case that an identical archive file is
> being archived a second time.  In such a case, they should compare that
> the source and the target file are identical and proceed without error
> if so.
>
> In some cases, it is difficult or impossible to configure
> archive_command or archive_library to do this.  In such cases, the
> archiving command or library should error like in the case for any
> pre-existing target file, and operators need to be prepared to resolve
> such cases manually.
> ===
>
> Is that correct?

I think it is. I think commit d263ced225bffe2c340175125b0270d1869138fe
made the documentation in this area substantially clearer than what we
had before, and I think that we could adjust that text to apply to
both archive libraries and archive commands relatively easily, so I'm
not really sure that we need to add text like what you propose here.

However, I also don't think it would be particularly bad if we did. An
advantage of that language is that it makes it clear what the
consequences are if you fail to follow the recommendation about
handling identical files. That may be helpful to users in
understanding the behavior of the system, and might even make it more
likely that they will include proper handling of that case in their
archive_command or archive_library.

"impossible" does seem like a bit of a strong word, though. I'd
recommend leaving that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Aleksander Alekseev

Hi Himanshu,

> I have changed this in the attached patch.

If it's not too much trouble could you please base your changes on v4
that I submitted? I put some effort into writing a proper commit
message, editing the comments, etc. The easiest way of doing this is
using `git am` and `git format-patch`.

-- 
Best regards,
Aleksander Alekseev

Re: Free list same_input_transnos in preprocess_aggref

2022-09-19 Thread Zhang Mingli


Regards,
Zhang Mingli
On Sep 20, 2022, 00:27 +0800, Zhang Mingli , wrote:
>
> SaveBytes = Sum results of accumulate_list_size: 32(4+4+8+8), as we have 5 
> aggs in sql
Correction: SaveBytes = Sum results of accumulate_list_size: 24(4+4+8+8),

Re: Support tls-exporter as channel binding for TLSv1.3

2022-09-19 Thread Jacob Champion

On Wed, Sep 7, 2022 at 10:03 AM Jacob Champion  wrote:
> Yeah, that should be fine. Requiring newer OpenSSLs for stronger
> crypto will probably be uncontroversial.

While looking into this I noticed that I left the following code in place:

> #ifdef HAVE_BE_TLS_GET_CERTIFICATE_HASH
> if (strcmp(selected_mech, SCRAM_SHA_256_PLUS_NAME) == 0 && 
> port->ssl_in_use)

In other words, we're still deciding whether to advertise -PLUS based
only on whether we support tls-server-end-point. Maybe all the
necessary features landed in OpenSSL in the same version, but I
haven't double-checked that, and in any case I think I need to make
this code more correct in the next version of this patch.

--Jacob

Tab complete for CREATE SUBSCRIPTION ... CONECTION does not work

2022-09-19 Thread Japin Li



Hi hacker,

As $subject detailed, the tab-complete cannot work such as:

   CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=6543' \t

It seems that the get_previous_words() could not parse the single quote.

OTOH, it works for CREATE SUBSCRIPTION sub CONNECTION xx \t, should we fix it?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Free list same_input_transnos in preprocess_aggref

2022-09-19 Thread Zhang Mingli


Regards,
Zhang Mingli
On Sep 19, 2022, 23:14 +0800, Tom Lane , wrote:
> Very little of the planner bothers with freeing small allocations
> like that.
I think so too, as said, not sure if it worths.
> Can you demonstrate a case where this would actually
> make a meaningful difference?
Offhand, an example may help a little:

create table t1(id int);
explain select max(id), min(id), sum(id), count(id), avg(id) from t1;

 Modify codes to test:

@@ -139,6 +139,7 @@ preprocess_aggref(Aggref *aggref, PlannerInfo *root)
 int16 transtypeLen;
 Oid inputTypes[FUNC_MAX_ARGS];
 int numArguments;
+ static size_t accumulate_list_size = 0;

 Assert(aggref->agglevelsup == 0);

@@ -265,7 +266,7 @@ preprocess_aggref(Aggref *aggref, PlannerInfo *root)
 aggserialfn, aggdeserialfn,
 initValue, initValueIsNull,
 same_input_transnos);
- list_free(same_input_transnos);
+ accumulate_list_size += sizeof(int) * list_length(same_input_transnos);

Gdb and print accumulate_list_size for each iteration:

SaveBytes = Sum results of accumulate_list_size: 32(4+4+8+8), as we have 5 aggs 
in sql.

If there were N sets of that aggs (more columns as id, with above aggs ), the 
bytes will be N*SaveBytes.

Seems we don’t have so many agg functions that could share the same trans 
function, Does it worth?

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Himanshu Upadhyaya

On Mon, Sep 19, 2022 at 8:27 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Himanshu,
>
> > Done, updated in the v3 patch.
>
> Thanks for the updated patch.
>
> Here is v4 with fixed compiler warnings and some minor tweaks from me.
>
> I didn't put too much thought into the algorithm but I already see
> something strange. At verify_heapam.c:553 you declared curr_xmax and
> next_xmin. However the variables are not used/initialized until you
> do:
>
> ```
> if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
> TransactionIdIsValid(curr_xmax) &&
> TransactionIdEquals(curr_xmax, next_xmin)) {
> /* ... */
> ```
>
> In v4 I elected to initialize both curr_xmax and next_xmin with
> InvalidTransactionId for safety and in order to silence the compiler
> but still there is no way this condition can succeed.
>
> Please make sure there is no logic missing.
>
>
Hi Aleksander,

Thanks for sharing the feedback,
It's my mistake, sorry about that, I was trying to merge two if conditions
and forgot to move the initialization part for xmin and xmax. Now I think
that it will be good to have nested if, and have an inner if condition to
test xmax and xmin matching. This way we can retrieve and populate xmin and
xmax when it is actually required for the inner if condition.
I have changed this in the attached patch.

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From 1a51c544c5c9a14441831f5299b9d6fe275fbf53 Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Mon, 19 Sep 2022 21:44:34 +0530
Subject: [PATCH v4] HOT chain validation in verify_heapam

---
 contrib/amcheck/verify_heapam.c | 211 
 1 file changed, 211 insertions(+)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index c875f3e5a2..2b9244ca69 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber] = {0};
+		OffsetNumber successor[MaxOffsetNumber] = {0};
+		bool		lp_valid[MaxOffsetNumber] = {false};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +474,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * make entry in successor array, redirected tuple will be
+ * validated at the time when we loop over successor array
+ */
+successor[ctx.offnum] = rdoffnum;
 continue;
 			}
 
@@ -504,9 +515,202 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple();
+
+			/*
+			 * Now add data to successor array, it will later be used to
+			 * generate the predecessor array. We need to access the tuple's
+			 * header to populate the predecessor array but tuple is not
+			 * necessarily sanity checked yet so delaying construction of
+			 * predecessor array until all tuples are sanity checked.
+			 */
+			nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
+			if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
+nextoffnum != ctx.offnum)
+			{
+successor[ctx.offnum] = nextoffnum;
+			}
+		}
+
+		/*
+		 * Loop over offset and populate predecessor array from all entries
+		 * that are present in successor array.
+		 */
+		ctx.attnum = -1;
+		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+			 ctx.offnum = OffsetNumberNext(ctx.offnum))
+		{
+			ItemId		curr_lp;
+			ItemId		next_lp;
+			HeapTupleHeader curr_htup;
+			HeapTupleHeader next_htup;
+			TransactionId curr_xmax;
+			TransactionId next_xmin;
+			OffsetNumber nextoffnum = successor[ctx.offnum];
+
+			curr_lp = PageGetItemId(ctx.page, ctx.offnum);
+			if (nextoffnum == 0)
+			{
+/*
+ * Either last updated tuple in chain or corruption raised for
+ * this tuple.
+ */
+continue;
+			}
+			if (ItemIdIsRedirected(curr_lp) && lp_valid[nextoffnum])
+			{
+next_lp = PageGetItemId(ctx.page, nextoffnum);
+next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);;
+if (!HeapTupleHeaderIsHeapOnly(next_htup))
+{
+	report_corruption(,
+	  psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
+			   (unsigned) nextoffnum));
+}
+if

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-09-19 Thread Jaime Casanova

On Fri, Sep 16, 2022 at 02:37:17PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 11:15:12 -0500, Jaime Casanova 
>  wrote in 
> > It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside
> 
> Thanks for the info.  I reproduced by make check.. stupid..
> 
> It's the thinko about the base address of reset_off.
> 
> So the attached doesn't crash..
> 

Hi,

Just confirming there have been no crash since this last patch.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Önder Kalacı

Hi Peter,

Thanks again for the review, see my comments below:


>
> ==
>
> 1. Commit message
>
> It is often not feasible to use `REPLICA IDENTITY FULL` on the publication
> because it leads to full table scan per tuple change on the subscription.
> This makes `REPLICA IDENTITY FULL` impracticable -- probably other than
> some small number of use cases.
>
> ~
>
> The "often not feasible" part seems repeated by the "impracticable" part.
>


> SUGGESTION
> Using `REPLICA IDENTITY FULL` on the publication leads to a full table
> scan per tuple change on the subscription. This makes `REPLICA
> IDENTITY FULL` impracticable -- probably other than some small number
> of use cases.
>
> ~~~
>

Sure, this is easier to follow, updated.


>
> 2.
>
> The Majority of the logic on the subscriber side already exists in
> the code.
>
> "Majority" -> "majority"
>
>
fixed


> ~~~
>
> 3.
>
> The ones familiar
> with this part of the code could realize that the sequential scan
> code on the subscriber already implements the `tuples_equal()`
> function.
>
> SUGGESTION
> Anyone familiar with this part of the code might recognize that...
>
> ~~~
>

Yes, this is better, applied


>
> 4.
>
> In short, the changes on the subscriber is mostly
> combining parts of (unique) index scan and sequential scan codes.
>
> "is mostly" -> "are mostly"
>
> ~~~
>
>
applied


> 5.
>
> From the performance point of view, there are few things to note.
>
> "are few" -> "are a few"
>
>
applied


> ==
>
> 6. src/backend/executor/execReplication.c - build_replindex_scan_key
>
> +static int
>  build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
>   TupleTableSlot *searchslot)
>  {
> - int attoff;
> + int index_attoff;
> + int scankey_attoff = 0;
>
> Should it be called 'skey_attoff' for consistency with the param 'skey'?
>
>
That looks better, updated


> ~~~
>
> 7.
>
>   Oid operator;
>   Oid opfamily;
>   RegProcedure regop;
> - int pkattno = attoff + 1;
> - int mainattno = indkey->values[attoff];
> - Oid optype = get_opclass_input_type(opclass->values[attoff]);
> + int table_attno = indkey->values[index_attoff];
> + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
>
> Maybe the 'optype' should be adjacent to the other Oid opXXX
> declarations just to keep them all together?
>

I do not have any preference on this. Although I do not see such a strong
pattern in the code, I have no objection to doing so changed.

~~~
>
> 8.
>
> + if (!AttributeNumberIsValid(table_attno))
> + {
> + IndexInfo *indexInfo PG_USED_FOR_ASSERTS_ONLY;
> +
> + /*
> + * There are two cases to consider. First, if the index is a primary or
> + * unique key, we cannot have any indexes with expressions. So, at this
> + * point we are sure that the index we are dealing with is not these.
> + */
> + Assert(RelationGetReplicaIndex(rel) != RelationGetRelid(idxrel) &&
> +RelationGetPrimaryKeyIndex(rel) != RelationGetRelid(idxrel));
> +
> + /*
> + * At this point, we are also sure that the index is not consisting
> + * of only expressions.
> + */
> +#ifdef USE_ASSERT_CHECKING
> + indexInfo = BuildIndexInfo(idxrel);
> + Assert(!IndexOnlyOnExpression(indexInfo));
> +#endif
>
> I was a bit confused by the comment. IIUC the code has already called
> the FilterOutNotSuitablePathsForReplIdentFull some point prior so all
> the unwanted indexes are already filtered out. Therefore these
> assertions are just for no reason, other than sanity checking that
> fact, right? If my understand is correct perhaps a simpler single
> comment is possible:
>

Yes, these are for sanity check


>
> SUGGESTION (or something like this)
> This attribute is an expression, however
> FilterOutNotSuitablePathsForReplIdentFull was called earlier during
> [...] and the indexes comprising only expressions have already been
> eliminated. We sanity check this now. Furthermore, because primary key
> and unique key indexes can't include expressions we also sanity check
> the index is neither of those kinds.
>
> ~~~
>

I agree that we can improve comments here. I incorporated your suggestion
as well.


>
> 9.
> - return hasnulls;
> + /* We should always use at least one attribute for the index scan */
> + Assert (scankey_attoff > 0);
>
> SUGGESTION
> There should always be at least one attribute for the index scan.
>

applied


>
> ~~~
>
> 10. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex
>
> ScanKeyData skey[INDEX_MAX_KEYS];
> IndexScanDesc scan;
> SnapshotData snap;
> TransactionId xwait;
> Relation idxrel;
> bool found;
> TypeCacheEntry **eq = NULL; /* only used when the index is not unique */
> bool indisunique;
> int scankey_attoff;
>
> 10a.
> Should 'scankey_attoff' be called 'skey_attoff' for consistency with
> the 'skey' array?
>

Yes, it makes sense as you suggested on build_replindex_scan_key

>
> ~
>
> 10b.
> Also, it might be tidier to declare the 'skey_attoff' adjacent to the
> 'skey'.
>

moved

>
> ==
>
> 11.

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-19 Thread Önder Kalacı

Hi Hayato Kuroda,

Thanks for the review, please see my reply below:


> ===
> For execRelation.c
>
> 01. RelationFindReplTupleByIndex()
>
> ```
> /* Start an index scan. */
> InitDirtySnapshot(snap);
> -   scan = index_beginscan(rel, idxrel, ,
> -
> IndexRelationGetNumberOfKeyAttributes(idxrel),
> -  0);
>
> /* Build scan key. */
> -   build_replindex_scan_key(skey, rel, idxrel, searchslot);
> +   scankey_attoff = build_replindex_scan_key(skey, rel, idxrel,
> searchslot);
>
> +   scan = index_beginscan(rel, idxrel, , scankey_attoff, 0);
> ```
>
> I think "/* Start an index scan. */" should be just above
> index_beginscan().
>

moved there


>
> ===
> For worker.c
>
> 02. sable_indexoid_internal()
>
> ```
> + * Note that if the corresponding relmapentry has InvalidOid
> usableIndexOid,
> + * the function returns InvalidOid.
> + */
> +static Oid
> +usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo
> *relinfo)
> ```
>
> "InvalidOid usableIndexOid" should be "invalid usableIndexOid,"
>

makes sense, updated


>
> 03. check_relation_updatable()
>
> ```
>  * We are in error mode so it's fine this is somewhat slow. It's
> better to
>  * give user correct error.
>  */
> -   if (OidIsValid(GetRelationIdentityOrPK(rel->localrel)))
> +   if (OidIsValid(rel->usableIndexOid))
> {
> ```
>
> Shouldn't we change the above comment to? The check is no longer slow.
>

Hmm, I couldn't realize this comment earlier. So you suggest "slow" here
refers to the additional function call "GetRelationIdentityOrPK"? If so,
yes I'll update that.


>
> ===
> For relation.c
>
> 04. GetCheapestReplicaIdentityFullPath()
>
> ```
> +static Path *
> +GetCheapestReplicaIdentityFullPath(Relation localrel)
> +{
> +   PlannerInfo *root;
> +   Query  *query;
> +   PlannerGlobal *glob;
> +   RangeTblEntry *rte;
> +   RelOptInfo *rel;
> +   int attno;
> +   RangeTblRef *rt;
> +   List *joinList;
> +   Path *seqScanPath;
> ```
>
> I think the part that constructs dummy-planner state can be move to
> another function
> because that part is not meaningful for this.
> Especially line 824-846 can.
>
>
Makes sense, simplified the function. Though, it is always hard to pick
good names for these kinds of helper functions. I
picked GenerateDummySelectPlannerInfoForRelation(), does that sound good to
you as well?


>
> ===
> For 032_subscribe_use_index.pl
>
> 05. general
>
> ```
> +# insert some initial data within the range 0-1000
> +$node_publisher->safe_psql('postgres',
> +   "INSERT INTO test_replica_id_full SELECT i%20 FROM
> generate_series(0,1000)i;"
> +);
> ```
>
> It seems that the range of initial data seems [0, 19].
> Same mistake-candidates are found many place.
>

Ah, several copy & paste errors. Fixed (hopefully) all.


>
> 06. general
>
> ```
> +# updates 1000 rows
> +$node_publisher->safe_psql('postgres',
> +   "UPDATE test_replica_id_full SET x = x + 1 WHERE x = 15;");
> ```
>
> Only 50 tuples are modified here.
> Same mistake-candidates are found many place.
>

Alright, yes there were several wrong comments in the tests. I went over
the tests once more to fix those and improve comments.


>
> 07. general
>
> ```
> +# we check if the index is used or not
> +$node_subscriber->poll_query_until(
> +   'postgres', q{select (idx_scan = 200) from pg_stat_all_indexes
> where indexrelname = 'test_replica_id_full_idx';}
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full_3
> updates 200 rows via index";
> ```
> The query will be executed until the index scan is finished, but it may be
> not commented.
> How about changing it to "we wait until the index used on the
> subscriber-side." or something?
> Same comments are found in many place.
>

Makes sense, updated


>
> 08. test related with ANALYZE
>
> ```
> +# Testcase start: SUBSCRIPTION CAN UPDATE THE INDEX IT USES AFTER ANALYZE
> - PARTITIONED TABLE
> +# 
> ```
>
> "Testcase start:" should be "Testcase end:" here.
>

thanks, fixed


>
> 09. general
>
> In some tests results are confirmed but in other test they are not.
> I think you can make sure results are expected in any case if there are no
> particular reasons.
>
>
Alright, yes I also don't see a reason not to do that. Added to all cases.


I'll attach the patch with the next email as I also want to incorporate the
other comments. Hope this is not going to be confusing.

Thanks,
Onder

Re: Pruning never visible changes

2022-09-19 Thread Matthias van de Meent

On Mon, 19 Sept 2022 at 01:16, Greg Stark  wrote:
>
> On Fri, 16 Sept 2022 at 10:27, Tom Lane  wrote:
> >
> > Simon Riggs  writes:
> > > A user asked me whether we prune never visible changes, such as
> > > BEGIN;
> > > INSERT...
> > > UPDATE.. (same row)
> > > COMMIT;
> >
> > Didn't we just have this discussion in another thread?
>
> Well.  not "just" :)

This recent thread [0] mentioned the same, and I mentioned it in [1]
too last year.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/2031521.1663076724%40sss.pgh.pa.us#01542683a64a312e5c21541fecd50e63
Subject: Re: Tuples inserted and deleted by the same transaction
Date: 2022-09-13 14:13:44

[1] 
https://www.postgresql.org/message-id/CAEze2Whjnhg96Wt2-DxtBydhmMDmVm2WfWOX3aGcB2C2Hbry0Q%40mail.gmail.com
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-14 09:53:47
(in a thread about a PS comment)

Re: why can't a table be part of the same publication as its schema

2022-09-19 Thread Alvaro Herrera



> diff --git a/doc/src/sgml/logical-replication.sgml 
> b/doc/src/sgml/logical-replication.sgml
> index 1ae3287..0ab768d 100644
> --- a/doc/src/sgml/logical-replication.sgml
> +++ b/doc/src/sgml/logical-replication.sgml
> @@ -1120,6 +1120,11 @@ test_sub=# SELECT * FROM child ORDER BY a;
>
>  
>
> +   Specifying a column list when the publication also publishes
> +   FOR ALL TABLES IN SCHEMA is not supported.
> +  
> +
> +  
> For partitioned tables, the publication parameter
> publish_via_partition_root determines which column list
> is used. If publish_via_partition_root is
> 
> diff --git a/doc/src/sgml/ref/create_publication.sgml 
> b/doc/src/sgml/ref/create_publication.sgml
> index 0a68c4b..0ced7da 100644
> --- a/doc/src/sgml/ref/create_publication.sgml
> +++ b/doc/src/sgml/ref/create_publication.sgml
> @@ -103,17 +103,17 @@ CREATE PUBLICATION  class="parameter">name
>   
>  
>   
> +  Specifying a column list when the publication also publishes
> +  FOR ALL TABLES IN SCHEMA is not supported.
> + 
> 
> @@ -733,6 +694,24 @@ CheckPubRelationColumnList(List *tables, const char 
> *queryString,
>   continue;
>  
>   /*
> +  * Disallow specifying column list if any schema is in the
> +  * publication.
> +  *
> +  * XXX We could instead just forbid the case when the 
> publication
> +  * tries to publish the table with a column list and a schema 
> for that
> +  * table. However, if we do that then we need a restriction 
> during
> +  * ALTER TABLE ... SET SCHEMA to prevent such a case which 
> doesn't
> +  * seem to be a good idea.
> +  */
> + if (publish_schema)
> + ereport(ERROR,
> + 
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("cannot use publication column 
> list for relation \"%s.%s\"",
> +
> get_namespace_name(RelationGetNamespace(pri->relation)),
> +
> RelationGetRelationName(pri->relation)),
> + errdetail("Column list cannot be 
> specified if any schema is part of the publication or specified in the 
> list."));
> +

This seems a pretty arbitrary restriction.  It feels like you're adding
this restriction precisely so that you don't have to write the code to
reject the ALTER .. SET SCHEMA if an incompatible configuration is
detected.  But we already have such checks in several cases, so I don't
see why this one does not seem a good idea.

The whole FOR ALL TABLES IN SCHEMA thing seems pretty weird in several
aspects.  Others have already commented about the syntax, which is
unlike what GRANT uses; I'm also surprised that we've gotten away with
it being superuser-only.  Why are we building more superuser-only
features in this day and age?  I think not even FOR ALL TABLES should
require superuser.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Re: Free list same_input_transnos in preprocess_aggref

Zhang Mingli  writes:
> In preprocess_aggref(), list same_input_transnos is used to track compatible 
> transnos.
> Free it if we don’t need it anymore.

Very little of the planner bothers with freeing small allocations
like that.  Can you demonstrate a case where this would actually
make a meaningful difference?

regards, tom lane

Re: HOT chain validation in verify_heapam()

2022-09-19 Thread Aleksander Alekseev

Hi Himanshu,

> Done, updated in the v3 patch.

Thanks for the updated patch.

Here is v4 with fixed compiler warnings and some minor tweaks from me.

I didn't put too much thought into the algorithm but I already see
something strange. At verify_heapam.c:553 you declared curr_xmax and
next_xmin. However the variables are not used/initialized until you
do:

```
if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
TransactionIdIsValid(curr_xmax) &&
TransactionIdEquals(curr_xmax, next_xmin)) {
/* ... */
```

In v4 I elected to initialize both curr_xmax and next_xmin with
InvalidTransactionId for safety and in order to silence the compiler
but still there is no way this condition can succeed.

Please make sure there is no logic missing.

-- 
Best regards,
Aleksander Alekseev


v4-0001-Implement-HOT-chain-validation-in-verify_heapam.patch
Description: Binary data

Re: remove more archiving overhead

2022-09-19 Thread David Steele





On 9/19/22 07:39, Noah Misch wrote:

On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:

On 18.09.22 09:13, Noah Misch wrote:

This documentation change only covers archive_library.  How are users of
archive_command supposed to handle this?


I believe users of archive_command need to do something similar to what is
described here.  However, it might be more reasonable to expect
archive_command users to simply return false when there is a pre-existing
file, as the deleted text notes.  IIRC that is why I added that sentence
originally.


What makes the answer for archive_command diverge from the answer for
archive_library?


I suspect what we are really trying to say here is

===
Archiving setups (using either archive_command or archive_library) should be
prepared for the rare case that an identical archive file is being archived
a second time.  In such a case, they should compare that the source and the
target file are identical and proceed without error if so.

In some cases, it is difficult or impossible to configure archive_command or
archive_library to do this.  In such cases, the archiving command or library
should error like in the case for any pre-existing target file, and
operators need to be prepared to resolve such cases manually.
===

Is that correct?


I wanted it to stop saying anything like the second paragraph, hence commit
d263ced.  Implementing a proper archiving setup is not especially difficult,
and inviting the operator to work around a wrong implementation invites
damaging mistakes under time pressure.


I would also not want to state that duplicate WAL is rare. In practice 
it is pretty common when things are going wrong. Also, implying it is 
rare might lead a user to decide they don't need to worry about it.


Regards,
-David

Re:Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Sergei Kornilov

Hi
Ubuntu 16.04 is EOL from April 2021, over a year ago.

https://wiki.postgresql.org/wiki/Apt/FAQ#Where_are_older_versions_of_the_packages.3F

regards, Sergei

Re: Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Magnus Hagander

Hello!

Because it has been removed and moved to the archives, as per the warning
from early July.

See
https://www.postgresql.org/message-id/flat/YsV8fmomNNC%2BGpIR%40msg.credativ.de

//Magnus


On Mon, Sep 19, 2022 at 4:46 PM Larry Rosenman  wrote:

>
> All of a sudden I'm getting repo not found for
> Ubuntu 16.04 LTS on the APT repo.  Why?
>
> --
> Larry Rosenman http://www.lerctr.org/~ler
> Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
>

binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-19 Thread Ashutosh Sharma

Hi All,

Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
functions which gives us information about the next wal insert
location and the WAL file that the next wal insert location belongs
to. Can we have a binary version of these sql functions? It would be
like any other binaries we have for e.g. pg_waldump to which we can
pass the location of the pg_wal directory. This binary would scan
through the directory to return the next wal insert location and the
wal file the next wal insert pointer belongs to.

The binary version of these sql functions can be used when the server
is offline. This can help us to know the overall WAL data that needs
to be replayed when the server is in recovery. In the control file we
do have the redo pointer. Knowing the end pointer would definitely be
helpful.

If you are ok then I will prepare a patch for it and share it. Please
let me know your thoughts/comments. thank you.!

--
With Regards,
Ashutosh Sharma.

Ubuntu 16.04: Xenial: Why was it removed from the apt repo?

2022-09-19 Thread Larry Rosenman




All of a sudden I'm getting repo not found for
Ubuntu 16.04 LTS on the APT repo.  Why?

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106

Re: remove more archiving overhead

2022-09-19 Thread Noah Misch

On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:
> On 18.09.22 09:13, Noah Misch wrote:
> >>>This documentation change only covers archive_library.  How are users of
> >>>archive_command supposed to handle this?
> >>
> >>I believe users of archive_command need to do something similar to what is
> >>described here.  However, it might be more reasonable to expect
> >>archive_command users to simply return false when there is a pre-existing
> >>file, as the deleted text notes.  IIRC that is why I added that sentence
> >>originally.
> >
> >What makes the answer for archive_command diverge from the answer for
> >archive_library?
> 
> I suspect what we are really trying to say here is
> 
> ===
> Archiving setups (using either archive_command or archive_library) should be
> prepared for the rare case that an identical archive file is being archived
> a second time.  In such a case, they should compare that the source and the
> target file are identical and proceed without error if so.
> 
> In some cases, it is difficult or impossible to configure archive_command or
> archive_library to do this.  In such cases, the archiving command or library
> should error like in the case for any pre-existing target file, and
> operators need to be prepared to resolve such cases manually.
> ===
> 
> Is that correct?

I wanted it to stop saying anything like the second paragraph, hence commit
d263ced.  Implementing a proper archiving setup is not especially difficult,
and inviting the operator to work around a wrong implementation invites
damaging mistakes under time pressure.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Sharma

On Mon, Sep 19, 2022 at 5:24 PM Ashutosh Bapat
 wrote:
>
> On Mon, Sep 19, 2022 at 1:43 PM Ashutosh Sharma  wrote:
> >
> > On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> > > >  wrote:
> > > > >
> > > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma 
> > > > >  wrote:
> > > Can you please point to the documentation.
> > >
> >
> > AFAIU there is just one documentation. Here is the link for it:
> >
> > https://www.postgresql.org/docs/current/view-pg-replication-slots.html
>
> Thanks. Description of confirmed_flush_lsn is "The address (LSN) up to
> which the logical slot's consumer has confirmed receiving data. Data
> older than this is not available anymore. NULL for physical slots."
> The second sentence is misleading. AFAIU, it really should be "Data
> corresponding to the transactions committed before this LSN is not
> available anymore". WAL before restart_lsn is likely to be removed but
> WAL with LSN higher than restart_lsn is preserved. This correction
> makes more sense because of the third sentence.
>

Thanks for the clarification. Attached is the patch with the changes.
Please have a look.

--
With Regards,
Ashutosh Sharma.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f573a4..0d7285d 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2392,8 +2392,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   
The address (LSN) up to which the logical
-   slot's consumer has confirmed receiving data. Data older than this is
-   not available anymore. NULL for physical slots.
+   slot's consumer has confirmed receiving data. Data corresponding to the
+   transactions committed before this LSN is not
+   available anymore. NULL for physical slots.

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 16, 2022 at 4:58 PM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
>  wrote:
> >
> > In other words, it seems to me that the macro name doesn't manifest
> > the condition correctly.
> >
> > I don't think we don't particularly want to do that unconditionally.
> > I wanted just to get rid of the macro from the usage site.  Even if
> > the same condition is used elsewhere, I see it better to write out the
> > condition directly there..
>
> I wanted to avoid a bit of duplicate code there. How about naming that
> macro IsXLOGSourceSwitchToStreamEnabled() or
> SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
> or any other better name?

SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
attaching the v7 patch with that change. Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b94327d9af60a44f252251be97ee1efabc964f97 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 19 Sep 2022 14:04:49 +
Subject: [PATCH v7] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  45 +++
 src/backend/access/transam/xlogrecovery.c | 124 +++--
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 700914684d..892442a053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from WAL archive to streaming replication (get WAL from
+primary). If the standby fails to switch to stream mode, it falls back
+to archive mode.
+If this value is specified without units, it is taken as milliseconds.
+The default is five minutes (5min).
+With a lower setting of this parameter, the standby makes frequent
+WAL source switch attempts when the primary is lost for quite longer.
+To avoid this, set a reasonable value.
+A setting of 0 disables the feature. When disabled,
+the standby typically switches to stream mode, only when receive from
+WAL archive finishes (no more WAL left there) or fails for any reason.
+This parameter can only be set in
+the postgresql.conf file or on the server
+command line.
+   
+   
+Reading WAL from archive may not always be efficient and cheaper
+because network latencies, disk IO cost might differ on the archive as
+compared to primary and often the archive may sit far from standby
+impacting recovery performance on the standby. Hence reading WAL
+from the primary, by setting this parameter, as opposed to the archive
+enables the standby to catch up with the primary sooner thus reducing
+replication lag and avoiding WAL files accumulation on the primary.
+  
+   
+Note that the standby may not always attempt to switch source from
+WAL archive to streaming replication at exact
+streaming_replication_retry_interval intervals.
+For example, if the parameter is set to 1min and
+fetching from WAL archive takes 5min, then the
+source

Re: cataloguing NOT NULL constraints

2022-09-19 Thread Matthias van de Meent

On Mon, 19 Sept 2022 at 15:32, Robert Haas  wrote:
>
> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  
> wrote:
> > If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> > printed by psql: (this is a bit more noisy that previously and it
> > changes a lot of regression tests output).
> >
> > 55489 16devel 1776237=# create table tab (a int not null);
> > CREATE TABLE
> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
>
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
>
> I'm not sure what a good alternative would be, though.

I'm not sure on the 'good' part of this alternative, but we could go
with a single row-based IS NOT NULL to reduce such clutter, utilizing
the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL
when all attributes are also IS NOT NULL:

Check constraints:
"tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL)

instead of:

Check constraints:
"tab_a_not_null" CHECK (a IS NOT NULL)
"tab_b_not_null" CHECK (b IS NOT NULL)
"tab_c_not_null" CHECK (c IS NOT NULL)
"tab_d_not_null" CHECK (d IS NOT NULL)
"tab_e_not_null" CHECK (e IS NOT NULL)

But the performance of repeated row-casting would probably not be as
good as our current NULL checks if we'd use the current row
infrastructure, and constraint failure reports wouldn't be as helpful
as the current attribute NOT NULL failures.

Kind regards,

Matthias van de Meent

Re: cataloguing NOT NULL constraints

Isaac Morland  writes:
> I thought I saw some discussion about the SQL standard saying that there is
> a difference between putting NOT NULL in a column definition, and CHECK
> (column_name NOT NULL). So if we're going to take this seriously, I think
> that means there needs to be a field in pg_constraint which identifies
> whether a constraint is a "real" one created explicitly as a constraint, or
> if it is just one created because a field is marked NOT NULL.

If we're going to go that way, I think that we should take the further
step of making not-null constraints be their own contype rather than
an artificially generated CHECK.  The bloat in pg_constraint from CHECK
expressions made this way seems like an additional reason not to like
doing it like that.

regards, tom lane

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

Thomas Munro  writes:
> On Mon, Sep 19, 2022 at 10:16 AM Thomas Munro  wrote:
>> Huh... wouldn't systems that pass arguments right-to-left on the stack
>> receive NULL for node?  That'd include the SysV i386 convention used
>> on Linux, *BSD etc.  But that can't be right or we'd know about it...

> I take that back after looking up some long forgotten details; it
> happily ignores extra arguments.

Yeah; the fact that no one has complained in several years seems to
indicate that there's not a problem on supported platforms.  Still,
unlike the quibbles over whether char and struct pointers are the
same, it seems clear that this is the sort of inconsistency that
C2x wants to forbid, presumably in the name of making the world
safe for more-efficient function calling code.  So I think we'd
better go fix ExecShutdownNode before somebody breaks it.

Whichever way we jump on the tree-walker API changes, those won't
be back-patchable.  I think the best we can do for the back branches
is add a configure test to use -Wno-deprecated-non-prototype
if available.  But the ExecShutdownNode change could be back-patched,
and I'm leaning to doing so even though that breakage is just
hypothetical today.

regards, tom lane

Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

Hi,

I was recently asked about converting an offset reported in WAL read
error messages[1] to an LSN with which pg_waldump can be used to
verify the records or WAL file around that LSN (basically one can
filter out the output based on LSN). AFAICS, there's no function that
takes offset as an input and produces an LSN and I ended up figuring
out LSN manually. And, for some of my work too, I was hitting errors
in XLogReaderValidatePageHeader() and adding recptr to those error
messages helped me debug issues faster.

We have a bunch of messages [1] that have an offset, but not LSN in
the error message. Firstly, is there an easiest way to figure out LSN
from offset reported in the error messages? If not, is adding LSN to
these messages along with offset a good idea? Of course, we can't just
convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
something meaningful like reporting the LSN of the page that we are
reading-in or writing-out etc.

Thoughts?

[1]
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not write to log file %s "
   "at offset %u, length %zu: %m",
errmsg("unexpected timeline ID %u in WAL segment %s, offset %u",
errmsg("could not read from WAL segment %s, offset %u: read %d of %zu",
pg_log_error("received write-ahead log record for offset %u with no file open",
"invalid magic number %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"unexpected pageaddr %X/%X in WAL segment %s, offset %u",
"out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: cataloguing NOT NULL constraints

2022-09-19 Thread Isaac Morland

On Mon, 19 Sept 2022 at 09:32, Robert Haas  wrote:

> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera 
> wrote:
> > If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> > printed by psql: (this is a bit more noisy that previously and it
> > changes a lot of regression tests output).
> >
> > 55489 16devel 1776237=# create table tab (a int not null);
> > CREATE TABLE
> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
>
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
>
> I'm not sure what a good alternative would be, though.
>

I thought I saw some discussion about the SQL standard saying that there is
a difference between putting NOT NULL in a column definition, and CHECK
(column_name NOT NULL). So if we're going to take this seriously, I think
that means there needs to be a field in pg_constraint which identifies
whether a constraint is a "real" one created explicitly as a constraint, or
if it is just one created because a field is marked NOT NULL.

If this is correct, the answer is easy: don't show constraints that are
there only because of a NOT NULL in the \d or \d+ listings. I certainly
don't want to see that clutter and I'm having trouble seeing why anybody
else would want to see it either; the information is already there in the
"Nullable" column of the field listing.

The error message for a duplicate constraint name when creating a
constraint needs however to be very clear that the conflict is with a NOT
NULL constraint and which one, since I'm proposing leaving those ones off
the visible listing, and it would be very bad for somebody to get
"duplicate name" and then be unable to see the conflicting entry.

Re: cataloguing NOT NULL constraints

On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  wrote:
> If you say CREATE TABLE (a int NOT NULL), you'll get a CHECK constraint
> printed by psql: (this is a bit more noisy that previously and it
> changes a lot of regression tests output).
>
> 55489 16devel 1776237=# create table tab (a int not null);
> CREATE TABLE
> 55489 16devel 1776237=# \d tab
> Tabla «public.tab»
>  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> ─┼─┼──┼──┼─
>  a   │ integer │  │ not null │
> Restricciones CHECK:
> "tab_a_not_null" CHECK (a IS NOT NULL)

In a table with many columns, most of which are NOT NULL, this is
going to produce a ton of clutter. I don't like that.

I'm not sure what a good alternative would be, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

pgstat: stats added in ReadPageInternal() aren't getting reported via pg_stat_wal

Hi,

I added xlogreader cache stats (hits/misses) to pg_stat_wal in
ReadPageInternal() for some of my work and ran some tests with logical
replication subscribers. I had expected that the new stats generated
by walsenders serving the subscribers would be accumulated and shown
via pg_stat_wal view in another session, but that didn't happen. Upon
looking around, I thought adding
pgstat_flush_wal()/pgstat_report_wal() around proc_exit() in
walsener.c might work, but that didn't help either. Can anyone please
let me know why the stats that I added aren't shown via pg_stat_wal
view despite the walsender doing pgstat_initialize() ->
pgstat_init_wal()? Am I missing something?

I'm attaching here with the patch that I was using.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 6bee81b6284d1619d6828e521d1d210ae5505034 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 19 Sep 2022 13:11:31 +
Subject: [PATCH v1] Add xlogreader cache stats

---
 src/backend/access/transam/xlogreader.c | 9 +
 src/backend/catalog/system_views.sql| 4 +++-
 src/backend/utils/activity/pgstat_wal.c | 2 ++
 src/backend/utils/adt/pgstatfuncs.c | 9 +++--
 src/include/catalog/pg_proc.dat | 6 +++---
 src/include/pgstat.h| 2 ++
 src/test/regress/expected/rules.out | 6 --
 7 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 050d2f424e..9a5b3d7893 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1000,7 +1000,16 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	/* check whether we have all the requested data already */
 	if (targetSegNo == state->seg.ws_segno &&
 		targetPageOff == state->segoff && reqLen <= state->readLen)
+	{
+#ifndef FRONTEND
+		PendingWalStats.xlogreader_cache_hits++;
+#endif
 		return state->readLen;
+	}
+
+#ifndef FRONTEND
+	PendingWalStats.xlogreader_cache_misses++;
+#endif
 
 	/*
 	 * Invalidate contents of internal buffer before read attempt.  Just set
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..0baab55e6b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1124,7 +1124,9 @@ CREATE VIEW pg_stat_wal AS
 w.wal_sync,
 w.wal_write_time,
 w.wal_sync_time,
-w.stats_reset
+w.stats_reset,
+w.xlogreader_cache_hits,
+w.xlogreader_cache_misses
 FROM pg_stat_get_wal() w;
 
 CREATE VIEW pg_stat_progress_analyze AS
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..ba97d71471 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -105,6 +105,8 @@ pgstat_flush_wal(bool nowait)
 	WALSTAT_ACC(wal_sync);
 	WALSTAT_ACC(wal_write_time);
 	WALSTAT_ACC(wal_sync_time);
+	WALSTAT_ACC(xlogreader_cache_hits);
+	WALSTAT_ACC(xlogreader_cache_misses);
 #undef WALSTAT_ACC
 
 	LWLockRelease(_shmem->lock);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index be15b4b2e5..c65c7c08a8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1732,7 +1732,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	11
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1759,7 +1759,10 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	   FLOAT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 	   TIMESTAMPTZOID, -1, 0);
-
+	TupleDescInitEntry(tupdesc, (AttrNumber) 10, "xlogreader_cache_hits",
+	   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 11, "xlogreader_cache_misses",
+	   INT8OID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
 	/* Get statistics about WAL activity */
@@ -1785,6 +1788,8 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
 
 	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[9] = Int64GetDatum(wal_stats->xlogreader_cache_hits);
+	values[10] = Int64GetDatum(wal_stats->xlogreader_cache_misses);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..f728406061 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5649,9 +5649,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype =>

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)