subject:""

Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Masahiko Sawada

On Tue, Jun 25, 2024 at 1:54 PM Amit Kapila  wrote:
>
> On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > I agree the current name seems too generic and the suggested ' 
> > > synchronized_standby_slots '
> > > is better than the current one.
> > >
> > > Some other ideas could be:
> > >
> > > synchronize_slots_on_standbys: it indicates that the standbys that enabled
> > > slot sync should be listed in this GUC.
> > >
> > > logical_replication_wait_slots: it means the logical replication(logical
> > > Walsender process) will wait for these slots to advance the confirm flush
> > > lsn before proceeding.
> >
> > I feel that the name that has some connection to "logical replication"
> > also sounds good. Let me add some ideas:
> >
> > - logical_replication_synchronous_standby_slots (might be too long)
> > - logical_replication_synchronous_slots
> >
>
> I see your point about keeping logical_replication in the name but
> that could also lead one to think that this list can contain logical
> slots.

Right.

>  OTOH, there is some value in keeping '_standby_' in the name as
> that is more closely associated with physical standby's and this list
> contains physical slots corresponding to physical standby's.

Agreed.

> So, my
> preference is in order as follows: synchronized_standby_slots,
> wait_for_standby_slots, logical_replication_wait_slots,
> logical_replication_synchronous_slots, and
> logical_replication_synchronous_standby_slots.

I also prefer synchronized_standby_slots.

>From a different angle just for discussion, is it worth considering
the term 'failover' since the purpose of this feature is to ensure a
standby to be ready for failover in terms of logical replication? For
example, failover_standby_slot_names?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Doc Rework: Section 9.16.13 SQL/JSON Query Functions

2024-06-25 Thread David G. Johnston

Hey!

Lots of SQL/JSON threads going about.  This one is less about technical
correctness and more about usability of the documentation.  Though in
writing this I am finding some things that aren't quite clear.  I'm going
to come back with those on a follow-on post once I get a chance to make my
second pass on this.  But for the moment just opening it up to a content
and structure review.

Please focus on the text changes.  It passes "check-docs" but I still need
to work on layout and stuff in html (markup, some more links).

Thanks!

David J.

p.s. v1 exists here (is just the idea of using basically variable names in
the function signature and minimizing direct syntax in the table);

https://www.postgresql.org/message-id/CAKFQuwbYBvUZasGj_ZnfXhC2kk4AT%3DepwGkNd2%3DRMMVXkfTNMQ%40mail.gmail.com


v2-0001-doc-json-query_9-16-3.patch
Description: Binary data

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-25 Thread Michael Paquier

On Tue, Jun 25, 2024 at 08:06:41AM +0200, Joel Jacobson wrote:
> Not sure if I see how to implement it for pg_get_acl() though.
> 
> I've had a look at how pg_describe_object() works for this case:
> 
> SELECT pg_describe_object(0,'t'::regclass::oid,0);
> ERROR:  unsupported object class: 0
> 
> I suppose this is the error message we want in pg_get_acl() when
> the class ID is invalid?

Ah, and here I thought that this was also returning NULL.  My previous
work in this area only focused on the object OIDs, not their classes.
At the end, I'm OK to keep your patch as it is, checking only for the
case of pinned dependencies in pg_depend as we do for
pg_describe_object().

It's still a bit confusing, but we've been living with that for years
now without anybody complaining except me, so perhaps that's fine at
the end to keep that as this is still useful.  If we change that,
applying the same rules across the board would make the most sense.
--
Michael


signature.asc
Description: PGP signature

Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Bertrand Drouvot

Hi,

On Tue, Jun 25, 2024 at 10:24:41AM +0530, Amit Kapila wrote:
> On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > I agree the current name seems too generic and the suggested ' 
> > > synchronized_standby_slots '
> > > is better than the current one.
> > >
> > > Some other ideas could be:
> > >
> > > synchronize_slots_on_standbys: it indicates that the standbys that enabled
> > > slot sync should be listed in this GUC.
> > >
> > > logical_replication_wait_slots: it means the logical replication(logical
> > > Walsender process) will wait for these slots to advance the confirm flush
> > > lsn before proceeding.
> >
> > I feel that the name that has some connection to "logical replication"
> > also sounds good. Let me add some ideas:
> >
> > - logical_replication_synchronous_standby_slots (might be too long)
> > - logical_replication_synchronous_slots
> >
> 
> I see your point about keeping logical_replication in the name but
> that could also lead one to think that this list can contain logical
> slots.

Agree, and we may add the same functionality for physical replication slots
in the future too (it has been discussed in the thread [1]). So I don't think
"logical" should be part of the name.

> OTOH, there is some value in keeping '_standby_' in the name as
> that is more closely associated with physical standby's and this list
> contains physical slots corresponding to physical standby's. So, my
> preference is in order as follows: synchronized_standby_slots,
> wait_for_standby_slots, logical_replication_wait_slots,
> logical_replication_synchronous_slots, and
> logical_replication_synchronous_standby_slots.
> 

I like the idea of having "synchronize[d]" in the name as it makes think of 
the feature it is linked to [2]. The slots mentioned in this parameter are
linked to the "primary_slot_name" parameter on the standby, so what about?

synchronized_primary_slot_names 

It makes clear it is somehow linked to "primary_slot_name" and that we want them
to be in sync.

So I'd vote for (in that order);

synchronized_primary_slot_names, synchronized_standby_slots

[1]: 
https://www.postgresql.org/message-id/bb437218-73bc-34c3-b8fb-8c1be4ddaec9%40enterprisedb.com
[2]: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=93db6cbda037f1be9544932bd9a785dabf3ff712

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Partial aggregates pushdown

2024-06-25 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Jelte, hackers.

Thank you for explanations.

Actually, I have other tasks about "PARTIAL_AGGREAGATE" keyword
to respond Requirement1 and Requirement2 in the following mail.
https://www.postgresql.org/message-id/TYAPR01MB3088755F2281D41F5EEF06D495F92%40TYAPR01MB3088.jpnprd01.prod.outlook.com

After that tasks, I plan to compare your proposal with mine seriously, with 
additional POC patches if necessary.

I think that your proposal might seem to be a more fundamental solution.
However, to be honest, so far, I don't perfectly get the benefits and impacts 
by stopping usage of internal types
instead using a native types, especially on handling memory contexts of 
existing deserialization functions and
on the amount of codes to be modified or added.
The followings are the answers with the knowledge I have right now.

> From: Jelte Fennema-Nio 
> Sent: Tuesday, June 25, 2024 5:49 AM
> > However, I still have the following two questions.
> >
> > 1. Not necessary components of new native types Refer to pg_type.dat,
> > typinput and typoutput are required.
> > I think that in your proposal they are not necessary, so waste. I
> > think that it is not acceptable.
> > How can I resolve the problem?
> 
> I think requiring typinput/typoutput is a benefit personally, because that 
> makes it possible to do PARTIAL_AGGREGATE
> pushdown to a different PG major version. Also it makes it easier to debug 
> the partial aggregate values when using
> psql/pgregress. So yes, it requires implementing both binary (send/receive) 
> and text (input/output) serialization, but it also
> has some benefits. And in theory you might be able to skip implementing the 
> binary serialization, and rely purely on the text
> serialization to send partial aggregates between servers.
I see. It seems that adding new natives might make it easier to transmit the 
state values between local and remote have different major versions.
However, in my opinion, we should be careful to support the case in which local 
and remote have different major versions,
because the transtype of an aggregate function would may change in future major 
version due to
something related to the implementation.
Actually, something like that occurs recently, see
https://github.com/postgres/postgres/commit/519fc1bd9e9d7b408903e44f55f83f6db30742b7
I think the transtype of an aggregate function quite more changeable than 
retype.
Consequently, so far, I want to support the cases in which local and remote 
have the same major version.
If we try to resolve the limitation, it seems to need more additional codes.

And, I'm afraid that adding typinput/typoutput bothers the developers.
They also have to create a new native types in addition to create their new 
aggregate functions.
I wonder if this concern might outweigh the benefits for debugging.
And, if skipping send/receive, they have to select only the text representation 
on
the transmission of the state value. I think it is narrow.

> > 2. Many new native types
> > I think that, basically, each aggregate function does need a new native 
> > type.
> > For example,
> > avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
> > PolyNumAggState.
> > You said that it is enough to add only one native type like
> > state_poly_num_agg for supporting them, right?
> 
> Yes, correct. That's what I had in mind.
> 
> > But the combine functions of them might have quite different
> > expectation on the data items of PolyNumAggState like the range of
> > N(means count) and the true/false of calcSumX2 (the flag of
> > calculating sum of squares).
> > The final functions of them have the similar expectation.
> > So, I think that, responded to your proposal, each of them need a
> > native data type like state_int8_avg, state_numeric_avg, for safety.
> >
> > And, we do need a native type for an aggregate function whose
> > transtype is not internal and not pseudo.
> > For avg(int4), the current transtype is _int8.
> > However, I do need a validation check on the number of the array And
> > the positiveness of count(the first element of the array).
> > Responded to your proposal,
> > I do need a new native type like state_int4_avg.
> 
> To help avoid confusion let me try to restate what I think you mean
> here: You're worried about someone passing in a bogus native type into the 
> final/combine functions and then getting
> crashes and/or wrong results. With internal type people cannot do this 
> because they cannot manually call the
> combinefunc/finalfunc because the argument type is internal. To solve this 
> problem your suggestion is to make the type
> specific to the specific aggregate such that send/receive or input/output can 
> validate the input as reasonable. But this
> would then mean that we have many native types (and also many 
> deserialize/serialize functions).
Yes, right.

> Assuming that's indeed what you meant, that's an interesting thought, I 
> didn't consider that much indeed. My thinking was
> that we

Re: Pgoutput not capturing the generated columns

2024-06-25 Thread Peter Smith

Here are some review comments for the patch v10-0002.

==
Commit Message

1.
Note that we don't copy columns when the subscriber-side column is also
generated. Those will be filled as normal with the subscriber-side computed or
default data.

~

Now this patch also introduced some errors etc, so I think that patch
comment should be written differently to explicitly spell out
behaviour of every combination, something like the below:

Summary

when (include_generated_column = true)

* publisher not-generated column => subscriber not-generated column:
This is just normal logical replication (not changed by this patch).

* publisher not-generated column => subscriber generated column: This
will give ERROR.

* publisher generated column => subscriber not-generated column: The
publisher generated column value is copied.

* publisher generated column => subscriber generated column: The
publisher generated column value is not copied. The subscriber
generated column will be filled with the subscriber-side computed or
default data.

when (include_generated_columns = false)

* publisher not-generated column => subscriber not-generated column:
This is just normal logical replication (not changed by this patch).

* publisher not-generated column => subscriber generated column: This
will give ERROR.

* publisher generated column => subscriber not-generated column: This
will replicate nothing. Publisher generate-column is not replicated.
The subscriber column will be filled with the subscriber-side default
data.

* publisher generated column => subscriber generated column:  This
will replicate nothing. Publisher generate-column is not replicated.
The subscriber generated column will be filled with the
subscriber-side computed or default data.

==
src/backend/replication/logical/relation.c

2.
logicalrep_rel_open:

I tested some of the "missing column" logic, and got the following results:

Scenario A:
PUB
test_pub=# create table t2(a int, b int);
test_pub=# create publication pub2 for table t2;
SUB
test_sub=# create table t2(a int, b int generated always as (a*2) stored);
test_sub=# create subscription sub2 connection 'dbname=test_pub'
publication pub2 with (include_generated_columns = false);
Result:
ERROR:  logical replication target relation "public.t2" is missing
replicated column: "b"

~

Scenario B:
PUB/SUB identical to above, but subscription sub2 created "with
(include_generated_columns = true);"
Result:
ERROR:  logical replication target relation "public.t2" has a
generated column "b" but corresponding column on source relation is
not a generated column

~~~

2a. Question

Why should we get 2 different error messages for what is essentially
the same problem according to whether the 'include_generated_columns'
is false or true? Isn't the 2nd error message the more correct and
useful one for scenarios like this involving generated columns?

Thoughts?

~

2b. Missing tests?

I also noticed there seems no TAP test for the current "missing
replicated column" message. IMO there should be a new test introduced
for this because the loop involved too much bms logic to go
untested...

==
src/backend/replication/logical/tablesync.c

make_copy_attnamelist:
NITPICK - minor comment tweak
NITPICK - add some spaces after "if" code

3.
Should you pfree the gencollist at the bottom of this function when
you no longer need it, for tidiness?

~~~

4.
 static void
-fetch_remote_table_info(char *nspname, char *relname,
+fetch_remote_table_info(char *nspname, char *relname, bool **remotegenlist,
  LogicalRepRelation *lrel, List **qual)
 {
  WalRcvExecResult *res;
  StringInfoData cmd;
  TupleTableSlot *slot;
  Oid tableRow[] = {OIDOID, CHAROID, CHAROID};
- Oid attrRow[] = {INT2OID, TEXTOID, OIDOID, BOOLOID};
+ Oid attrRow[] = {INT2OID, TEXTOID, OIDOID, BOOLOID, BOOLOID};
  Oid qualRow[] = {TEXTOID};
  bool isnull;
+ bool*remotegenlist_res;

IMO the names 'remotegenlist' and 'remotegenlist_res' should be
swapped the other way around, because it is the function parameter
that is the "result", whereas the 'remotegenlist_res' is just the
local working var for it.

~~~

5. fetch_remote_table_info

Now walrcv_server_version(LogRepWorkerWalRcvConn) is used in multiple
places, I think it will be better to assign this to a 'server_version'
variable to be used everywhere instead of having multiple function
calls.

~~~

6.
  "SELECT a.attnum,"
  "   a.attname,"
  "   a.atttypid,"
- "   a.attnum = ANY(i.indkey)"
+ "   a.attnum = ANY(i.indkey),"
+ " a.attgenerated != ''"
  "  FROM pg_catalog.pg_attribute a"
  "  LEFT JOIN pg_catalog.pg_index i"
  "   ON (i.indexrelid = pg_get_replica_identity_index(%u))"
  " WHERE a.attnum > 0::pg_catalog.int2"
- "   AND NOT a.attisdropped %s"
+ "   AND NOT a.attisdropped", lrel->remoteid);
+
+ if ((walrcv_server_version(LogRepWorkerWalRcvConn) >= 12 &&
+ walrcv_server_version(LogRepWorkerWalRcvConn) <= 16) ||
+ !MySubscription->includegencols)
+ appendStringInfo(, "

Re: speed up a logical replica setup

2024-06-25 Thread Amit Kapila

On Tue, Jun 25, 2024 at 3:38 AM Noah Misch  wrote:
>
> On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> > On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> > >
> > > > +static void
> > > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > > +{
> > >
> > > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > > +   ipubname_esc);
> > >
> > > This tool's documentation says it "guarantees that no transaction will be
> > > lost."  I tried to determine whether achieving that will require something
> > > like the fix from
> > > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > > (Not exactly the fix from that thread, since that thread has not 
> > > discussed the
> > > FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> > > hand, pg_createsubscriber benefits from creating a logical slot after 
> > > creating
> > > the publication.  That snapbuild.c process will wait for running XIDs.  
> > > On the
> > > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and 
> > > builds
> > > its relcache entry before assigning an XID, so perhaps the snapbuild.c 
> > > process
>
> Correction: it doesn't matter how the original INSERT/UPDATE/DELETE builds its
> relcache entry, just how pgoutput of the change builds the relcache entry from
> the historic snapshot.
>
> > > isn't enough to prevent that thread's race condition.  What do you think?
> >
> > I am not able to imagine how the race condition discussed in the
> > thread you quoted can impact this patch. The problem discussed is
> > mainly the interaction when we are processing the changes in logical
> > decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> > problem happens because we use the old cache state.
>
> Right.  Taking the example from
> http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, LSNs
> between what that mail calls 4) and 5) are not safely usable as start points.
> pg_createsubscriber evades that thread's problem if the consistent_lsn it
> passes to pg_replication_origin_advance() can't be in a bad-start-point LSN
> span.  I cautiously bet the snapbuild.c process achieves that:
>
> > I am missing your
> > point about the race condition mentioned in the thread you quoted with
> > snapbuild.c. Can you please elaborate a bit more?
>
> When pg_createsubscriber calls pg_create_logical_replication_slot(), the key
> part starts at:
>
> /*
>  * If caller needs us to determine the decoding start point, do so 
> now.
>  * This might take a while.
>  */
> if (find_startpoint)
> DecodingContextFindStartpoint(ctx);
>
> Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
> committed before pg_create_logical_replication_slot() started.  Second, (b)
> DecodingContextFindStartpoint() waits for running XIDs to complete, via the
> process described at the snapbuild.c "starting up in several stages" diagram.
> Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
> even if the original INSERT populated all caches before CREATE PUBLICATION
> started and managed to assign an XID only after consistent_lsn.  From the
> pgoutput perspective, that's indistinguishable from the transaction starting
> at its first WAL record, after consistent_lsn.  The linked "long-standing data
> loss bug in initial sync of logical replication" thread doesn't have (a),
> hence its bug.  How close is that to accurate?
>

Yeah, this theory sounds right to me. The key point is that no DML
(processing of WAL corresponding to DML) before CREATE PUBLICATION ...
command would have reached pgoutput level because we would have waited
for it during snapbuild.c. Can we conclude that the race condition
discussed in the other thread won't impact this patch?

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-25 Thread Joel Jacobson

On Tue, Jun 25, 2024, at 07:11, Michael Paquier wrote:
> On Tue, Jun 25, 2024 at 12:20:20AM +0200, Joel Jacobson wrote:
>> Thanks, much better. New version attached.
>
> +  The PostgreSQL documentation, and code, 
> refers
> +  to the specifications within the ACL as "privileges".  This has the 
> same
> +  meaning as "permissions" on the linked page.  Generally if we say 
>
> Hmm?  A privilege is a property that is part of an ACL, which is
> itself a set made of object types, roles and privileges.  This entire
> paragraph is unnecessary IMO, let's keep it simple with only a
> reference link to the wiki page.
>
> v1 is fine without the "privileges list" part mentioned by Nathan in
> the first reply.

v2 is exactly that, but renamed and attached, so we have an entry this was the 
last version.

/Joel

v5-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-25 Thread Joel Jacobson

On Tue, Jun 25, 2024, at 03:57, Michael Paquier wrote:
> On Tue, Jun 25, 2024 at 01:21:14AM +0200, Joel Jacobson wrote:
>> Good idea, I've started a separate thread for this:
>> 
>> https://postgr.es/m/9253b872-dbb1-42a6-a79e-b1e96effc857%40app.fastmail.com
>> 
>> This patch now assumes ACL will be supported.
>
> Thanks for doing that!  That helps in making reviews easier to follow
> for all, attracting the correct audience when necessary.
>
>>> +SELECT
>>> +(pg_identify_object(s.classid,s.objid,s.objsubid)).*,
>>> +pg_catalog.pg_get_acl(s.classid,s.objid)
>>> +FROM pg_catalog.pg_shdepend AS s
>>> +JOIN pg_catalog.pg_database AS d ON d.datname = current_database() AND 
>>> d.oid = s.dbid
>>> +JOIN pg_catalog.pg_authid AS a ON a.oid = s.refobjid AND s.refclassid 
>>> = 'pg_authid'::regclass
>>> +WHERE s.deptype = 'a';
>>>
>>> Could be a bit prettier.  That's a good addition.
>> 
>> How could we make it prettier?
>
> Perhaps split the two JOIN conditions into two lines each, with a bit
> more indentation to make it render better?  Usually I handle that on a
> case-by-case basis while preparing a patch for commit.  I'm OK to edit
> that myself with some final touches, FWIW.  Depending on the input
> this shows, I'd also look at some LATERAL business, that can be
> cleaner in some cases for the docs.

Thanks, some indentation certainly helped.
Not sure where LATERAL would help, so leaving that part to you.

>> SELECT pg_get_acl(0, 'atest2'::regclass::oid);
>> ERROR:  unrecognized class ID: 0
>> 
>> I believe we want an error here, since: an invalid class ID,
>> like 0, or any other invalid OID, should raise an error,
>> since classes can't be dropped, so we should never
>> expect an invalid OID for a class ID.
>> Please correct me if this reasoning is incorrect.
>
> This is an internal error, so it should never be visible to the end
> user via SQL because it is an unexpected state.  See for example
> 2a10fdc4307a, which is similar to what you are doing here.

Thanks for pointing me to that commit, good to learn about missing_ok.

Not sure if I see how to implement it for pg_get_acl() though.

I've had a look at how pg_describe_object() works for this case:

SELECT pg_describe_object(0,'t'::regclass::oid,0);
ERROR:  unsupported object class: 0

I suppose this is the error message we want in pg_get_acl() when
the class ID is invalid?

If no, the rest of this email can be skipped.

If yes, then I suppose we should try to see if there is any existing code
in objectaddress.c that we could reuse, that can throw this error message
for us, for an invalid class OID.

There are three places in objectaddress.c currently capable of
throwing a "unsupported object class" error message:

char *
getObjectDescription(const ObjectAddress *object, bool missing_ok)

char *
getObjectTypeDescription(const ObjectAddress *object, bool missing_ok)

char *
getObjectIdentityParts(const ObjectAddress *object,
   List **objname, List **objargs,
   bool missing_ok)

All three of them contain a `switch (object->classId)` statement,
where the default branch contains the code that throws the error:

default:
elog(ERROR, "unsupported object class: %u", 
object->classId);

It would be nice to avoid having to copy the long switch statement, with noops 
for each branch,
except the default branch, to throw an error in case of an invalid class OID,
but I don't see how we can use any of these three functions in pg_get_acl(), 
since they
do more things than just checking if the class OID is valid?

So not sure what to do here.

Maybe we want a separate new bool helper function to check if a class OID is 
valid or not?

That helper function would not be useful for the existing three cases where 
this error is thrown
in objectaddress.c, since they need actual switch branches for each class ID, 
whereas in pg_get_acl()
we just need to check if it's valid or not.

I haven't checked, but maybe there are other places in the sources where we 
just want to check
if a class OID is valid or not, that could benefit from such a helper function.

Or perhaps one exist already?

/Joel

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Michael Paquier

On Tue, Jun 25, 2024 at 12:20:20AM +0200, Joel Jacobson wrote:
> Thanks, much better. New version attached.

+  The PostgreSQL documentation, and code, refers
+  to the specifications within the ACL as "privileges".  This has the same
+  meaning as "permissions" on the linked page.  Generally if we say 

Hmm?  A privilege is a property that is part of an ACL, which is
itself a set made of object types, roles and privileges.  This entire
paragraph is unnecessary IMO, let's keep it simple with only a
reference link to the wiki page.

v1 is fine without the "privileges list" part mentioned by Nathan in
the first reply.
--
Michael


signature.asc
Description: PGP signature

Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Amit Kapila

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
 wrote:
>
> On 6/24/24 12:54, Amit Kapila wrote:
> > ...
> >>
>  I'm not sure there are any cases where using SRE instead of AE would 
>  cause
>  problems for logical decoding, but it seems very hard to prove. I'd be 
>  very
>  surprised if just using SRE would not lead to corrupted cache contents 
>  in some
>  situations. The cases where a lower lock level is ok are ones where we 
>  just
>  don't care that the cache is coherent in that moment.
> >>
> >>> Are you saying it might break cases that are not corrupted now? How
> >>> could obtaining a stronger lock have such effect?
> >>
> >> No, I mean that I don't know if using SRE instead of AE would have negative
> >> consequences for logical decoding. I.e. whether, from a logical decoding 
> >> POV,
> >> it'd suffice to increase the lock level to just SRE instead of AE.
> >>
> >> Since I don't see how it'd be correct otherwise, it's kind of a moot 
> >> question.
> >>
> >
> > We lost track of this thread and the bug is still open. IIUC, the
> > conclusion is to use SRE in OpenTableList() to fix the reported issue.
> > Andres, Tomas, please let me know if my understanding is wrong,
> > otherwise, let's proceed and fix this issue.
> >
>
> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
> don't think we 'lost track' of it, but it's true we haven't done much
> progress recently.
>

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

-- 
With Regards,
Amit Kapila.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-06-24 Thread Alexander Lakhin


24.06.2024 01:59, Jelte Fennema-Nio wrote:

On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin  wrote:

@@ -2775,6 +2775,7 @@
   SET LOCAL statement_timeout = '10ms';
   select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- 
this takes very long
   ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
   COMMIT;

As you describe it, this problem occurs when the cancel request is
processed by the foreign server, before the query is actually
received. And postgres then (rightly) ignores the cancel request. I'm
not sure if the existing test is easily changeable to fix this. The
only thing that I can imagine works in practice is increasing the
statement_timeout, e.g. to 100ms.


I'd just like to add that that one original query assumes several "remote"
queries (see the attached excerpt from postmaster.log with verbose logging
enabled).

Best regards,
Alexander2024-06-24 07:25:33.539 UTC [47957:426] pg_regress/postgres_fdw LOG:  
statement: SET LOCAL statement_timeout = '10ms';
2024-06-24 07:25:33.540 UTC [47957:427] pg_regress/postgres_fdw LOG:  
statement: select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN 
ft5;
2024-06-24 07:25:33.541 UTC [47957:428] pg_regress/postgres_fdw DEBUG:  
starting remote transaction on connection 0x519f1480
2024-06-24 07:25:33.542 UTC [48390:203] pg_regress LOG:  statement: START 
TRANSACTION ISOLATION LEVEL REPEATABLE READ
2024-06-24 07:25:33.542 UTC [48390:204] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM "S 1"."T 1"
2024-06-24 07:25:33.543 UTC [48390:205] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE))
2024-06-24 07:25:33.544 UTC [48390:206] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM ("S 1"."T 1" r2 INNER JOIN "S 1"."T 3" r4 ON (TRUE))
2024-06-24 07:25:33.545 UTC [48390:207] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM ("S 1"."T 1" r2 INNER JOIN "S 1"."T 4" r6 ON (TRUE))
2024-06-24 07:25:33.546 UTC [48390:208] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM (("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) INNER 
JOIN "S 1"."T 3" r4 ON (TRUE))
2024-06-24 07:25:33.548 UTC [48390:209] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM (("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) INNER 
JOIN "S 1"."T 4" r6 ON (TRUE))
2024-06-24 07:25:33.549 UTC [48390:210] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM (("S 1"."T 1" r2 INNER JOIN "S 1"."T 3" r4 ON (TRUE)) INNER 
JOIN "S 1"."T 4" r6 ON (TRUE))
2024-06-24 07:25:33.550 UTC [48390:211] pg_regress LOG:  statement: EXPLAIN 
SELECT NULL FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) INNER 
JOIN "S 1"."T 3" r4 ON (TRUE)) INNER JOIN "S 1"."T 4" r6 ON (TRUE))
2024-06-24 07:25:33.551 UTC [48390:212] pg_regress LOG:  statement: EXPLAIN 
SELECT count(*) FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) 
INNER JOIN "S 1"."T 3" r4 ON (TRUE)) INNER JOIN "S 1"."T 4" r6 ON (TRUE))
2024-06-24 07:25:33.553 UTC [48390:213] pg_regress DEBUG:  parse : 
DECLARE c1 CURSOR FOR
SELECT count(*) FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON 
(TRUE)) INNER JOIN "S 1"."T 3" r4 ON (TRUE)) INNER JOIN "S 1"."T 4" r6 ON 
(TRUE))
2024-06-24 07:25:33.553 UTC [48390:214] pg_regress DEBUG:  bind  to 

2024-06-24 07:25:33.553 UTC [48390:215] pg_regress LOG:  execute : 
DECLARE c1 CURSOR FOR
SELECT count(*) FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON 
(TRUE)) INNER JOIN "S 1"."T 3" r4 ON (TRUE)) INNER JOIN "S 1"."T 4" r6 ON 
(TRUE))
2024-06-24 07:25:33.555 UTC [47957:429] pg_regress/postgres_fdw ERROR:  
canceling statement due to statement timeout
2024-06-24 07:25:33.555 UTC [47957:430] pg_regress/postgres_fdw DEBUG:  closing 
remote transaction on connection 0x519f1480
2024-06-24 07:25:33.561 UTC [48465:1] [unknown] LOG:  connection received: 
host=[local]
2024-06-24 07:25:33.561 UTC [48465:2] [unknown] DEBUG:  processing cancel 
request: sending SIGINT to process 48390
2024-06-24 07:25:33.561 UTC [48465:3] [unknown] DEBUG:  shmem_exit(0): 0 
before_shmem_exit callbacks to make
2024-06-24 07:25:33.561 UTC [48465:4] [unknown] DEBUG:  shmem_exit(0): 0 
on_shmem_exit callbacks to make
2024-06-24 07:25:33.561 UTC [48465:5] [unknown] DEBUG:  proc_exit(0): 1 
callbacks to make
2024-06-24 07:25:33.561 UTC [48465:6] [unknown] DEBUG:  exit(0)
2024-06-24 07:25:33.561 UTC [48465:7] [unknown] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2024-06-24 07:25:33.561 UTC [48465:8] [unknown] DEBUG:  shmem_exit(-1): 0 
on_shmem_exit callbacks to make
2024-06-24 07:25:33.561 UTC [48465:9] [unknown] DEBUG:  proc_exit(-1): 0 
callbacks to make
2024-06-24 07:25:33.635 UTC [48390:216] pg_regress LOG:  statement: FETCH 100 
FROM c1
...
2024-06-24 07:26:09.041 UTC [47957:431] pg_regress/postgres_fdw WARNING:  could 
not get result of cancel request due to timeout
2024-06-24 07:26:09.041 UTC [47957:432]

Re: New standby_slot_names GUC in PG 17

2024-06-24 Thread Amit Kapila

On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada  wrote:
>
> On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > I agree the current name seems too generic and the suggested ' 
> > synchronized_standby_slots '
> > is better than the current one.
> >
> > Some other ideas could be:
> >
> > synchronize_slots_on_standbys: it indicates that the standbys that enabled
> > slot sync should be listed in this GUC.
> >
> > logical_replication_wait_slots: it means the logical replication(logical
> > Walsender process) will wait for these slots to advance the confirm flush
> > lsn before proceeding.
>
> I feel that the name that has some connection to "logical replication"
> also sounds good. Let me add some ideas:
>
> - logical_replication_synchronous_standby_slots (might be too long)
> - logical_replication_synchronous_slots
>

I see your point about keeping logical_replication in the name but
that could also lead one to think that this list can contain logical
slots. OTOH, there is some value in keeping '_standby_' in the name as
that is more closely associated with physical standby's and this list
contains physical slots corresponding to physical standby's. So, my
preference is in order as follows: synchronized_standby_slots,
wait_for_standby_slots, logical_replication_wait_slots,
logical_replication_synchronous_slots, and
logical_replication_synchronous_standby_slots.

-- 
With Regards,
Amit Kapila.

Re: sql/json miscellaneous issue

2024-06-24 Thread jian he

On Tue, Jun 25, 2024 at 11:23 AM Amit Langote  wrote:
>
> On Tue, Jun 25, 2024 at 12:18 PM jian he  wrote:
> > On Mon, Jun 24, 2024 at 7:46 PM Amit Langote  
> > wrote:
> > > On Mon, Jun 24, 2024 at 7:04 PM jian he  
> > > wrote:
> > > >
> > > > hi.
> > > > the following two queries should return the same result?
> > > >
> > > > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> > > > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
> > >
> > > I get this with HEAD:
> > >
> > > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> > >  json_query
> > > 
> > >  null
> > > (1 row)
> > >
> > > Time: 734.587 ms
> > > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
> > >  json_value
> > > 
> > >
> > > (1 row)
> > >
> > > Much like:
> > >
> > > SELECT JSON_QUERY('{"key": null}', '$.key');
> > >  json_query
> > > 
> > >  null
> > > (1 row)
> > >
> > > Time: 2.975 ms
> > > SELECT JSON_VALUE('{"key": null}', '$.key');
> > >  json_value
> > > 
> > >
> > > (1 row)
> > >
> > > Which makes sense to me, because JSON_QUERY() is supposed to return a
> > > JSON null in both cases and JSON_VALUE() is supposed to return a SQL
> > > NULL for a JSON null.
> > >
> > > --
> > > Thanks, Amit Langote
> >
> > hi amit, sorry to bother you again.
>
> No worries.
>
> > My thoughts  for the above cases are:
> > * json_value, json_query main description is the same:
> > {{Returns the result of applying the SQL/JSON path_expression to the
> > context_item using the PASSING values.}}
> > same context_item, same path_expression, for the above cases, the
> > result should be the same?
> >
> > * in json_value, description
> > {{The extracted value must be a single SQL/JSON scalar item; an error
> > is thrown if that's not the case. If you expect that extracted value
> > might be an object or an array, use the json_query function instead.}}
> > query: `SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);`
> > the returned jsonb 'null' (applying the path expression) is a single
> > SQL/JSON scalar item.
> > json_value return jsonb null should be fine
> >
> >
> > However, other database implementations return SQL null,
> > so I guess returning SQL null is fine)
> > (based on the doc explanation, returning jsonb null more make sense, imho)
>
> If the difference in behavior is not clear from the docs, I guess that
> means that we need to improve the docs.  Would you like to give a shot
> at writing the patch?
>

other databases did mention how json_value deals with  json null. eg.
[0] mysql description:
When the data at the specified path consists of or resolves to a JSON
null literal, the function returns SQL NULL.
[1] oracle description:
SQL/JSON function json_value applied to JSON value null returns SQL
NULL, not the SQL string 'null'. This means, in particular, that you
cannot use json_value to distinguish the JSON value null from the
absence of a value; SQL NULL indicates both cases.


imitate above, i come up with following:
"The extracted value must be a single SQL/JSON scalar item; an error
is thrown if that's not the case. ..."
to
"The extracted value must be a single SQL/JSON scalar item; an error
is thrown if that's not the case.
If the extracted value is a JSON null, an SQL NULL value will return.
This means that you cannot use json_value to distinguish the JSON
value null from evaluating path_expression yields no value at all; SQL
NULL indicates both cases, to distinguish these two cases, use
json_query instead.
"


I also changed from
ON EMPTY is not specified is to return a null value.
ON ERROR is not specified is to return a null value.
to
The default when ON EMPTY is not specified is to return an SQL NULL value.
The default when ON ERROR is not specified is to return an SQL NULL value.

[0] 
https://dev.mysql.com/doc/refman/8.4/en/json-search-functions.html#function_json-value
[1]https://docs.oracle.com/en/database/oracle/oracle-database/19/adjsn/function-JSON_VALUE.html#GUID-622170D8-7BAD-4F5F-86BF-C328451FC3BE
From dbc50acce12efcd25e7c55c51609e125e1545439 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 25 Jun 2024 12:35:40 +0800
Subject: [PATCH v1 1/1] document how json_value dealing with jsonb 'null'

---
 doc/src/sgml/func.sgml | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 26092696..f6ec96ae 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18772,7 +18772,7 @@ ERROR:  jsonpath array subscript is out of bounds
 The ON EMPTY clause specifies the behavior if
 evaluating path_expression yields no value
 at all. The default when ON EMPTY is not specified
-is to return a null value.
+is to return an SQL NULL value.


 The ON ERROR clause specifies the
@@ -18781,7 +18781,7 @@ ERROR:  jsonpath array subscript is out of bounds
 coerce the result

Re: [PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Stan Hu

Thanks, Tom and Michael. I've submitted a bug report via Apple's
Feedback Assistant. It's filed under FB14047412.

If anyone happens to know the right person at Apple to look at this,
please direct them there.

On Mon, Jun 24, 2024 at 7:15 PM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > Ugh.  Which means that you are testing macOS Sequoia still in beta
> > phase?  Thanks for the report.
>
> > Perhaps we should wait for the actual release before seeing if this is
> > still an issue and see if this is still a problem?  Tom is a heavy
> > macOS user, I'm still under 14 myself for some time.
>
> Yeah, I'm not in a huge hurry to act on this.  The problem may
> go away by the time SDK 15 gets out of beta --- in fact, I think
> it'd be a good idea to file a bug with Apple complaining that this
> pointless-looking change breaks third-party code.  If it doesn't
> go away, we're going to have to back-patch all supported branches
> (and, really, even out-of-support ones back to 9.2); which puts a
> large premium on getting the patch right.  So we have both time to
> think about it and good reason to be careful.
>
> (I've not yet read any of Stan's proposed patches.)
>
> regards, tom lane

Re: Patch bug: Fix jsonpath .* on Arrays

2024-06-24 Thread Степан Неретин

>Вторник, 25 июня 2024, 11:17 +07:00 от David E. Wheeler 
>:
> 
>On Jun 7, 2024, at 10:23, David E. Wheeler < da...@justatheory.com > wrote:
> 
>> Rebased and moved the new tests to the end of the file.
>Bah, sorry, that was the previous patch. Here’s v3.
>
>D
>  

Hi! Looks good to me, but I have several comments.
Your patch improves tests, but why did you change formatting in 
jsonpath_exec.c? What's the motivation?

[1] select jsonb_path_query('[1,2,3,{"b": [3,4,5]}]', 'strict $.*');
I propose adding a similar test with explicitly specified lax mode: select 
jsonb_path_query('[1,2,3,{"b": [3,4,5]}]', 'lax $.*'); to show what lax mode is 
set by default.

Odd empty result for the test: select jsonb '[1,2,3,{"b": [3,4,5]}]' @? 'strict 
$.*';
I expected an error like in test [1]. This behavior is not obvious to me.

Everything else is cool. Thanks to the patch and the discussion above, I began 
to understand better how wildcards in JSON work.
Best regards, Stepan Neretin.

Re: sql/json miscellaneous issue

2024-06-24 Thread Amit Langote

On Tue, Jun 25, 2024 at 12:18 PM jian he  wrote:
> On Mon, Jun 24, 2024 at 7:46 PM Amit Langote  wrote:
> > On Mon, Jun 24, 2024 at 7:04 PM jian he  wrote:
> > >
> > > hi.
> > > the following two queries should return the same result?
> > >
> > > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> > > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
> >
> > I get this with HEAD:
> >
> > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> >  json_query
> > 
> >  null
> > (1 row)
> >
> > Time: 734.587 ms
> > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
> >  json_value
> > 
> >
> > (1 row)
> >
> > Much like:
> >
> > SELECT JSON_QUERY('{"key": null}', '$.key');
> >  json_query
> > 
> >  null
> > (1 row)
> >
> > Time: 2.975 ms
> > SELECT JSON_VALUE('{"key": null}', '$.key');
> >  json_value
> > 
> >
> > (1 row)
> >
> > Which makes sense to me, because JSON_QUERY() is supposed to return a
> > JSON null in both cases and JSON_VALUE() is supposed to return a SQL
> > NULL for a JSON null.
> >
> > --
> > Thanks, Amit Langote
>
> hi amit, sorry to bother you again.

No worries.

> My thoughts  for the above cases are:
> * json_value, json_query main description is the same:
> {{Returns the result of applying the SQL/JSON path_expression to the
> context_item using the PASSING values.}}
> same context_item, same path_expression, for the above cases, the
> result should be the same?
>
> * in json_value, description
> {{The extracted value must be a single SQL/JSON scalar item; an error
> is thrown if that's not the case. If you expect that extracted value
> might be an object or an array, use the json_query function instead.}}
> query: `SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);`
> the returned jsonb 'null' (applying the path expression) is a single
> SQL/JSON scalar item.
> json_value return jsonb null should be fine
>
>
> However, other database implementations return SQL null,
> so I guess returning SQL null is fine)
> (based on the doc explanation, returning jsonb null more make sense, imho)

If the difference in behavior is not clear from the docs, I guess that
means that we need to improve the docs.  Would you like to give a shot
at writing the patch?

-- 
Thanks, Amit Langote

Re: sql/json miscellaneous issue

2024-06-24 Thread jian he

On Mon, Jun 24, 2024 at 7:46 PM Amit Langote  wrote:
>
> Hi,
>
> On Mon, Jun 24, 2024 at 7:04 PM jian he  wrote:
> >
> > hi.
> > the following two queries should return the same result?
> >
> > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
>
> I get this with HEAD:
>
> SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
>  json_query
> 
>  null
> (1 row)
>
> Time: 734.587 ms
> SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
>  json_value
> 
>
> (1 row)
>
> Much like:
>
> SELECT JSON_QUERY('{"key": null}', '$.key');
>  json_query
> 
>  null
> (1 row)
>
> Time: 2.975 ms
> SELECT JSON_VALUE('{"key": null}', '$.key');
>  json_value
> 
>
> (1 row)
>
> Which makes sense to me, because JSON_QUERY() is supposed to return a
> JSON null in both cases and JSON_VALUE() is supposed to return a SQL
> NULL for a JSON null.
>
> --
> Thanks, Amit Langote

hi amit, sorry to bother you again.

My thoughts  for the above cases are:
* json_value, json_query main description is the same:
{{Returns the result of applying the SQL/JSON path_expression to the
context_item using the PASSING values.}}
same context_item, same path_expression, for the above cases, the
result should be the same?

* in json_value, description
{{The extracted value must be a single SQL/JSON scalar item; an error
is thrown if that's not the case. If you expect that extracted value
might be an object or an array, use the json_query function instead.}}
query: `SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);`
the returned jsonb 'null' (applying the path expression) is a single
SQL/JSON scalar item.
json_value return jsonb null should be fine

However, other database implementations return SQL null,
so I guess returning SQL null is fine)
(based on the doc explanation, returning jsonb null more make sense, imho)

Re: Support "Right Semi Join" plan shapes

2024-06-24 Thread wenhui qiu

Hi Japin Li
 Thank you for your reviewing ,This way the notes are more accurate and
complete. Thanks also to the author for updating the patch ,I also tested
the new patch ,It looks good to me


Regrads

Japin Li  于2024年6月25日周二 08:51写道：

> On Mon, 24 Jun 2024 at 17:59, Richard Guo  wrote:
> > Thank you for reviewing.
> >
> > On Mon, Jun 24, 2024 at 1:27 PM Li Japin  wrote:
> >> +   /*
> >> +* For now we do not support RIGHT_SEMI join in mergejoin or
> nestloop
> >> +* join.
> >> +*/
> >> +   if (jointype == JOIN_RIGHT_SEMI)
> >> +   return;
> >> +
> >>
> >> How about adding some reasons here?
> >
> > I've included a brief explanation in select_mergejoin_clauses.
> >
>
> Thank you for updating the patch.
>
> >> + * this is a right-semi join, or this is a right/right-anti/full join
> and
> >> + * there are nonmergejoinable join clauses.  The executor's mergejoin
> >>
> >> Maybe we can put the right-semi join together with the
> right/right-anti/full
> >> join.  Is there any other significance by putting it separately?
> >
> > I don't think so.  The logic is different: for right-semi join we will
> > always set *mergejoin_allowed to false, while for right/right-anti/full
> > join it is set to false only if there are nonmergejoinable join clauses.
> >
>
> Got it.  Thanks for the explanation.
>
> >> Maybe the following comments also should be updated. Right?
> >
> > Correct.  And there are a few more places where we need to mention
> > JOIN_RIGHT_SEMI, such as in reduce_outer_joins_pass2 and in the comment
> > for SpecialJoinInfo.
> >
> >
> > I noticed that this patch changes the plan of a query in join.sql from
> > a semi join to right semi join, compromising the original purpose of
> > this query, which was to test the fix for neqjoinsel's behavior for
> > semijoins (see commit 7ca25b7d).
> >
> > --
> > -- semijoin selectivity for <>
> > --
> > explain (costs off)
> > select * from int4_tbl i4, tenk1 a
> > where exists(select * from tenk1 b
> >  where a.twothousand = b.twothousand and a.fivethous <>
> b.fivethous)
> >   and i4.f1 = a.tenthous;
> >
> > So I've changed this test case a bit so that it is still testing what it
> > is supposed to test.
> >
> > In passing, I've also updated the commit message to clarify that this
> > patch does not address the support of "Right Semi Join" for merge joins.
> >
>
> Tested and looks good to me!
>
> --
> Regrads,
> Japin Li
>

Re: New standby_slot_names GUC in PG 17

2024-06-24 Thread Masahiko Sawada

On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Saturday, June 22, 2024 5:47 PM Amit Kapila  
> wrote:
> >
> > On Sat, Jun 22, 2024 at 1:49 AM Nathan Bossart
> >  wrote:
> > >
> > > On Fri, Jun 21, 2024 at 03:50:00PM -0400, Tom Lane wrote:
> > > > Allow specification of physical standbys that must be
> > > > synchronized before they are visible to subscribers (Hou Zhijie,
> > > > Shveta Malik)
> > > >
> > > > it seems like the name ought to have some connection to
> > > > synchronization.  Perhaps something like "synchronized_standby_slots"?
> > >
> > > IMHO that might be a bit too close to synchronous_standby_names.
> > >
> >
> > Right, but better than the current one. The other possibility could be
> > wait_for_standby_slots.
>
> I agree the current name seems too generic and the suggested ' 
> synchronized_standby_slots '
> is better than the current one.
>
> Some other ideas could be:
>
> synchronize_slots_on_standbys: it indicates that the standbys that enabled
> slot sync should be listed in this GUC.
>
> logical_replication_wait_slots: it means the logical replication(logical
> Walsender process) will wait for these slots to advance the confirm flush
> lsn before proceeding.

I feel that the name that has some connection to "logical replication"
also sounds good. Let me add some ideas:

- logical_replication_synchronous_standby_slots (might be too long)
- logical_replication_synchronous_slots

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Injection point locking

2024-06-24 Thread Michael Paquier

On Mon, Jun 24, 2024 at 11:03:09AM -0400, Tom Lane wrote:
> Given your point that the existing locking is just a fig leaf
> anyway, maybe we could simply not have any?  Then it's on the
> test author to be sure that the injection point won't be
> getting invoked when it's about to be removed.

That would work for me to put the responsibility to the test author,
ripping out the LWLock.  I was wondering when somebody would come up
with a case where they'd want to point to the postmaster to do
something, without really coming down to a case, so there was that
from my side originally.

Looking at all the points currently in the tree, nothing cares about
the concurrent locking when attaching or detaching a point, so perhaps
it is a good thing based on these experiences to just let this LWLock
go.  This should not impact the availability of the tests, either.

> Or just rip
> out the removal capability, and say that once installed an
> injection point is there until postmaster shutdown (or till
> shared memory reinitialization, anyway).

But not that.  Being able to remove points on the fly can be important
in some cases, for example where you'd still want to issue an ERROR
(partial write path is one case) in a SQL test, then remove it in a
follow-up SQL query to not trigger the same ERROR.
--
Michael

signature.asc
Description: PGP signature

Re: Injection point locking

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 11:03:09AM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > ... I can't do that, because InjectionPointRun() requires a PGPROC 
> > entry, because it uses an LWLock. That also makes it impossible to use 
> > injection points in the postmaster. Any chance we could allow injection 
> > points to be triggered without a PGPROC entry? Could we use a simple 
> > spinlock instead?

That sounds fine to me.  If calling hash_search() with a spinlock feels too
awful, a list to linear-search could work.

> > With a fast path for the case that no injection points 
> > are attached or something?
> 
> Even taking a spinlock in the postmaster is contrary to project
> policy.  Maybe we could look the other way for debug-only code,
> but it seems like a pretty horrible precedent.

If you're actually using an injection point in the postmaster, that would be
the least of concerns.  It is something of a concern for running an injection
point build while not attaching any injection point.  One solution could be a
GUC to control whether the postmaster participates in injection points.
Another could be to make the data structure readable with atomics only.

> Given your point that the existing locking is just a fig leaf
> anyway, maybe we could simply not have any?  Then it's on the
> test author to be sure that the injection point won't be
> getting invoked when it's about to be removed.

That's tricky with injection_points_set_local() plus an injection point at a
frequently-reached location.  It's one thing to control when the
injection_points_set_local() process reaches the injection point.  It's too
hard to control all the other processes that just reach the injection point
and conclude it's not for their PID.

> Or just rip
> out the removal capability, and say that once installed an
> injection point is there until postmaster shutdown (or till
> shared memory reinitialization, anyway).

That could work.  Tests do need a way to soft-disable, but it's okay with me
if nothing can reclaim the resources.

RE: New standby_slot_names GUC in PG 17

2024-06-24 Thread Zhijie Hou (Fujitsu)

On Saturday, June 22, 2024 5:47 PM Amit Kapila  wrote:
> 
> On Sat, Jun 22, 2024 at 1:49 AM Nathan Bossart
>  wrote:
> >
> > On Fri, Jun 21, 2024 at 03:50:00PM -0400, Tom Lane wrote:
> > > Allow specification of physical standbys that must be
> > > synchronized before they are visible to subscribers (Hou Zhijie,
> > > Shveta Malik)
> > >
> > > it seems like the name ought to have some connection to
> > > synchronization.  Perhaps something like "synchronized_standby_slots"?
> >
> > IMHO that might be a bit too close to synchronous_standby_names.
> >
> 
> Right, but better than the current one. The other possibility could be
> wait_for_standby_slots.

I agree the current name seems too generic and the suggested ' 
synchronized_standby_slots '
is better than the current one.

Some other ideas could be:

synchronize_slots_on_standbys: it indicates that the standbys that enabled
slot sync should be listed in this GUC.

logical_replication_wait_slots: it means the logical replication(logical
Walsender process) will wait for these slots to advance the confirm flush
lsn before proceeding.

Best Regards,
Hou zj

Re: [PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Tom Lane

Michael Paquier  writes:
> Ugh.  Which means that you are testing macOS Sequoia still in beta
> phase?  Thanks for the report.

> Perhaps we should wait for the actual release before seeing if this is
> still an issue and see if this is still a problem?  Tom is a heavy
> macOS user, I'm still under 14 myself for some time.

Yeah, I'm not in a huge hurry to act on this.  The problem may
go away by the time SDK 15 gets out of beta --- in fact, I think
it'd be a good idea to file a bug with Apple complaining that this
pointless-looking change breaks third-party code.  If it doesn't
go away, we're going to have to back-patch all supported branches
(and, really, even out-of-support ones back to 9.2); which puts a
large premium on getting the patch right.  So we have both time to
think about it and good reason to be careful.

(I've not yet read any of Stan's proposed patches.)

regards, tom lane

Re: Injection point locking

2024-06-24 Thread Michael Paquier

On Mon, Jun 24, 2024 at 01:29:38PM +0300, Heikki Linnakangas wrote:
> InjectionPointRun() acquires InjectionPointLock, looks up the hash entry,
> and releases the lock:
> 
> > LWLockAcquire(InjectionPointLock, LW_SHARED);
> > entry_by_name = (InjectionPointEntry *)
> > hash_search(InjectionPointHash, name,
> > HASH_FIND, );
> > LWLockRelease(InjectionPointLock);
> 
> Later, it reads fields from the entry it looked up:
> 
> > /* not found in local cache, so load and register */
> > snprintf(path, MAXPGPATH, "%s/%s%s", pkglib_path,
> >  entry_by_name->library, DLSUFFIX);
> 
> Isn't that a straightforward race condition, if the injection point is
> detached in between?

This is a feature, not a bug :)

Jokes apart, this is a behavior that Noah was looking for so as it is
possible to detach a point to emulate what a debugger would do with a
breakpoint for some of his tests with concurrent DDL bugs, so not
taking a lock while running a point is important.  It's true, though,
that we could always delay the LWLock release once the local cache is
loaded, but would it really matter?
--
Michael


signature.asc
Description: PGP signature

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 09:49:53PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 9:30 PM Noah Misch  wrote:
> > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > > Right now, in master, we do use a single horizon when determining what
> > > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > > freezing and full page visibility determinations. Using a different
> > > horizon for pruning by vacuum than freezing is what is causing the
> > > error on master.
> >
> > Agreed, and I think using different sources for pruning and freezing is a
> > recipe for future bugs.  Fundamentally, both are about answering "is
> > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> > That's not to say this thread shall unify the two, but I suspect that's the
> > right long-term direction.
> 
> What does it really mean to unify the two, though?
> 
> If the OldestXmin field was located in struct GlobalVisState (next to
> definitely_needed and maybe_needed), but everything worked in
> essentially the same way as it will with Melanie's patch in place,
> would that count as unifying the two? Why or why not?

To me, no, unification would mean removing the data redundancy.  Relocating
the redundant field and/or code that updates the redundant field certainly
could reduce the risk of bugs, so I'm not opposing everything short of
removing the data redundancy.  I'm just agreeing with the "prefer" from
https://postgr.es/m/CA+TgmoYzS_bkt_MrNxr5QrXDKfedmh4tStn8UBTTBXqv=3j...@mail.gmail.com

Re: [PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Michael Paquier

On Mon, Jun 24, 2024 at 02:58:47PM -0700, Stan Hu wrote:
> Prior to macOS SDK 15, there were include guards in
> $SDK_ROOT/usr/include/xlocale/_regex.h:
> 
>   #ifndef _REGEX_H_
>   #include <_regex.h>
>   #endif // _REGEX_H_
>   #include <_xlocale.h>

Ugh.  Which means that you are testing macOS Sequoia still in beta
phase?  Thanks for the report.

Perhaps we should wait for the actual release before seeing if this is
still an issue and see if this is still a problem?  Tom is a heavy
macOS user, I'm still under 14 myself for some time.
--
Michael


signature.asc
Description: PGP signature

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-24 Thread Michael Paquier

On Tue, Jun 25, 2024 at 01:21:14AM +0200, Joel Jacobson wrote:
> Good idea, I've started a separate thread for this:
> 
> https://postgr.es/m/9253b872-dbb1-42a6-a79e-b1e96effc857%40app.fastmail.com
> 
> This patch now assumes ACL will be supported.

Thanks for doing that!  That helps in making reviews easier to follow
for all, attracting the correct audience when necessary.

>> +SELECT
>> +(pg_identify_object(s.classid,s.objid,s.objsubid)).*,
>> +pg_catalog.pg_get_acl(s.classid,s.objid)
>> +FROM pg_catalog.pg_shdepend AS s
>> +JOIN pg_catalog.pg_database AS d ON d.datname = current_database() AND 
>> d.oid = s.dbid
>> +JOIN pg_catalog.pg_authid AS a ON a.oid = s.refobjid AND s.refclassid 
>> = 'pg_authid'::regclass
>> +WHERE s.deptype = 'a';
>>
>> Could be a bit prettier.  That's a good addition.
> 
> How could we make it prettier?

Perhaps split the two JOIN conditions into two lines each, with a bit
more indentation to make it render better?  Usually I handle that on a
case-by-case basis while preparing a patch for commit.  I'm OK to edit
that myself with some final touches, FWIW.  Depending on the input
this shows, I'd also look at some LATERAL business, that can be
cleaner in some cases for the docs.

>> How about adding a bit more coverage?  I'd suggest the following
>> additions:
> 
> Thanks, good idea. I've added the tests,
> but need some help reasoning if the output is expected:

Total coverage sounds good here.

>> - class ID as 0 in input.
> 
> SELECT pg_get_acl(0, 'atest2'::regclass::oid);
> ERROR:  unrecognized class ID: 0
> 
> I believe we want an error here, since: an invalid class ID,
> like 0, or any other invalid OID, should raise an error,
> since classes can't be dropped, so we should never
> expect an invalid OID for a class ID.
> Please correct me if this reasoning is incorrect.

This is an internal error, so it should never be visible to the end
user via SQL because it is an unexpected state.  See for example
2a10fdc4307a, which is similar to what you are doing here.

>> - object ID as 0 in input.
> SELECT pg_get_acl('pg_class'::regclass, 0);
> 
> This returns null, which I believe it should,
> since the OID for a database object could
> be invalid due to having being dropped concurrently.

That's right.  It would be sad for monitoring queries doing large
scans of pg_depend or pg_shdepend to fail in obstructive ways because
of concurrent object drops, because we'd lose information about all
the other objects because of at least one object gone at the moment
where pg_get_acl() is called for its OID retrieved previously.

>> - Both class and object ID as 0 in input.
> 
> This returns null, but I'm not sure I think this is correct?
> Since if the class ID is zero, i.e. incorrect, that is unexpected,
> and wouldn't we want to throw an error in that case,
> just like if only the class ID is invalid?

NULL is the correct answer for all that, IMO.

>> FWIW, I'm still a bit meh with the addition of the functions
>> overloading the arguments with reg inputs.  I'm OK with that when we
>> know that the input would be of a given object type, like
>> pg_partition_ancestors or pg_partition_tree, but for a case as generic
>> as this one this is less appealing to me.
> 
> I've looked at other occurrences of "reg" in func.sgml,
> and I now agree with you we should skip the overloads,
> since adding them would seem unconventional.

Okay.  If another committer is interested in that, I'd be OK if there
is a consensus on this point.  The fact that I'm not convinced does
not mean that it would show enough value for somebody else.
--
Michael

signature.asc
Description: PGP signature

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 9:30 PM Noah Misch  wrote:
> On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > Right now, in master, we do use a single horizon when determining what
> > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > freezing and full page visibility determinations. Using a different
> > horizon for pruning by vacuum than freezing is what is causing the
> > error on master.
>
> Agreed, and I think using different sources for pruning and freezing is a
> recipe for future bugs.  Fundamentally, both are about answering "is
> snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> That's not to say this thread shall unify the two, but I suspect that's the
> right long-term direction.

What does it really mean to unify the two, though?

If the OldestXmin field was located in struct GlobalVisState (next to
definitely_needed and maybe_needed), but everything worked in
essentially the same way as it will with Melanie's patch in place,
would that count as unifying the two? Why or why not?

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> > I'm not sure I understand. The most important thing here is fixing the
> > bug. But if we have a choice of how to fix the bug, I'd prefer to do
> > it by having the pruning code test one horizon that is always correct,
> > rather than (as I think the patch does) having it test against two
> > horizons because as a way of covering possible discrepancies between
> > those values.
> 
> Your characterizing of OldestXmin + vistest as two horizons seems
> pretty arbitrary to me. I know what you mean, of course, but it seems
> like a distinction without a difference.

"Two horizons" matches how I model it.  If the two were _always_ indicating
the same notion of visibility, we wouldn't have this thread.

On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed, and I think using different sources for pruning and freezing is a
recipe for future bugs.  Fundamentally, both are about answering "is
snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
That's not to say this thread shall unify the two, but I suspect that's the
right long-term direction.

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

2024-06-24 Thread Michael Paquier

On Fri, Jun 21, 2024 at 11:31:21AM +0200, Michail Nikolaev wrote:
> Yes, I also have tried that approach, but it doesn't work, unfortunately.
> You may fail test increasing number of connections:
> 
> '--no-vacuum --client=10 -j 2 --transactions=1000',
> 
> The source of the issue is not the swap of the indexes (and not related to
> REINDEX CONCURRENTLY only), but the fact that indexes are fetched once
> during planning (to find the arbiter), but then later reread with a new
> catalog snapshot for the the actual execution.

When I first saw this report, my main worry was that I have somewhat
managed to break the state of the indexes leading to data corruption
because of an incorrect step in the concurrent operations.  However,
as far as I can see this is not the case, as an effect of two
properties we rely on for concurrent index operations, that hold in
the executor and the planner.  Simply put:
- The planner ignores indexes with !indisvalid.
- The executor ignores indexes with !indislive.

The critical point is that we wait in DROP INDEX CONC and REINDEX CONC
for any transactions still using an index that's waiting to be marked
as !indislive, because such indexes *must* not be used in the
executor.

> So, other possible fixes I see:
> * fallback to replanning in case we see something changed during the
> execution
> * select arbiter indexes during actual execution

These two properties make ON CONFLICT react the way it should
depending on the state of the indexes selected by the planner based on
the query clauses, with changes reflecting when executing, with two
patterns involved:
- An index may be created in a concurrent operation after the planner
has selected the arbiter indexes (the index may be defined, still not
valid yet, or just created after), then the query execution would need
to handle the extra index created available at execution, with a
failure on a ccnew index.
- An index may be selected at planning phase, then a different index
could be used by a constraint once both indexes swap, with a failure
on a ccold index.

As far as I can see, it depends on what kind of query semantics and
the amount of transparency you are looking for here in your
application.  An error in the query itself can also be defined as
useful so as your application is aware of what happens as an effect of
the concurrent index build (reindex or CIC/DIC), and it is not really
clear to me why silently falling back to a re-selection of the arbiter
indexes would be always better.  Replanning could be actually
dangerous if a workload is heavily concurrently REINDEX'd, as we could
fall into a trap where a query can never decide which index to use.
I'm not saying that it cannot be improved, but it's not completely
clear to me what query semantics are the best for all users because
the behavior of HEAD and your suggestions have merits and demerits.
Anything we could come up with would be an improvement anyway, AFAIU.

>> That's a HEAD-only thing IMO,
>> though.
>
> Do you mean that it needs to be moved to a separate patch?

It should, but I'm wondering if that's necessary for two reasons.

First, a check on indisvalid would be incorrect, because indexes
marked as !indisvalid && indislive mean that there is a concurrent
operation happening, and that this concurrent operation is waiting for
all transactions working with a lock on this index to finish before
flipping the live flag and make this index invalid for decisions taken
in the executor, like HOT updates, etc.

A check on indislive may be an idea, still I'm slightly biased
regarding its additional value because any indexes opened for a
relation are fetched from the relcache with RelationGetIndexList()
explaining why indislive indexes cannot be fetched, and we rely on
that in the executor for the indexes opened by a relation.
--
Michael

signature.asc
Description: PGP signature

Re: Track the amount of time waiting due to cost_delay

2024-06-24 Thread Imseih (AWS), Sami

>> 2. the leader being interrupted while waiting is also already happening on 
>> master
>> due to the pgstat_progress_parallel_incr_param() calls in
>> parallel_vacuum_process_one_index() (that have been added in
>> 46ebdfe164). It has been the case "only" 36 times during my test case.

46ebdfe164 will interrupt the leaders sleep every time a parallel workers 
reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the 
remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process. 


> 1. Having a time based only approach to throttle 

I do agree with a time based approach overall.


> 1.1) the more parallel workers is used, the less the impact of the leader on
> the vacuum index phase duration/workload is (because the repartition is done
> on more processes).

Did you mean " because the vacuum is done on more processes"? 

When a leader is operating on a large index(s) during the entirety
of the vacuum operation, wouldn't more parallel workers end up
interrupting the leader more often? This is why I think reporting even more
often than 1 second (more below) will be better.

> 3. A 1 second reporting "throttling" looks a reasonable threshold as:

> 3.1 the idea is to have a significant impact when the leader could have been
> interrupted say hundred/thousand times per second.

> 3.2 it does not make that much sense for any tools to sample 
> pg_stat_progress_vacuum
> multiple times per second (so a one second reporting granularity seems ok).

I feel 1 second may still be too frequent. 
What about 10 seconds ( or 30 seconds )? 
I think this metric in particular will be mainly useful for vacuum runs that 
are 
running for minutes or more, making reporting every 10 or 30 seconds 
still useful.

It just occurred to me also that pgstat_progress_parallel_incr_param 
should have a code comment that it will interrupt a leader process and
cause activity such as a sleep to end early.



Regards,

Sami Imseih
Amazon Web Services (AWS)


[1] https://man7.org/linux/man-pages/man2/nanosleep.2.html

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-24 Thread Kyotaro Horiguchi

At Fri, 21 Jun 2024 11:48:22 +0530, Amit Kapila  wrote 
in 
> On Wed, Jun 19, 2024 at 10:44 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Horiguchi-san,
> >
> > Thanks for sharing the patch! I agree this approach (ensure WAL records are 
> > flushed)
> > Is more proper than others.
> >
> > I have an unclear point. According to the comment atop GetInsertRecPtr(), 
> > it just
> > returns the approximated value - the position of the last full WAL page [1].
> > If there is a continuation WAL record which across a page, will it return 
> > the
> > halfway point of the WAL record (end of the first WAL page)? If so, the 
> > proposed
> > fix seems not sufficient. We have to point out the exact the end of the 
> > record.
> >
> 
> You have a point but if this theory is correct why we are not able to
> reproduce the issue after this patch? Also, how to get the WAL
> location up to which we need to flush? Is XLogCtlData->logInsertResult
> the one we are looking for?

It is not exposed, but of course logInsertResult is more
straightforward source for the LSN.

The reason why the patch is working well is due to the following bit
of the code.

xlog.c:958, in XLInsertRecord()
>   /*
>* Update shared LogwrtRqst.Write, if we crossed page boundary.
>*/
>   if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
>   {
>   SpinLockAcquire(>info_lck);
>   /* advance global request to include new block(s) */
>   if (XLogCtl->LogwrtRqst.Write < EndPos)
>   XLogCtl->LogwrtRqst.Write = EndPos;
>   SpinLockRelease(>info_lck);
>   RefreshXLogWriteResult(LogwrtResult);
>   }

The code, which exists has existed for a long time, ensures that
GetInsertRecPtr() returns the accurate end of a record when it spanns
over page boundaries. This would need to be written in the new comment
if we use GetInsertRecPtr().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support "Right Semi Join" plan shapes

2024-06-24 Thread Japin Li

On Mon, 24 Jun 2024 at 17:59, Richard Guo  wrote:
> Thank you for reviewing.
>
> On Mon, Jun 24, 2024 at 1:27 PM Li Japin  wrote:
>> +   /*
>> +* For now we do not support RIGHT_SEMI join in mergejoin or nestloop
>> +* join.
>> +*/
>> +   if (jointype == JOIN_RIGHT_SEMI)
>> +   return;
>> +
>>
>> How about adding some reasons here?
>
> I've included a brief explanation in select_mergejoin_clauses.
>

Thank you for updating the patch.

>> + * this is a right-semi join, or this is a right/right-anti/full join and
>> + * there are nonmergejoinable join clauses.  The executor's mergejoin
>>
>> Maybe we can put the right-semi join together with the right/right-anti/full
>> join.  Is there any other significance by putting it separately?
>
> I don't think so.  The logic is different: for right-semi join we will
> always set *mergejoin_allowed to false, while for right/right-anti/full
> join it is set to false only if there are nonmergejoinable join clauses.
>

Got it.  Thanks for the explanation.

>> Maybe the following comments also should be updated. Right?
>
> Correct.  And there are a few more places where we need to mention
> JOIN_RIGHT_SEMI, such as in reduce_outer_joins_pass2 and in the comment
> for SpecialJoinInfo.
>
>
> I noticed that this patch changes the plan of a query in join.sql from
> a semi join to right semi join, compromising the original purpose of
> this query, which was to test the fix for neqjoinsel's behavior for
> semijoins (see commit 7ca25b7d).
>
> --
> -- semijoin selectivity for <>
> --
> explain (costs off)
> select * from int4_tbl i4, tenk1 a
> where exists(select * from tenk1 b
>  where a.twothousand = b.twothousand and a.fivethous <> 
> b.fivethous)
>   and i4.f1 = a.tenthous;
>
> So I've changed this test case a bit so that it is still testing what it
> is supposed to test.
>
> In passing, I've also updated the commit message to clarify that this
> patch does not address the support of "Right Semi Join" for merge joins.
>

Tested and looks good to me!

-- 
Regrads,
Japin Li

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-24 Thread Joel Jacobson

On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
> Rather unrelated to this patch, still this patch makes the situation
> more complicated in the docs, but wouldn't it be better to add ACL as
> a term in acronyms.sql, and reuse it here?  It would be a doc-only
> patch that applies on top of the rest (could be on a new thread of its
> own), with some  markups added where needed.

Good idea, I've started a separate thread for this:

https://postgr.es/m/9253b872-dbb1-42a6-a79e-b1e96effc857%40app.fastmail.com

This patch now assumes ACL will be supported.

> +SELECT
> +(pg_identify_object(s.classid,s.objid,s.objsubid)).*,
> +pg_catalog.pg_get_acl(s.classid,s.objid)
> +FROM pg_catalog.pg_shdepend AS s
> +JOIN pg_catalog.pg_database AS d ON d.datname = current_database() AND 
> d.oid = s.dbid
> +JOIN pg_catalog.pg_authid AS a ON a.oid = s.refobjid AND s.refclassid 
> = 'pg_authid'::regclass
> +WHERE s.deptype = 'a';
>
> Could be a bit prettier.  That's a good addition.

How could we make it prettier?

> + catalogId = (classId == LargeObjectRelationId) ? 
> LargeObjectMetadataRelationId : classId;
>
> Indeed, and we need to live with this tweak as per the reason in
> inv_api.c related to clients, so that's fine.  Still a comment is
> adapted for this particular case?

Thanks, fixed.

> How about adding a bit more coverage?  I'd suggest the following
> additions:

Thanks, good idea. I've added the tests,
but need some help reasoning if the output is expected:

> - class ID as 0 in input.

SELECT pg_get_acl(0, 'atest2'::regclass::oid);
ERROR:  unrecognized class ID: 0

I believe we want an error here, since: an invalid class ID,
like 0, or any other invalid OID, should raise an error,
since classes can't be dropped, so we should never
expect an invalid OID for a class ID.
Please correct me if this reasoning is incorrect.

> - object ID as 0 in input.
SELECT pg_get_acl('pg_class'::regclass, 0);

This returns null, which I believe it should,
since the OID for a database object could
be invalid due to having being dropped concurrently.

> - Both class and object ID as 0 in input.

This returns null, but I'm not sure I think this is correct?
Since if the class ID is zero, i.e. incorrect, that is unexpected,
and wouldn't we want to throw an error in that case,
just like if only the class ID is invalid?

> +SELECT pg_get_acl('pg_class'::regclass, 'atest2'::regclass::oid);
> +   
>  pg_get_acl 
>
> +--
> + 
> {regress_priv_user1=arwdDxtm/regress_priv_user1,regress_priv_user2=r/regress_priv_user1,regress_priv_user3=w/regress_priv_user1,regress_priv_user4=a/regress_priv_user1,regress_priv_user5=D/regress_priv_user1}
> +(1 row)
>
> This is hard to parse.  I would add an unnest() and order the entries
> so as modifications are easier to catch, with a more predictible
> result.

Thanks, much better, fixed.

> FWIW, I'm still a bit meh with the addition of the functions
> overloading the arguments with reg inputs.  I'm OK with that when we
> know that the input would be of a given object type, like
> pg_partition_ancestors or pg_partition_tree, but for a case as generic
> as this one this is less appealing to me.

I've looked at other occurrences of "reg" in func.sgml,
and I now agree with you we should skip the overloads,
since adding them would seem unconventional.

/Joel

v7-0001-Add-pg_get_acl.patch
Description: Binary data

Re: Apparent bug in WAL summarizer process (hung state)

2024-06-24 Thread Michael Paquier

On Mon, Jun 24, 2024 at 02:56:00PM -0300, Israel Barth Rubio wrote:
> I've been playing a bit with the incremental backup feature which might
> come as
> part of the 17 release, and I think I hit a possible bug in the WAL
> summarizer
> process.

Thanks for testing new features and for this report!

> FWIW, once I restart Postgres the WAL summarizer process gets back to normal
> functioning. It seems to me there is some race condition between when a WAL
> file
> is removed and when `summarize_wal` is re-enabled, causing the process to
> keep
> looking for a WAL file that is the past.

I am adding an open item to track this issue, to make sure that this
is looked at.
--
Michael


signature.asc
Description: PGP signature

Re: improve predefined roles documentation

2024-06-24 Thread David G. Johnston

On Mon, Jun 24, 2024 at 2:53 PM Nathan Bossart 
wrote:

> On Mon, Jun 24, 2024 at 02:44:33PM -0400, Robert Haas wrote:
>
> > I don't know what to do about pg_database_owner. I almost wonder if
> > that should be moved out of the table and documented as a special
> > case. Or maybe some more wordsmithing would add clarity. Or maybe it's
> > fine as-is.
>
> I've left it alone for now.  I thought about adding something like
> "pg_database_owner does not provide any special capabilities or access
> out-of-the-box" to the beginning of the entry, but I don't have time at the
> moment to properly wordsmith the rest.  If anyone else wants to give it a
> try before I get to it (probably tomorrow), please be my guest.
>

This feels like a case where why is more important than what, so here's my
first draft suggestion.

pg_database_owner owns the initially created public schema and has an
implicit membership list of one - the role owning the connected-to database.
It exists to encourage and facilitate best practices regarding database
administration.  The primary rule being to avoid using superuser to own or
do things.  The bootstrap superuser thus should connect to the postgres
database and create a login role, with the createdb attribute, and then use
that role to create and administer additional databases.  In that context,
this feature allows the creator of the new database to log into it and
immediately begin working in the public schema.

As a result, in version 14, PostgreSQL no longer initially grants create
and usage privileges, on the public schema, to the public pseudo-role.

For technical reasons, pg_database_owner may not participate in explicitly
granted role memberships.  This is an easily mitigated limitation since the
role that owns the database may be a group and any inheriting members of
that group will be considered owners as well.

David J.

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson

On Mon, Jun 24, 2024, at 23:15, David G. Johnston wrote:
> I really dislike "For avoidance of doubt and clarity" - and in terms of 
> being equivalent the following seems like a more accurate description 
> of reality.
>
> The PostgreSQL documentation, and code, refers to the specifications 
> within the ACL as "privileges".  This has the same meaning as 
> "permissions" on the linked page.  Generally if we say "permissions" we 
> are referring to something that is not covered by the ACL.  In routine 
> communication the two words are often used interchangeably.

Thanks, much better. New version attached.

/Joel

v4-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data

[PATCH v3] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Stan Hu

Prior to macOS SDK 15, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

  #ifndef _REGEX_H_
  #include <_regex.h>
  #endif // _REGEX_H_
  #include <_xlocale.h>

In macOS SDK 15, these include guards are gone:

  #include <_regex.h>
  #include <_xlocale.h>

Because _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from $SDK_ROOT/usr/include/_regex.h (not to be confused
with $SDK_ROOT/usr/include/xlocale/_regex.h).

As a result, attempting to compile PostgreSQL with macOS SDK 15 fails
with "previous definition is here" errors for regoff_t, regex_t, and
regmatch_t structures.

To fix this build issue, define __REGEX_H_ to prevent macOS from
including the header that contain redefinitions of the local regex
structures.

Discussion: 
https://www.postgresql.org/message-id/CAMBWrQ%3DF9SSPfsFtCv%3DJT51WGK2VcgLA%2BiiJJOmjN0zbbufOEA%40mail.gmail.com
---
 src/include/regex/regex.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..f7aa7cf3a3 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -32,6 +32,20 @@
  * src/include/regex/regex.h
  */
 
+#if defined(__darwin__)
+/*
+ * macOS SDK 15.0 removed the _REGEX_H_ include guards in
+ * $SDK_ROOT/usr/include/xlocale/_regex.h, so now
+ * $SDK_ROOT/usr/include/_regex.h is always included. That file defines
+ * the same types as below. To guard against type redefinition errors,
+ * define __REGEX_H_.
+ */
+#ifndef __REGEX_H_
+#define __REGEX_H_
+
+#endif /* __REGEX_H_ */
+#endif
+
 /*
  * Add your own defines, if needed, here.
  */
-- 
2.45.0

Re: speed up a logical replica setup

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> >
> > > +static void
> > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > +{
> >
> > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > +   ipubname_esc);
> >
> > This tool's documentation says it "guarantees that no transaction will be
> > lost."  I tried to determine whether achieving that will require something
> > like the fix from
> > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > (Not exactly the fix from that thread, since that thread has not discussed 
> > the
> > FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> > hand, pg_createsubscriber benefits from creating a logical slot after 
> > creating
> > the publication.  That snapbuild.c process will wait for running XIDs.  On 
> > the
> > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds
> > its relcache entry before assigning an XID, so perhaps the snapbuild.c 
> > process

Correction: it doesn't matter how the original INSERT/UPDATE/DELETE builds its
relcache entry, just how pgoutput of the change builds the relcache entry from
the historic snapshot.

> > isn't enough to prevent that thread's race condition.  What do you think?
> 
> I am not able to imagine how the race condition discussed in the
> thread you quoted can impact this patch. The problem discussed is
> mainly the interaction when we are processing the changes in logical
> decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> problem happens because we use the old cache state.

Right.  Taking the example from
http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, LSNs
between what that mail calls 4) and 5) are not safely usable as start points.
pg_createsubscriber evades that thread's problem if the consistent_lsn it
passes to pg_replication_origin_advance() can't be in a bad-start-point LSN
span.  I cautiously bet the snapbuild.c process achieves that:

> I am missing your
> point about the race condition mentioned in the thread you quoted with
> snapbuild.c. Can you please elaborate a bit more?

When pg_createsubscriber calls pg_create_logical_replication_slot(), the key
part starts at:

/*
 * If caller needs us to determine the decoding start point, do so now.
 * This might take a while.
 */
if (find_startpoint)
DecodingContextFindStartpoint(ctx);

Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
committed before pg_create_logical_replication_slot() started.  Second, (b)
DecodingContextFindStartpoint() waits for running XIDs to complete, via the
process described at the snapbuild.c "starting up in several stages" diagram.
Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
even if the original INSERT populated all caches before CREATE PUBLICATION
started and managed to assign an XID only after consistent_lsn.  From the
pgoutput perspective, that's indistinguishable from the transaction starting
at its first WAL record, after consistent_lsn.  The linked "long-standing data
loss bug in initial sync of logical replication" thread doesn't have (a),
hence its bug.  How close is that to accurate?

Thanks,
nm

[PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Stan Hu

Prior to macOS SDK 15, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

  #ifndef _REGEX_H_
  #include <_regex.h>
  #endif // _REGEX_H_
  #include <_xlocale.h>

In macOS SDK 15, these include guards are gone:

  #include <_regex.h>
  #include <_xlocale.h>

Because _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from $SDK_ROOT/usr/include/_regex.h (not to be confused
with $SDK_ROOT/usr/include/xlocale/_regex.h).

As a result, attempting to compile PostgreSQL with macOS SDK 15 fails
with "previous definition is here" errors for regoff_t, regex_t, and
regmatch_t structures.

To fix this build issue, define __REGEX_H_ to prevent macOS from
including the header that contain redefinitions of the local regex
structures.

Discussion: 
https://www.postgresql.org/message-id/CAMBWrQ%3DF9SSPfsFtCv%3DJT51WGK2VcgLA%2BiiJJOmjN0zbbufOEA%40mail.gmail.com
---
 src/include/regex/regex.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..045ac626cc 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -32,6 +32,20 @@
  * src/include/regex/regex.h
  */
 
+#if defined(__darwin__)
+/*
+ * mmacOS SDK 15.0 removed the _REGEX_H_ include guards in
+ * $SDK_ROOT/usr/include/xlocale/_regex.h, so now
+ * $SDK_ROOT/usr/include/_regex.h is always included. That file defines
+ * the same types as below. To guard against type redefinition errors,
+ * define __REGEX_H_.
+ */
+#ifndef __REGEX_H_
+#define __REGEX_H_
+
+#endif /* __REGEX_H_ */
+#endif
+
 /*
  * Add your own defines, if needed, here.
  */
-- 
2.45.0

Re: improve predefined roles documentation

2024-06-24 Thread Nathan Bossart

On Mon, Jun 24, 2024 at 02:44:33PM -0400, Robert Haas wrote:
> I think the first two cases could be made more like each other by
> changing the varlistentires that are just about one setting to use the
> second format instead of the first, e.g. pg_checkpoint allows
> executing the CHECKPOINT command.

Done.

> I don't know what to do about pg_database_owner. I almost wonder if
> that should be moved out of the table and documented as a special
> case. Or maybe some more wordsmithing would add clarity. Or maybe it's
> fine as-is.

I've left it alone for now.  I thought about adding something like
"pg_database_owner does not provide any special capabilities or access
out-of-the-box" to the beginning of the entry, but I don't have time at the
moment to properly wordsmith the rest.  If anyone else wants to give it a
try before I get to it (probably tomorrow), please be my guest.  TBH I
think the existing content is pretty good, so I'm not opposed to leaving it
alone, even if the style is different than the other entries.

-- 
nathan
>From d76a09646bd9458450a4f5c051c36050fefbbf29 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 18 Jun 2024 11:38:40 -0500
Subject: [PATCH v3 1/1] revamp predefined roles documentation

---
 doc/src/sgml/config.sgml |   2 +-
 doc/src/sgml/monitoring.sgml |   4 +-
 doc/src/sgml/ref/checkpoint.sgml |   2 +-
 doc/src/sgml/ref/reindex.sgml|   2 +-
 doc/src/sgml/user-manag.sgml | 339 ---
 5 files changed, 185 insertions(+), 164 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0c7a9082c5..03e37209e6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -731,7 +731,7 @@ include_dir 'conf.d'

 Determines the number of connection slots that are
 reserved for connections by roles with privileges of the
-pg_use_reserved_connections
+
 role.  Whenever the number of free connection slots is greater than
  but less than or
 equal to the sum of superuser_reserved_connections
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b2ad9b446f..f30c1e53fa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -286,8 +286,8 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34 
  0:00 postgres: ser
other sessions, many columns will be null.  Note, however, that the
existence of a session and its general properties such as its sessions user
and database are visible to all users.  Superusers and roles with 
privileges of
-   built-in role pg_read_all_stats (see also ) can see all the information about all 
sessions.
+   built-in role pg_read_all_stats
+   can see all the information about all sessions.
   
 
   
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 28a1d717b8..db011a47d0 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -53,7 +53,7 @@ CHECKPOINT
 
   
Only superusers or users with the privileges of
-   the pg_checkpoint
+   the 
role can call CHECKPOINT.
   
  
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 2942dccf1e..dcf70d14bc 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -305,7 +305,7 @@ REINDEX [ ( option [, ...] ) ] { DA
partitioned table, such commands skip the privilege checks when processing
the individual partitions.  Reindexing a schema or database requires being 
the
owner of that schema or database or having privileges of the
-   pg_maintain
+   
role.  Note specifically that it's thus
possible for non-superusers to rebuild indexes of tables owned by
other users.  However, as a special exception,
diff --git a/doc/src/sgml/user-manag.sgml b/doc/src/sgml/user-manag.sgml
index 07a16247d7..8b55216631 100644
--- a/doc/src/sgml/user-manag.sgml
+++ b/doc/src/sgml/user-manag.sgml
@@ -590,101 +590,72 @@ DROP ROLE doomed_role;
and information.  Administrators (including roles that have the
CREATEROLE privilege) can GRANT these
roles to users and/or other roles in their environment, providing those
-   users with access to the specified capabilities and information.
+   users with access to the specified capabilities and information.  For
+   example:
+
+
+GRANT pg_signal_backend TO admin_user;
+
   
 
+  
+   
+Care should be taken when granting these roles to ensure they are only used
+where needed and with the understanding that these roles grant access to
+privileged information.
+   
+  
+
   
-   The predefined roles are described in .
+   The predefined roles are described below.
Note that the specific permissions for each of the roles may change in
the future as additional capabilities are added.  Administrators
should monitor the release notes for changes.
-  
 
-   
-Predefined Roles
-
- 
- 
- 
-  
-

[PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Stan Hu

Prior to macOS SDK 15, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

  #ifndef _REGEX_H_
  #include <_regex.h>
  #endif // _REGEX_H_
  #include <_xlocale.h>

In macOS SDK 15.5, these include guards are gone:

  #include <_regex.h>
  #include <_xlocale.h>

Because _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from $SDK_ROOT/usr/include/_regex.h (not to be confused
with $SDK_ROOT/usr/include/xlocale/_regex.h).

To fix this build issue, define __REGEX_H_ to prevent macOS from
including the header that contain redefinitions of the local regex
structures.

Discussion: 
https://www.postgresql.org/message-id/CAMBWrQ%3DF9SSPfsFtCv%3DJT51WGK2VcgLA%2BiiJJOmjN0zbbufOEA%40mail.gmail.com
---
 src/include/regex/regex.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..045ac626cc 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -32,6 +32,20 @@
  * src/include/regex/regex.h
  */
 
+#if defined(__darwin__)
+/*
+ * mmacOS SDK 15.0 removed the _REGEX_H_ include guards in
+ * $SDK_ROOT/usr/include/xlocale/_regex.h, so now
+ * $SDK_ROOT/usr/include/_regex.h is always included. That file defines
+ * the same types as below. To guard against type redefinition errors,
+ * define __REGEX_H_.
+ */
+#ifndef __REGEX_H_
+#define __REGEX_H_
+
+#endif /* __REGEX_H_ */
+#endif
+
 /*
  * Add your own defines, if needed, here.
  */
-- 
2.45.0

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio

On Mon, 24 Jun 2024 at 22:42, David E. Wheeler  wrote:
> >> BINDIR
> >> DOCDIR
> >> HTMLDIR
> >> PKGINCLUDEDIR
> >> LOCALEDIR
> >> MANDIR
> >>
> >> I can imagine an extension wanting or needing to use any and all of these.
> >
> > Are these really all relevant to backend code?
>
> Oh I think so. Especially BINDIR; lots of extensions ship with binary 
> applications. And most ship with docs, too (PGXS puts items listed in DOCS 
> into DOCDIR). Some might also produce man pages (for their binaries), HTML 
> docs, and other stuff. Maybe an FTE extension would include locale files?
>
> I find it pretty easy to imagine use cases for all of them. So much so that I 
> wrote an extension binary distribution RFC[1] and its POC[2] around them.

Definitely agreed on BINDIR needing to be supported.

And while lots of extensions ship with docs, I expect this feature to
mostly be used in production environments to make deploying extensions
easier. And I'm not sure that many people care about deploying docs to
production (honestly lots of people would probably want to strip
them).

Still, for the sake of completeness it might make sense to support
this whole list in extension_destdir. (assuming it's easy to do)

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler

On Jun 24, 2024, at 17:17, Jelte Fennema-Nio  wrote:

> If you want to only change $libdir during CREATE EXTENSION (or ALTER
> EXTENSION UPDATE), then why not just change it there. And really you'd
> only want to change it when creating an extension from which the
> control file is coming from extension_destdir.

IIUC, the postmaster needs to load an extension on first use in every session 
unless it’s in shared_preload_libraries.

> However, I can also see a case for really always changing $libdir.
> Because some extensions in shared_preload_libraries, might want to
> trigger loading other libraries that they ship with dynamically. And
> these libraries are probably also in extension_destdir.

Right, it can be more than just the DSOs for the extension itself.

Best,

David

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio

On Mon, 24 Jun 2024 at 18:11, Nathan Bossart  wrote:
> At first glance, the general idea seems reasonable to me.  I'm wondering
> whether there is a requirement for this directory to be prepended or if it
> could be appended to the end.  That way, the existing ones would take
> priority, which might be desirable from a security standpoint.

Citus does ship with some override library for pgoutput to make
logical replication/CDC work correctly with sharded tables. Right now
using this override library requires changing dynamic_library_path. It
would be nice if that wasn't necessary. But this is obviously a small
thing. And I definitely agree that there's a security angle to this as
well, but honestly that seems rather small too. If an attacker can put
shared libraries into the extension_destdir, I'm pretty sure you've
lost already, no matter if extension_destdir is prepended or appended
to the existing $libdir.

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio

On Thu, 11 Apr 2024 at 19:52, David E. Wheeler  wrote:
> I realize this probably isn’t going to happen for 17, given the freeze, but I 
> would very much welcome feedback and pointers to address concerns about 
> providing a second directory for extensions and DSOs. Quite a few people have 
> talked about the need for this in the Extension Mini Summits[1], so I’m sure 
> I could get some collaborators to make improvements or look at a different 
> approach.

Overall +1 for the idea. We're running into this same limitation (only
a single place to put extension files) at Microsoft at the moment.

+and to the '$libdir' directive when loading modules
+that back functions.

I feel like this is a bit strange. Either its impact is too wide, or
it's not wide enough depending on your intent.

If you want to only change $libdir during CREATE EXTENSION (or ALTER
EXTENSION UPDATE), then why not just change it there. And really you'd
only want to change it when creating an extension from which the
control file is coming from extension_destdir.

However, I can also see a case for really always changing $libdir.
Because some extensions in shared_preload_libraries, might want to
trigger loading other libraries that they ship with dynamically. And
these libraries are probably also in extension_destdir.

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston

On Mon, Jun 24, 2024 at 1:49 PM Joel Jacobson  wrote:

> How about?
>
> + 
> +  The linked page uses "permissions" while we consistently use the
> synonym
> +  "privileges", to describe the contents of the list. For avoidance of
> +  doubt and clarity, these two terms are equivalent in the
> +  PostgreSQL documentation.
> + 
>
> /Joel

I really dislike "For avoidance of doubt and clarity" - and in terms of
being equivalent the following seems like a more accurate description of
reality.

The PostgreSQL documentation, and code, refers to the specifications within
the ACL as "privileges".  This has the same meaning as "permissions" on the
linked page.  Generally if we say "permissions" we are referring to
something that is not covered by the ACL.  In routine communication the two
words are often used interchangeably.

David J.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman

On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> > I thought the idea was that the GlobalVisTest stuff would force a
> > recalculation now and then, but maybe it doesn't actually do that?
>
> It definitely can do that. Just not in a way that meaningfully
> increases the number of heap tuples that we can recognize as DEAD and
> remove. At least not currently.
>
> > Suppose process A begins a transaction, acquires an XID, and then goes
> > idle. Process B now begins a giant vacuum. At some point in the middle
> > of the vacuum, A ends the transaction. Are you saying that B's
> > GlobalVisTest never really notices that this has happened?
>
> That's my understanding, yes. That is, vistest is approximately the
> same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Nikolay Shaplov

Hi!

Let me join the review process.

I am no expert in execution plans, so there would not be much help in doing 
even better optimization. But I can read the code, as a person who is not 
familiar 
with this area and help making it clear even to a person like me.

So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that 
have been posted some time ago, and especially transform_or_to_any function.

> @@ -38,7 +45,6 @@
>  int  from_collapse_limit;
>  int  join_collapse_limit;
>
> - 
>  /*
>   * deconstruct_jointree requires multiple passes over the join tree,  
because we 
>   * need to finish computing JoinDomains before we start  distributing quals.

Do not think that removing empty line should be part of the patch

> + /*
> +  * If the const node's (right side of operator 
expression) type
> +  * don't have â€œtrueâ€ array type, then we cannnot 
do the
> +  * transformation. We simply concatenate the 
expression node.
> +  */
Guess using unicode double quotes is not the best idea here... 
   
Now to the first part of  `transform_or_to_any`:

> + List   *entries = NIL;
I guess the idea of entries should be explained from the start. What kind of 
entries are accomulated there... I see they are added there all around the 
code, but what is the purpose of that is not quite clear when you read it.

At the first part of `transform_or_to_any` function, you costanly repeat two 
lines, like a mantra:

> + entries = lappend(entries, rinfo);
> + continue;

"If something is wrong -- do that mantra"

From my perspective, if you have to repeat same code again and again, then 
most probably you have some issues with architecture of the code. If you 
repeat some code again and again, you need to try to rewrite the code, the 
way, that part is repeated only once.

In that case I would try to move the most of the first loop of 
`transform_or_to_any`  to a separate function (let's say its name is 
prepare_single_or), that will do all the checks, if this or is good for us; 
return NULL if it does not suits our purposes (and in this case we do "entries 
= lappend(entries, rinfo); continue" in the main code, but only once) or 
return pointer to some useful data if this or clause is good for our purposes. 

This, I guess will make that part more clear and easy to read, without 
repeating same "lappend mantra" again and again.

Will continue digging into the code tomorrow.

P.S. Sorry for sending partly finished email. Pressed Ctrl+Enter 
accidentally... With no way to make it back :-((( 

signature.asc
Description: This is a digitally signed message part.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman

On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
>  wrote:
> > One thing I don't understand is why it is okay to freeze the xmax of a
> > dead tuple just because it is from an aborted update.
>
> We don't do that with XID-based xmaxs. Though perhaps we should, since
> we'll already prune-away the successor tuple, and so might as well go
> one tiny step further and clear the xmax for the original tuple via
> freezing/setting it InvalidTransactionId. Instead we just leave the
> original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

else if (TransactionIdIsNormal(xid))
{
/* Raw xmax is normal XID */
freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
}

And then later we

if (freeze_xmax)
frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

> > The only case in which we freeze dead tuples
> > with a non-multi xmax is if the xmax is from before OldestXmin and is
> > also not committed (so from an aborted update).
>
> Perhaps I misunderstand, but: we simply don't freeze DEAD (not
> RECENTLY_DEAD) tuples in the first place, because we don't have to
> (pruning removes them instead). It doesn't matter if they're DEAD due
> to being from aborted transactions or DEAD due to being
> deleted/updated by a transaction that committed (committed and <
> OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

> The freezing related code paths in heapam.c don't particularly care
> whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
> Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson

On Mon, Jun 24, 2024, at 21:51, David G. Johnston wrote:
> On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson  wrote:
>> On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
>> 
>> > The page we link to uses "permissions" while we consistently use 
>> > "privileges" to describe the contents of the list.  This seems like an 
>> > obvious synonym, but as the point of these is to formally define 
>> > things, pointing this equivalence is worth considering.
>> 
>> I like this idea. How could this be implemented in the docs? Maybe a 
>> ... for ACL in acronyms.sgml?
>> 
>
> Add a second  under the one holding the link?

How about?

+ 
+  The linked page uses "permissions" while we consistently use the synonym
+  "privileges", to describe the contents of the list. For avoidance of
+  doubt and clarity, these two terms are equivalent in the
+  PostgreSQL documentation.
+ 

/Joel

v3-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data

Re: Partial aggregates pushdown

2024-06-24 Thread Jelte Fennema-Nio

On Mon, 24 Jun 2024 at 15:03, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I see. I maybe got your proposal.
> Refer to your proposal, for avg(int8),
> I create a new native type like state_int8_avg
> with the new typsend/typreceive functions
> and use them to transmit the state value, right?

Yes, that's roughly what I had in mind indeed.

> That might seem to be a more fundamental solution
> because I can avoid adding export/import functions of my proposal,
> which are the new components of aggregate function.
> I have never considered the solution.
> I appreciate your proposal.

Thank you :)

> However, I still have the following two questions.
>
> 1. Not necessary components of new native types
> Refer to pg_type.dat, typinput and typoutput are required.
> I think that in your proposal they are not necessary,
> so waste. I think that it is not acceptable.
> How can I resolve the problem?

I think requiring typinput/typoutput is a benefit personally, because
that makes it possible to do PARTIAL_AGGREGATE pushdown to a different
PG major version. Also it makes it easier to debug the partial
aggregate values when using psql/pgregress. So yes, it requires
implementing both binary (send/receive) and text (input/output)
serialization, but it also has some benefits. And in theory you might
be able to skip implementing the binary serialization, and rely purely
on the text serialization to send partial aggregates between servers.

> 2. Many new native types
> I think that, basically, each aggregate function does need a new native type.
> For example,
> avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
> PolyNumAggState.
> You said that it is enough to add only one native type like state_poly_num_agg
> for supporting them, right?

Yes, correct. That's what I had in mind.

> But the combine functions of them might have quite different expectation
> on the data items of PolyNumAggState like
> the range of N(means count) and the true/false of calcSumX2
> (the flag of calculating sum of squares).
> The final functions of them have the similar expectation.
> So, I think that, responded to your proposal,
> each of them need a native data type
> like state_int8_avg, state_numeric_avg, for safety.
>
> And, we do need a native type for an aggregate function
> whose transtype is not internal and not pseudo.
> For avg(int4), the current transtype is _int8.
> However, I do need a validation check on the number of the array
> And the positiveness of count(the first element of the array).
> Responded to your proposal,
> I do need a new native type like state_int4_avg.

To help avoid confusion let me try to restate what I think you mean
here: You're worried about someone passing in a bogus native type into
the final/combine functions and then getting crashes and/or wrong
results. With internal type people cannot do this because they cannot
manually call the combinefunc/finalfunc because the argument type is
internal. To solve this problem your suggestion is to make the type
specific to the specific aggregate such that send/receive or
input/output can validate the input as reasonable. But this would then
mean that we have many native types (and also many
deserialize/serialize functions).

Assuming that's indeed what you meant, that's an interesting thought,
I didn't consider that much indeed. My thinking was that we only need
to implement send/receive & input/output functions for these types,
and even though their meaning is very different we can serialize them
in the same way.

As you say though, something like that is already true for avg(int4)
today. The way avg(int4) handles this issue is by doing some input
validation for every call to its trans/final/combinefunc (see
int4_avg_accum, int4_avg_combine, and int8_avg). It checks the length
of the array there, but it doesn't check the positiveness of the
count. I think that makes sense. IMHO these functions only need to
protect against crashes (e.g. null pointer dereferences). But I don't
think there is a good reason for them to protect the user against
passing in weird data. These functions aren't really meant to be
called manually in the first place anyway, so if the user does that
and they pass in weird data then I'm fine with them getting a weird
result back, even errors are fine (only crashes are not).

So as long as our input/output & send/receive functions for
state_poly_num_agg handle all the inconsistencies that could cause
crashes later on (which I think is pretty simple to do for
PolyNumAggState), then I don't think we need state_int8_avg,
state_numeric_avg, etc.

> > Not a full review but some initial notes:
> Thank you. I don't have time today, so I'll answer after tomorrow.

Sure, no rush.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
 wrote:
> One thing I don't understand is why it is okay to freeze the xmax of a
> dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

> The only case in which we freeze dead tuples
> with a non-multi xmax is if the xmax is from before OldestXmin and is
> also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler

On Jun 24, 2024, at 4:28 PM, Robert Haas  wrote:

> As long as the GUC is superuser-only, I'm not sure what else there is
> to do here. The only question is whether there's some reason to
> disallow this even from the superuser, but I'm not quite seeing such a
> reason.

I can switch it back from requiring a restart to allowing a superuser to set it.

>> I sketched them quickly, so agree they can be better. Reading the code, I 
>> now see that it appears to be the former case. I’d like to advocate for the 
>> latter.
> 
> Sounds good.

Yeah, though then I have a harder time deciding how it should work. pg_config’s 
paths are absolute. With your first example, we just use them exactly as they 
are, but prefix them with the destination directory. So if it’s set to 
`/my/temp/root/`, then files go into

/my/temp/root/$(pg_conifg --sharedir)
/my/temp/root/$(pg_conifg --pkglibdir)
/my/temp/root/$(pg_conifg --bindir)
# etc.

Which is exactly how RPM and Apt packages are built, but seems like an odd 
configuration for general use.

>> BINDIR
>> DOCDIR
>> HTMLDIR
>> PKGINCLUDEDIR
>> LOCALEDIR
>> MANDIR
>> 
>> I can imagine an extension wanting or needing to use any and all of these.
> 
> Are these really all relevant to backend code?

Oh I think so. Especially BINDIR; lots of extensions ship with binary 
applications. And most ship with docs, too (PGXS puts items listed in DOCS into 
DOCDIR). Some might also produce man pages (for their binaries), HTML docs, and 
other stuff. Maybe an FTE extension would include locale files?

I find it pretty easy to imagine use cases for all of them. So much so that I 
wrote an extension binary distribution RFC[1] and its POC[2] around them.

Best,

David

[1]: https://github.com/orgs/pgxn/discussions/2
[1]: https://justatheory.com/2024/06/trunk-poc/

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> Are you more concerned about having a single horizon for pruning or
> about having a horizon that does not move backwards after being
> established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed.

> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change. But, in writing this repro, it is actually quite
> hard to get GlobalVisState to update. Our backend's RecentXmin needs
> to have changed. And there aren't very many places where we take a new
> snapshot after starting to vacuum a relation. One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Nikolay Shaplov

Hi!

Let me join the review process.

I am no expert in execution plans, so there would not be much help in doing 
even better optimization. But I can read the code, as a person who is not 
familiar 
with this area and help making it clear even to a person like me.

So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that 
have been posted some time ago, and especially transform_or_to_any function.

> @@ -38,7 +45,6 @@
>  int  from_collapse_limit;
>  int  join_collapse_limit;
>
> - 
>  /*
>   * deconstruct_jointree requires multiple passes over the join tree,  
> because we 
>   * need to finish computing JoinDomains before we start  distributing quals.

Do not think that removing empty line should be part of the patch

> + /*
> +  * If the const node's (right side of operator 
> expression) type
> +  * don't have â€œtrueâ€ array type, then we cannnot do 
> the
> +  * transformation. We simply concatenate the expression 
> node.
> +  */
Guess using unicode double quotes is not the best idea here... 
   
Now to the first part of  `transform_or_to_any`


signature.asc
Description: This is a digitally signed message part.

Re: Direct SSL connection and ALPN loose ends

2024-06-24 Thread Heikki Linnakangas


On 21/06/2024 02:32, Jacob Champion wrote:

On Thu, Jun 20, 2024 at 4:13 PM Heikki Linnakangas  wrote:

By "negotiation" I mean the server's response to the startup packet.
I.e. "supported"/"not supported"/"error".


Ok, I'm still a little confused, probably a terminology issue. The
server doesn't respond with "supported" or "not supported" to the
startup packet, that happens earlier. I think you mean the SSLRequst /
GSSRequest packet, which is sent *before* the startup packet?


Yes, sorry. (I'm used to referring to those as startup packets too, ha.)


Yeah I'm not sure what the right term would be.


Hmm, right, GSS encryption was introduced in v12, and older versions
respond with an error to a GSSRequest.

We probably could make the same assumption for GSS as we did for TLS in
a49fbaaf, i.e. that an error means that something's wrong with the
server, rather than that it's just very old and doesn't support GSS. But
the case for that is a lot weaker case than with TLS. There are still
pre-v12 servers out there in the wild.


Right. Since we default to gssencmode=prefer, if you have Kerberos
creds in your environment, I think this could potentially break
existing software that connects to v11 servers once you upgrade libpq.


When you connect to a V11 server and attempt to perform GSSAPI 
authentication, it will respond with a V3 error that says: "unsupported 
frontend protocol 1234.5680: server supports 2.0 to 3.0". That was a 
surprise to me until I tested it just now. I thought that it would 
respond with a protocol V2 error, but it is not so. The backend sets 
FrontendProtocol to 1234.5680 before sending the error, and because it 
is >= 3, the error is sent with protocol version 3.


Given that, I think it is a good thing to fail the connection completely 
on receiving a V2 error.


Attached is a patch to fix the other issue, with falling back from SSL 
to plaintext. And some tests and comment fixes I spotted while at it.


0001: A small comment fix
0002: This is the main patch that fixes the SSL fallback issue

0003: This adds fault injection tests to exercise these early error 
codepaths. It is not ready to be merged, as it contains a hack to skip 
locking. See thread at 
https://www.postgresql.org/message-id/e1ffb822-054e-4006-ac06-50532767f75b%40iki.fi.


0004: More tests, for what happens if the server sends an error after 
responding "yes" to the SSLRequest or GSSRequest, but before performing 
the SSL/GSS handshake.


Attached is also a little stand-alone perl program that listens on a 
socket, and when you connect to it, it immediately sends a V2 or V3 
error, depending on the argument. That's useful for testing. It could be 
used as an alternative strategy to the injection points I used in the 
0003-0004 patches, but for now I just used it for manual testing.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 4b988b1c74066fe8b5f98acb4ecd20150a47bc33 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 24 Jun 2024 20:41:41 +0300
Subject: [PATCH 1/4] Fix outdated comment after removal of direct SSL fallback

The option to fall back from direct SSL to negotiated SSL or a
plaintext connection was removed in commit fb5718f35f.
---
 src/interfaces/libpq/fe-connect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 071b1b34aa1..e003279fb6c 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3564,8 +3564,8 @@ keep_going:		/* We will come back to here until there is
 if (pollres == PGRES_POLLING_FAILED)
 {
 	/*
-	 * Failed direct ssl connection, possibly try a new
-	 * connection with postgres negotiation
+	 * SSL handshake failed.  We will retry with a plaintext
+	 * connection, if permitted by sslmode.
 	 */
 	CONNECTION_FAILED();
 }
-- 
2.39.2

From 7df581e902b76665de4c751ddbc1309143e6c0b8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 24 Jun 2024 18:14:52 +0300
Subject: [PATCH 2/4] Fix fallback behavior when server sends an ERROR early at
 startup

With sslmode=prefer, the desired behavior is to completely fail the
connection attempt, *not* fall back to a plaintext connection, if the
server responds to the SSLRequest with an error ('E') response instead
of rejecting SSL with an 'N' response. This was broken in commit
05fd30c0e7.

Discussion: https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e003279fb6c..360d9a45476 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3521,6 +3521,12 @@ keep_going:		/* We will come back to here until there is
 		 * byte here.
 		 */
 		conn->status =

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 3:37 PM David E. Wheeler  wrote:
> I guess the question then is what security controls are appropriate for this 
> feature, which after all tells the postmaster what directories to read files 
> from. It feels a little outside the scope of a regular user to even be aware 
> of the file system undergirding the service. But perhaps there’s a 
> non-superuser role for whom it is appropriate?

As long as the GUC is superuser-only, I'm not sure what else there is
to do here. The only question is whether there's some reason to
disallow this even from the superuser, but I'm not quite seeing such a
reason.

> > On the patch itself, I find the documentation for this to be fairly
> > hard to understand. I think it could benefit from an example. I'm
> > confused about whether this is intended to let me search for
> > extensions in /my/temp/root/usr/lib/postgresql/... by setting
> > extension_directory=/my/temp/dir, or whether it's intended me to
> > search both /usr/lib/postgresql as I normally would and also
> > /some/other/place.
>
> I sketched them quickly, so agree they can be better. Reading the code, I now 
> see that it appears to be the former case. I’d like to advocate for the 
> latter.

Sounds good.

> > If the latter, I wonder why we don't handle shared
> > libraries by setting dynamic_library_path and then just have an
> > analogue of that for control files.
>
> The challenge is that it applies not just to shared object libraries and 
> control files, but also extension SQL files and any other SHAREDIR files an 
> extension might include. But also, I think it should support all the 
> pg_config installation targets that extensions might use, including:
>
> BINDIR
> DOCDIR
> HTMLDIR
> PKGINCLUDEDIR
> LOCALEDIR
> MANDIR
>
> I can imagine an extension wanting or needing to use any and all of these.

Are these really all relevant to backend code?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E. Wheeler

On Jun 19, 2024, at 05:41, Peter Eisentraut  wrote:

> This is probably a bit confusing.  This might as well mean client application 
> code against libpq.  Better something like "server plugin code that uses the 
> PostgreSQL server APIs".

That works.

> But now we're talking about API.  That might be subject of another document 
> or another section in this one, but it seems confusing to mix this with the 
> ABI discussion.

Hrm. They’re super closely-related in my mind, as an extension developer. I 
need to know both! I guess I’m taking of this policy as what I can expect may 
be changed (and how to adapt to it) and what won’t.

That said, I’m fine to remove the API stuff if there’s consensus objecting to 
it, to be defined in a separate policy (perhaps on the same doc page).

>> PostgreSQL avoids unnecessary API changes in major releases, but usually 
>> ships a few necessary API changes, including deprecation, renaming, and 
>> argument variation.
> 
> Obviously, as a practical matter, there won't be random pointless changes.  
> But I wouldn't go as far as writing anything down about how these APIs are 
> developed.

Fair enough, was trying to give some idea of the sorts of changes. Don’t have 
to include them.

>> In such cases the incompatible changes will be listed in the Release Notes.
> 
> I don't think anyone is signing up to do that.

It needn’t be comprehensive. Just mention that an ABI or API changed in the 
release note item. Unless they almost *all* make such changes.

>> Minor Releases
>> --

> I think one major problem besides actively avoiding or managing such 
> minor-version ABI breaks is some automated detection.  Otherwise, this just 
> means "we try, but who knows”.

I think you *do* try, and the fact that there are so few issues means you 
succeed at that. I’m not advocating for an ABI guarantee here, just a 
description of the policy committees already follow.

Here’s an update based on all the feedback, framing things more from the 
perspective of “do I need to recompile this or change my code”. Many thanks!

``` md
ABI Policy
==

Changes to the the PostgreSQL server APIs may require recompilation of server 
plugin code that uses them. This policy describes the core team's approach to 
such changes, and what server API users can expect.

Major Releases
--

Applications that use server APIs must be compiled for each major release 
supported by the application. The inclusion of `PG_MODULE_MAGIC` ensures that 
code compiled for one major version will rejected by other major versions. 
Developers needing to support multiple versions of PostgreSQL with incompatible 
APIs should use the `PG_VERSION_NUM` constant to adjust code as appropriate. 
For example:

``` c
#if PG_VERSION_NUM >= 16
#include "varatt.h"
#endif
```

The core team avoids unnecessary breakage, but users of the server APIs should 
expect and be prepared to make adjustments and recompile for every major 
release.

Minor Releases
--

PostgreSQL makes an effort to avoid server API and ABI breaks in minor 
releases. In general, an application compiled against any minor release will 
work with any other minor release, past or future. In the absence of automated 
detection of such changes, this is not a guarantee, but history such breaking 
changes have been extremely rare.

When a change *is* required, PostgreSQL will choose the least invasive change 
possible, for example by squeezing a new field into padding space or appending 
it to the end of a struct. This sort of change should not impact dependent 
applications unless they use `sizeof(the struct)` on a struct with an appended 
field, or create their own instances of such structs --- patterns best avoided.

In rare cases, however, even such non-invasive changes may be impractical or 
impossible. In such an event, the change will be documented in the Release 
Notes, and details on the issue will also be posted to [TBD; mail list? Blog 
post? News item?].

To minimize issues and catch changes early, the project strongly recommends 
that developers adopt continuous integration testing at least for the latest 
minor release all major versions of Postgres they support.
```

Best,

David

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston

On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson  wrote:

> On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
>
> > The page we link to uses "permissions" while we consistently use
> > "privileges" to describe the contents of the list.  This seems like an
> > obvious synonym, but as the point of these is to formally define
> > things, pointing this equivalence is worth considering.
>
> I like this idea. How could this be implemented in the docs? Maybe a
> ... for ACL in acronyms.sgml?
>
>
Add a second  under the one holding the link?

David J.

Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E . Wheeler

On Jun 24, 2024, at 14:51, Robert Haas  wrote:

> I suppose that it's true that we try to avoid gratuitous breakage, but
> I feel like it would be weird to document that.

I see how that can seem weird to a committer deeply familiar with the 
development process and how things happen. But people outside the -hackers 
bubble have very little idea. It’s fair to say it needn’t be a long statement 
for major versions: a single sentence such as “we try to avoid gratuitous 
breakage” is a perfectly reasonable framing. But I’d say, in the interest of 
completeness, it would be useful to document the policy for major release *as 
well as* minor releases.

Best,

David

Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E. Wheeler

On Jun 19, 2024, at 05:42, Peter Eisentraut  wrote:

>>> https://postgr.es/m/CAH2-Wzm-W6hSn71sUkz0Rem=qdeu7tnfmc7_jg2djrlfef_...@mail.gmail.com
>>> 
>>> Theoretically anybody can do this themselves. In practice they don't.
>>> So something as simple as providing automated reports about ABI
>>> changes might well move the needle here.
>> What would be required to make such a thing? Maybe it’d make a good Summer 
>> of Code project.
> 
> The above thread contains a lengthy discussion about what one could do.

I somehow missed that GSoC 2024 is already going with contributors. Making a 
mental note to add an item for 2025.

D

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson

On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
> On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart  
> wrote:
>> I think we could omit "i.e. privileges list."
>> 
>
> Agreed.  Between the docs and code we say "privileges list" once and 
> that refers to the dumputIls description of the arguments to grant.  As 
> the acronym page now defines the term using fundamentals, introducing 
> another term not used elsewhere seems undesirable.

New version attached.

> Observations:
> We are referencing a disambiguation page.  We never actually spell out 
> ACL anywhere so we might as well just reference what Wikipedia believes 
> is the expected spelling.
>
> The page we link to uses "permissions" while we consistently use 
> "privileges" to describe the contents of the list.  This seems like an 
> obvious synonym, but as the point of these is to formally define 
> things, pointing this equivalence is worth considering.

I like this idea. How could this be implemented in the docs? Maybe a 
... for ACL in acronyms.sgml?

/Joel

v2-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data

Unusually long checkpoint time on version 16, and 17beta1 running in a docker container

2024-06-24 Thread Dave Cramer

Greetings,

While testing pgjdbc I noticed the following

pgdb-1  | Will execute command on database postgres:
pgdb-1  | SELECT pg_drop_replication_slot(slot_name) FROM
pg_replication_slots WHERE slot_name = 'replica_one';
pgdb-1  | DROP USER IF EXISTS replica_one;
pgdb-1  | CREATE USER replica_one WITH REPLICATION PASSWORD 'test';
pgdb-1  | SELECT * FROM
pg_create_physical_replication_slot('replica_one');
pgdb-1  |
pgdb-1  | NOTICE:  role "replica_one" does not exist, skipping
pgdb-1  |  pg_drop_replication_slot
pgdb-1  | --
pgdb-1  | (0 rows)
pgdb-1  |
pgdb-1  | DROP ROLE
pgdb-1  | CREATE ROLE
pgdb-1  |   slot_name  | lsn
pgdb-1  | -+-
pgdb-1  |  replica_one |
pgdb-1  | (1 row)
pgdb-1  |
pgdb-1  | waiting for checkpoint
pgdb-1  | 2024-06-24 19:07:18.569 UTC [66] LOG:  checkpoint starting: force
wait
pgdb-1  | 2024-06-24 19:11:48.008 UTC [66] LOG:  checkpoint complete: wrote
6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 3 recycled;
write=269.438 s, sync=0.001 s, total=269.439 s; sync files=0, longest=0.000
s, average=0.000 s; distance=44140 kB, estimate=44140 kB; lsn=0/4B8,
redo lsn=0/428


Note that it takes 4 minutes 48 seconds to do the checkpoint. This seems
ridiculously long ?

If I add a checkpoint before doing anything there is no delay

 Will execute command on database postgres:
pgdb-1  | checkpoint;
pgdb-1  | SELECT pg_drop_replication_slot(slot_name) FROM
pg_replication_slots WHERE slot_name = 'replica_one';
pgdb-1  | DROP USER IF EXISTS replica_one;
pgdb-1  | CREATE USER replica_one WITH REPLICATION PASSWORD 'test';
pgdb-1  | SELECT * FROM
pg_create_physical_replication_slot('replica_one');
pgdb-1  |
pgdb-1  | 2024-06-24 19:19:57.498 UTC [66] LOG:  checkpoint starting:
immediate force wait
pgdb-1  | 2024-06-24 19:19:57.558 UTC [66] LOG:  checkpoint complete: wrote
6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 2 recycled;
write=0.060 s, sync=0.001 s, total=0.061 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=29947 kB, estimate=29947 kB; lsn=0/3223BA0, redo
lsn=0/3223B48
===> pgdb-1  | CHECKPOINT
pgdb-1  |  pg_drop_replication_slot
pgdb-1  | --
pgdb-1  | (0 rows)
pgdb-1  |
pgdb-1  | DROP ROLE
pgdb-1  | NOTICE:  role "replica_one" does not exist, skipping
pgdb-1  | CREATE ROLE
pgdb-1  |   slot_name  | lsn
pgdb-1  | -+-
pgdb-1  |  replica_one |
pgdb-1  | (1 row)
pgdb-1  |
pgdb-1  | waiting for checkpoint
pgdb-1  | 2024-06-24 19:19:57.614 UTC [66] LOG:  checkpoint starting: force
wait
pgdb-1  | 2024-06-24 19:19:57.915 UTC [66] LOG:  checkpoint complete: wrote
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.301
s, sync=0.001 s, total=0.302 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=14193 kB, estimate=28372 kB; lsn=0/480, redo
lsn=0/428

This starts in version 16, versions up to and including 15 do not impose
the wait.


Dave Cramer

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler

On Jun 24, 2024, at 1:53 PM, Robert Haas  wrote:

> Is "tighten up what the superuser can do" on our list of objectives?
> Personally, I think we should be focusing mostly, and maybe entirely,
> on letting non-superusers do more things, with appropriate security
> controls. The superuser can ultimately do anything, since they can
> cause shell commands to be run and load arbitrary code into the
> backend and write code in untrusted procedural languages and mutilate
> the system catalogs and lots of other terrible things.

I guess the question then is what security controls are appropriate for this 
feature, which after all tells the postmaster what directories to read files 
from. It feels a little outside the scope of a regular user to even be aware of 
the file system undergirding the service. But perhaps there’s a non-superuser 
role for whom it is appropriate?

> Now, I think there are environments where people have used things like
> containers to try to lock down the superuser, and while I'm not sure
> that can ever be particularly water-tight, if it were the case that
> this patch would make it a whole lot easier for a superuser to bypass
> the kinds of controls that people are imposing today, that might be an
> argument against this patch. But ... off-hand, I'm not seeing such an
> exposure.

Yeah I’m not even sure I follow. Containers are immutable, other than mutable 
mounted volumes --- which is one use case this patch is attempting to enable.

> On the patch itself, I find the documentation for this to be fairly
> hard to understand. I think it could benefit from an example. I'm
> confused about whether this is intended to let me search for
> extensions in /my/temp/root/usr/lib/postgresql/... by setting
> extension_directory=/my/temp/dir, or whether it's intended me to
> search both /usr/lib/postgresql as I normally would and also
> /some/other/place.

I sketched them quickly, so agree they can be better. Reading the code, I now 
see that it appears to be the former case. I’d like to advocate for the latter. 

> If the latter, I wonder why we don't handle shared
> libraries by setting dynamic_library_path and then just have an
> analogue of that for control files.

The challenge is that it applies not just to shared object libraries and 
control files, but also extension SQL files and any other SHAREDIR files an 
extension might include. But also, I think it should support all the pg_config 
installation targets that extensions might use, including:

BINDIR
DOCDIR
HTMLDIR
PKGINCLUDEDIR
LOCALEDIR
MANDIR

I can imagine an extension wanting or needing to use any and all of these.

Best,

David

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
>
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

> One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

-- 
Peter Geoghegan

Re: PostgreSQL does not compile on macOS SDK 15.0

2024-06-24 Thread Stan Hu

It appears in macOS SDK 14.5, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

#ifndef _XLOCALE__REGEX_H_
#define _XLOCALE__REGEX_H_

#ifndef _REGEX_H_
#include <_regex.h>
#endif // _REGEX_H_
#include <_xlocale.h>

In macOS SDK 15.5, these include guards are gone:

#ifndef _XLOCALE__REGEX_H_
#define _XLOCALE__REGEX_H_

#include <_regex.h>
#include <__xlocale.h>

Since _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from /usr/include/_regex.h (not to be confused with
/usr/include/xlocale/_regex.h).

If I hack the PostgreSQL src/include/regex/regex.h to include the double
underscore include guard of __REGEX_H_, the build succeeds:

```
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..734172167a 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -1,3 +1,6 @@
+#ifndef __REGEX_H_
+#define __REGEX_H_ /* never again */
+
 #ifndef _REGEX_H_
 #define _REGEX_H_ /* never again */
 /*
@@ -187,3 +190,5 @@ extern bool RE_compile_and_execute(text *text_re, char
*dat, int dat_len,
int nmatch, regmatch_t *pmatch);

 #endif /* _REGEX_H_ */
+
+#endif /* __REGEX_H_ */
```

Any better ideas here?

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
>
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

> I agree with you that letting the pruning horizon move forward during
> vacuum is desirable. I'm just wondering if having the vacuum code need
> to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-24 Thread Jeff Davis

On Wed, 2024-06-19 at 08:53 +0530, Ashutosh Sharma wrote:
> For standalone functions, users can easily adjust the search_path
> settings as needed. However, managing this becomes challenging for
> functions created via extensions. Extensions are relocatable, making
> it difficult to determine and apply search_path settings in advance
> within the extension_name--*.sql file when defining functions or
> procedures.

A special search_path variable "$extension_schema" would be a better
solution to this problem. We need something like that anyway, in case
the extension is relocated, so that we don't have to dig through the
catalog and update all the settings.

>  Introducing a new control flag option allows users to set
> implicit search_path settings for functions created by the extension,
> if needed. The main aim here is to enhance security for functions
> created by extensions by setting search_path at the function level.
> This ensures precise control over how objects are accessed within
> each
> function, making behavior more predictable and minimizing security
> risks,

That leaves me wondering whether it's being addressed at the right
level.

For instance, did you consider just having a GUC to control the default
search_path for a function? That may be too magical, but if so, why
would an extension-only setting be better?

> especially for SECURITY DEFINER functions associated with
> extensions created by superusers.

I'm not sure that it's specific to SECURITY DEFINER functions.

Regards,
Jeff Davis

Re: strange context message in spi.c?

2024-06-24 Thread Daniel Gustafsson

> On 24 Jun 2024, at 11:14, Stepan Neretin  wrote:
> 
> Hi! Looks good to me! 

Thanks for review.  I have this on my TODO for when the tree branches, it
doesn't seem like anything worth squeezing in before then.

--
Daniel Gustafsson

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 2:28 PM Robert Haas  wrote:
> On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan  wrote:
> > I agree, with the proviso that "avoid gratuitous failures" should
> > include cases where a query that got the optimization suddenly fails
> > to get the optimization, due only to some very innocuous looking
> > change. Such as a change from using a constant 1_000_000_000 to a
> > constant 5_000_000_000 in the query text. That is a POLA violation.
>
> Nope, I don't agree with that at all. If you imagine that we can
> either have the optimization apply to one of those cases on the other,
> or on the other hand we can have some cases that outright fail, I
> think it's entirely clear that the former is better.

I'm just saying that not having the optimization apply to a query very
similar to one where it does apply is a POLA violation. That's another
kind of failure, for all practical purposes. Weird performance cliffs
like that are bad. It's very easy to imagine code that generates a
query text, that at some point randomly and mysteriously gets a
sequential scan. Or a much less efficient index scan.

> I was assuming this patch shouldn't be changing the way indexes work
> at all, just making use of the facilities that we have today. More
> could be done, but that might make it harder to get anything
> committed.

I was just pointing out that there is currently no good way to make
nbtree efficiently execute a qual "WHERE a = 5 OR a IS NULL", which is
almost entirely (though not quite entirely) due to a lack of any way
of expressing that idea through SQL, in a way that'll get pushed down
to the index scan node. You can write "WHERE a = any('{5,NULL')", of
course, but that doesn't treat NULL as just another array element to
match against via an IS NULL qual (due to NULL semantics).

Yeah, this is nonessential. But it's quite a nice optimization, and
seems entirely doable within the framework of the patch. It would be a
natural follow-up.

All that I'd need on the nbtree side is to get an input scan key that
directly embodies "WHERE mycol = 5 OR mycol IS NULL". That would
probably just be a scan key with sk_flags "SK_SEARCHARRAY |
SK_SEARCHNULL", that was otherwise identical to the current
SK_SEARCHARRAY scan keys.

Adopting the nbtree array index scan code to work with this would be
straightforward. SK_SEARCHNULL scan keys basically already work like
regular equality scan keys at execution time, so all that this
optimization requires on the nbtree side is teaching
_bt_advance_array_keys to treat NULL as a distinct array condition
(evaluated as IS NULL, not as = NULL).

> It's even possible, in my mind at least, that the patch is already
> doing exactly the right things here. Even if it isn't, the problem
> doesn't seem to be fundamental, because if this example can work (and
> it does) then what the patch is trying to do should be workable, too.
> We just have to make sure we're plugging all the pieces properly
> together, and that we have comments adequately explain what is
> happening and test cases that verify it. My feeling is that the patch
> doesn't meet that standard today, but I think that just means it needs
> some more work. I'm not arguing we have to throw the whole thing out,
> or invent a lot of new infrastructure, or anything like that.

I feel like my point about the potential for POLA violations is pretty
much just common sense. I'm not particular about the exact mechanism
by which we avoid it; only that we avoid it.

--
Peter Geoghegan

Re: Proposal: Document ABI Compatibility

2024-06-24 Thread Robert Haas

On Mon, Jun 17, 2024 at 6:38 PM David E. Wheeler  wrote:
> Is it? ISTM that there is the intention not to break things that don’t need 
> to be broken, though that doesn’t rule out interface improvements.

I suppose that it's true that we try to avoid gratuitous breakage, but
I feel like it would be weird to document that.

Sometimes I go to a store and I see a sign that says "shoplifters will
be prosecuted." But I have yet to see a store with a sign that says
"people who appear to be doing absolutely nothing wrong will not be
prosecuted." If I did see such a sign, I would frankly be a little
concerned.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: improve predefined roles documentation

2024-06-24 Thread Robert Haas

On Fri, Jun 21, 2024 at 11:40 AM Nathan Bossart
 wrote:
> Done.

If you look at how the varlistentries begin, there are three separate patterns:

* Some document a single role and start with "Allow doing blah blah blah".

* Some document a couple of rolls so there are several paragraphs,
each beginning with "name_of_rolehttp://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan  wrote:
> I agree, with the proviso that "avoid gratuitous failures" should
> include cases where a query that got the optimization suddenly fails
> to get the optimization, due only to some very innocuous looking
> change. Such as a change from using a constant 1_000_000_000 to a
> constant 5_000_000_000 in the query text. That is a POLA violation.

Nope, I don't agree with that at all. If you imagine that we can
either have the optimization apply to one of those cases on the other,
or on the other hand we can have some cases that outright fail, I
think it's entirely clear that the former is better.

> Maybe it doesn't. My point was only that the B-Tree code doesn't
> necessarily need to use just one rhs type for the same column input
> opclass. The definition of SOAP works (or could work) in basically the
> same way, provided the "OR condition" were provably disjunct. We could
> for example mix different operators for the same nbtree scan key (with
> some work in nbtutils.c), just as we could support "where mycol =5 OR
> mycol IS NULL" with much effort.
>
> BTW, did you know MySQL has long supported the latter? It has a <=>
> operator, which is basically a non-standard spelling of IS NOT
> DISTINCT FROM. Importantly, it is indexable, whereas right now
> Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
> interested in working on this problem within the scope of this patch,
> or some follow-up patch, I can take care of the nbtree side of things.

I was assuming this patch shouldn't be changing the way indexes work
at all, just making use of the facilities that we have today. More
could be done, but that might make it harder to get anything
committed.

Before we get too deep into arguing about hypotheticals, I don't think
there's any problem here that we can't solve with the infrastructure
we already have. For instance, consider this:

robert.haas=# explain select * from foo where a in (1, 1000);
QUERY PLAN
---
 Seq Scan on foo1 foo  (cost=0.00..25.88 rows=13 width=36)
   Filter: (a = ANY ('{1,1000}'::bigint[]))
(2 rows)

I don't know exactly what's happening here, but it seems very similar
to what we need to have happen for this patch to work. pg_typeof(1) is
integer, and pg_typeof(1000) is bigint, and we're able to
figure out that it's OK to put both of those in an array of a single
type and without having any type conversion failures. If you replace
1000 with 2, then the array ends up being of type
integer[] rather than type bigint[], so. clearly the system is able to
reason its way through these kinds of scenarios already.

It's even possible, in my mind at least, that the patch is already
doing exactly the right things here. Even if it isn't, the problem
doesn't seem to be fundamental, because if this example can work (and
it does) then what the patch is trying to do should be workable, too.
We just have to make sure we're plugging all the pieces properly
together, and that we have comments adequately explain what is
happening and test cases that verify it. My feeling is that the patch
doesn't meet that standard today, but I think that just means it needs
some more work. I'm not arguing we have to throw the whole thing out,
or invent a lot of new infrastructure, or anything like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

PostgreSQL does not compile on macOS SDK 15.0

2024-06-24 Thread Stan Hu

Xcode 16 beta was released on 2024-06-10 and ships with macOS SDK 15.0
[1]. It appears PostgreSQL does not compile due to typedef redefiniton
errors in the regex library:

/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new -Wendif-labels
-Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro
-Wno-cast-function-type-strict -O2 -I../../../src/include  -isysroot
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk
   -c -o regcomp.o regcomp.c
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:107:24:
error: typedef redefinition with different types ('__darwin_off_t'
(aka 'long long') vs 'long')
  107 | typedef __darwin_off_t regoff_t;
  |^
../../../src/include/regex/regex.h:48:14: note: previous definition is here
   48 | typedef long regoff_t;
  |  ^
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:114:3:
error: typedef redefinition with different types ('struct regex_t' vs
'struct regex_t')
  114 | } regex_t;
  |   ^
../../../src/include/regex/regex.h:82:3: note: previous definition is here
   82 | } regex_t;
  |   ^
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:119:3:
error: typedef redefinition with different types ('struct regmatch_t'
vs 'struct regmatch_t')
  119 | } regmatch_t;
  |   ^
../../../src/include/regex/regex.h:89:3: note: previous definition is here
   89 | } regmatch_t;
  |   ^
3 errors generated.
make[3]: *** [regcomp.o] Error 1
make[2]: *** [regex-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

I've reproduced this issue by:

1. Download the XCode 16 beta 2 ZIP file:
https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_16_beta/Xcode_16_beta.xip
2. Extract this to `/tmp`.
3. Then I ran:

export 
PATH=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin:$PATH
export 
SDKROOT=/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk
export 
XCODE_DIR=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain
export CC="$XCODE_DIR/usr/bin/clang" export CXX="$XCODE_DIR/usr/bin/clang++"

./configure CC="$CC" CXX="$CXX"
make

The compilation goes through if I comment out the "#include
" from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h.
However, even on macOS SDK 14.5 I see that include statement. I'm
still trying to figure out what changed here.

[1] - https://developer.apple.com/macos/

Re: Apparent bug in WAL summarizer process (hung state)

2024-06-24 Thread Israel Barth Rubio

I'm attaching the files which I missed in the original email.

>
19:34:17.437626 epoll_wait(5, [], 1, 8161) = 0 <8.171542>
19:34:25.610176 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.000334>
19:34:25.611012 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.000323>
19:34:25.611657 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.000406>
19:34:25.612327 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:25.612"..., 
132) = 132 <0.000203>
19:34:25.612873 epoll_wait(5, [], 1, 1) = 0 <10.010950>
19:34:35.623942 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.12>
19:34:35.624120 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.18>
19:34:35.624191 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.07>
19:34:35.624237 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:35.624"..., 
132) = 132 <0.52>
19:34:35.624341 epoll_wait(5, [], 1, 1) = 0 <10.010941>
19:34:45.635408 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16>
19:34:45.635602 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.62>
19:34:45.635765 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.09>
19:34:45.635830 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:45.635"..., 
132) = 132 <0.000495>
19:34:45.636390 epoll_wait(5, [], 1, 1) = 0 <10.008785>
19:34:55.645305 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16>
19:34:55.645520 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.32>
19:34:55.645645 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.27>
19:34:55.646214 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:55.645"..., 
132) = 132 <0.000136>
19:34:55.646445 epoll_wait(5, [], 1, 1) = 0 <10.011731>
19:35:05.658366 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.18>
19:35:05.658591 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.28>
19:35:05.658689 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.10>
19:35:05.658750 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:35:05.658"..., 
132) = 132 <0.000521>
19:35:05.659335 epoll_wait(5,  
#0  0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, 
timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00922bad in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at 
storage/ipc/latch.c:1570
#2  WaitEventSetWait (set=0x1409f70, timeout=1, 
occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at 
storage/ipc/latch.c:1516
#3  0x00920e65 in WaitLatch (latch=, 
wakeEvents=, timeout=, wait_event_info=150994953) 
at storage/ipc/latch.c:538
#4  0x008aa7e4 in WalSummarizerMain (startup_data=, 
startup_data_len=) at postmaster/walsummarizer.c:317
#5  0x008a46a0 in postmaster_child_launch (client_sock=0x0, 
startup_data_len=0, startup_data=0x0, child_type=B_WAL_SUMMARIZER) at 
postmaster/launch_backend.c:265
#6  postmaster_child_launch (client_sock=0x0, startup_data_len=0, 
startup_data=0x0, child_type=B_WAL_SUMMARIZER) at 
postmaster/launch_backend.c:226
#7  StartChildProcess (type=type@entry=B_WAL_SUMMARIZER) at 
postmaster/postmaster.c:3928
#8  0x008a4dcf in MaybeStartWalSummarizer () at 
postmaster/postmaster.c:4075
#9  MaybeStartWalSummarizer () at postmaster/postmaster.c:4070
#10 0x008a7f8f in ServerLoop () at postmaster/postmaster.c:1746
#11 0x0089f5e0 in PostmasterMain (argc=3, argv=0x14097b0) at 
postmaster/postmaster.c:1372
#12 0x005381aa in main (argc=3, argv=0x14097b0) at main/main.c:197
No symbol "full" in current context.
#0  0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, 
timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
sc_ret = -4
sc_ret = 
#1  0x00922bad in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at 
storage/ipc/latch.c:1570
returned_events = 0
rc = 
cur_event = 
cur_epoll_event = 
returned_events = 
rc = 
cur_event = 
cur_epoll_event = 
__func__ = { }
__errno_location = 
#2  WaitEventSetWait (set=0x1409f70, timeout=1, 
occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at 
storage/ipc/latch.c:1516
rc = 
returned_events = 0
start_time = {ticks = 49310570173263}
cur_time = {ticks = }
cur_timeout = 
#3

Apparent bug in WAL summarizer process (hung state)

2024-06-24 Thread Israel Barth Rubio

Hello,

Hope you are doing well.

I've been playing a bit with the incremental backup feature which might
come as
part of the 17 release, and I think I hit a possible bug in the WAL
summarizer
process.

The issue that I face refers to the summarizer process getting into a hung
state.
When the issue is triggered, it keeps in an infinite loop trying to process
a WAL
file that no longer exists.  It apparently comes up only when I perform
changes to
`wal_summarize` GUC and reload Postgres, while there is some load in
Postgres
which makes it recycle WAL files.

I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL
files
available in `pg_wal` and make it easier to reproduce the issue, I'm using
a low
value for `max_wal_size` ('100MB'). You can find below the steps that I
took to
reproduce this problem, assuming this small `max_wal_size`, and
`summarize_wal`
initially enabled:

```bash
# Assume we initially have max_wal_size = '100MB' and summarize_wal = on

# Create a table of ~ 100MB
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_1

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental
full_backup_1/backup_manifest

# Disable summarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO off"
psql -c "SELECT pg_reload_conf()"

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Re-enable sumarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO on"
psql -c "SELECT pg_reload_conf()"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_2

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental
full_backup_2/backup_manifest
```

I'm able to reproduce the issue most of the time when running these steps
manually. It's harder to reproduce if I attempt to run those commands as a
bash script.

This is the sample output of a run of those commands:

```console

(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS
SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
full_backup_1NOTICE:  WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331785/331785 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_1 --incremental
full_backup_1/backup_manifestNOTICE:  WAL archiving is not enabled;
you must ensure that all required WAL segments are copied through
other means to complete the backup111263/331720 kB (33%), 1/1
tablespace(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM
SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres@barmandevhost
~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf
t(1 row)
(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO
on"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT
pg_reload_conf()" pg_reload_conf t(1 row)
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P
-D full_backup_2NOTICE:  WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331734/331734 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_2 --incremental
full_backup_2/backup_manifestWARNING:  still waiting for WAL
summarization through 2/C128 after 10 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 20 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and 2/B3D8 in memory.WARNING:  still waiting
for WAL summarization through 2/C128 after 30 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 40 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and 2/B3D8 in memory.WARNING:  still waiting
for WAL summarization through 2/C128 after 50 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 60 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 1:46 PM Peter Geoghegan  wrote:
> BTW, did you know MySQL has long supported the latter? It has a <=>
> operator, which is basically a non-standard spelling of IS NOT
> DISTINCT FROM. Importantly, it is indexable, whereas right now
> Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
> interested in working on this problem within the scope of this patch,
> or some follow-up patch, I can take care of the nbtree side of things.

To be clear, I meant that we could easily support "where mycol = 5 OR
mycol IS NULL" and have nbtree handle that efficiently, by making it a
SAOP internally. Separately, we could also make IS NOT DISTINCT FROM
indexable, though that probably wouldn't need any work in nbtree.

-- 
Peter Geoghegan

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Robert Haas

On Wed, Apr 3, 2024 at 3:13 AM Alvaro Herrera  wrote:
> I support the idea of there being a second location from where to load
> shared libraries ... but I don't like the idea of making it
> runtime-configurable.  If we want to continue to tighten up what
> superuser can do, then one of the things that has to go away is the
> ability to load shared libraries from arbitrary locations
> (dynamic_library_path).  I think we should instead look at making those
> locations hardcoded at compile time.  The packager can then decide where
> those things go, and the superuser no longer has the ability to load
> arbitrary code from arbitrary locations.

Is "tighten up what the superuser can do" on our list of objectives?
Personally, I think we should be focusing mostly, and maybe entirely,
on letting non-superusers do more things, with appropriate security
controls. The superuser can ultimately do anything, since they can
cause shell commands to be run and load arbitrary code into the
backend and write code in untrusted procedural languages and mutilate
the system catalogs and lots of other terrible things.

Now, I think there are environments where people have used things like
containers to try to lock down the superuser, and while I'm not sure
that can ever be particularly water-tight, if it were the case that
this patch would make it a whole lot easier for a superuser to bypass
the kinds of controls that people are imposing today, that might be an
argument against this patch. But ... off-hand, I'm not seeing such an
exposure.

On the patch itself, I find the documentation for this to be fairly
hard to understand. I think it could benefit from an example. I'm
confused about whether this is intended to let me search for
extensions in /my/temp/root/usr/lib/postgresql/... by setting
extension_directory=/my/temp/dir, or whether it's intended me to
search both /usr/lib/postgresql as I normally would and also
/some/other/place. If the latter, I wonder why we don't handle shared
libraries by setting dynamic_library_path and then just have an
analogue of that for control files.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 1:29 PM Robert Haas  wrote:
> I am not against handling this kind of case if we can do it, but it's
> more important that the patch doesn't cause gratuitous failures than
> that it handles more cases.

I agree, with the proviso that "avoid gratuitous failures" should
include cases where a query that got the optimization suddenly fails
to get the optimization, due only to some very innocuous looking
change. Such as a change from using a constant 1_000_000_000 to a
constant 5_000_000_000 in the query text. That is a POLA violation.

> > Maybe it would be practical to do something with the B-Tree operator
> > class for each of the types involved in the optimization. You could
> > probably find a way for a SAOP to work against a
> > "heterogeneously-typed array" while still getting B-Tree index scans
> > -- provided the types all came from the same operator family. I'm
> > assuming that the index has an index column whose input opclass was a
> > member of that same family. That would necessitate changing the
> > general definition of SAOP, and adding new code to nbtree that worked
> > with that. But that seems doable.
>
> I agree that something based on operator families might be viable. Why
> would that require changing the definition of an SAOP?

Maybe it doesn't. My point was only that the B-Tree code doesn't
necessarily need to use just one rhs type for the same column input
opclass. The definition of SOAP works (or could work) in basically the
same way, provided the "OR condition" were provably disjunct. We could
for example mix different operators for the same nbtree scan key (with
some work in nbtutils.c), just as we could support "where mycol =5 OR
mycol IS NULL" with much effort.

BTW, did you know MySQL has long supported the latter? It has a <=>
operator, which is basically a non-standard spelling of IS NOT
DISTINCT FROM. Importantly, it is indexable, whereas right now
Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
interested in working on this problem within the scope of this patch,
or some follow-up patch, I can take care of the nbtree side of things.

> > Admittedly I'm glossing over a lot of important details here. Does it
> > just work for the default opclass for the type, or can we expect it to
> > work with a non-default opclass when that's the salient opclass (the
> > one used by our index)? I don't know what you'd do about stuff like
> > that.
>
> It seems to me that it just depends on the opclasses in the query. If
> the user says
>
> WHERE column op1 const1 AND column op2 const2
>
> ...then if op1 and op2 are in the same operator family and if we can
> convert one of const1 and const2 to the type of the other without risk
> of failure, then we can rewrite this as an SAOP with whichever of the
> two operators pertains to the target type, e.g.
>
> column1 op1 ANY[const1,converted_const2]
>
> I don't think the default opclass matters here, or the index properties 
> either.

Okay, good.

The docs do say "Another requirement for a multiple-data-type family
is that any implicit or binary-coercion casts that are defined between
data types included in the operator family must not change the
associated sort ordering" [1]. There must be precedent for this sort
of thing. Probably for merge joins.

[1] https://www.postgresql.org/docs/devel/btree.html#BTREE-BEHAVIOR
-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

-- 
Peter Geoghegan

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 12:09 PM Peter Geoghegan  wrote:
> But what about cases like this:
>
> SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna =
> 5_000_000_000; -- columna is int4
>
> This is using two types, of course. 1_000_000_000 is int4, while
> 5_000_000_000 is bigint. If the transformation suddenly failed to work
> when a constant above INT_MAX was used for the first time, then I'd
> say that that's pretty surprising. That's what happens currently if
> you write the same query as "WHERE columna =
> any('{1_000_000_000,5_000_000_000}')", due to the way the coercion
> works. That seems less surprising to me, because the user is required
> to construct their own array, and users expect arrays to always have
> one element type.

I am not against handling this kind of case if we can do it, but it's
more important that the patch doesn't cause gratuitous failures than
that it handles more cases.

> Maybe it would be practical to do something with the B-Tree operator
> class for each of the types involved in the optimization. You could
> probably find a way for a SAOP to work against a
> "heterogeneously-typed array" while still getting B-Tree index scans
> -- provided the types all came from the same operator family. I'm
> assuming that the index has an index column whose input opclass was a
> member of that same family. That would necessitate changing the
> general definition of SAOP, and adding new code to nbtree that worked
> with that. But that seems doable.

I agree that something based on operator families might be viable. Why
would that require changing the definition of an SAOP?

> Admittedly I'm glossing over a lot of important details here. Does it
> just work for the default opclass for the type, or can we expect it to
> work with a non-default opclass when that's the salient opclass (the
> one used by our index)? I don't know what you'd do about stuff like
> that.

It seems to me that it just depends on the opclasses in the query. If
the user says

WHERE column op1 const1 AND column op2 const2

...then if op1 and op2 are in the same operator family and if we can
convert one of const1 and const2 to the type of the other without risk
of failure, then we can rewrite this as an SAOP with whichever of the
two operators pertains to the target type, e.g.

column1 op1 ANY[const1,converted_const2]

I don't think the default opclass matters here, or the index properties either.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> The problem here is that OldestXmin is supposed to be more
> conservative than vistest, which it almost always is, except in this
> one edge case. I don't think that plugging that hole changes the basic
> fact that there is one source of truth about what *needs* to be
> pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 11:44 AM Robert Haas  wrote:
> I don't have a great feeling about this fix. It's not that I think
> it's wrong. It's just that the underlying problem here is that we have
> heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
> struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
> of deciding what gets pruned, but that doesn't actually work, because
> as I pointed out in
> http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
> it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
> fix is to consider both variables, which again may be totally correct,
> but wouldn't it be a lot better if we didn't have two variables
> fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

-- 
Peter Geoghegan

Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Nathan Bossart

On Thu, Apr 11, 2024 at 01:52:26PM -0400, David E. Wheeler wrote:
> I realize this probably isn´t going to happen for 17, given the freeze,
> but I would very much welcome feedback and pointers to address concerns
> about providing a second directory for extensions and DSOs. Quite a few
> people have talked about the need for this in the Extension Mini
> Summits[1], so I´m sure I could get some collaborators to make
> improvements or look at a different approach.

At first glance, the general idea seems reasonable to me.  I'm wondering
whether there is a requirement for this directory to be prepended or if it
could be appended to the end.  That way, the existing ones would take
priority, which might be desirable from a security standpoint.

-- 
nathan

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan

On Mon, Jun 24, 2024 at 11:28 AM Robert Haas  wrote:
> > It needs to transform all similar constants to one type, because some 
> > constants of "OR" expressions can belong others, like the numeric and int 
> > types. Due to the fact that array structure demands that all types must be 
> > belonged to one type, so for this reason we applied this procedure.
>
> The alternative that should be considered is not combining things if
> the types don't match. If we're going to combine such things, we need
> to be absolutely certain that type conversion cannot fail.

But what about cases like this:

SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna =
5_000_000_000; -- columna is int4

This is using two types, of course. 1_000_000_000 is int4, while
5_000_000_000 is bigint. If the transformation suddenly failed to work
when a constant above INT_MAX was used for the first time, then I'd
say that that's pretty surprising. That's what happens currently if
you write the same query as "WHERE columna =
any('{1_000_000_000,5_000_000_000}')", due to the way the coercion
works. That seems less surprising to me, because the user is required
to construct their own array, and users expect arrays to always have
one element type.

It would probably be okay to make the optimization not combine
things/not apply when the user gratuitously mixes different syntaxes.
For example, if a numeric constant was used, rather than an integer
constant.

Maybe it would be practical to do something with the B-Tree operator
class for each of the types involved in the optimization. You could
probably find a way for a SAOP to work against a
"heterogeneously-typed array" while still getting B-Tree index scans
-- provided the types all came from the same operator family. I'm
assuming that the index has an index column whose input opclass was a
member of that same family. That would necessitate changing the
general definition of SAOP, and adding new code to nbtree that worked
with that. But that seems doable.

I was already thinking about doing something like this, to support
index scans for "IS NOT DISTINCT FROM", or on constructs like "columna
= 5 OR columna IS NULL". That is more or less a SAOP with two values,
except that one of the values in the value NULL. I've already
implemented "nbtree SAOPs where one of the elements is a NULL" for
skip scan, which could be generalized to support these other cases.

Admittedly I'm glossing over a lot of important details here. Does it
just work for the default opclass for the type, or can we expect it to
work with a non-default opclass when that's the salient opclass (the
one used by our index)? I don't know what you'd do about stuff like
that.

-- 
Peter Geoghegan

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston

On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart 
wrote:

> On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote:
> > This patch is based on a suggestion from a separate thread [1]:
> >
> > On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
> >> Rather unrelated to this patch, still this patch makes the situation
> >> more complicated in the docs, but wouldn't it be better to add ACL as
> >> a term in acronyms.sql, and reuse it here?  It would be a doc-only
> >> patch that applies on top of the rest (could be on a new thread of its
> >> own), with some  markups added where needed.
>
> Sounds reasonable to me.
>

+1

> +  https://en.wikipedia.org/wiki/Access_Control_List;>Access
> Control List, i.e. privileges list
>
> I think we could omit "i.e. privileges list."
>
>
Agreed.  Between the docs and code we say "privileges list" once and that
refers to the dumputIls description of the arguments to grant.  As the
acronym page now defines the term using fundamentals, introducing another
term not used elsewhere seems undesirable.

Observations:
We are referencing a disambiguation page.  We never actually spell out ACL
anywhere so we might as well just reference what Wikipedia believes is the
expected spelling.

The page we link to uses "permissions" while we consistently use
"privileges" to describe the contents of the list.  This seems like an
obvious synonym, but as the point of these is to formally define things,
pointing this equivalence is worth considering.

David J.

Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Matthias van de Meent

On Mon, 24 Jun 2024 at 14:42, Jelte Fennema-Nio  wrote:
>
> On Mon, 24 Jun 2024 at 13:02, Matthias van de Meent
>  wrote:
> > It does not really behave similar: index scan keys (such as the
> > id3=101 scankey) don't require visibility checks in the btree code,
> > while the Filter condition _does_ require a visibility check, and
> > delegates the check to the table AM if the scan isn't Index-Only, or
> > if the VM didn't show all-visible during the check.
>
> Any chance you could point me in the right direction for the
> code/docs/comment about this? I'd like to learn a bit more about why
> that is the case, because I didn't realize visibility checks worked
> differently for index scan keys and Filter keys.

This can be derived by combining how Filter works (it only filters the
returned live tuples) and how Index-Only scans work (return the index
tuple, unless !ALL_VISIBLE, in which case the heap tuple is
projected). There have been several threads more or less recently that
also touch this topic and closely related topics, e.g. [0][1].

> > Furthermore, the index could use the scankey to improve the number of
> > keys to scan using "skip scans"; by realising during a forward scan
> > that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you
> > can skip forward to (1, 3, _), rather than having to go through tuples
> > (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for
> > INCLUDE-d columns, because their datatypes and structure are opaque to
> > the index AM; the AM cannot assume anything about or do anything with
> > those values.
>
> Does Postgres actually support this currently? I thought skip scans
> were not available (yet).

Peter Geoghegan has been working on it as project after PG17's
IN()-list improvements were committed, and I hear he has the basics
working but the further details need fleshing out.

> > As you can see, there's a huge difference in performance. Putting both
> > non-bound and "normal" filter clauses in the same Filter clause will
> > make it more difficult to explain performance issues based on only the
> > explain output.
>
> Fair enough, that's of course the main point of this patch in the
> first place: being able to better interpret the explain plan when you
> don't have access to the schema. Still I think Filter is the correct
> keyword for both, so how about we make it less confusing by making the
> current "Filter" more specific by calling it something like "Non-key
> Filter" or "INCLUDE Filter" and then call the other something like
> "Index Filter" or "Secondary Bound Filter".

I'm not sure how debuggable explain plans are without access to the
schema, especially when VERBOSE isn't configured, so I would be
hesitant to accept that as an argument here.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me
[1] 
https://www.postgresql.org/message-id/flat/cf85f46f-b02f-05b2-5248-5000b894ebab%40enterprisedb.com

Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Nathan Bossart

On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote:
> This patch is based on a suggestion from a separate thread [1]:
> 
> On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
>> Rather unrelated to this patch, still this patch makes the situation
>> more complicated in the docs, but wouldn't it be better to add ACL as
>> a term in acronyms.sql, and reuse it here?  It would be a doc-only
>> patch that applies on top of the rest (could be on a new thread of its
>> own), with some  markups added where needed.

Sounds reasonable to me.

+  https://en.wikipedia.org/wiki/Access_Control_List;>Access 
Control List, i.e. privileges list

I think we could omit "i.e. privileges list."

-- 
nathan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE

2024-06-24 Thread Tomas Vondra

On 6/24/24 17:14, Nathan Bossart wrote:
> On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
>> The important observation is that this only happens if a database is
>> created while the backup is running, and that it only happens with the
>> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
>> default since PG15).
> 
> My first thought is that this sounds related to the large comment in
> CreateDatabaseUsingFileCopy():
> 
>   /*
>* We force a checkpoint before committing.  This effectively means that
>* committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to 
> be
>* replayed (at least not in ordinary crash recovery; we still have to
>* make the XLOG entry for the benefit of PITR operations). This avoids
>* two nasty scenarios:
>*
>* #1: When PITR is off, we don't XLOG the contents of newly created
>* indexes; therefore the drop-and-recreate-whole-directory behavior of
>* DBASE_CREATE replay would lose such indexes.
>*
>* #2: Since we have to recopy the source database during DBASE_CREATE
>* replay, we run the risk of copying changes in it that were committed
>* after the original CREATE DATABASE command but before the system 
> crash
>* that led to the replay.  This is at least unexpected and at worst 
> could
>* lead to inconsistencies, eg duplicate table names.
>*
>* (Both of these were real bugs in releases 8.0 through 8.0.3.)
>*
>* In PITR replay, the first of these isn't an issue, and the second is
>* only a risk if the CREATE DATABASE and subsequent template database
>* change both occur while a base backup is being taken. There doesn't
>* seem to be much we can do about that except document it as a
>* limitation.
>*
>* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
>* strategy that avoids these problems.
>*/
> 

Perhaps, the mentioned risks certainly seem like it might be related to
the issues I'm observing.

>> I don't recall any reports of similar issues from pre-15 releases, where
>> FILE_COPY was the only available option - I'm not sure why is that.
>> Either it didn't have this issue back then, or maybe people happen to
>> not create databases concurrently with a backup very often. It's a race
>> condition / timing issue, essentially.
> 
> If it requires concurrent activity on the template database, I wouldn't be
> surprised at all that this is rare.
> 

Right. Although, "concurrent" here means a somewhat different thing.
AFAIK there can't be a any changes concurrent with the CREATE DATABASE
directly, because we make sure there are no connections:

createdb: error: database creation failed: ERROR:  source database
"test" is being accessed by other users
DETAIL:  There is 1 other session using the database.

But per the comment, it'd be a problem if there is activity after the
database gets copied, but before the backup completes (which is where
the replay will happen).

>> I see there have been a couple threads proposing various improvements to
>> FILE_COPY, that might make it more efficient/faster, namely using the
>> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
>> something that's (maybe) faster but not quite correct does not seem like
>> a winning strategy to me ...
>>
>> Alternatively, if we don't have clear desire to fix it, maybe the right
>> solution would be get rid of it?
> 
> It would be unfortunate if we couldn't use this for pg_upgrade, especially
> if it is unaffected by these problems.
> 

Yeah. I wouldn't mind using FILE_COPY in contexts where we know it's
safe, like pg_upgrade. I just don't want to let users to unknowingly
step on this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas

On Fri, Jun 21, 2024 at 6:52 PM Alena Rybakina
 wrote:
> It's hard to tell, but I think it might be one of the good places to apply 
> transformation. Let me describe a brief conclusion on the two approaches.

This explanation is somewhat difficult for me to follow. For example:

> In the first approach, we definitely did not process the extra "OR" 
> expressions in the first approach, since they were packaged as an Array. It 
> could lead to the fact that less planning time would be spent on the 
> optimizer.

I don't know what the "first approach" refers to, or what processing
the extra "OR" expressions means, or what it would mean to package OR
expressions as an array. If you made them into an SAOP then you'd have
an array *instead of* OR expressions, not OR expressions "packaged as
an array" but even then, they'd still be processed somewhere, unless
the patch was just wrong.

I think you should try writing this summary again and see if you can
make it a lot clearer and more precise.

I'm suspicious based that we should actually be postponing the
transformation even further. If, for example, the transformation is
advantageous for index scans and disadvantageous for bitmap scans, or
the other way around, then this approach can't help much: it either
does the transformation and all scan types are affected, or it doesn't
do it and no scan types are affected. But if you decided for each scan
whether to transform the quals, then you could handle that. Against
that, there might be increased planning cost. But, perhaps that could
be avoided somehow.

> What exactly is the strategy around OR-clauses with type differences?
> If I'm reading the code correctly, the first loop requires an exact
> opno match, which presumably implies that the constant-type elements
> are of the same type. But then why does the second loop need to use
> coerce_to_common_type?
>
> It needs to transform all similar constants to one type, because some 
> constants of "OR" expressions can belong others, like the numeric and int 
> types. Due to the fact that array structure demands that all types must be 
> belonged to one type, so for this reason we applied this procedure.

The alternative that should be considered is not combining things if
the types don't match. If we're going to combine such things, we need
to be absolutely certain that type conversion cannot fail.

> I do not think this is acceptable. We should find a way to get the
> right operator into the ScalarArrayOpExpr without translating the OID
> back into a name and then back into an OID.
>
> I don’t really understand the reason why it’s better not to do this. Can you 
> explain please?

One reason is that it is extra work to convert things to a name and
then back to an OID. It's got to be slower than using the OID you
already have.

The other reason is that it's error-prone. If somehow the second
lookup doesn't produce the same OID as the first lookup, bad things
will happen, possibly including security vulnerabilities. I see you've
taken steps to avoid that, like nailing down the schema, and that's
good, but it's not a good enough reason to do it like this. If we
don't have a function that can construct the node we need with the OID
rather than the name as an argument, we should invent one, not do this
sort of thing.

> Also, why is the array built with eval_const_expressions instead of
> something like makeArrayResult? There should be no need for general
> expression evaluation here if we are just dealing with constants.
>
> I'm not ready to answer this question right now, I need time to study the use 
> of the makeArrayResult function in the optimizer.

OK. An important consideration here is that eval_const_expressions()
is prone to *fail* because it can call user-defined functions. We
really don't want this optimization to cause planner failure (or
queries to error out at any other stage, either). We also don't want
to end up with any security problems, which is another possible danger
when we call a function that can execute arbitrary code. It's better
to keep it simple and only do things that we know are simple and safe,
like assembling a bunch of datums that we already have into an array.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Tomas Vondra





On 6/24/24 16:53, Melanie Plageman wrote:
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>>
>> On 21/06/2024 03:02, Peter Geoghegan wrote:
>>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>>>  wrote:
>>>
 The repro forces a round of index vacuuming after the standby
 reconnects and before pruning a dead tuple whose xmax is older than
 OldestXmin.

 At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
 calls GetOldestNonRemovableTransactionId(), thereby updating the
 backend's GlobalVisState and moving maybe_needed backwards.
>>>
>>> Right. I saw details exactly consistent with this when I used GDB
>>> against a production instance.
>>>
>>> I'm glad that you were able to come up with a repro that involves
>>> exactly the same basic elements, including index page deletion.
>>
>> Would it be possible to make it robust so that we could always run it
>> with "make check"? This seems like an important corner case to
>> regression test.
> 
> I'd have to look into how to ensure I can stabilize some of the parts
> that seem prone to flaking. I can probably stabilize the vacuum bit
> with a query of pg_stat_activity making sure it is waiting to acquire
> the cleanup lock.
> 
> I don't, however, see a good way around the large amount of data
> required to trigger more than one round of index vacuuming. I could
> generate the data more efficiently than I am doing here
> (generate_series() in the from clause). Perhaps with a copy? I know it
> is too slow now to go in an ongoing test, but I don't have an
> intuition around how fast it would have to be to be acceptable. Is
> there a set of additional tests that are slower that we don't always
> run? I didn't follow how the wraparound test ended up, but that seems
> like one that would have been slow.
> 

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE

2024-06-24 Thread Nathan Bossart

On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
> The important observation is that this only happens if a database is
> created while the backup is running, and that it only happens with the
> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
> default since PG15).

My first thought is that this sounds related to the large comment in
CreateDatabaseUsingFileCopy():

/*
 * We force a checkpoint before committing.  This effectively means that
 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to 
be
 * replayed (at least not in ordinary crash recovery; we still have to
 * make the XLOG entry for the benefit of PITR operations). This avoids
 * two nasty scenarios:
 *
 * #1: When PITR is off, we don't XLOG the contents of newly created
 * indexes; therefore the drop-and-recreate-whole-directory behavior of
 * DBASE_CREATE replay would lose such indexes.
 *
 * #2: Since we have to recopy the source database during DBASE_CREATE
 * replay, we run the risk of copying changes in it that were committed
 * after the original CREATE DATABASE command but before the system 
crash
 * that led to the replay.  This is at least unexpected and at worst 
could
 * lead to inconsistencies, eg duplicate table names.
 *
 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
 *
 * In PITR replay, the first of these isn't an issue, and the second is
 * only a risk if the CREATE DATABASE and subsequent template database
 * change both occur while a base backup is being taken. There doesn't
 * seem to be much we can do about that except document it as a
 * limitation.
 *
 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
 * strategy that avoids these problems.
 */

> I don't recall any reports of similar issues from pre-15 releases, where
> FILE_COPY was the only available option - I'm not sure why is that.
> Either it didn't have this issue back then, or maybe people happen to
> not create databases concurrently with a backup very often. It's a race
> condition / timing issue, essentially.

If it requires concurrent activity on the template database, I wouldn't be
surprised at all that this is rare.

> I see there have been a couple threads proposing various improvements to
> FILE_COPY, that might make it more efficient/faster, namely using the
> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
> something that's (maybe) faster but not quite correct does not seem like
> a winning strategy to me ...
> 
> Alternatively, if we don't have clear desire to fix it, maybe the right
> solution would be get rid of it?

It would be unfortunate if we couldn't use this for pg_upgrade, especially
if it is unaffected by these problems.

-- 
nathan

Re: scalability bottlenecks with (many) partitions (and more)

2024-06-24 Thread Robert Haas

On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
 wrote:
> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
> LWLock table has 16 partitions by default - it's quite possible that on
> machine with many cores and/or many partitions, we can easily hit this.
> So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

> What I ended up doing is having a hash table of 16-element arrays. There
> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
> have now. Each OID is mapped to exactly one of these parts as if in a
> hash table, and in each of those 16-element parts we do exactly the same
> thing we do now (linear search, removal, etc.). This works great, the
> locality is great, etc. The one disadvantage is this makes PGPROC
> larger, but I did a lot of benchmarks and I haven't seen any regression
> that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Injection point locking

2024-06-24 Thread Tom Lane

Heikki Linnakangas  writes:
> ... I can't do that, because InjectionPointRun() requires a PGPROC 
> entry, because it uses an LWLock. That also makes it impossible to use 
> injection points in the postmaster. Any chance we could allow injection 
> points to be triggered without a PGPROC entry? Could we use a simple 
> spinlock instead? With a fast path for the case that no injection points 
> are attached or something?

Even taking a spinlock in the postmaster is contrary to project
policy.  Maybe we could look the other way for debug-only code,
but it seems like a pretty horrible precedent.

Given your point that the existing locking is just a fig leaf
anyway, maybe we could simply not have any?  Then it's on the
test author to be sure that the injection point won't be
getting invoked when it's about to be removed.  Or just rip
out the removal capability, and say that once installed an
injection point is there until postmaster shutdown (or till
shared memory reinitialization, anyway).

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>
> On 21/06/2024 03:02, Peter Geoghegan wrote:
> > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> >  wrote:
> >
> >> The repro forces a round of index vacuuming after the standby
> >> reconnects and before pruning a dead tuple whose xmax is older than
> >> OldestXmin.
> >>
> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> >> calls GetOldestNonRemovableTransactionId(), thereby updating the
> >> backend's GlobalVisState and moving maybe_needed backwards.
> >
> > Right. I saw details exactly consistent with this when I used GDB
> > against a production instance.
> >
> > I'm glad that you were able to come up with a repro that involves
> > exactly the same basic elements, including index page deletion.
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 170236 matches

Mail list logo