Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 12:48 PM Tobias Bussmann  wrote:
> Perhaps I can shed some light on this matter:

Hi Tobias,

Oh, thanks for your answers.  Definitely a few bits of interesting
archeology I was not aware of.

> Apple's libc collations have always been a bit special in that concern, even 
> for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating 
> table backward compatible with ASCII" thus upper and lower cases characters 
> are separated (There are exceptions like 'cs_CZ.ISO8859-2').

Wow.  I see that I can sort the English dictionary the way most people
expect by pretending it's Czech.  What a mess!

> With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't 
> see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo 
> '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( 
> echo '5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q 
> <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort 
> /usr/share/dict/words)"

I see, so it does *something*, just not what anybody wants.




Re: Allow foreign keys to reference a superset of unique columns

2022-06-09 Thread David Rowley
On Fri, 10 Jun 2022 at 16:14, Peter Eisentraut
 wrote:
>
> On 10.06.22 05:47, David Rowley wrote:
> >> I think this should be referring to constraint name, not an index name.
> > Can you explain why you think that?
>
> If you wanted to specify this feature in the SQL standard (I'm not
> proposing that, but it seems plausible), then you need to deal in terms
> of constraints, not indexes.  Maybe referring to an index directly could
> be a backup option if desired, but I don't see why that would be
> necessary, since you can easily create a real constraint on top of an index.

That's a good point, but, if we invented syntax for specifying a
constraint name, would that not increase the likelihood that we'd end
up with something that conflicts with some future extension to the SQL
standard?

We already have USING INDEX as an extension to ADD CONSTRAINT.

David




Re: Handle infinite recursion in logical replication setup

2022-06-09 Thread Peter Smith
Below are some review comments for the patch v18-0003

1. Commit message

This patch does a couple of things:
change 1) Checks and throws an error if the publication tables were also
subscribing data in the publisher from other publishers when copy_data
parameter is specified as 'ON' and origin parameter is specified as
'local'.
change 2) Adds force value for copy_data parameter.

SUGGESTION
This patch does a couple of things:
change 1) Checks and throws an error if 'copy_data = on' and 'origin =
local' but the publication tables were also subscribing data in the
publisher from other publishers.
change 2) Adds 'force' value for copy_data parameter.

~~~

2. Commit message - about the example

All my following review comments for the commit message are assuming
that the example steps are as they are written in the patch, but
actually I felt that the example might be more complex than it needed
to be: e.g
- You don’t really need the node2 to have data
- Therefore you don’t need all the added TRUNCATE complications

E.g. I think you only need node1 (with data) and node2 (no data).

Then node1 subscribes node2 with (origin=local, copy_data=off).
Then node2 subscribes node1 with (origin=local, copy_data=on).
- Demonstrates exception happens because node1 already had a subscription
- Demonstrates need for the copy_data=force to override that exception

So please consider using a simpler example for this commit message

~~~

3. Commit message

The below help us understand how the first change will be useful:

If copy_data parameter was used with 'on' in step 5, then an error
will be thrown
to alert the user to prevent inconsistent data being populated:
CREATE SUBSCRIPTION sub_node2_node1 CONNECTION ''
PUBLICATION pub_node1 WITH (copy_data = on, origin = local);
ERROR:  CREATE/ALTER SUBSCRIPTION with origin and copy_data as true is not
allowed when the publisher might have replicated data

SUGGESTION
The steps below help to demonstrate how the new exception is useful:

The initial copy phase has no way to know the origin of the row data,
so if 'copy_data = on' in the step 5 below, then an error will be
thrown to prevent any potentially non-local data from being copied:

e.g
CREATE SUBSCRIPTION ...

~~~

4. Commit message

The following will help us understand how the second change will be useful:
Let's take a simple case where user is trying to setup bidirectional logical
replication between node1 and node2 where the two nodes have some pre-existing
data like below:
node1: Table t1 (c1 int) has data 11, 12, 13, 14
node2: Table t1 (c1 int) has data 21, 22, 23, 24

SUGGESTION
The following steps help to demonstrate how the 'copy_data = force'
change will be useful:

Let's take a scenario where the user wants to set up bidirectional
logical replication between node1 and node2 where the same table on
each node has pre-existing data. e.g.
node1: Table t1 (c1 int) has data 11, 12, 13, 14
node2: Table t1 (c1 int) has data 21, 22, 23, 24

~~~

5. Commit message

step 4:
node2=# CREATE SUBSCRIPTION sub_node2_node1 Connection ''
node2-# PUBLICATION pub_node1;
CREATE SUBSCRIPTION

"Connection" => "CONNECTION"

~~~

6. Commit message

If table t1 has a unique key, it will cause a unique key
violation and replication won't proceed.

SUGGESTION
In case, table t1 has a unique key, it will lead to a unique key
violation and replication won't proceed.

~~~

7. Commit message

step 3: Create a subscription in node1 to subscribe to node2. Use
copy_data specified as on so that the existing table data is copied during
initial sync:

SUGGESTION
step 3: Create a subscription in node1 to subscribe to node2. Use
'copy_data = on' so that the existing table data is copied during
initial sync:

~~~

8. Commit message

step 4: Adjust the publication publish settings so that truncate is not
published to the subscribers and truncate the table data in node2:

SUGGESTION (only added a comma)
step 4: Adjust the publication publish settings so that truncate is
not published to the subscribers, and truncate the table data in
node2:

~~~

9. Commit message

step 5: Create a subscription in node2 to subscribe to node1. Use copy_data
specified as force when creating a subscription to node1 so that the existing
table data is copied during initial sync:

SUGGESTION
step 5: Create a subscription in node2 to subscribe to node1. Use
'copy_data = force' when creating a subscription to node1 so that the
existing table data is copied during initial sync:

==

10. doc/src/sgml/ref/create_subscription.sgml

@@ -383,6 +398,15 @@ CREATE SUBSCRIPTION subscription_name

+  
+   If subscription is created with origin = local and
+   copy_data = on, it will check if the publisher tables are
+   being subscribed to any other publisher and throw an error to prevent
+   inconsistent data in the subscription. The user can continue with the copy
+   operation without throwing any error in this case by specifying
+   copy_data = force.
+  

SUGGESTION (minor rewording)

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 1:48 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro  wrote:
> > Well I can report that the system from ec483147 was hellishly
> > complicated, and not universally loved.  Which isn't to say that there
> > isn't a simple and loveable way to do it, waiting to be discovered,
> > and I do think we could fix most of the problems with that work.
>
> I admit that I don't have much idea of how difficult it would be to
> make it all work. I'm definitely not claiming that it's easy.

Hrrm... perhaps my memory of ec483147 is confusing me.  I think I'm
starting to come around to your idea a bit more now.  Let me sketch
out some more details here and see where this goes.

I *was* thinking that you'd have to find all references to collations
through static analysis, as we did in that version tracking project.
But perhaps for this you only need to record one ICU library version
for the whole index at build time, without any analysis at all, and it
would be used for any and all ICU collations that are reached while
evaluating anything to do with that index (index navigation, but also
eg WHERE clause for partial index, etc).  That would change to the
"current" value when you REINDEX.

Perhaps that could be modeled with a pg_depend row pointing to a
pg_icu_library row, which you'd probably need anyway, to prevent a
registered ICU library that is needed for a live index from being
dropped.  (That's assuming that the pg_icu_library catalogue concept
has legs...  well if we're going with dlopen(), we'll need *somewhere*
to store the shared object paths.  Perhaps it's not a given that we
really want paths in a table... I guess it might prevent certain
cross-OS streaming rep scenarios, but mostly that'd be solvable with
symlinks...)

One problem is that to drop an old pg_icu_library row, you'd have to
go and REINDEX everything, even indexes that don't really use
collations!  If you want to prove that an index doesn't use
collations, you're back in ec483147 territory.  Perhaps we don't care
about that and we're happy to let useless dependencies on
pg_icu_library rows accumulate, or to require useless work to be able
to drop them.

I'm not sure how we'd know what the "current" library version is.  The
highest numbered one currently in that pg_icu_library catalogue I
sketched?  So if I do whatever new DDL we invent to tell the system
about a new ICU library, and it's got a higher number than any others,
new indexes start using it but old ones keep using whatever they're
using.  Maybe with some way for users to override it, so users who
really want to use an older one when creating a new index can say so.

I suppose it would be the same for constraints.  For those,
considering that they need to be rechecked, the only way to change ICU
version would be to drop the constraint and recreate it.  Same goes
for range partitioned tables, right?  It'd keep using the old ICU
library until you drop the p table and create a new one, at which
point you're using the new current ICU library and it'll recheck all
your partitions against the constraints when you add them.  (Those
constraints are much simpler constants, so for those we could prove no
use of ICU without the general ec483147 beast.)

I think these things would have to survive pg_upgrade, but would be
lost on dump/restore.

There's still the pathkey problem to solve, and maybe some more
problems like that hiding somewhere.

I'm not sold on any particular plan, but working through some examples
helped me see your idea better...  I may try to code that up in a
minimal way so we can kick the tyres...




Re: Multi-Master Logical Replication

2022-06-09 Thread Amit Kapila
On Thu, Jun 9, 2022 at 6:04 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
> >
> > MULTI-MASTER LOGICAL REPLICATION
> >
> > 1.0 BACKGROUND
> >
> > Let’s assume that a user wishes to set up a multi-master environment
> > so that a set of PostgreSQL instances (nodes) use logical replication
> > to share tables with every other node in the set.
> >
> > We define this as a multi-master logical replication (MMLR) node-set.
> >
> > 
> >
> > 1.1 ADVANTAGES OF MMLR
> >
> > - Increases write scalability (e.g., all nodes can write arbitrary data).
> > - Allows load balancing
> > - Allows rolling updates of nodes (e.g., logical replication works
> > between different major versions of PostgreSQL).
> > - Improves the availability of the system (e.g., no single point of failure)
> > - Improves performance (e.g., lower latencies for geographically local 
> > nodes)
>
> Thanks for working on this proposal. I have a few high-level thoughts,
> please bear with me if I repeat any of them:
>
> 1. Are you proposing to use logical replication subscribers to be in
> sync quorum? In other words, in an N-masters node, M (M >= N)-node
> configuration, will each master be part of the sync quorum in the
> other master?
>

What exactly do you mean by sync quorum here? If you mean to say that
each master node will be allowed to wait till the commit happens on
all other nodes similar to how our current synchronous_commit and
synchronous_standby_names work, then yes, it could be achieved. I
think the patch currently doesn't support this but it could be
extended to support the same. Basically, one can be allowed to set up
async and sync nodes in combination depending on its use case.

> 2. Is there any mention of reducing the latencies that logical
> replication will have generally (initial table sync and
> after-caught-up decoding and replication latencies)?
>

No, this won't change under the hood replication mechanism.

> 3. What if "some" postgres provider assures an SLA of very few seconds
> for failovers in typical HA set up with primary and multiple sync and
> async standbys? In this context, where does the multi-master
> architecture sit in the broad range of postgres use-cases?
>

I think this is one of the primary use cases of the n-way logical
replication solution where in there shouldn't be any noticeable wait
time when one or more of the nodes goes down. All nodes have the
capability to allow writes so the app just needs to connect to another
node. I feel some analysis is required to find out and state exactly
how the users can achieve this but seems doable. The other use cases
are discussed in this thread and are summarized in emails [1][2].

> 4. Can the design proposed here be implemented as an extension instead
> of a core postgres solution?
>

Yes, I think it could be. I think this proposal introduces some system
tables, so need to analyze what to do about that.  BTW, do you see any
advantages to doing so?

> 5. Why should one use logical replication for multi master
> replication? If logical replication is used, isn't it going to be
> something like logically decode and replicate every WAL record from
> one master to all other masters? Instead, can't it be achieved via
> streaming/physical replication?
>

The failover/downtime will be much lesser in a solution based on
logical replication because all nodes are master nodes and users will
be allowed to write on other nodes instead of waiting for the physical
standby to become writeable. Then it will allow more localized
database access for geographically distributed databases, see the
email for further details on this [3]. Also, the benefiting scenarios
are the same as all usual Logical Replication quoted benefits - e.g
version independence, getting selective/required data, etc.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BZP9c6q1BQWSQC__w09WQ-qGt22dTmajDmTxR_CAUyJQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/TYAPR01MB58660FCFEC7633E15106C94BF5A29%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1%2BDRHCNLongM0stsVBY01S-s%3DEa_yjBFnv_Uz3m3Hky-w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Allow foreign keys to reference a superset of unique columns

2022-06-09 Thread Peter Eisentraut

On 10.06.22 05:47, David Rowley wrote:

I think this should be referring to constraint name, not an index name.

Can you explain why you think that?


If you wanted to specify this feature in the SQL standard (I'm not 
proposing that, but it seems plausible), then you need to deal in terms 
of constraints, not indexes.  Maybe referring to an index directly could 
be a backup option if desired, but I don't see why that would be 
necessary, since you can easily create a real constraint on top of an index.





Re: Allow foreign keys to reference a superset of unique columns

2022-06-09 Thread David Rowley
On Fri, 10 Jun 2022 at 15:08, Peter Eisentraut
 wrote:
>
> On 07.06.22 20:59, Kaiting Chen wrote:
> > 2. The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] 
> > clause
> > optionally following the referenced column list.
> >
> > The index specified by this clause is used to support the foreign key
> > constraint, and it must be a non-deferrable unique or primary key index 
> > on
> > the referenced table compatible with the referenced columns.
> >
> > Here, compatible means that the columns of the index are a subset (the 
> > same,
> > or a strict subset) of the referenced columns.
>
> I think this should be referring to constraint name, not an index name.

Can you explain why you think that?

My thoughts are that it should be an index name. I'm basing that on
the fact that transformFkeyCheckAttrs() look for valid unique indexes
rather than constraints.  The referenced table does not need any
primary key or unique constraints to be referenced by a foreign key.
It just needs a unique index matching the referencing columns.

It would seem very strange to me if we required a unique or primary
key constraint to exist only when this new syntax is being used. Maybe
I'm missing something though?

David




Re: Allow foreign keys to reference a superset of unique columns

2022-06-09 Thread Peter Eisentraut

On 07.06.22 20:59, Kaiting Chen wrote:

2. The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] clause
optionally following the referenced column list.

The index specified by this clause is used to support the foreign key
constraint, and it must be a non-deferrable unique or primary key index on
the referenced table compatible with the referenced columns.

Here, compatible means that the columns of the index are a subset (the same,
or a strict subset) of the referenced columns.


I think this should be referring to constraint name, not an index name.




Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro  wrote:
> Well I can report that the system from ec483147 was hellishly
> complicated, and not universally loved.  Which isn't to say that there
> isn't a simple and loveable way to do it, waiting to be discovered,
> and I do think we could fix most of the problems with that work.

I admit that I don't have much idea of how difficult it would be to
make it all work. I'm definitely not claiming that it's easy.

> I understand that that's not ideal from an
> end-user perspective, but maybe it's more realistically and robustly
> and simply implementable.  Hmm.

That may be a decisive reason to go with your proposal. I really don't know.

-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-09 Thread Tobias Bussmann
Am 08.06.2022 um 16:16 schrieb Tom Lane :
> The proposed patch would result in a warning about every collation-
> sensitive index during every macOS major version upgrade, ie about
> once a year for most people.  
> We need something that has at least *some* connection to actual changes.

In Postgres.app we introduced default collation versioning and warnings about 
possible mismatches from outside the actual server. When the user runs initdb 
with the GUI wrapper, the OS version and a checksum of the LC_COLLATE file of 
the used default collation is stored as meta-data. This allows to display a 
reindex warning on startup if the hash changes or we hardcode a known 
incompatible OS change.

Having collversion support on macOS within postgres would leverage the existing 
infrastructure for version change warnings and enables support for multiple 
collations. But I agree, we need something more specific than the major OS 
version here. Lacking any collation version information from the provider, a 
checksum on the binary LC_COLLATE file is the best I can come up with. 

Best regards,
Tobias



Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 1:06 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro  wrote:
> > That sounds nice, but introduces subtle problems for the planner.  For
> > example, pathkeys that look compatible might not be, when
> > merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> > could teach it about that, whereas with my distinct OID concept they
> > would already be considered non-matching automatically.
>
> Right -- my proposal is likely to be more difficult to implement.
> Seems like it might be worth going to the trouble of teaching the
> planner about this difference, though.

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved.  Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work.  It's
just that I was rather thinking of this new line of attack as being a
way to avoid the complications of identifying dependencies on moving
things through complicated analysis of object graphs and AST, by
instead attaching those slippery external things to the floor with a
nail gun.  That is, treating ICU 63 and ICU 67's collations as
completely unrelated.  I understand that that's not ideal from an
end-user perspective, but maybe it's more realistically and robustly
and simply implementable.  Hmm.




Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro  wrote:
> That sounds nice, but introduces subtle problems for the planner.  For
> example, pathkeys that look compatible might not be, when
> merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> could teach it about that, whereas with my distinct OID concept they
> would already be considered non-matching automatically.

Right -- my proposal is likely to be more difficult to implement.
Seems like it might be worth going to the trouble of teaching the
planner about this difference, though.

That exact issue seems like the true underlying problem to me: we have
two sets of behaviors for a given collation, that are equivalent for
some purposes (the user thinks of them as totally interchangeable),
but not for other purposes (we can't expect old indexes to continue to
work with a new physical collation for their logical collation). So
directly tackling that seems natural to me.

-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 12:32 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro  wrote:
> > You seem to have some
> > other idea in mind where the system only knows about one
> > "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> > which indexes were built with ICU 63 and which with ICU 67, which I
> > don't yet grok.  Or did I misunderstand?
>
> That's what I meant, yes -- you got it right.

OK, I see now.

I think if you design a system to record the library that each index
(and constraint, ...) was built with, it'd surely finish up being at
least conceptually something like the system Julien and I built and
then reverted in ec483147.  Except that it'd be a stronger form of
that, because instead of just squawking when the version is not the
latest/current version, it'd keep working but route collations to the
older library for indexes that haven't been rebuilt yet.

That sounds nice, but introduces subtle problems for the planner.  For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.




Re: Collation version tracking for macOS

2022-06-09 Thread Tobias Bussmann
Thanks for picking this up!

> How can I see evidence of this?  I'm comparing Debian, FreeBSD and
> macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
> /usr/share/dict/words" I get upper and lower case mixed together on
> the other OSes, but on the Mac the upper case comes first, which is my
> usual smoke test for "am I looking at binary sort order?"

Perhaps I can shed some light on this matter:

Apple's libc collations have always been a bit special in that concern, even 
for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating 
table backward compatible with ASCII" thus upper and lower cases characters are 
separated (There are exceptions like 'cs_CZ.ISO8859-2'). The latest public 
sources I can find are in adv_cmds-119 [1] which belongs to OSX 10.5 [2] - 
these correspond to the ones used in FreeBSD till v10 [3], whereby the 
timestamps rather point its origin around FreeBSD 5. Further, there are only 
very few locales actually present on macOS (36 - none of it supporting Unicode) 
and these have not changed for a very long time (I verified that from OS X 
10.6.8 till macOS 12.4 [4], exception is a 'de_DE-A.ISO8859-1' present only in 
macOS 10.15).

What they do instead is symlinking [5] missing collations to similar ones even 
across encodings, often resulting in la_LN.US-ASCII ('la_LN' seem to stand for 
a Latin meta language) being used which is exactly byte order [6]. These 
symlinks have not changed [7] from OS X 10.6.8 till macOS 10.15.7. But in macOS 
11 many of these symlinks changed their target. So did the popular 
'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from 
'la_LN.US-ASCII' to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 
collations change from no collation to partial/broken collation support. macOS 
12 again shows no changes - tests for macOS 13 are outstanding.

# tl:dr;

With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't 
see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo '£5' 
) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo 
'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C 
sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort 
/usr/share/dict/words)"

The upside is that we don't have to cope with the new characters added in every 
version of Unicode (although I have not examined LC_CTYPE yet).

best regards
Tobias

[1]: 
https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef
[2]: https://opensource.apple.com/releases/
[3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef
[4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;
[5]: 
https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile
[6]: 
https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src
[7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \; 



Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro  wrote:
> However, since you mentioned that a simple REINDEX would get you from
> one library version to another, I think we're making some completely
> different assumptions somewhere along the line, and I don't get your
> idea yet.  It sounds like you don't want two different collation OIDs
> in that case?

Not completely sure about the REINDEX behavior, but it's at least an
example of the kind of thing that could be enabled. I'm proposing that
pg_collation-wise collations have the most abstract possible
definitions -- "logical collations", which are decoupled from
"physical collations" that actually describe a particular ICU collator
associated with a particular ICU version (all the information that
keeps how the on-disk structure is organized for a given relfilenode
straight). In other words, the definition of a collation is the user's
own definition. To the user, it's pretty close to (maybe even exactly)
a BCP47 string, now and forever.

You can make arguments against the REINDEX behavior. And maybe those
arguments will turn out to be good arguments. Assuming that they are,
then the solution may just be to have a special option that will make
the REINDEX use the most recent library.

The important point is to make the abstraction as high level as
possible from the point of view of users.

> You seem to have some
> other idea in mind where the system only knows about one
> "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> which indexes were built with ICU 63 and which with ICU 67, which I
> don't yet grok.  Or did I misunderstand?

That's what I meant, yes -- you got it right.

Another way to put it would be to go as far as we can in the direction
of decoupling the concerns that we have as database people from the
concerns of natural language experts. Let's not step on their toes,
and let's avoid having our toes trampled on.

-- 
Peter Geoghegan




doc: array_length produces null instead of 0

2022-06-09 Thread David G. Johnston
Hi,

Per discussion here:

https://www.postgresql.org/message-id/163636931138.8076.5140809232053731248%40wrigleys.postgresql.org

We can now easily document the array_length behavior of returning null
instead of zero for an empty array/dimension.

I added an example to the json_array_length function to demonstrate that it
does return 0 as one would expect, but contrary to the SQL array behavior.

I did not bother to add examples to the other half dozen or so "_length"
functions that all produce 0 as expected.  Just the surprising case and the
adjacent one.

David J.


0001-doc-array_length-produces-null-instead-of-0.patch
Description: Binary data


Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 10:29 AM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim  wrote:
> > For example, an alternate syntax might be:
> >
> > create collation icu63."en-US-x-icu" (provider = icu, locale = 
> > 'en-US@colVersion=63');
>
> Why would a user want to specify an ICU version in DDL? Wouldn't that
> break in the event of a dump and reload of the database, for example?
> It also strikes me as being inconsistent with the general philosophy
> for ICU and the broader BCP45 IETF standard, which is "interpret the
> locale string to the best of our ability, never throw an error".
>
> Your proposed syntax already "works" today! You just need to create a
> schema called icu63 -- then the command executes successfully (for
> certain values of successfully).

Jim was proposing the @colVersion=63 part, but the schema part came
from my example upthread.  That was from a real transcript, and I
included that  because the way I've been thinking of this so far has
distinct collation OIDs for the "same" collation from different ICU
libraries, and yet I want them to have the same collname.  That is, I
don't want (say) "en-US-x-icu63" and "en-US-x-icu71"... I thought it'd
be nice to keep using "en-US-x-icu" as we do today, so if there are
two of them they'd *have* to be in different schemas.  That has the
nice property that you can use the search_path to avoid mentioning it.
But I'm not at all wedded to that idea, or any other ideas in this
thread, just trying stuff out...

However, since you mentioned that a simple REINDEX would get you from
one library version to another, I think we're making some completely
different assumptions somewhere along the line, and I don't get your
idea yet.  It sounds like you don't want two different collation OIDs
in that case?

The (vastly too) simplistic way I was thinking of it, if you have a
column with an ICU 63 collation, to switch to ICU 67 you first do some
DDL to add ICU 67 to your system and import 67's collations (creating
new collation OIDs), and then eg ALTER TABLE foo ALTER COLUMN bar TYPE
text COLLATE icu67."en-US-x-icu", which will rebuild your indexes.
That's a big job, and doesn't address how you switch the database
default collation.  None of that is very satisfying, much more thought
needed, but it falls out of the decision to have distinct
icu63."en-US-x-icu" and icu67."en-US-x-icu".  You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok.  Or did I misunderstand?




Re: better page-level checksums

2022-06-09 Thread Matthias van de Meent
On Thu, 9 Jun 2022 at 23:13, Robert Haas  wrote:
>
> On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost  wrote:
> > Alternatively, we could use
> > that technique to just provide a better per-page checksum than what we
> > have today.  Maybe we could figure out how to leverage that to move to
> > 64bit transaction IDs with some page-level epoch.
>
> I'm interested in assessing the feasibility of a "better page-level
> checksums" feature. I have a few questions, and a few observations.
> One of my questions is what algorithm(s) we'd want to support. I did a
> quick Google search and found that brtfs supports CRC-32C, XXHASH,
> SHA256, and BLAKE2B. I don't know that we want to support that many
> options (but maybe we do) and I don't think CRC-32C makes any sense
> here, for two reasons. First, we've already got a 16-bit checksum, and
> a 32-bit checksum doesn't seem like it's gaining enough to be worth
> the implementation complexity. Second, we're probably going to have to
> dole out per-page space in multiples of MAXALIGN, and that's usually
> 8.

Why so? We already dole out per-page space in 4-byte increments
through pd_linp, and I see no reason why we can't reserve some line
pointers for per-page metadata if we decide that we need extra
per-page ~overhead~ metadata.

>  I think for this purpose we should limit ourselves to algorithms
> whose output size is, at minimum, 64 bits, and ideally, a multiple of
> 64 bits. I'm sure there are plenty of options other than the ones that
> btrfs uses; I mentioned them only as a way of jump-starting the
> discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
> wide 16 BYTE checksums. That's a lot of space to consume with a
> checksum, but your chances of a collision are very small indeed.

Isn't the goal of a checksum to find - and where possible, correct -
bit flips and other broken pages? I would suggest not to use
cryptographic hash functions for that, as those are rarely
error-correcting.

> Even if we only offer one new kind of checksum, making space for a
> wider checksum makes the page format variable in a way that it
> currently isn't. There seem to be about 50 compile-time constants in
> the source code whose values are computed based on the block size and
> amount of special space in use by some particular AM (yikes!).

Isn't that expected for most of those places? With the current
bufpage.h description of Page, it seems obvious that all bytes on a
page except those in the "hole" and those in the page header are under
full control of the AM. Of course AMs will pre-calculate limits and
offsets during compilation, that saves recalculation cycles and/or
cache lines with constants to keep in L1.

>  For
> example, for the heap, there's stuff like MaxHeapTuplesPerPage and
> MaxHeapTupleSize. If in the future we have some pages that are just
> like the ones we have today, and other clusters where we've allowed
> space for a checksum, then those constants become run-time variable.
> And since they're used in some very low-level functions that are
> called a lot, like PageGetHeapFreeSpace(), that seems potentially
> problematic. The problem is multiplied if you also think about trying
> to store an epoch on each heap page, as per Stephen's proposal above,
> because now every page used by any AM whatsoever might or might not
> have a checksum, and every heap page might also have or not have an
> epoch XID. I think it's going to be somewhat tricky to figure out a
> scheme here that avoids making things slow.

Can't we add some extra fork that stores this extra per-page
information, and contains this extra metadata in a double-buffered
format, so that both before the actual page is written the metadata
too is written to disk, while the old metadata is available too for
recovery purposes. This allows us to maintain the current format with
its low per-page overhead, and only have extra overhead (up to 2x
writes for each page, but the writes for these metadata pages need not
be BLCKSZ in size) for those that opt-in to the more computationally
expensive features of larger checksums, nonces, and/or other non-AM
per-page ~overhead~ metadata.

> Originally I was thinking
> that things like MaxHeapTuplesPerPage ought to become macros or static
> inline functions, but now I have what I think is a better idea: make
> them into global variables and initialize them with the appropriate
> values for the cluster right after we read the control file. This
> doesn't solve the problem if some pages are different than others,
> though, and even for the case where every page in the cluster has the
> same amount of reserved space, reading a global variable is not going
> to be as efficient as referring to a constant compiled right into the
> code. I'm hopeful that all of this is solvable somehow, but it's
> hairy, for sure.
>
> Another thing I realized is that we would probably end up with the
> pd_checksum unused when this other feature is activated. If someone
> comes up 

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro  wrote:
> Suppose you pg_upgrade to something that is linked against 71.
> Perhaps you'd need to tell it how to dlopen 67 before you can open any
> collations with that library, but once you've done that your
> collation-dependent partition constraints etc should all hold.  I
> dunno, lots of problems to figure out here, including quite broad ones
> about various migration problems.  I haven't understood what Peter G
> is suggesting about how upgrades might work, so I'll go and try to do
> that...

I'm mostly just arguing for the idea that we should treat ICU versions
as essentially interchangeable in terms of their high-level
capabilities around collations and languages/scripts/whatever provided
for by the underlying CLDR version -- tools like pg_dump shouldn't
need to care about ICU versions per se. *ICU itself* should be
versioned, rather than having multiple independent ICU collation
providers. This should work as well as anything like this can ever be
expected to work -- because internationalization is just hard.

These remarks need to be interpreted in the context of how
internationalization is *supposed* to work under standards like BCP47
(again, this is a broad RFC about internationalization, not really an
ICU thing). Natural languages are inherently squishy, messy things.
The "default ICU collations" that initdb puts in pg_collation are not
really special to ICU -- we generate them through a quasi-arbitrary
process that iterates through top-level locales, which results in a
list that is a bit like what you get with libc collations. If you
pg_upgrade, you might have leftover "default ICU collations" that
wouldn't have been the default on a new initdb. It's inherently pretty
chaotic (because humans aren't as predictable as computers), which is
why BCP47 itself is so forgiving -- it literally has to be. Plus there
really isn't much downside to being so lax; as Jeremy pretty much said
already, the important thing is generally to have roughly the right
idea -- which this fuzzy approach mostly manages to do.

Let's not fight that. Let's leave the natural language stuff to the
experts, by versioning a single collation provider (like ICU), and
generalizing the definition of a collation along the same lines --
something that can be implemented using any available version of ICU
(with a preference for the latest on REINDEX, perhaps). It might turn
out that an older version does a slightly better job than a newer
version (regressions cannot be ruled out), but ultimately that's not
our problem. It can't be -- we're not the unicode consortium.

It's theoretically up to the user to make sure they're happy with any
behavioral changes under this scheme, perhaps by testing. They won't
actually test very often, of course, but that shouldn't matter in
practice. This is already what we advise for users that use advanced
tailorings of custom ICU collations, such as a custom collation for
"natural sorting", often used for things like alphanumeric invoice
numbers. That might break if you downgrade ICU version, and maybe even
if you upgrade ICU version.

--
Peter Geoghegan




doc: Bring mention of unique index forced transaction wait behavior outside of the internal section

2022-06-09 Thread David G. Johnston
Hi.

The fact that one transaction will wait on another if they are trying to
claim the same unique value is presently relegated to a subchapter of the
documentation where the typical reader will not even understand (rightly
so) the main chapter's title.  This has prompted a number of questions
being sent to the mailing list over the years about a topic we do cover in
some detail in the documentation, most recently here:

https://www.postgresql.org/message-id/cajqy8uosnct0m0xbd7gkwgs02c0sozn1det-q94jjpv1lrc...@mail.gmail.com

Attached is a proposal for incorporating some high-level detail within the
MVCC Chapter, where readers are already looking to learn about how
transactions interact with each other and are "isolated" (or not, in this
case) from each other.

I'm neither married to the verbiage nor location but it seems better than
nothing and a modest improvement for not much effort.  It's basically a
glorified cross-reference.  I didn't dislike directing the reader to the
internals section enough to try and establish a better location for
the main content.  It just needs better navigation to it from places the
reader is expected to peruse.

David J.


0001-doc-cross-reference-to-unique-indexes-internals-from.patch
Description: Binary data


Re: [PoC] Let libpq reject unexpected authentication requests

2022-06-09 Thread Jacob Champion
On Wed, Jun 8, 2022 at 9:58 PM Michael Paquier  wrote:
> > - require_auth=scram-sha-256,cert means that either a SCRAM handshake
> > must be completed, or the server must request a client certificate. It
> > has a potential pitfall in that it allows a partial SCRAM handshake,
> > as long as a certificate is requested and sent.
>
> Er, this one could be a problem protocol-wise for SASL, because that
> would mean that the authentication gets completed but that the
> exchange has begun and is not finished?

I think it's already a problem, if you're not using channel_binding.
The cert behavior here makes it even less intuitive.

> > AND and NOT, discussed upthread, are not yet implemented. I tied
> > myself up in knots trying to make AND generic, so I think I"m going to
> > tackle NOT first, instead. The problem with AND is that it only makes
> > sense when one (and only one) of the options is a form of transport
> > authentication. (E.g. password+md5 never makes sense.) And although
> > cert+ and gss+ could be useful, the latter case
> > is already handled by gssencmode=require, and the gssencmode option is
> > more powerful since you can disable it (or set it to don't-care).
>
> I am on the edge regarding NOT as well, and I am unsure of the actual
> benefits we could get from it as long as one can provide a white list
> of auth methods.  If we don't see an immediate benefit in that, I'd
> rather choose a minimal, still useful, design.  As a whole, I would
> vote with adding only support for OR and a comma-separated list like
> your proposal.

Personally I think the ability to provide a default of `!password` is
very compelling. Any allowlist we hardcode won't be future-proof (see
also my response to Laurenz upthread [1]).

> > I'm not generally happy with how the "cert" option is working. With
> > the other methods, if you don't include a method in the list, then the
> > connection fails if the server tries to negotiate it. But if you don't
> > include the cert method in the list, we don't forbid the server from
> > asking for a cert, because the server always asks for a client
> > certificate via TLS whether it needs one or not. Behaving in the
> > intuitive way here would effectively break all use of TLS.
>
> Agreed.  Looking at what you are doing with allowed_auth_method_cert,
> this makes the code harder to follow, which is risky for any
> security-related feature, and that's different than the other methods
> where we have the AUTH_REQ_* codes.  This leads to extra complications
> in the shape of ssl_cert_requested and ssl_cert_sent, as well.  This
> strongly looks like what we do for channel binding as it has
> requirements separated from the actual auth methods but has dependency
> with them, so a different parameter, as suggested, would make sense.
> If we are not sure about this part, we could discard it in the first
> instance of the patch.

I'm pretty motivated to provide the ability to say "I want cert auth
only, nothing else." Using a separate parameter would mean we'd need
something like `require_auth=none`, but I think that makes a certain
amount of sense.

> I am not convinced that we have any need for the AND grammar within
> one parameter, as that's basically the same thing as defining multiple
> connection parameters, isn't it?  This is somewhat a bit similar to
> the interactions of channel binding with this new parameter and what
> you have implemented.  For example, channel_binding=require
> require_auth=md5 would imply both and should fail, even if it makes
> little sense because MD5 has no idea of channel binding.  One
> interesting case comes down to stuff like channel_binding=require
> require_auth="md5,scram-sha-256", where I think that we should still
> fail even if the server asks for MD5 and enforce an equivalent of an
> AND grammar through the parameters.  This reasoning limits the
> dependencies between each parameter and treats the areas where these
> are checked independently, which is what check_expected_areq() does
> for channel binding.  So that's more robust at the end.

Agreed.

> Speaking of which, I would add tests to check some combinations of
> channel_binding and require_auth.

Sounds good.

> +   appendPQExpBuffer(>errorMessage,
> + libpq_gettext("auth method \"%s\" required, but 
> %s\n"),
> + conn->require_auth, reason);
> This one is going to make translation impossible.  One way to tackle
> this issue is to use "auth method \"%s\" required: %s".

Yeah, I think I have a TODO somewhere about that somewhere. I'm
confused about your suggested fix though; can you elaborate?

> +   {"require_auth", NULL, NULL, NULL,
> +   "Require-Auth", "", 14, /* sizeof("scram-sha-256") == 14 */
> +   offsetof(struct pg_conn, require_auth)},
> We could have an environment variable for that.

I think that'd be a good idea. It'd be nice to have the option of
forcing a particular auth type across a process tree.

> 

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro
On Fri, Jun 10, 2022 at 9:20 AM Finnerty, Jim  wrote:
> Specifying the library name before the language-country code with a new 
> separator  (":") as you suggested below has some benefits.

One of the reasons for putting some representation of desired library
into the colliculocale column (rather than, say, adding a new column
pg_collation) is that I think we'd also want to be able to put that
into daticulocale (for the database default collation, when using
ICU).  But really I just did that because it was easy... perhaps, both
pg_collation and pg_database could gain a new column, and that would
be a little more pleasing from a schema design point of view (1NF
atomicity, and it's a sort of foreign key, or at least it would be if
there were another catalog to list library versions...)?

> Did you consider making the collation version just another collation 
> attribute, such as colStrength, colCaseLevel, etc.?
> For example, an alternate syntax might be:
>
> create collation icu63."en-US-x-icu" (provider = icu, locale = 
> 'en-US@colVersion=63');

Hmm, I hadn't considered that.  (I wouldn't call it "col" version BTW,
it's a library version, and we don't want to overload our terminology
for collation version.  We'd still be on the look out for collversion
changes coming from a single library's minor version changing, for
example an apt-get upgrade can replace the .63 files, which on most
systems are symlinks to .63.1, .63.2 etc. ☠️)

> Was the concern that ICU might redefine a new collation property with the 
> same name in a different and incompatible way (we might work with the ICU 
> developers to agree on what it should be), or that a version is just not the 
> same kind of collation property as the other collation properties?

Well my first impression is that we don't really own that namespace,
and since we're using this to decide which library to route calls to,
it seems nicer to put it at a "higher level" than those properties.
So I'd prefer something like "63:en-US", or 63 in a new column.

> (in the example above, I'm assuming that for provider = icu, we could 
> translate '63' into  'libicui18n.so.63' automatically.)

Yeah.  My patch that jams a library name in there was just the fastest
way I could think of to get something off the ground to test whether I
could route calls to different libraries (yes!), though at one moment
I thought it wasn't terrible.  But aside from any aesthetic complaints
about that way of doing it, it turns out not to be enough: we need to
dlopen() two different libraries, because we also need some ctype-ish
functions from this guy:

$ nm -D -C /usr/lib/x86_64-linux-gnu/libicuuc.so.63.1 | grep u_strToUpper
000d22c0 T u_strToUpper_63

I guess we probably want to just put "63" somewhere in pg_collation,
as you say.  But then, teaching PostgreSQL how to expand that to a
name that is platform/packaging dependent seems bad.  The variations
would probably be minor; on a Mac it's .dylib, on AIX it may be .a,
and the .63 convention may not be universal, I dunno, but some systems
might need absolute paths (depending on ld.so.conf etc), but that's
all stuff that I think an administrator should care about, not us.

Perhaps there could be a new catalog table just for that.  So far I
have imagined there would still be one special ICU library linked at
build time, which doesn't need to be dlopen'd, and works automatically
without administrators having to declare it.  So a system that has one
linked-in library version 67, and then has two extras that have been
added by an administrator running some new DDL commands might have:

postgres=# select * from pg_icu_library order by version;
 version |libicuuc|libicui18n
-++--
  58 | libicuuc.so.58 | libicui18n.so.58
  63 | libicuuc.so.63 | libicui18n.so.63
  67 ||
(3 rows)

Suppose you pg_upgrade to something that is linked against 71.
Perhaps you'd need to tell it how to dlopen 67 before you can open any
collations with that library, but once you've done that your
collation-dependent partition constraints etc should all hold.  I
dunno, lots of problems to figure out here, including quite broad ones
about various migration problems.  I haven't understood what Peter G
is suggesting about how upgrades might work, so I'll go and try to do
that...




Re: Logging query parmeters in auto_explain

2022-06-09 Thread Dagfinn Ilmari Mannsåker
Michael Paquier  writes:

> On Tue, Jun 07, 2022 at 12:18:52PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> Point, fixed in the attached v2. I've also added a test that truncation
>> and disabling works.
>
> The tests are structured so as all the queries are run first, then the
> full set of logs is slurped and scanned.  With things like tests for
> truncation, it becomes easier to have a given test overlap with
> another one in terms of pattern matching, so we could silently lose
> coverage.  Wouldn't it be better to reorganize things so we save the
> current size of the log file (as of -s $node->logfile), run one query,
> and then slurp the log file based on the position saved previously?
> The GUC updates had better be localized in each sub-section of the
> tests, as well. 

Done (and more tests added), v3 attached.

- ilmari

>From de3627d972f5259c9290109a6d7b08fbfa1faea2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 31 May 2022 21:12:12 +0100
Subject: [PATCH v3] Log query parameters in auto_explain

Add an auto_explain.log_parameter_max_length config setting, similar
to the corresponding core setting, that controls the inclusion of
query parameters in the logged explain output.

Also adjust the tests to only look at the relevant parts of the log
for each query, and tho reset the GUCs after each test.
---
 contrib/auto_explain/auto_explain.c|  15 +++
 contrib/auto_explain/t/001_auto_explain.pl | 133 ++---
 doc/src/sgml/auto-explain.sgml |  19 +++
 src/backend/commands/explain.c |  22 
 src/include/commands/explain.h |   1 +
 5 files changed, 176 insertions(+), 14 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c9a0d947c8..1ba7536879 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -19,12 +19,14 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/params.h"
 #include "utils/guc.h"
 
 PG_MODULE_MAGIC;
 
 /* GUC variables */
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
+static int	auto_explain_log_parameter_max_length = -1; /* bytes or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
@@ -105,6 +107,18 @@ _PG_init(void)
 			NULL,
 			NULL);
 
+	DefineCustomIntVariable("auto_explain.log_parameter_max_length",
+			"Sets the maximum length of query parameters to log.",
+			"Zero logs no query parameters, -1 logs them in full.",
+			_explain_log_parameter_max_length,
+			-1,
+			-1, INT_MAX,
+			PGC_SUSET,
+			GUC_UNIT_BYTE,
+			NULL,
+			NULL,
+			NULL);
+
 	DefineCustomBoolVariable("auto_explain.log_analyze",
 			 "Use EXPLAIN ANALYZE for plan logging.",
 			 NULL,
@@ -389,6 +403,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 
 			ExplainBeginOutput(es);
 			ExplainQueryText(es, queryDesc);
+			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
 			ExplainPrintPlan(es, queryDesc);
 			if (es->analyze && auto_explain_log_triggers)
 ExplainPrintTriggers(es, queryDesc);
diff --git a/contrib/auto_explain/t/001_auto_explain.pl b/contrib/auto_explain/t/001_auto_explain.pl
index 82e4d9d15c..4fa1a8457b 100644
--- a/contrib/auto_explain/t/001_auto_explain.pl
+++ b/contrib/auto_explain/t/001_auto_explain.pl
@@ -16,42 +16,147 @@
 $node->append_conf('postgresql.conf', "auto_explain.log_analyze = on");
 $node->start;
 
+sub query_log
+{
+	my ($params, $sql) = @_;
+
+	if (keys %$params)
+	{
+		for my $key (keys %$params)
+		{
+			$node->append_conf('postgresql.conf', "$key = $params->{$key}\n");
+		}
+		$node->reload;
+	}
+
+	my $log= $node->logfile();
+	my $offset = -s $log;
+
+	$node->safe_psql("postgres", $sql);
+
+	my $log_contents = slurp_file($log, $offset);
+
+	if (keys %$params)
+	{
+		for my $key (keys %$params)
+		{
+			$node->adjust_conf('postgresql.conf', $key, undef);
+		}
+		$node->reload;
+	}
+
+	return $log_contents;
+}
+
 # run a couple of queries
-$node->safe_psql("postgres", "SELECT * FROM pg_class;");
-$node->safe_psql("postgres",
-	"SELECT * FROM pg_proc WHERE proname = 'int4pl';");
+my $log_contents = query_log({}, "SELECT * FROM pg_class;");
 
-# emit some json too
-$node->append_conf('postgresql.conf', "auto_explain.log_format = json");
-$node->reload;
-$node->safe_psql("postgres", "SELECT * FROM pg_proc;");
-$node->safe_psql("postgres",
-	"SELECT * FROM pg_class WHERE relname = 'pg_class';");
+like(
+	$log_contents,
+	qr/Query Text: SELECT \* FROM pg_class;/,
+	"query text logged, text mode");
 
-$node->stop('fast');
-
-my $log = $node->logfile();
-
-my $log_contents = slurp_file($log);
+unlike(
+	$log_contents,
+	qr/Query Parameters:/,
+	"no query parameters logged when none, text mode");
 
 like(
 	$log_contents,
 	qr/Seq 

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim  wrote:
> Specifying the library name before the language-country code with a new 
> separator  (":") as you suggested below has some benefits. Did you consider 
> making the collation version just another collation attribute, such as 
> colStrength, colCaseLevel, etc.?
> For example, an alternate syntax might be:
>
> create collation icu63."en-US-x-icu" (provider = icu, locale = 
> 'en-US@colVersion=63');

Why would a user want to specify an ICU version in DDL? Wouldn't that
break in the event of a dump and reload of the database, for example?
It also strikes me as being inconsistent with the general philosophy
for ICU and the broader BCP45 IETF standard, which is "interpret the
locale string to the best of our ability, never throw an error".

Your proposed syntax already "works" today! You just need to create a
schema called icu63 -- then the command executes successfully (for
certain values of successfully).

I'm not arguing against the need for something like this. I'm just
pointing out that there are good reasons to imagine that it would
largely be an implementation detail, perhaps only used to
unambiguously identify which specific ICU version and locale string
relate to which on-disk relfilenode structure currently.

-- 
Peter Geoghegan




Re: pgcon unconference / impact of block size on performance

2022-06-09 Thread Tomas Vondra




On 6/9/22 13:23, Jakub Wartak wrote:
>>> The really
>> puzzling thing is why is the filesystem so much slower for smaller
>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
>> Why would a filesystem have such effect?
>
> Ha! I don't care at this point as 1 or 2kB seems too small to handle
> many real world scenarios ;)
>>> [..]
 Independently of that, it seems like an interesting behavior and it
 might tell us something about how to optimize for larger pages.
>>>
>>> OK, curiosity won:
>>>
>>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
>>> (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
>>> maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
>> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
>> something is being locked thus preventing queue to build up With randwrite on
>> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) 
>> and
>> I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
>> just need to split the lock contention by 4.
>>>
>>> The 1kB (slow) profile top function is aio_write() ->  -> 
>>> iov_iter_get_pages()
>> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
>> inside {related to memory manager, padding to full page size, inode locking}
>> also one can find some articles / commits related to it [1] which didn't 
>> made a
>> good feeling to be honest as the fio is using just 1 file (even while I'm on 
>> kernel
>> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
>> contention solved whatever it was :) So I would assume PostgreSQL (with it's
>> splitting data files by default on 1GB boundaries and multiprocess 
>> architecture)
>> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
>> smallest 1kb block sizes on Direct I/O some day.
>>>
>>
>> Interesting. So what parameter values would you suggest?
> 
> At least have 4x filename= entries and numjobs=4
> 
>> FWIW some of the tests I did were on xfs, so I wonder if that might be 
>> hitting
>> similar/other bottlenecks.
> 
> Apparently XFS also shows same contention on single file for 1..2kb 
> randwrite, see [ZZZ]. 
> 

I don't have any results yet, but after thinking about this a bit I find
this really strange. Why would there be any contention with a single fio
job? Doesn't contention imply multiple processes competing for the same
resource/lock etc.?

Isn't this simply due to the iodepth increase? IIUC with multiple fio
jobs, each will use a separate iodepth value. So with numjobs=4, we'll
really use iodepth*4, which can make a big difference.


>>>
>>> Explanation: it's the CPU scheduler migrations mixing the performance result
>> during the runs of fio  (as you have in your framework). Various VCPUs seem 
>> to
>> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to 
>> be
>> unaware of it. At least on 1kB and 4kB blocksize this happens also notice 
>> that
>> some VCPUs [ marker] don't reach 100% CPU reaching almost twice the
>> result; while cores 0, 3 do reach 100% and lack CPU power to perform more.
>> The only thing that I don't get is that it doesn't make sense from extened 
>> lscpu
>> output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
>>
>> Uh, that's strange. I haven't seen anything like that, but I'm running on 
>> physical
>> HW and not AWS, so it's either that or maybe I just didn't do the same test.
> 
> I couldn't belived it until I've checked via taskset  BTW: I don't 
> have real HW with NVMe , but we might be with worth checking if
> placing (taskset -c ...) fio on hyperthreading VCPU is not causing
> (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and
> maybe lscpu(1) output). On AWS I have feeling that lscpu might simply
> lie and I cannot identify which VCPU is HT and which isn't.

Did you see the same issue with io_uring?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: better page-level checksums

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 2:33 PM Peter Geoghegan  wrote:
> My preference is for an approach that builds on that, or at least
> doesn't significantly complicate it. So a cryptographic hash or nonce
> can go in the special area proper (structs like BTPageOpaqueData don't
> need any changes), but at a page offset before the special area proper
> -- not after.

Minor correction: I meant "before structs like BTPageOpaqueData,
earlier in the page and in the special area proper".

-- 
Peter Geoghegan




Re: better page-level checksums

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 2:13 PM Robert Haas  wrote:
> I'm interested in assessing the feasibility of a "better page-level
> checksums" feature. I have a few questions, and a few observations.
> One of my questions is what algorithm(s) we'd want to support. I did a
> quick Google search and found that brtfs supports CRC-32C, XXHASH,
> SHA256, and BLAKE2B. I don't know that we want to support that many
> options (but maybe we do) and I don't think CRC-32C makes any sense
> here, for two reasons. First, we've already got a 16-bit checksum, and
> a 32-bit checksum doesn't seem like it's gaining enough to be worth
> the implementation complexity.

Why not? The only problems that it won't solve are all related to
crypto. Which is perfectly fine, but it seems like there is a
terminology issue here. ISTM that you're really talking about adding a
cryptographic hash function, not a checksum. These are rather
different things.

> Even if we only offer one new kind of checksum, making space for a
> wider checksum makes the page format variable in a way that it
> currently isn't.

I believe that the page special area was designed to be
variable-sized, and even anticipates dynamic resizing of the special
area. At least in index AMs, where it's not that hard to make extra
space in the special area by shifting the tuples back, and then fixing
line pointers to point to the new offsets. So you have a dynamic
variable-sized array that's a little like a second line pointer array
(though probably not added to all that often).

My preference is for an approach that builds on that, or at least
doesn't significantly complicate it. So a cryptographic hash or nonce
can go in the special area proper (structs like BTPageOpaqueData don't
need any changes), but at a page offset before the special area proper
-- not after.

What disadvantages does that approach have, if any, from your point of view?

-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-09 Thread Finnerty, Jim
Specifying the library name before the language-country code with a new 
separator  (":") as you suggested below has some benefits. Did you consider 
making the collation version just another collation attribute, such as 
colStrength, colCaseLevel, etc.?  
For example, an alternate syntax might be:  

create collation icu63."en-US-x-icu" (provider = icu, locale = 
'en-US@colVersion=63');

Was the concern that ICU might redefine a new collation property with the same 
name in a different and incompatible way (we might work with the ICU developers 
to agree on what it should be), or that a version is just not the same kind of 
collation property as the other collation properties?

(in the example above, I'm assuming that for provider = icu, we could translate 
'63' into  'libicui18n.so.63' automatically.)


On 6/8/22, 6:22 AM, "Thomas Munro"  wrote:


postgres=# create collation icu63."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.63:en-US');
CREATE COLLATION




better page-level checksums

2022-06-09 Thread Robert Haas
On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost  wrote:
> Alternatively, we could use
> that technique to just provide a better per-page checksum than what we
> have today.  Maybe we could figure out how to leverage that to move to
> 64bit transaction IDs with some page-level epoch.

I'm interested in assessing the feasibility of a "better page-level
checksums" feature. I have a few questions, and a few observations.
One of my questions is what algorithm(s) we'd want to support. I did a
quick Google search and found that brtfs supports CRC-32C, XXHASH,
SHA256, and BLAKE2B. I don't know that we want to support that many
options (but maybe we do) and I don't think CRC-32C makes any sense
here, for two reasons. First, we've already got a 16-bit checksum, and
a 32-bit checksum doesn't seem like it's gaining enough to be worth
the implementation complexity. Second, we're probably going to have to
dole out per-page space in multiples of MAXALIGN, and that's usually
8. I think for this purpose we should limit ourselves to algorithms
whose output size is, at minimum, 64 bits, and ideally, a multiple of
64 bits. I'm sure there are plenty of options other than the ones that
btrfs uses; I mentioned them only as a way of jump-starting the
discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
wide 16 BYTE checksums. That's a lot of space to consume with a
checksum, but your chances of a collision are very small indeed.

Even if we only offer one new kind of checksum, making space for a
wider checksum makes the page format variable in a way that it
currently isn't. There seem to be about 50 compile-time constants in
the source code whose values are computed based on the block size and
amount of special space in use by some particular AM (yikes!). For
example, for the heap, there's stuff like MaxHeapTuplesPerPage and
MaxHeapTupleSize. If in the future we have some pages that are just
like the ones we have today, and other clusters where we've allowed
space for a checksum, then those constants become run-time variable.
And since they're used in some very low-level functions that are
called a lot, like PageGetHeapFreeSpace(), that seems potentially
problematic. The problem is multiplied if you also think about trying
to store an epoch on each heap page, as per Stephen's proposal above,
because now every page used by any AM whatsoever might or might not
have a checksum, and every heap page might also have or not have an
epoch XID. I think it's going to be somewhat tricky to figure out a
scheme here that avoids making things slow. Originally I was thinking
that things like MaxHeapTuplesPerPage ought to become macros or static
inline functions, but now I have what I think is a better idea: make
them into global variables and initialize them with the appropriate
values for the cluster right after we read the control file. This
doesn't solve the problem if some pages are different than others,
though, and even for the case where every page in the cluster has the
same amount of reserved space, reading a global variable is not going
to be as efficient as referring to a constant compiled right into the
code. I'm hopeful that all of this is solvable somehow, but it's
hairy, for sure.

Another thing I realized is that we would probably end up with the
pd_checksum unused when this other feature is activated. If someone
comes up with a clever idea for how to allocate extra space without
needing things to be a multiple of MAXIMUM_ALIGNOF, they could
potentially shrink the space they need elsewhere by 2 bytes and then
use both that space and pd_checksum, but otherwise pd_checksum is
likely to be dead when an enhanced checksum feature is in use. Since
it's also dead when checksums are turned off, that's probably OK. I
suppose another possibility is to allow both to be turned on and off
independently, i.e. let someone have both a Fletcher-16 checksum in
pd_checksum, and also a wider checksum in this other chunk of space,
but I'm not sure whether that's really a useful thing to be able to
do. (Opinions?)

I'm also a little fuzzy on what the command-line interface for
selecting this functionality would look like. The existing option to
initdb is just --data-checksums, which doesn't leave any way to say
what kind of checksums you want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 12:39 PM Greg Stark  wrote:
> Generally I think the idea is that the user *is* responsible for
> writing immutable functions carefully to hide any non-deterministic
> behaviour from the code they're calling. But that does raise the
> question of why to focus on search_path.
>
> I guess I'm just saying my goal isn't to *prove* the code is correct.
> The user is still responsible for asserting it's correct. I just want
> to detect cases where I can prove (or at least show it's likely that)
> it's *not* correct.

Right. It's virtually impossible to prove that, for many reasons, so
the final responsibility must lie with the user-defined code.

Presumably there is still significant value in detecting cases where
some user-defined code provably does the wrong thing. Especially by
targeting mistakes that experience has shown are relatively common.
That's what the search_path case seems like to me.

If somebody else wants to write another patch that adds on that,
great. If not, then having this much still seems useful.

-- 
Peter Geoghegan




Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-09 Thread Greg Stark
On Wed, 8 Jun 2022 at 19:39, Mark Dilger  wrote:
>
>
> I like the general idea, but I'm confused why you are limiting the analysis 
> to search path resolution.  The following is clearly wrong, but not for that 
> reason:
>
> create function public.identity () returns double precision as $$
>   select random()::integer;

Well I did originally think it would be necessary to consider
cases like this. (or even just cases where you call a user function
that is not immutable because of search_path dependencies).

But there are two problems:

Firstly, that would be a lot harder to implement. We don't actually do
any object lookups in plpgsql when defining plpgsql functions. So this
would be a much bigger change.

But secondly, there are a lot of cases where non-immutable functions
*are* immutable if they're used carefully. to_char() is obviously the
common example, but it's perfectly safe if you set the timezone or
other locale settings or if your format string doesn't actually depend
on any settings.

Similarly, a user function that is non-immutable only due to a
dependency on search_path *would* be safe to call from within an
immutable function if that function does set search_path. The
search_path would be inherited alleviating the problem.

Even something like random() could be safely used in an immutable
function as long as it doesn't actually change the output -- say if it
just logs diagnostic messages as a result?

Generally I think the idea is that the user *is* responsible for
writing immutable functions carefully to hide any non-deterministic
behaviour from the code they're calling. But that does raise the
question of why to focus on search_path.

I guess I'm just saying my goal isn't to *prove* the code is correct.
The user is still responsible for asserting it's correct. I just want
to detect cases where I can prove (or at least show it's likely that)
it's *not* correct.

-- 
greg




Re: Checking for missing heap/index files

2022-06-09 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> We currently can check for missing heap/index files by comparing
> pg_class with the database directory files.  However, I am not clear if
> this is safe during concurrent DDL.  I assume we create the file before
> the update to pg_class is visible, but do we always delete the file
> after the update to pg_class is visible?  I assume any external checking
> tool would need to lock the relation to prevent concurrent DDL.

It'd sure be nice if an external tool (such as one trying to back up the
database..) could get this full list *without* having to run around and
lock everything.  This is because of some fun discoveries that have been
made around readdir() not always being entirely honest.  Take a look at:

https://github.com/pgbackrest/pgbackrest/issues/1754

and

https://gitlab.alpinelinux.org/alpine/aports/-/issues/10960

TL;DR: if you're removing files from a directory that you've got an
active readdir() running through, you might not actually get all of the
*existing* files.  Given that PG is happy to remove files from PGDATA
while a backup is running, in theory this could lead to a backup utility
like pgbackrest or pg_basebackup not actually backing up all the files.

Now, pgbackrest runs the readdir() very quickly to build a manifest of
all of the files to backup, minimizing the window for this to possibly
happen, but pg_basebackup keeps a readdir() open during the entire
backup, making this more possible.

> Also, how would it check if the number of extents is correct?  Seems we
> would need this value to be in pg_class, and have the same update
> protections outlined above.  Seems that would require heavier locking.

Would be nice to have but also would be expensive to maintain..

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Checking for missing heap/index files

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 11:46 AM Bruce Momjian  wrote:
> I don't have a need for it --- I was just wondering why we have
> something that checks the relation contents, but not the file existence?

We do this for B-tree indexes within amcheck. They must always have
storage, if only to store the index metapage. (Actually unlogged
indexes that run on a standby don't, but that's accounted for
directly.)

-- 
Peter Geoghegan




Re: BTMaxItemSize seems to be subtly incorrect

2022-06-09 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Jun 8, 2022 at 5:55 PM Peter Geoghegan  wrote:
> > > That's a problem, because if in that scenario you allow three 2704
> > > byte items that don't need a heap TID and later you find you need to
> > > add a heap TID to one of those items, the result will be bigger than
> > > 2704 bytes, and then you can't fit three of them into a page.
> >
> > Seems you must be right. I'm guessing that the field "cabbage" was
> > originally a nonce value, as part of a draft patch you're working on?
> 
> I wasn't originally setting out to modify BTPageOpaqueData at all,
> just borrow some special space. See the "storing an explicit nonce"
> discussion and patch set. But when this regression failure turned up I
> said to myself, hmm, I think this is an unrelated bug.

I had seen something along these lines when also playing with trying to
use special space.  I hadn't had a chance to run down exactly where it
was coming from, so thanks for working on this.

> > I think that we should fix this on HEAD, on general principle. There
> > is no reason to believe that this is a live bug, so a backpatch seems
> > unnecessary.
> 
> Yeah, I agree with not back-patching the fix, unless it turns out that
> there is some platform where the same issue occurs without any
> cabbage. I assume that if it happened on any common system someone
> would have complained about it by now, so probably it doesn't. I
> suppose we could try to enumerate plausibly different values of the
> quantities involved and see if any of the combinations look like they
> lead to a bad result. I'm not really sure how many things could
> plausibly be different, though, apart from MAXIMUM_ALIGNOF.

Agreed that it doesn't seem like we'd need to backpatch this.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Checking for missing heap/index files

2022-06-09 Thread Bruce Momjian
On Thu, Jun 9, 2022 at 09:46:51AM -0700, Mark Dilger wrote:
>
>
> > On Jun 8, 2022, at 5:45 AM, Bruce Momjian  wrote:
> >
> > Is this something anyone has even needed or had requested?
>
> I might have put this in amcheck's verify_heapam() had there been an
> interface for it.  I vaguely recall wanting something like this, yes.
>
> As it stands, verify_heapam() may trigger mdread()'s "could not open
> file" or "could not read block" error, in the course of verifying
> the table.  There isn't an option in amcheck to just verify that
> the underlying files exist.  If struct f_smgr had a function for
> validating that all segment files exist, I may have added an option to
> amcheck (and the pg_amcheck frontend tool) to quickly look for missing
> files.

Well, how do we know what files should exist?  We know the first segment
should probably exist, but what about +1GB segments?

> Looking at smgr/md.c, it seems mdnblocks() is close to what we want,
> but it skips already opened segments "to avoid redundant seeks".
> Perhaps we'd want to add a function to f_smgr, say "smgr_allexist",
> to check for all segment files?  I'm not sure how heavy-handed the
> corresponding mdallexist() function should be.  Should it close
> all open segments, then reopen and check the size of all of them
> by calling mdnblocks()?  That seems safer than merely asking the
> filesystem if the file exists without verifying that it can be opened.

Yes.

> If we made these changes, and added corresponding quick check options
> to amcheck and pg_amcheck, would that meet your current needs?  The
> downside to using amcheck for this sort of thing is that we did not
> (and likely will not) back port it.  I have had several occasions to
> want this functionality recently, but the customers were on pre-v14
> servers, so these tools were not available anyway.

I don't have a need for it --- I was just wondering why we have
something that checks the relation contents, but not the file existence?

It seems like pg_amcheck would be a nice place to add this checking
ability.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
 wrote:
> MySQL did the right thing here by doing what every other RDBMS did, and just 
> making a simple “good-enough” collation hardcoded in the DB, same across all 
> platforms, that never changes.

That's not true. Both SQL Server and DB2 have some notion of
collations that are versioned.

Oracle may not, but then Oracle also handles collations by indexing
strxfrm() blobs, with all of the obvious downsides that that entails
(far larger indexes, issues with index-only scans). That seems like an
excellent example of what not to do.

-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
 wrote:
> I’m probably just going to end up rehashing the old threads I haven’t read 
> yet…
>
> One challenge with this approach is you have things like sort-merge joins 
> that require the same collation across multiple objects. So I think you’d 
> need to keep all the old indexes around until you have new indexes available 
> for all objects in a database, and somehow the planner would need to be smart 
> enough to dynamically figure out old vs new versions on a query-by-query 
> basis.

I don't think that it would be fundamentally difficult to have the
planner deal with collations at the level required to avoid incorrect
query plans.

I'm not suggesting that this is an easy project, or that the end
result would be totally free of caveats, such as the issue with merge
joins. I am only suggesting that something like this seems doable.
There aren't that many distinct high level approaches that could
possibly decouple upgrading Postgres/the OS from reindexing. This is
one.

> And my opinion is that the problems caused by depending on OS libraries for 
> collation need to be addressed on a shorter timeline than what’s realistic 
> for inventing a new way for a relational database to offer transparent or 
> online upgrades of linguistic collation versions.

But what does that really mean? You can use ICU collations as the
default for the entire cluster now. Where do we still fall short? Do
you mean that there is still a question of actively encouraging using
ICU collations?

I don't understand what you're arguing for. Literally everybody agrees
that the current status quo is not good. That much seems settled to
me.

> Also I still think folks are overcomplicating this by focusing on linguistic 
> collation as the solution.

I don't think that's true; I think that everybody understands that
being on the latest linguistic collation is only very rarely a
compelling feature. The whole way that BCP47 tags are so forgiving is
entirely consistent with that view of things.

But what difference does it make? As long as you accept that any
collation *might* need to be updated, or the default ICU version might
change on OS upgrade, then you have to have some strategy for dealing
with the transition. Not being on a very old obsolete version of ICU
will eventually become a "compelling feature" in its own right.

I believe that EDB adopted ICU many years ago, and stuck with one
vendored version for quite a few years. And eventually being on a very
old version of ICU became a real problem.

-- 
Peter Geoghegan




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2022-06-09 Thread Soumyadeep Chakraborty
On Wed, May 18, 2022 at 6:26 PM Zhihong Yu  wrote:

> +   accessMethodId = ((Form_pg_class) GETSTRUCT(tup))->relam;
>
> -   /* look up the access method, verify it is for a table */
> -   if (accessMethod != NULL)
> -   accessMethodId = get_table_am_oid(accessMethod, false);
> +   if (!HeapTupleIsValid(tup))
> +   elog(ERROR, "cache lookup failed for relation %u", relid);
>
> The validity check of tup should be done before fetching the value of
relam field.

Thanks. Fixed and rebased.

Regards,
Soumyadeep (VMware)
From a9fc146192430aa45224fb1e64bd95e0bb64fc00 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Thu, 9 Jun 2022 10:57:35 -0700
Subject: [PATCH v3 1/2] Allow ATSETAM on partition roots

Setting the access method on partition roots was disallowed. This commit
relaxes that restriction.

Summary of changes over original patch:

* Rebased
* Minor comment updates
* Add more test coverage

Authors: Justin Pryzby , Soumyadeep Chakraborty

---
 src/backend/catalog/heap.c  |   3 +-
 src/backend/commands/tablecmds.c| 103 +++-
 src/backend/utils/cache/relcache.c  |   4 +
 src/test/regress/expected/create_am.out |  50 +++-
 src/test/regress/sql/create_am.sql  |  23 +++---
 5 files changed, 129 insertions(+), 54 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 1803194db94..4561f3b8c98 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1449,7 +1449,8 @@ heap_create_with_catalog(const char *relname,
 		 * No need to add an explicit dependency for the toast table, as the
 		 * main table depends on it.
 		 */
-		if (RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE)
+		if ((RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE) ||
+			relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			ObjectAddressSet(referenced, AccessMethodRelationId, accessmtd);
 			add_exact_object_address(, addrs);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2de0ebacec3..9a5eabcac49 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -570,6 +570,7 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
 	 LOCKMODE lockmode);
 static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
 static void ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname);
+static void ATExecSetAccessMethodNoStorage(Relation rel, Oid newAccessMethod);
 static bool ATPrepChangePersistence(Relation rel, bool toLogged);
 static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
 const char *tablespacename, LOCKMODE lockmode);
@@ -676,7 +677,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	Oid			ofTypeId;
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
-	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
 
 	/*
@@ -938,20 +938,32 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	 * a type of relation that needs one, use the default.
 	 */
 	if (stmt->accessMethod != NULL)
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	else if (stmt->partbound && relkind == RELKIND_RELATION)
 	{
-		accessMethod = stmt->accessMethod;
+		HeapTuple   tup;
+		Oidrelid;
 
-		if (partitioned)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
+		/*
+		 * For partitioned tables, when no access method is specified, we
+		 * default to the parent table's AM.
+		 */
+		Assert(list_length(inheritOids) == 1);
+		/* XXX: should implement get_rel_relam? */
+		relid = linitial_oid(inheritOids);
+		tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+		if (!HeapTupleIsValid(tup))
+			elog(ERROR, "cache lookup failed for relation %u", relid);
 
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
+		accessMethodId = ((Form_pg_class) GETSTRUCT(tup))->relam;
+
+		ReleaseSysCache(tup);
+
+		if (!OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+		accessMethodId = get_table_am_oid(default_table_access_method, false);
 
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
@@ -4693,13 +4705,6 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			break;
 		case AT_SetAccessMethod:	/* SET ACCESS METHOD */
 			ATSimplePermissions(cmd->subtype, rel, ATT_TABLE | ATT_MATVIEW);
-
-			/* partitioned tables don't have an access method */
-			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-ereport(ERROR,
-		

Re: BTMaxItemSize seems to be subtly incorrect

2022-06-09 Thread Robert Haas
On Thu, Jun 9, 2022 at 1:39 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 6:40 AM Robert Haas  wrote:
> > Are you going to code up a patch?
>
> I can, but feel free to fix it yourself if you prefer. Your analysis
> seems sound.

I think I'd feel more comfortable if you wrote the patch, if that's possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Collation version tracking for macOS

2022-06-09 Thread Jeremy Schneider


> On Jun 8, 2022, at 22:40, Peter Geoghegan  wrote:
> 
> On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
>  wrote:
>> Even if PG supports two versions of ICU, how does someone actually go about 
>> removing every dependency on the old version and replacing it with the new?
> 
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.
> 
>> Can it be done without downtime? Can it be done without modifying a running 
>> application?
> 
> Clearly the only way that we can ever transition to a new "physical
> collation" is by reindexing using a newer ICU version. And clearly
> there is going to be a need to fully deprecate any legacy version of
> ICU on a long enough timeline. There is just no getting around that.


I’m probably just going to end up rehashing the old threads I haven’t read yet…

One challenge with this approach is you have things like sort-merge joins that 
require the same collation across multiple objects. So I think you’d need to 
keep all the old indexes around until you have new indexes available for all 
objects in a database, and somehow the planner would need to be smart enough to 
dynamically figure out old vs new versions on a query-by-query basis. May need 
an atomic database-wide cutover; running a DB with internally mixed collation 
versions doesn’t seem like a small challenge. It would require enough disk 
space for two copies of all indexes, and queries would change which indexes 
they use in a way that wouldn’t be immediately obvious to users or app dev. 
Suddenly switching to or from a differently-bloated index could result in 
confusing and sudden performance changes.

Also there would still need to be a plan to address all the other non-index 
objects where collation is used, as has been mentioned before.

And given the current architecture, that final “alter database update default 
collation” command still seems awful risky, bug-prone and difficult to get 
correct. At least it seems that way to me.

At a minimum, this is a very big project and it seems to me like it may be wise 
to get more end-to-end fleshing out of the plans before committing incremental 
pieces in core (which could end up being misguided if the plan doesn’t work as 
well as assumed). Definitely doesn’t seem to me like anything that will happen 
in a year or two.

And my opinion is that the problems caused by depending on OS libraries for 
collation need to be addressed on a shorter timeline than what’s realistic for 
inventing a new way for a relational database to offer transparent or online 
upgrades of linguistic collation versions.

Also I still think folks are overcomplicating this by focusing on linguistic 
collation as the solution. Like 1% of users actually need or care about having 
the latest technically correct local-language-based sorting, at a database 
level. MySQL did the right thing here by doing what every other RDBMS did, and 
just making a simple “good-enough” collation hardcoded in the DB, same across 
all platforms, that never changes.

The 1% of users who need true linguistic collation can probably deal with the 
trade-off of dump-and-load upgrades for their ICU indexes and databases for a 
few more years.

-Jeremy


Sent from my TI-83





Re: BTMaxItemSize seems to be subtly incorrect

2022-06-09 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 6:40 AM Robert Haas  wrote:
> Are you going to code up a patch?

I can, but feel free to fix it yourself if you prefer. Your analysis
seems sound.

-- 
Peter Geoghegan




Re: Checking for missing heap/index files

2022-06-09 Thread Mark Dilger



> On Jun 8, 2022, at 5:45 AM, Bruce Momjian  wrote:
> 
> Is this something anyone has even needed or had requested?

I might have put this in amcheck's verify_heapam() had there been an interface 
for it.  I vaguely recall wanting something like this, yes.

As it stands, verify_heapam() may trigger mdread()'s "could not open file" or 
"could not read block" error, in the course of verifying the table.  There 
isn't an option in amcheck to just verify that the underlying files exist.  If 
struct f_smgr had a function for validating that all segment files exist, I may 
have added an option to amcheck (and the pg_amcheck frontend tool) to quickly 
look for missing files.

Looking at smgr/md.c, it seems mdnblocks() is close to what we want, but it 
skips already opened segments "to avoid redundant seeks".  Perhaps we'd want to 
add a function to f_smgr, say "smgr_allexist", to check for all segment files?  
I'm not sure how heavy-handed the corresponding mdallexist() function should 
be.  Should it close all open segments, then reopen and check the size of all 
of them by calling mdnblocks()?  That seems safer than merely asking the 
filesystem if the file exists without verifying that it can be opened.

If we made these changes, and added corresponding quick check options to 
amcheck and pg_amcheck, would that meet your current needs?  The downside to 
using amcheck for this sort of thing is that we did not (and likely will not) 
back port it.  I have had several occasions to want this functionality 
recently, but the customers were on pre-v14 servers, so these tools were not 
available anyway.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-09 Thread Laurenz Albe
On Thu, 2022-06-09 at 21:55 +0900, Etsuro Fujita wrote:
> On Thu, Jun 9, 2022 at 9:49 AM Laurenz Albe  wrote:
> > On Wed, 2022-06-08 at 19:06 +0900, Etsuro Fujita wrote:
> > > On Wed, Jun 8, 2022 at 2:51 PM Kyotaro Horiguchi 
> > >  wrote:
> > > > At Wed, 08 Jun 2022 07:05:09 +0200, Laurenz Albe 
> > > >  wrote in
> > > > > diff --git a/doc/src/sgml/postgres-fdw.sgml 
> > > > > b/doc/src/sgml/postgres-fdw.sgml
> > > > > index b43d0aecba..b4b7e36d28 100644
> > > > > --- a/doc/src/sgml/postgres-fdw.sgml
> > > > > +++ b/doc/src/sgml/postgres-fdw.sgml
> > > > > @@ -274,6 +274,14 @@ OPTIONS (ADD password_required 'false');
> > > > >     but only for that table.
> > > > >     The default is false.
> > > > >    
> > > > > +
> > > > > +  
> > > > > +   Note that EXPLAIN will be run on the 
> > > > > remote server
> > > > > +   at query planning time, before 
> > > > > permissions on the
> > > > > +   foreign table are checked.  This is not a security problem, 
> > > > > since the
> > > > > +   subsequent error from the permission check will prevent the 
> > > > > user from
> > > > > +   seeing any of the resulting data.
> > > > > +  
> > > > >   
> > > > >  
> > > > 
> > > > Looks fine.  I'd like to add something like "If needed, depriving
> > > > unprivileged users of relevant user mappings will prevent such remote
> > > > executions that happen at planning-time."
> > > 
> > > I agree on that point; if the EXPLAIN done on the remote side is
> > > really a problem, I think the user should revoke privileges from the
> > > remote user specified in the user mapping, to prevent it.  I’d rather
> > > recommend granting to the remote user privileges consistent with those
> > > granted to the local user.
> > 
> > I don't think that is better.  Even if the local and remote privileges are
> > consistent, you will get an error from the *remote* table access when trying
> > to use a foreign table on which you don't have permissions.
> > The above paragraph describes why.
> > Note that the original complaint against oracle_fdw that led to this thread
> > was just such a case.
> 
> I thought you were worried about security, so I thought that that
> would be a good practice becasue that would reduce such risks, but I
> got the point.  However, I'm not 100% sure we really need to document
> something about this, because 1) this doesn't cause any actual
> problems, as you described, and 2) this is a pretty-exceptional case
> IMO.

I am not sure if it worth adding to the documentation.  I would never have 
thought
of the problem if Phil hadn't brought it up.  On the other hand, I was surprised
to learn that permissions aren't checked until the executor kicks in.
It makes sense, but some documentation might help others in that situation.

I'll gladly leave the decision to your judgement as a committer.

Yours,
Laurenz Albe




doc: Move enum storage commentary to top of section

2022-06-09 Thread David G. Johnston
Per suggestion over on -docs:

https://www.postgresql.org/message-id/bl0pr06mb4978f6c0b69f3f03aebed0fbb3...@bl0pr06mb4978.namprd06.prod.outlook.com

David J.


0001-doc-Move-enum-storage-size-to-top-of-section.patch
Description: Binary data


doc: Make selectivity example match wording

2022-06-09 Thread David G. Johnston
Hi,

Reposting this to its own thread.

https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com

doc: make unique non-null join selectivity example match the prose

The description of the computation for the unique, non-null,
join selectivity describes a division by the maximum of two values,
while the example shows a multiplication by their reciprocal.  While
equivalent the max phrasing is easier to understand; which seems
more important here than precisely adhering to the formula used
in the code (for which either variant is still an approximation).

While both num_distinct and num_rows are equal for a unique column
both the concept and formula use row count (10,000) and the
field num_distinct has already been set to mean the specific value
present in the pg_stats table (i.e, -1), so use num_rows here.

David J.


0001-doc-Make-selectivity-example-match-text.patch
Description: Binary data


doc: Fix description of how the default user name is chosen

2022-06-09 Thread David G. Johnston
Hi.

Reposting this on its own thread.

https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com

The default database name is just the user name, not the
operating-system user name.

In passing, the authentication error examples use the phrase
"database user name" in a couple of locations.  The word
database in both cases is both unusual and unnecessary for
understanding.  The reference to user name means the one in/for the
database unless otherwise specified.

Furthermore, it seems better to tell the reader the likely
reason why the displayed database name happens to be a user name.

This change is probably optional but I think it makes sense:
-The indicated database user name was not found.
+The indicated user name was not found.

The other changes simply try to avoid the issue altogether.

David J.


0001-doc-Clarify-wording-around-default-database-name-and.patch
Description: Binary data


doc: Clarify Routines and Extension Membership

2022-06-09 Thread David G. Johnston
Hi.

Reposting this on its own thread.

https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com

Per discussion on -general the documentation for the
ALTER ROUTINE ... DEPENDS ON EXTENSION and DROP EXTENSION doesn't
clearly indicate that these dependent routines are treated in a
similar manner to the extension's owned objects when it comes to
using RESTRICT mode drop: namely their presence doesn't force
the drop command to abort.  Clear that up.

David J.


0001-doc-Clarify-the-behavior-of-routines-and-explict-ext.patch
Description: Binary data


Re: doc: Clarify Savepoint Behavior

2022-06-09 Thread David G. Johnston
On Thu, Jun 9, 2022 at 8:36 AM David G. Johnston 
wrote:

> Hi,
>
> Reposting this on its own thread.
>
>
> https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com
>
> Presently, the open item seems to be whether my novelty regarding the
> reworked example is too much.
>
>
Commentary:

Per documentation comment the savepoint command lacks an example
where the savepoint name is reused.  The suggested example didn't
conform to the others on the page, nor did the suggested location
in compatibility seem desirable, but the omission rang true. Add
another example to the examples section demonstrating this case.
Additionally, document under the description for savepoint_name
that we allow for the name to be repeated - and note what that
means in terms of release and rollback. It seems desirable to
place this comment in description rather than notes for savepoint.
For the other two commands the behavior in the presence of
duplicate savepoint names best fits as notes.  In fact release
already had one.  This commit copies the same verbiage over to
rollback.

David J.


doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-06-09 Thread David G. Johnston
Hi,

Reposting this on its own thread.

https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com

As one cannot place excluded in a FROM clause (subquery) in the
ON CONFLICT clause referring to it as a table, with plural rows
nonetheless, leads the reader to infer more about what the
behavior here is than is correct. We already just say use the
table's name for the existing row so just match that pattern
of using the name excluded for the proposed row.

The alias description doesn't have the same issue regarding the
use of the word table and rows, as the use there is more conceptual,
but the wording about "otherwise taken as" is wrong: rather two
labels of excluded end up in scope and you get an ambiguous name error.

The error messages still consider excluded to be a table reference
and this patch does not try to change that.  That implementation
detail need not force the user-facing documentation for the feature
to use the term table when it doesn't really apply.

David J.


0001-doc-Clarify-that-excluded-is-really-just-a-special-n.patch
Description: Binary data


doc: Clarify Savepoint Behavior

2022-06-09 Thread David G. Johnston
Hi,

Reposting this on its own thread.

https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com

Presently, the open item seems to be whether my novelty regarding the
reworked example is too much.

David J.


0001-doc-Clarify-Savepoint-behavior-around-repeated-name.patch
Description: Binary data


Re: broken regress tests on fedora 36

2022-06-09 Thread Tom Lane
Peter Eisentraut  writes:
> Shouldn't we reset the locale setting (LC_NUMERIC?) to a known value? 
> We clearly already do that for other categories, or it wouldn't say "Time:".

pg_regress.c and Utils.pm force LC_MESSAGES to C, explaining

 * Set translation-related settings to English; otherwise psql will
 * produce translated messages and produce diffs.

While that seems clearly necessary, I'm inclined to think that we
should not mess with the user's LC_XXX environment more than we
absolutely must.  pg_regress only resets the rest of that if you
say --no-locale, an option the TAP infrastructure lacks.

In short, I think the committed fix is better than this proposal.

regards, tom lane




[no subject]

2022-06-09 Thread Tom Lane
Peter Eisentraut  writes:
> Initially, that chapter did not document any system views.

Maybe we could make the system views a separate chapter?

regards, tom lane




Re: broken regress tests on fedora 36

2022-06-09 Thread Peter Eisentraut

On 07.06.22 14:56, Michael Paquier wrote:

On Tue, Jun 07, 2022 at 10:52:45AM +0200, Pavel Stehule wrote:

#   Failed test '\timing with query error: timing output appears'
#   at t/001_basic.pl line 95.
#   'Time: 0,293 ms'
# doesn't match '(?^m:^Time: \d+\.\d\d\d ms)'
# Looks like you failed 2 tests of 58.


Fun.  The difference is in the separator: dot vs comma.  This should
fail with French the same way.  Perhaps it would fail differently in
other languages?  There is no need to be that precise with the regex
IMO, so I would just cut the regex with the number, checking only the
unit at the end.


Shouldn't we reset the locale setting (LC_NUMERIC?) to a known value? 
We clearly already do that for other categories, or it wouldn't say "Time:".





Re: BTMaxItemSize seems to be subtly incorrect

2022-06-09 Thread Robert Haas
On Wed, Jun 8, 2022 at 7:44 PM Peter Geoghegan  wrote:
> FWIW I don't see much difference between borrowing special space and
> adding something to BTPageOpaqueData. If anything I'd prefer the
> latter.

I think this discussion will take us too far afield from the topic of
this thread, so I'll just say here that wouldn't solve the problem I
was trying to tackle.

> Here's why: BTMaxItemSizeNoHeapTid() is actually what BTMaxItemSize()
> looked like prior to Postgres 12. So the limit on internal pages never
> changed, even in Postgres 12. There was no separate leaf page limit
> prior to 12. Only the rules on the leaf level ever really changed.

Yeah, I noticed that, too.

Are you going to code up a patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Replica Identity check of partition table on subscriber

2022-06-09 Thread Amit Langote
Hi Amit,

On Thu, Jun 9, 2022 at 8:02 PM Amit Kapila  wrote:
> On Wed, Jun 8, 2022 at 2:17 PM shiy.f...@fujitsu.com
>  wrote:
> > I saw a problem in logical replication, when the target table on subscriber 
> > is a
> > partitioned table, it only checks whether the Replica Identity of 
> > partitioned
> > table is consistent with the publisher, and doesn't check Replica Identity 
> > of
> > the partition.
> ...
> >
> > It caused an assertion failure on subscriber:
> > TRAP: FailedAssertion("OidIsValid(idxoid) || (remoterel->replident == 
> > REPLICA_IDENTITY_FULL)", File: "worker.c", Line: 2088, PID: 1616523)
> >
> > The backtrace is attached.
> >
> > We got the assertion failure because idxoid is invalid, as table child has 
> > no
> > Replica Identity or Primary Key. We have a check in 
> > check_relation_updatable(),
> > but what it checked is table tbl (the parent table) and it passed the check.
> >
>
> I can reproduce the problem. This seems to be the problem since commit
> f1ac27bf (Add logical replication support to replicate into
> partitioned tables), so adding Amit L. and Peter E.

Thanks, I can see the problem.

I have looked at other mentioned problems with the code too and agree
they look like bugs.

Both patches look to be on the right track to fix the issues, but will
look more closely to see if I've anything to add.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re:

2022-06-09 Thread Peter Eisentraut

On 09.06.22 01:29, Peter Smith wrote:

OTOH it looks like this table of contents page has been this way
forever (20+ years?). It is hard to believe nobody else suggested
modifying it in all that time, so perhaps there is some reason for it
being like it is?


Initially, that chapter did not document any system views.




Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-06-09 Thread Robert Haas
On Wed, Jun 8, 2022 at 7:53 PM Jacob Champion  wrote:
> But I don't have any better ideas for how to achieve both. I'm fine
> with your suggestion of ClientConnectionInfo, if that sounds good to
> others; the doc comment can clarify why it differs from Port? Or add
> one of the Shared-/Gang-/Group- prefixes to it, maybe?

I don't like the prefixes, so I'd prefer explaining it in the struct comment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-09 Thread Etsuro Fujita
On Thu, Jun 9, 2022 at 9:49 AM Laurenz Albe  wrote:
> On Wed, 2022-06-08 at 19:06 +0900, Etsuro Fujita wrote:
> > On Wed, Jun 8, 2022 at 2:51 PM Kyotaro Horiguchi  
> > wrote:
> > > At Wed, 08 Jun 2022 07:05:09 +0200, Laurenz Albe 
> > >  wrote in
> > > > diff --git a/doc/src/sgml/postgres-fdw.sgml 
> > > > b/doc/src/sgml/postgres-fdw.sgml
> > > > index b43d0aecba..b4b7e36d28 100644
> > > > --- a/doc/src/sgml/postgres-fdw.sgml
> > > > +++ b/doc/src/sgml/postgres-fdw.sgml
> > > > @@ -274,6 +274,14 @@ OPTIONS (ADD password_required 'false');
> > > > but only for that table.
> > > > The default is false.
> > > >
> > > > +
> > > > +  
> > > > +   Note that EXPLAIN will be run on the remote 
> > > > server
> > > > +   at query planning time, before permissions 
> > > > on the
> > > > +   foreign table are checked.  This is not a security problem, 
> > > > since the
> > > > +   subsequent error from the permission check will prevent the 
> > > > user from
> > > > +   seeing any of the resulting data.
> > > > +  
> > > >   
> > > >  
> > >
> > > Looks fine.  I'd like to add something like "If needed, depriving
> > > unprivileged users of relevant user mappings will prevent such remote
> > > executions that happen at planning-time."
> >
> > I agree on that point; if the EXPLAIN done on the remote side is
> > really a problem, I think the user should revoke privileges from the
> > remote user specified in the user mapping, to prevent it.  I’d rather
> > recommend granting to the remote user privileges consistent with those
> > granted to the local user.
>
> I don't think that is better.  Even if the local and remote privileges are
> consistent, you will get an error from the *remote* table access when trying
> to use a foreign table on which you don't have permissions.
> The above paragraph describes why.
> Note that the original complaint against oracle_fdw that led to this thread
> was just such a case.

I thought you were worried about security, so I thought that that
would be a good practice becasue that would reduce such risks, but I
got the point.  However, I'm not 100% sure we really need to document
something about this, because 1) this doesn't cause any actual
problems, as you described, and 2) this is a pretty-exceptional case
IMO.

Best regards,
Etsuro Fujita




Re: Multi-Master Logical Replication

2022-06-09 Thread Bharath Rupireddy
On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
>
> MULTI-MASTER LOGICAL REPLICATION
>
> 1.0 BACKGROUND
>
> Let’s assume that a user wishes to set up a multi-master environment
> so that a set of PostgreSQL instances (nodes) use logical replication
> to share tables with every other node in the set.
>
> We define this as a multi-master logical replication (MMLR) node-set.
>
> 
>
> 1.1 ADVANTAGES OF MMLR
>
> - Increases write scalability (e.g., all nodes can write arbitrary data).
> - Allows load balancing
> - Allows rolling updates of nodes (e.g., logical replication works
> between different major versions of PostgreSQL).
> - Improves the availability of the system (e.g., no single point of failure)
> - Improves performance (e.g., lower latencies for geographically local nodes)

Thanks for working on this proposal. I have a few high-level thoughts,
please bear with me if I repeat any of them:

1. Are you proposing to use logical replication subscribers to be in
sync quorum? In other words, in an N-masters node, M (M >= N)-node
configuration, will each master be part of the sync quorum in the
other master?
2. Is there any mention of reducing the latencies that logical
replication will have generally (initial table sync and
after-caught-up decoding and replication latencies)?
3. What if "some" postgres provider assures an SLA of very few seconds
for failovers in typical HA set up with primary and multiple sync and
async standbys? In this context, where does the multi-master
architecture sit in the broad range of postgres use-cases?
4. Can the design proposed here be implemented as an extension instead
of a core postgres solution?
5. Why should one use logical replication for multi master
replication? If logical replication is used, isn't it going to be
something like logically decode and replicate every WAL record from
one master to all other masters? Instead, can't it be achieved via
streaming/physical replication?

Regards,
Bharath Rupireddy.




RE: pgcon unconference / impact of block size on performance

2022-06-09 Thread Jakub Wartak
> >  The really
>  puzzling thing is why is the filesystem so much slower for smaller
>  pages. I mean, why would writing 1K be 1/3 of writing 4K?
>  Why would a filesystem have such effect?
> >>>
> >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
> >>> many real world scenarios ;)
> > [..]
> >> Independently of that, it seems like an interesting behavior and it
> >> might tell us something about how to optimize for larger pages.
> >
> > OK, curiosity won:
> >
> > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
> > (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
> > maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
> something is being locked thus preventing queue to build up With randwrite on
> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
> I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
> just need to split the lock contention by 4.
> >
> > The 1kB (slow) profile top function is aio_write() ->  -> 
> > iov_iter_get_pages()
> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
> inside {related to memory manager, padding to full page size, inode locking}
> also one can find some articles / commits related to it [1] which didn't made 
> a
> good feeling to be honest as the fio is using just 1 file (even while I'm on 
> kernel
> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
> contention solved whatever it was :) So I would assume PostgreSQL (with it's
> splitting data files by default on 1GB boundaries and multiprocess 
> architecture)
> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
> smallest 1kb block sizes on Direct I/O some day.
> >
> 
> Interesting. So what parameter values would you suggest?

At least have 4x filename= entries and numjobs=4

> FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
> similar/other bottlenecks.

Apparently XFS also shows same contention on single file for 1..2kb randwrite, 
see [ZZZ]. 

[root@x ~]# mount|grep /mnt/nvme
/dev/nvme0n1 on /mnt/nvme type xfs 
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

# using 1 fio job and 1 file
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=5779.1MB, bw=196573KB/s, 
iops=196573, runt= 30109msec
libaio/nvme/randread/128/2k/1.txt:  read : io=10335MB, bw=352758KB/s, 
iops=176379, runt= 30001msec
libaio/nvme/randread/128/4k/1.txt:  read : io=0MB, bw=758408KB/s, 
iops=189601, runt= 30001msec
libaio/nvme/randread/128/8k/1.txt:  read : io=28914MB, bw=986896KB/s, 
iops=123361, runt= 30001msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=694856KB, bw=23161KB/s, 
iops=23161, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/2k/1.txt:  write: io=1370.7MB, bw=46782KB/s, 
iops=23390, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8261.3MB, bw=281272KB/s, 
iops=70318, runt= 30076msec [OK]
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11598MB, bw=394320KB/s, 
iops=49289, runt= 30118msec

# but it's all ok using 4 fio jobs and 4 files
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=6174.6MB, bw=210750KB/s, 
iops=210750, runt= 30001msec
libaio/nvme/randread/128/2k/1.txt:  read : io=12152MB, bw=413275KB/s, 
iops=206637, runt= 30110msec
libaio/nvme/randread/128/4k/1.txt:  read : io=24382MB, bw=832116KB/s, 
iops=208028, runt= 30005msec
libaio/nvme/randread/128/8k/1.txt:  read : io=29281MB, bw=985831KB/s, 
iops=123228, runt= 30415msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=1692.2MB, bw=57748KB/s, 
iops=57748, runt= 30003msec
libaio/nvme/randwrite/128/2k/1.txt:  write: io=3601.9MB, bw=122940KB/s, 
iops=61469, runt= 30001msec
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8470.8MB, bw=285857KB/s, 
iops=71464, runt= 30344msec
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11449MB, bw=390603KB/s, 
iops=48825, runt= 30014msec
 

> >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot
> >>> understand
> >> how having direct I/O enabled (which disables caching) achieves
> >> better read IOPS on ext4 than on raw device... isn't it contradiction?
> >>>
> >>
> >> Thanks for the clarification. Not sure what might be causing this.
> >> Did you use the same parameters (e.g. iodepth) in both cases?
> >
> > Explanation: it's the CPU scheduler migrations mixing the performance result
> during the runs of fio  (as you have in your framework). Various VCPUs seem to
> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to 
> be
> unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
> some VCPUs [ marker] don't reach 100% CPU reaching almost twice the
> result; while cores 0, 3 do reach 100% and lack CPU 

Re: Replica Identity check of partition table on subscriber

2022-06-09 Thread Amit Kapila
On Wed, Jun 8, 2022 at 2:17 PM shiy.f...@fujitsu.com
 wrote:
>
> I saw a problem in logical replication, when the target table on subscriber 
> is a
> partitioned table, it only checks whether the Replica Identity of partitioned
> table is consistent with the publisher, and doesn't check Replica Identity of
> the partition.
>
...
>
> It caused an assertion failure on subscriber:
> TRAP: FailedAssertion("OidIsValid(idxoid) || (remoterel->replident == 
> REPLICA_IDENTITY_FULL)", File: "worker.c", Line: 2088, PID: 1616523)
>
> The backtrace is attached.
>
> We got the assertion failure because idxoid is invalid, as table child has no
> Replica Identity or Primary Key. We have a check in 
> check_relation_updatable(),
> but what it checked is table tbl (the parent table) and it passed the check.
>

I can reproduce the problem. This seems to be the problem since commit
f1ac27bf (Add logical replication support to replicate into
partitioned tables), so adding Amit L. and Peter E.

> I think one approach to fix it is to check the target partition in this case,
> instead of the partitioned table.
>

This approach sounds reasonable to me. One minor point:
+/*
+ * Check that replica identity matches.
+ *
+ * We allow for stricter replica identity (fewer columns) on subscriber as
+ * that will not stop us from finding unique tuple. IE, if publisher has
+ * identity (id,timestamp) and subscriber just (id) this will not be a
+ * problem, but in the opposite scenario it will.
+ *
+ * Don't throw any error here just mark the relation entry as not updatable,
+ * as replica identity is only for updates and deletes but inserts can be
+ * replicated even without it.
+ */
+static void
+logicalrep_check_updatable(LogicalRepRelMapEntry *entry)

Can we name this function as logicalrep_rel_mark_updatable as we are
doing that? If so, change the comments as well.

> When trying to fix it, I saw some other problems about updating partition map
> cache.
>
> a) In logicalrep_partmap_invalidate_cb(), the type of the entry in
> LogicalRepPartMap should be LogicalRepPartMapEntry, instead of
> LogicalRepRelMapEntry.
>
> b) In logicalrep_partition_open(), it didn't check if the entry is valid.
>
> c) When the publisher send new relation mapping, only relation map cache will 
> be
> updated, and partition map cache wouldn't. I think it also should be updated
> because it has remote relation information, too.
>

Is there any test case that can show the problem due to your above observations?

-- 
With Regards,
Amit Kapila.




Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-06-09 Thread Etsuro Fujita
On Wed, Jun 8, 2022 at 7:18 PM Etsuro Fujita  wrote:
> On Fri, Jun 3, 2022 at 1:03 AM Zhihong Yu  wrote:
> > Suggestion on formatting the comment:
> >
> > + * node (or that for any plan node in the subplan tree), 2)
> > + * set_plan_references() modifies the tlist for every plan node in the
> >
> > It would be more readable if `2)` is put at the beginning of the second 
> > line above.
> >
> > + * preserves the length and order of the tlist, and 3) 
> > set_plan_references()
> > + * might delete the topmost plan node like an Append or MergeAppend from 
> > the
> >
> > Similarly you can move `3) set_plan_references()` to the beginning of the 
> > next line.
>
> Seems like a good idea, so I updated the patch as you suggest.  I did
> some indentation as well, which I think improves readability a bit
> further.  Attached is an updated version.  If no objections, I’ll
> commit the patch.

Done.

Best regards,
Etsuro Fujita




Re: Improve TAP tests of pg_upgrade for cross-version tests

2022-06-09 Thread Michael Paquier
On Wed, May 25, 2022 at 10:35:57AM +0900, Michael Paquier wrote:
> On Tue, May 24, 2022 at 03:03:28PM +0900, Michael Paquier wrote:
>> (Adding Andrew Dunstan in CC.)
>> 
>> I have been toying with $subject, trying to improve the ways to test
>> pg_upgrade across different major versions as perl makes that easier.
>> The buildfarm does three things to allow such tests to work (see
>> TestUpgradeXversion.pm):

Rebased to cope with the recent changes in this area.
--
Michael
From 61d69469887533bbdde75d8aad1d2d39907ac831 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 9 Jun 2022 16:42:47 +0900
Subject: [PATCH v2] Add more filtering capabilities in the dumps.

This adds support for some filtering capabilities, for some tests done
down to v11.  This allows the tests to pass with v14, while v11~13 still
generate a few diffs.
---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 78 --
 1 file changed, 61 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 3f11540e18..20906181b2 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -30,6 +30,39 @@ sub generate_db
 		"created database with ASCII characters from $from_char to $to_char");
 }
 
+# Filter the contents of a dump before its use in a content comparison.
+# This returns the path to the filtered dump.
+sub filter_dump
+{
+	my ($node, $tempdir, $dump_file_name) = @_;
+	my $dump_file = "$tempdir/$dump_file_name";
+	my $dump_contents = slurp_file($dump_file);
+
+	# Remove the comments.
+	$dump_contents =~ s/^\-\-.*//mgx;
+	# Remove empty lines.
+	$dump_contents =~ s/^\n//mgx;
+
+	# Locale setup has changed for collations in 15~.
+	$dump_contents =~
+	  s/(^CREATE\sCOLLATION\s.*?)\slocale\s=\s'und'/$1 locale = ''/mgx
+	  if ($node->pg_version < 15);
+
+	# Dumps taken from <= 11 use EXECUTE PROCEDURE.  Replace it
+	# with EXECUTE FUNCTION.
+	$dump_contents =~
+	  s/(^CREATE\sTRIGGER\s.*?)\sEXECUTE\sPROCEDURE/$1 EXECUTE FUNCTION/mgx
+	  if ($node->pg_version < 12);
+
+	my $dump_file_filtered = "$tempdir/${dump_file_name}_filter";
+	open(my $dh, '>', $dump_file_filtered)
+	  || die "opening $dump_file_filtered";
+	print $dh $dump_contents;
+	close($dh);
+
+	return $dump_file_filtered;
+}
+
 # The test of pg_upgrade requires two clusters, an old one and a new one
 # that gets upgraded.  Before running the upgrade, a logical dump of the
 # old cluster is taken, and a second logical dump of the new one is taken
@@ -153,13 +186,13 @@ my $oldbindir = $oldnode->config_data('--bindir');
 
 # Take a dump before performing the upgrade as a base comparison. Note
 # that we need to use pg_dumpall from the new node here.
-$newnode->command_ok(
-	[
-		'pg_dumpall', '--no-sync',
-		'-d', $oldnode->connstr('postgres'),
-		'-f', "$tempdir/dump1.sql"
-	],
-	'dump before running pg_upgrade');
+my @dump_command = (
+	'pg_dumpall', '--no-sync', '-d', $oldnode->connstr('postgres'),
+	'-f', "$tempdir/dump1.sql");
+# --extra-float-digits is needed when upgrading from a version older than 11.
+push(@dump_command, '--extra-float-digits', '0')
+  if ($oldnode->pg_version < 12);
+$newnode->command_ok(\@dump_command, 'dump before running pg_upgrade');
 
 # After dumping, update references to the old source tree's regress.so
 # to point to the new tree.
@@ -282,24 +315,35 @@ if (-d $log_path)
 }
 
 # Second dump from the upgraded instance.
-$newnode->command_ok(
-	[
-		'pg_dumpall', '--no-sync',
-		'-d', $newnode->connstr('postgres'),
-		'-f', "$tempdir/dump2.sql"
-	],
-	'dump after running pg_upgrade');
+@dump_command = (
+	'pg_dumpall', '--no-sync', '-d', $newnode->connstr('postgres'),
+	'-f', "$tempdir/dump2.sql");
+# --extra-float-digits is needed when upgrading from a version older than 11.
+push(@dump_command, '--extra-float-digits', '0')
+  if ($oldnode->pg_version < 12);
+$newnode->command_ok(\@dump_command, 'dump after running pg_upgrade');
+
+# Filter the contents of the dumps from the old version of any contents.
+my $dump1_filtered = "$tempdir/dump1.sql";
+my $dump2_filtered = "$tempdir/dump2.sql";
+
+# No need to apply filters on the dumps if working on the same version.
+if ($oldnode->pg_version != $newnode->pg_version)
+{
+	$dump1_filtered = filter_dump($oldnode, $tempdir, "dump1.sql");
+	$dump2_filtered = filter_dump($newnode, $tempdir, "dump2.sql");
+}
 
 # Compare the two dumps, there should be no differences.
-my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
+my $compare_res = compare($dump1_filtered, $dump2_filtered);
 is($compare_res, 0, 'old and new dumps match after pg_upgrade');
 
 # Provide more context if the dumps do not match.
 if ($compare_res != 0)
 {
 	my ($stdout, $stderr) =
-	  run_command([ 'diff', "$tempdir/dump1.sql", "$tempdir/dump2.sql" ]);
-	print "=== diff of $tempdir/dump1.sql and $tempdir/dump2.sql\n";
+	  run_command([ 'diff', 

Re: A proposal to provide a timeout option for CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot

2022-06-09 Thread Kyotaro Horiguchi
At Thu, 9 Jun 2022 10:25:06 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> Currently CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot waits
> unboundedly if there are any in-progress write transactions [1]. The wait
> is for a reason actually i.e. for building an initial snapshot, but waiting
> unboundedly isn't good for usability of the command/function and when
> stuck, the callers will not have any information as to why.
> 
> How about we provide a timeout for the command/function instead of letting
> them wait unboundedly? The behavior will be something like this - if the
> logical replication slot isn't created within this timeout, the
> command/function will fail.
> 
> We could've asked callers to set statement_timeout before calling
> CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot but that impacts
> the queries running in all other sessions and it may not be always possible
> to set this parameter just for the session that runs command
> CREATE_REPLICATION_SLOT.
>
> Thoughts?

How can the other sessions get affected by setting statement_timeout a
session?  And "SET LOCAL" narrows the effect down to within a
transaction. I think that is sufficient. On the other hand,
CREATE_REPLICATION_SLOT doesn't honor statement_timeout, but honors
lock_timeout. (It's a bit strange but I would hardly go so far as to
say we should "fix" it..) If a program issues CREATE_REPLICATION_SLOT,
it's hard to believe that the same program cannot issue SET (for
lock_timeout) command as well.

When CREATE_REPLICATION_SLOT is called from a CREATE SUBSCRIPTION
command, the latter command itself honors statement_timeout and
disconnects the peer walsender. Thus, client_connection_check_interval
set on publisher side kills the walsender shortly after the
disconnection.

In short, I don't see much point in the timeout of the function/command.

As a general discussion on the timeout of functions/commands by a
parameter, I can only come up with pg_terminate_backend() for now, but
its timeout parameter not only determines timeout seconds but also
specifies whether the function waits for the process termination. That
functionality cannot be achieved by statement timeout.  In that sense
it is a bit apart from pg_logical_replication_slot().

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments

2022-06-09 Thread Amit Kapila
On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy
 wrote:
>
> Currently postgres doesn't allow dropping a replication slot that's active 
> [1]. This can make certain operations more time-consuming or stuck in 
> production environments. These operations are - disable async/sync standbys 
> and disable logical replication that require the postgres running on standby 
> or the subscriber to go down. If stopping postgres server takes time, the VM 
> or container will have to be killed forcefully which can take a considerable 
> amount of time as there are many layers in between.
>

Why do you want to drop the slot when the server is going down? Is it
some temporary replication slot, otherwise, how will you resume
replication after restarting the server?

--
With Regards,
Amit Kapila.




Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan
On Wed, Jun 8, 2022 at 10:39 PM Peter Geoghegan  wrote:
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.

As I said before, BCP47 format tags are incredibly forgiving by
design. So it should be reasonable to assume that anything that has
worked in an earlier version of ICU will continue to work in a way
that's at least as useful in a future version. See:

https://www.postgresql.org/message-id/CAH2-Wz=ZrA5Yf55pKtdJb2pYCVN=2dh__vgr9arqqohmqwg...@mail.gmail.com

That's not strictly guaranteed, because sometimes countries cease to
exist, and their ISO country codes eventually go away too. But that
still tends to fail gracefully. It's mostly only relevant for things
that are part of a locale, which is a broader concept than just
collation. An application that did this and relied on ICU for
localization might then find that the currency sign changed, but I'm
not aware of any impact on locales. You can ask for total nonsense
and mostly get reasonable behaviors, like Japanese as spoken in
Iceland. Even some totally made up (or misspelled) country is
accepted without complaint.

--
Peter Geoghegan