Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Kyotaro Horiguchi
At Wed, 08 Jun 2022 07:05:09 +0200, Laurenz Albe  
wrote in 
> I take Tom's comment above as saying that the current behavior is fine.
> So yes, perhaps some documentation would be in order:
> 
> diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
> index b43d0aecba..b4b7e36d28 100644
> --- a/doc/src/sgml/postgres-fdw.sgml
> +++ b/doc/src/sgml/postgres-fdw.sgml
> @@ -274,6 +274,14 @@ OPTIONS (ADD password_required 'false');
> but only for that table.
> The default is false.
>
> +
> +  
> +   Note that EXPLAIN will be run on the remote server
> +   at query planning time, before permissions on the
> +   foreign table are checked.  This is not a security problem, since the
> +   subsequent error from the permission check will prevent the user from
> +   seeing any of the resulting data.
> +  
>   
>  

Looks fine.  I'd like to add something like "If needed, depriving
unprivileged users of relevant user mappings will prevent such remote
executions that happen at planning-time."

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: bogus: logical replication rows/cols combinations

2022-06-07 Thread Peter Smith
On Wed, Jun 8, 2022 at 1:25 PM Justin Pryzby  wrote:
>
> On Mon, Jun 06, 2022 at 03:42:31PM +1000, Peter Smith wrote:
> > I noticed the patch "0001-language-fixes-on-HEAD-from-Justin.patch" says:
> >
> > @@ -11673,7 +11673,7 @@
> >prosrc => 'pg_show_replication_origin_status' },
> >
> >  # publications
> > -{ oid => '6119', descr => 'get information of tables in a publication',
> > +{ oid => '6119', descr => 'get information about tables in a publication',
> >
> > ~~~
> >
> > But, this grammar website [1] says:
> ...
> > From which I guess
> >
> > 1. 'get information of tables in a publication' ~= 'get information
> > belonging to tables in a publication'
>
> But the information doesn't "belong to" the tables.
>
> The information is "regarding" the tables (or "associated with" or "concerned
> with" or "respecting" or "on the subject of" the tables).
>
> I think my change is correct based on the grammar definition, as well as its
> intuitive "feel".
>

Actually, I have no problem with this being worded either way. My
point was mostly to question if it was really worth changing it at
this time - e.g. I think there is a reluctance to change anything to
do with the catalogs during beta (even when a catversion bump may not
be required).

I agree that "about" seems better if the text said, "get information
about tables". But it does not say that - it says "get information
about tables in a publication" which I felt made a subtle difference.

e.g.1 "... on the subject of / concerned with tables."
- sounds like attributes about each table (col names, row filter etc)

versus

e.g.2 "... on the subject of / concerned with tables in a publication."
- sounds less like information PER table, and more like information
about the table membership of the publication.

~~

Any ambiguities can be eliminated if this text was just fixed to be
consistent with the wording of catalogs.sgml:
e.g. "publications and information about their associated tables"

But then this comes full circle back to my question if during beta is
a good time to be making such a change.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Laurenz Albe
On Wed, 2022-06-08 at 13:06 +0900, Kyotaro Horiguchi wrote:
> At Wed, 08 Jun 2022 12:09:27 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Wed, 08 Jun 2022 04:38:02 +0200, Laurenz Albe  
> > wrote in 
> > > If anything, it should be done in the FDW, because it is only necessary 
> > > if the
> > > FDW calls the remote site during planning.
> > > 
> > > The question is: is this a bug in postgres_fdw that should be fixed?
> > 
> > It's depends on what we think about allowing remote access trials
> > through unprivileged foreign table in any style.  It won't be a
> > problem if the system is configured appropriately but too-frequent
> > estimate accesses via unprivileged foreign tables might be regarded as
> > an attack attempt.
> 
> In other words, I don't think it's not a bug and no need to fix.  If
> one want to prevent such estimate accesses via unprivileged foreign
> tables, it is enough to prevent non-privileged users from having a
> user mapping.  This might be worth documenting?

I take Tom's comment above as saying that the current behavior is fine.
So yes, perhaps some documentation would be in order:

diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index b43d0aecba..b4b7e36d28 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -274,6 +274,14 @@ OPTIONS (ADD password_required 'false');
but only for that table.
The default is false.
   
+
+  
+   Note that EXPLAIN will be run on the remote server
+   at query planning time, before permissions on the
+   foreign table are checked.  This is not a security problem, since the
+   subsequent error from the permission check will prevent the user from
+   seeing any of the resulting data.
+  
  
 

Yours,
Laurenz Albe




Re: tablesync copy ignores publication actions

2022-06-07 Thread Amit Kapila
On Tue, Jun 7, 2022 at 7:08 PM Euler Taveira  wrote:
>
> On Tue, Jun 7, 2022, at 1:10 AM, Peter Smith wrote:
>
> The logical replication tablesync ignores the publication 'publish'
> operations during the initial data copy.
>
> This is current/known PG behaviour (e.g. as recently mentioned [1])
> but it was not documented anywhere.
>
> initial data synchronization != replication. publish parameter is a 
> replication
> property; it is not a initial data synchronization property. Maybe we should
> make it clear like you are suggesting.
>

+1 to document it. We respect some other properties of publication
like the publish_via_partition_root parameter, column lists, and row
filters. So it is better to explain about 'publish' parameter which we
ignore during the initial sync.

> This patch just documents the existing behaviour and gives some examples.
>
> Why did you add this information to that specific paragraph? IMO it belongs to
> a separate paragraph; I would add it as the first paragraph in that 
> subsection.
>
> I suggest the following paragraph:
>
> 
> The initial data synchronization does not take into account the
> publish parameter to copy the existing data.
> 
>
> There is no point to link the Initial Snapshot subsection. That subsection is
> explaining the initial copy steps and you want to inform about the effect of a
> publication parameter on the initial copy. Although both are talking about the
> same topic (initial copy), that link to Initial Snapshot subsection won't add
> additional information about the publish parameter.
>

Here, we are explaining the behavior of row filters during initial
sync so adding a link to the Initial Snapshot section makes sense to
me.

> You could expand the
> suggested sentence to make it clear what publish parameter is or even add a
> link to the CREATE PUBLICATION synopsis (that explains about publish
> parameter).
>

+1. I suggest that we should add some text about the behavior of
initial sync in CREATE PUBLICATION doc (along with the 'publish'
parameter) or otherwise, we can explain it where we are talking about
publications [1].

> You add an empty paragraph. Remove it.
>
> I'm not sure it deserves an example. It is an easy-to-understand concept and a
> good description is better than ~ 80 new lines.
>

I don't think it is very clear that "initial data synchronization !=
replication" as mentioned by you nor does our docs does a good job in
explaining it otherwise the confusion wouldn't have arisen in the
email link shared by Peter. Personally, I think such things can be
better explained by example and in that regards the example shared by
Peter does half the job because it doesn't explain the replication
part. I don't think "Initial Snapshot" is the right place for these
examples considering we want to show the replication based on the
publish actions. We can extend it to show one example with row filters
as well. How about showing these examples in the Subscription section
[2]?

[1]: https://www.postgresql.org/docs/devel/logical-replication-publication.html
[2]: https://www.postgresql.org/docs/devel/logical-replication-subscription.html


-- 
With Regards,
Amit Kapila.




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Kyotaro Horiguchi
At Tue, 07 Jun 2022 23:04:52 -0400, Tom Lane  wrote in 
> Laurenz Albe  writes:
> > On Wed, 2022-06-08 at 11:12 +0900, Kyotaro Horiguchi wrote:
> > RangeTblEntry *rte = root->simple_rte_array[i];
> > aclcheck_error(ACLCHECK_NO_PRIV,
> >    get_relkind_objtype(rte->relkind),
> >    get_rel_name(rte->relid));
> 
> I think it's completely inappropriate for FDWs to be taking it on
> themselves to inject privilege checks.  The system design is that
> that is checked at executor start; not before, not after.

Ah, yes.  It's not good that checking it at multiple stages, and the
only one place should be executor start.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Kyotaro Horiguchi
At Wed, 08 Jun 2022 12:09:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Wed, 08 Jun 2022 04:38:02 +0200, Laurenz Albe  
> wrote in 
> > If anything, it should be done in the FDW, because it is only necessary if 
> > the
> > FDW calls the remote site during planning.
> > 
> > The question is: is this a bug in postgres_fdw that should be fixed?
> 
> It's depends on what we think about allowing remote access trials
> through unprivileged foreign table in any style.  It won't be a
> problem if the system is configured appropriately but too-frequent
> estimate accesses via unprivileged foreign tables might be regarded as
> an attack attempt.

In other words, I don't think it's not a bug and no need to fix.  If
one want to prevent such estimate accesses via unprivileged foreign
tables, it is enough to prevent non-privileged users from having a
user mapping.  This might be worth documenting?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: bogus: logical replication rows/cols combinations

2022-06-07 Thread Justin Pryzby
On Mon, Jun 06, 2022 at 03:42:31PM +1000, Peter Smith wrote:
> I noticed the patch "0001-language-fixes-on-HEAD-from-Justin.patch" says:
> 
> @@ -11673,7 +11673,7 @@
>prosrc => 'pg_show_replication_origin_status' },
> 
>  # publications
> -{ oid => '6119', descr => 'get information of tables in a publication',
> +{ oid => '6119', descr => 'get information about tables in a publication',
> 
> ~~~
> 
> But, this grammar website [1] says:
...
> From which I guess
> 
> 1. 'get information of tables in a publication' ~= 'get information
> belonging to tables in a publication'

But the information doesn't "belong to" the tables.

The information is "regarding" the tables (or "associated with" or "concerned
with" or "respecting" or "on the subject of" the tables).

I think my change is correct based on the grammar definition, as well as its
intuitive "feel".

-- 
Justin




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Kyotaro Horiguchi
At Wed, 08 Jun 2022 04:38:02 +0200, Laurenz Albe  
wrote in 
> If anything, it should be done in the FDW, because it is only necessary if the
> FDW calls the remote site during planning.
> 
> The question is: is this a bug in postgres_fdw that should be fixed?

It's depends on what we think about allowing remote access trials
through unprivileged foreign table in any style.  It won't be a
problem if the system is configured appropriately but too-frequent
estimate accesses via unprivileged foreign tables might be regarded as
an attack attempt.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Tom Lane
Laurenz Albe  writes:
> On Wed, 2022-06-08 at 11:12 +0900, Kyotaro Horiguchi wrote:
> RangeTblEntry *rte = root->simple_rte_array[i];
> aclcheck_error(ACLCHECK_NO_PRIV,
>    get_relkind_objtype(rte->relkind),
>    get_rel_name(rte->relid));

I think it's completely inappropriate for FDWs to be taking it on
themselves to inject privilege checks.  The system design is that
that is checked at executor start; not before, not after.

regards, tom lane




Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Tom Lane
"Jonathan S. Katz"  writes:
> I don't know how frequently issues around "max_stack_depth" being too 
> small are reported -- I'd be curious to know that -- but I don't have 
> any strong arguments against allowing the behavior you describe based on 
> our current docs.

I can't recall any recent gripes on our own lists, but the issue was
top-of-mind for me after discovering that NetBSD defaults "ulimit -s"
to 2MB on at least some platforms.  That would leave us setting
max_stack_depth to something less than that, probably about 1.5MB.

regards, tom lane




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Laurenz Albe
On Wed, 2022-06-08 at 11:12 +0900, Kyotaro Horiguchi wrote:
> At Tue, 07 Jun 2022 11:24:55 -0300, "Euler Taveira"  wrote 
> in 
> > On Tue, Jun 7, 2022, at 12:03 AM, Laurenz Albe wrote:
> > > On Sat, 2022-06-04 at 21:18 +, Phil Florent wrote:
> > > > I opened an issue with an attached code on oracle_fdw git page : 
> > > > https://github.com/laurenz/oracle_fdw/issues/534 
> > > > Basically I expected to obtain a "no privilege" error from PostgreSQL 
> > > > when I have no read privilege
> > > > on the postgres foreign table but I obtained an Oracle error instead.
> > > > Laurenz investigated and closed the issue but he suggested perhaps I 
> > > > should post that on
> > > > the hackers list since it also occurs with postgres-fdw on some 
> > > > occasion(I have investigated some more,
> > > > and postgres_fdw does the same thing when you turn 
> > > > onuse_remote_estimate.). Hence I do...
> > > 
> > > To add more detais: permissions are checked at query execution time, but 
> > > if "use_remote_estimate"
> > > is used, the planner already accesses the remote table, even if the user 
> > > has no permissions
> > > on the foreign table.
> > > 
> > > I feel that that is no bug, but I'd be curious to know if others disagree.
> > You should expect an error (like in the example) -- probably not at that 
> > point.
> > It is behaving accordingly. However, that error is exposing an 
> > implementation
> > detail (FDW has to access the remote table at that phase). I don't think 
> > that
> > changing the current design (permission check after planning) for FDWs to
> > provide a good UX is worth it. IMO it is up to the FDW author to hide such
> > cases if it doesn't cost much to do it.
> 
> It is few lines of code.
> 
> > i = -1;
> > while ((i = bms_next_member(rel->relids, i)) >= 0)
> > {
> > RangeTblEntry *rte = root->simple_rte_array[i];
> > aclcheck_error(ACLCHECK_NO_PRIV,
> >    get_relkind_objtype(rte->relkind),
> >    get_rel_name(rte->relid));
> > }
> 
> It can be done in GetForeignRelSize callback by individual FDW, but it
> also can be done in set_foreign_size() in core.

If anything, it should be done in the FDW, because it is only necessary if the
FDW calls the remote site during planning.

The question is: is this a bug in postgres_fdw that should be fixed?

Yours,
Laurenz Albe




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Kyotaro Horiguchi
At Tue, 07 Jun 2022 11:24:55 -0300, "Euler Taveira"  wrote 
in 
> 
> 
> On Tue, Jun 7, 2022, at 12:03 AM, Laurenz Albe wrote:
> > On Sat, 2022-06-04 at 21:18 +, Phil Florent wrote:
> > > I opened an issue with an attached code on oracle_fdw git page : 
> > > https://github.com/laurenz/oracle_fdw/issues/534 
> > > Basically I expected to obtain a "no privilege" error from PostgreSQL 
> > > when I have no read privilege
> > > on the postgres foreign table but I obtained an Oracle error instead.
> > > Laurenz investigated and closed the issue but he suggested perhaps I 
> > > should post that on
> > > the hackers list since it also occurs with postgres-fdw on some 
> > > occasion(I have investigated some more,
> > > and postgres_fdw does the same thing when you turn 
> > > onuse_remote_estimate.). Hence I do...
> > 
> > To add more detais: permissions are checked at query execution time, but if 
> > "use_remote_estimate"
> > is used, the planner already accesses the remote table, even if the user 
> > has no permissions
> > on the foreign table.
> > 
> > I feel that that is no bug, but I'd be curious to know if others disagree.
> You should expect an error (like in the example) -- probably not at that 
> point.
> It is behaving accordingly. However, that error is exposing an implementation
> detail (FDW has to access the remote table at that phase). I don't think that
> changing the current design (permission check after planning) for FDWs to
> provide a good UX is worth it. IMO it is up to the FDW author to hide such
> cases if it doesn't cost much to do it.

It is few lines of code.

>   i = -1;
>   while ((i = bms_next_member(rel->relids, i)) >= 0)
>   {
>   RangeTblEntry *rte = root->simple_rte_array[i];
>   aclcheck_error(ACLCHECK_NO_PRIV,
>  get_relkind_objtype(rte->relkind),
>  get_rel_name(rte->relid));
>   }

It can be done in GetForeignRelSize callback by individual FDW, but it
also can be done in set_foreign_size() in core.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-07 Thread Michael Paquier
On Mon, Jun 06, 2022 at 10:11:48PM -0500, Justin Pryzby wrote:
> tather => rather
> is charge => in charge

Thanks for the extra read.  Fixed.  There was an extra one in the
comments, as of s/thier/their/.

> I think it's better with a dot (HHMMSS.ms) rather than underscore (HHMMSS_ms).
>
> The ISO timestamp can include milliseconds (or apparently fractional parts of
> the "lowest-order" unit), so the "appended by" part doesn't need to be
> explained here.
> 
> +   snprintf(timebuf, sizeof(timebuf), "%s_%03d",
> +timebuf, (int) (time.tv_usec / 1000));
> 
> Is it really allowed to sprintf a buffer onto itself ?
> I can't find any existing cases doing that.

Yes, there is no need to do that, so I have just appended the ms
digits to the end of the string.

And applied, to take care of this open item.
--
Michael


signature.asc
Description: PGP signature


Re: An inverted index using roaring bitmaps

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 5:00 PM Chinmay Kanchi  wrote:
> I personally don't think this is a great replacement for a BTree index - for 
> one thing, it isn't really possible to use this approach beyond equality 
> comparisons (for scalars) or "contains"-type operations for arrays (or 
> tsvectors, jsonb, etc).

Why not? A bitmap is just a way of representing TIDs, that is often
very space efficient once compression is applied. In principle a
bitmap index can do anything that a B-Tree index can do, at least for
SELECTs.

Bitmap indexes are considered totally distinct to B-Tree indexes in
some DB systems because the concurrency control characteristics (the
use of heavyweight locks to protect the logical contents of the
database) are very different . I think that this is because the index
structure itself is so dense that the only practical approach that's
compatible with 2PL style concurrency control (or versions to MVCC
based on 2PL) is to lock a large number of TIDs at the same time. This
can lead to deadlocks with even light concurrent modifications --
which would never happen with an equivalent B-Tree index. But the data
structure is nevertheless more similar than different.

I probably wouldn't want to have a technique like roaring bitmap
compression of TIDs get applied by default within Postgres B-Trees,
but the reasons for that are pretty subtle. I might still advocate
*optional* TID list compression in Postgres B-Trees, which might even
be something we'd end up calling a bitmap index, that would only be
recommended for use in data warehousing scenarios. Extreme TID list
compression isn't free -- it really isn't desirable when there are
many concurrent modifications to relatively few index pages, as is
common in OLTP applications. That's one important reason why bitmap
indexes are generally only used in data warehousing environments,
where the downside doesn't really matter, but the upside pays for
itself (usually a fact table will have several bitmap indexes that are
usually combined, not just one).

> We ultimately abandoned that project because of the difficulty of keeping the 
> bitmaps in sync with changing data, which would no longer be an issue, if 
> this was built as an index.

I think that it would in fact still be an issue if this was built as
an index. There is a reason why the concurrency characteristics of
bitmap indexes make them unsuitable for OLTP apps, which seems
related. That wouldn't mean that it wouldn't still be worth it, but it
would definitely be a real downside with some workloads. B-Tree
deduplication is designed to have very little overhead with mixed
reads and writes, so it's a performance all-rounder that can still be
beaten by specialized techniques that come with their own downsides.

-- 
Peter Geoghegan




Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-06-07 Thread Robert Haas
On Tue, Jun 7, 2022 at 6:54 PM Jacob Champion  wrote:
> "This struct contains connection fields that are explicitly safe for
> workers to access" _is_ a useful semantic, in my opinion. And it seems
> like it'd make it easier to determine what needs to be included in the
> struct; I'm not sure I follow why it would result in less consistency.
>
> But to your suggestion, if we just called the new struct
> "ClientConnectionInfo", would it be a useful step towards your
> proposed three-bucket state? I guess I'm having trouble understanding
> why a struct that is defined as "this stuff *doesn't* get serialized"
> is materially different from having one that's the opposite.

Well, it isn't, and if my proposal boils down to that, which perhaps
it does, then my proposal isn't that great, honestly. Let me try again
to explain, though, and maybe it will seem less arbitrary with a
second explanation -- or maybe it won't.

If we say "this struct contains authentication-related information
that we got from the client and which functions may want to look at
later," then I feel like the chances are good that when someone adds a
new thing to the system in the future, they will know whether or not
that new thing falls into that category or not. If the definition of a
struct is "everything that should be serialized," then I feel like the
chances are less good that everyone will know whether a new thing they
are adding falls into that category or not. Perhaps that is
ill-founded, but I don't think "should be serialized" is necessarily
something that everybody is going to have the same view on, or even
know what it means.

Also, I don't think we want to end up with a situation where we have a
struct that contains wildly unrelated things that all need to be
serialized. If the definition of the struct is "stuff we should
serialize and send to the worker," well then maybe the transaction
snapshot ought to go in there! Well, no. I mean, we already have a
separate place for that, but suppose somehow we didn't. It doesn't
belong here, because yes the things in this struct get serialized, but
it's not only any old thing that needs serializing, it's more specific
than that.

I guess what this boils down to is that I really want this thing to
have a meaningful name by means of which a future developer can make a
guess as to whether some field they're adding ought to go in there. I
theorize that SharedPort is not too great because (a) Port is already
a bad name and (b) how am I supposed to know whether my stuff ought to
be shared or not? I like something like ClientConnectionInfo better
because it seems to describe what the stuff in the struct *is* rather
than what we *do* with it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Jonathan S. Katz

On 6/7/22 7:58 PM, Tom Lane wrote:

I wrote:

The attached draft patch makes the following changes:


Here's a v2 that polishes the loose ends:


Thanks! I reviewed and did some basic testing locally. I did not see any 
of the generated defaults.



(I didn't do anything about in_hot_standby, which is set through
a hack rather than via set_config_option; not sure whether we want
to do anything there, or what it should be if we do.)


The comment diff showed that it went from "hack" to "hack" :)


I concluded that directly assigning to in_hot_standby was a fairly
horrid idea and we should just change it with SetConfigOption.
With this coding, as long as in_hot_standby is TRUE it will show
as having a non-default setting in \dconfig.  I had to remove the
assertion I'd added about PGC_INTERNAL variables only receiving
"default" values, but this just shows that was too inflexible anyway.


I tested this and the server correctly rendered "in_hot_standby" in 
\dconfig. I also tested setting "hot_standby to "on" while the server 
was not in recovery, and \dconfig correctly did not render "in_hot_standby".



* The rlimit-derived value of max_stack_depth is likewise relabeled
as PGC_S_DYNAMIC_DEFAULT, resolving the complaint Jonathan had upthread.
But now that we have a way to hide this, I'm having second thoughts
about whether we should.  If you are on a platform that's forcing an
unreasonably small stack size, it'd be good if \dconfig told you so.
Could it be sane to label that value as PGC_S_DYNAMIC_DEFAULT only when
it's the limit value (2MB) and PGC_S_ENV_VAR when it's smaller?


I concluded that was just fine and did it.


Reading the docs, I think this is OK to do. We already say that "2MB" is 
a very conservative setting. And even if the value can be computed to be 
larger, we don't allow the server to set it higher than "2MB".


I don't know how frequently issues around "max_stack_depth" being too 
small are reported -- I'd be curious to know that -- but I don't have 
any strong arguments against allowing the behavior you describe based on 
our current docs.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 4:29 PM Thomas Munro  wrote:
> The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
> but it does *not* have multiple -dev packages for libicu, just a
> single libicu-dev which can be used to compile and link against their
> chosen current library version.  They do have multiple packages for
> the actual .so and allow them to be installed concurrently.
> Therefore, you could install N .sos and dlopen() them, but you *can't*
> write a program that compiles and links against N versions at the same
> time using their packages (despite IBM's work to make that possible,
> perhaps for use in their own databases).

I know that glibc has various facilities for versioning dynamic
libraries, which includes ways to control symbol visibility. It's
possible that IBM's work on ICU versioning didn't just build on a
generic facility like that because that approach wasn't sufficiently
portable, particularly with platforms like AIX. It's also possible
that we won't have any of these same requirements, and can feasibly
link against multiple ICU versions some other way, and ultimately
achieve the same result -- multiple versions of ICU that can be used
by Postgres at the same time, with long term stable collations across
major OS and Postgres versions.

I now understand that you agree with me on this basic and important
point. Must have been a miscommunication.

> > Packaging standards certainly matter, but they're not immutable laws
> > of the universe. It seems reasonable to suppose that the people that
> > define these standards would be willing to hear us out -- this is
> > hardly a trifling matter, or something that only affects a small
> > minority of *their* users.
>
> OK, yeah, I'm thinking within the confines of things we can do easily
> right now on existing systems as they are currently packaging software
> only by changing our code, not "tell Debian to change their packaging
> so we can compile and link against N versions".

There are lots of specifics here, and I'm certainly not an expert on
packaging. IMV our approach doesn't necessarily need to use the same
original canonical package, though. It just needs to provide a
reasonably smooth experience for users that actually need to keep
their old collations working on upgrade. Either way, the process needs
to be something where all parties understand the concerns of each
other.

Of course Debian doesn't support linking against multiple versions of
ICU right now; why would they? Is there any reason to think that even
one person ever asked about it? Our interest in doing that will
probably be totally unique from their point of view. Can we just ask
somebody about it that has a deep understanding of these things?

> Supposing Debian
> maintainers (and all the others) agreed, there'd still something else
> in favour of dlopen():  wouldn't it be nice if the users were not
> limited by the versions that the packager of PostgreSQL decided to
> link against?  What if someone has a good reason to want to use ICU
> versions that are older than Debian currently ships, that are easily
> available in add-on repos?

I don't consider the ability to support many versions of ICU for the
sake of ICU features to be much of an advantage. I mostly just care
about the simple, common case where a user upgrades and doesn't want
to REINDEX immediately. You may well be right about dlopen(); I just
don't know right now.

> I think we're not understanding each other here: I was talking about
> the technical choice of whether we'd model the multiple library
> versions in our catalogues as different "collprovider" values, or
> somehow encode them into the "collcollate" string, or something else.

ISTM that there are two mostly-distinct questions here:

1. How do we link to multiple versions of ICU at the same time, in a
way that is going to work smoothly on mainstream platforms?

2. What semantics around collations do we want for Postgres once we
gain the ability to use multiple versions of ICU at the same time? For
example, do we want to generalize the definition of a collation, so
that it's associated with one particular ICU version and collation for
the purposes of on-disk compatibility, but isn't necessarily tied to
the same ICU version in other contexts, such as on a dump and restore?

-- 
Peter Geoghegan




Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Jonathan S. Katz

On 6/7/22 1:02 PM, Tom Lane wrote:


In any case, I expect that we'd apply this patch only to HEAD, which
means that when using psql's \dconfig against a pre-v15 server,
you'd still see these settings that we're trying to hide.
That doesn't bother me too much, but maybe some would find it
confusing.


Well, "\dconfig" is a v15 feature, and though it's in the client, the 
best compatibility for it will be with v15. I think it's OK to have the 
behavior different in v15 vs. older versions.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: An inverted index using roaring bitmaps

2022-06-07 Thread Chinmay Kanchi
I personally don't think this is a great replacement for a BTree index -
for one thing, it isn't really possible to use this approach beyond
equality comparisons (for scalars) or "contains"-type operations for arrays
(or tsvectors, jsonb, etc). I see this more as "competing" with GIN, though
I think GIN solves a different use-case. The primary thought here is that
we could build lightning fast inverted indexes for the cases where these
really help.

I played with using roaring bitmaps in production to build rollup tables,
for instance - where a single bitmap per key could satisfy count() queries
and count(*) ... GROUP BY with multiple WHERE conditions way faster than
even an index-only scan could, and without the overhead of multi-column
indexes. In our particular case, there were about 2 dozen columns with
around 30-40 million rows, and we were able to run these queries in
single-digit milliseconds. We ultimately abandoned that project because of
the difficulty of keeping the bitmaps in sync with changing data, which
would no longer be an issue, if this was built as an index.

I think your point about data warehouse-style bitmap indexes hits the nail
on the head here. This would be pretty much just that, a very efficient way
to accelerate such queries.

Cheers,
Chinmay


On Tue, Jun 7, 2022 at 11:53 AM Peter Geoghegan  wrote:

> On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi  wrote:
> > The simulated index in this case is outrageously fast, up to ~150x on
> the GROUP BY.
>
> Couldn't you make a similar argument in favor of adding a B-Tree index
> on "country"? This probably won't be effective in practice, but the
> reasons for this have little to do with how a B-Tree index represents
> TIDs. A GIN index can compress TIDs much more effectively, but the
> same issues apply there.
>
> The main reason why it won't work with a B-Tree is that indexes in
> Postgres are not transactionally consistent structures, in general.
> Whereas your cities_rb table is transactionally consistent (or perhaps
> just simulates a transactionally consistent index). Maybe it could
> work in cases where an index-only scan could be used, which is roughly
> comparable to having a transactionally consistent index. But that
> depends on having the visibility map set most or all heap pages
> all-visible.
>
> GIN indexes don't support index-only scans, and I don't see that
> changing. So it's possible that just adding TID compression to B-Tree
> indexes would significantly speedup this kind of query, just by making
> low cardinality indexes much smaller. Though that's a hard project,
> for many subtle reasons. This really amounts to building a bitmap
> index, of the kind that are typically used for data warehousing, which
> is something that has been discussed plenty of times on this list. GIN
> indexes were really built for things like full-text search, not for
> data warehousing.
>
> B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
> to "saturate" at about 3.5x smaller (relative to the same index with
> deduplication disabled) once there are about 10 or so distinct keys
> per row (the exception is indexes that happen to have huge keys, which
> aren't very interesting). There are many B-Tree indexes (with typical
> sized keys) that are similar in size to an "equivalent" GIN index --
> the ability to compress TIDs isn't very valuable when you don't have
> that many TIDs per key anyway. It's different when you have many TIDs
> per key, of course. GIN indexes alone don't "saturate" at the same
> point -- there is often a big size difference between low cardinality
> and ultra low cardinality data. There are bound to be cases where not
> having that level of space efficiency matters, especially with B-Tree
> index-only scans that scan a significant fraction of the entire index,
> or even the entire index.
>
> --
> Peter Geoghegan
>


Re: broken regress tests on fedora 36

2022-06-07 Thread Michael Paquier
On Tue, Jun 07, 2022 at 10:54:07AM -0400, Andrew Dunstan wrote:
> On 2022-06-07 Tu 08:56, Michael Paquier wrote:
>> On Tue, Jun 07, 2022 at 10:52:45AM +0200, Pavel Stehule wrote:
>>> #   Failed test '\timing with query error: timing output appears'
>>> #   at t/001_basic.pl line 95.
>>> #   'Time: 0,293 ms'
>>> # doesn't match '(?^m:^Time: \d+\.\d\d\d ms)'
>>> # Looks like you failed 2 tests of 58.
>> Fun.  The difference is in the separator: dot vs comma.  This should
>> fail with French the same way.  Perhaps it would fail differently in
>> other languages?  There is no need to be that precise with the regex
>> IMO, so I would just cut the regex with the number, checking only the
>> unit at the end.
> 
> or just replace '\.' with '[.,]'

I was wondering about other separators actually:
https://en.wikipedia.org/wiki/Decimal_separator#Usage_worldwide

These two should be enough, though.  So changing only that sounds fine
by me.
--
Michael


signature.asc
Description: PGP signature


Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Tom Lane
I wrote:
> The attached draft patch makes the following changes:

Here's a v2 that polishes the loose ends:

> (I didn't do anything about in_hot_standby, which is set through
> a hack rather than via set_config_option; not sure whether we want
> to do anything there, or what it should be if we do.)

I concluded that directly assigning to in_hot_standby was a fairly
horrid idea and we should just change it with SetConfigOption.
With this coding, as long as in_hot_standby is TRUE it will show
as having a non-default setting in \dconfig.  I had to remove the
assertion I'd added about PGC_INTERNAL variables only receiving
"default" values, but this just shows that was too inflexible anyway.

> * The rlimit-derived value of max_stack_depth is likewise relabeled
> as PGC_S_DYNAMIC_DEFAULT, resolving the complaint Jonathan had upthread.
> But now that we have a way to hide this, I'm having second thoughts
> about whether we should.  If you are on a platform that's forcing an
> unreasonably small stack size, it'd be good if \dconfig told you so.
> Could it be sane to label that value as PGC_S_DYNAMIC_DEFAULT only when
> it's the limit value (2MB) and PGC_S_ENV_VAR when it's smaller?

I concluded that was just fine and did it.

regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71136b11a2..8764084e21 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4167,7 +4167,7 @@ ReadControlFile(void)
 
 	snprintf(wal_segsz_str, sizeof(wal_segsz_str), "%d", wal_segment_size);
 	SetConfigOption("wal_segment_size", wal_segsz_str, PGC_INTERNAL,
-	PGC_S_OVERRIDE);
+	PGC_S_DYNAMIC_DEFAULT);
 
 	/* check and update variables dependent on wal_segment_size */
 	if (ConvertToXSegs(min_wal_size_mb, wal_segment_size) < 2)
@@ -4186,7 +4186,7 @@ ReadControlFile(void)
 
 	/* Make the initdb settings visible as GUC variables, too */
 	SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no",
-	PGC_INTERNAL, PGC_S_OVERRIDE);
+	PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 }
 
 /*
@@ -4343,13 +4343,22 @@ XLOGShmemSize(void)
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
+	 *
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 * the matter with PGC_S_OVERRIDE.
 	 */
 	if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
 		snprintf(buf, sizeof(buf), "%d", XLOGChooseNumBuffers());
-		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER,
+		PGC_S_DYNAMIC_DEFAULT);
+		if (XLOGbuffers == -1)	/* failed to apply it? */
+			SetConfigOption("wal_buffers", buf, PGC_POSTMASTER,
+			PGC_S_OVERRIDE);
 	}
 	Assert(XLOGbuffers > 0);
 
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 9fa8fdd4cf..9a610d41ad 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -262,7 +262,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("-X requires a power of two value between 1 MB and 1 GB")));
 	SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL,
-	PGC_S_OVERRIDE);
+	PGC_S_DYNAMIC_DEFAULT);
 }
 break;
 			case 'c':
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3b73e26956..dde4bc25b1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -921,7 +921,7 @@ PostmasterMain(int argc, char *argv[])
 		 * useful to show, even if one would only expect at least PANIC.  LOG
 		 * entries are hidden.
 		 */
-		SetConfigOption("log_min_messages", "FATAL", PGC_INTERNAL,
+		SetConfigOption("log_min_messages", "FATAL", PGC_SUSET,
 		PGC_S_OVERRIDE);
 	}
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 26372d95b3..1a6f527051 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -334,7 +334,8 @@ InitializeShmemGUCs(void)
 	size_b = CalculateShmemSize(NULL);
 	size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
 	sprintf(buf, "%zu", size_mb);
-	SetConfigOption("shared_memory_size", buf, PGC_INTERNAL, PGC_S_OVERRIDE);
+	SetConfigOption("shared_memory_size", buf,
+	PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 
 	/*
 	 * Calculate the number of huge pages required.
@@ -346,6 +347,7 @@ InitializeShmemGUCs(void)
 
 		hp_required = add_size(size_b / hp_size, 1);
 		sprintf(buf, "%zu", hp_required);
-		SetConfigOption("shared_memory_size_in_huge_pages", buf, PGC_INTERNAL, PGC_S_OVERRIDE);
+		

Re: Collation version tracking for macOS

2022-06-07 Thread Thomas Munro
 On Wed, Jun 8, 2022 at 10:59 AM Peter Geoghegan  wrote:
> On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro  wrote:
> > Yeah, it's possible to link against multiple versions in theory and
> > that might be a way to do it if we were shipping our own N copies of
> > ICU like DB2 does, but that's hard in practice for shared libraries on
> > common distros (and vendoring or static linking of such libraries was
> > said to be against many distros' rules, since it would be a nightmare
> > if everyone did that, though I don't have a citation for that).
>
> I'm not saying that it's going to be easy, but I can't see why it
> should be impossible. I use Debian unstable for most of my work. It
> supports multiple versions of LLVM/clang, not just one (though there
> is a virtual package with a default version, I believe). What's the
> difference, really?

The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
but it does *not* have multiple -dev packages for libicu, just a
single libicu-dev which can be used to compile and link against their
chosen current library version.  They do have multiple packages for
the actual .so and allow them to be installed concurrently.
Therefore, you could install N .sos and dlopen() them, but you *can't*
write a program that compiles and links against N versions at the same
time using their packages (despite IBM's work to make that possible,
perhaps for use in their own databases).

> Packaging standards certainly matter, but they're not immutable laws
> of the universe. It seems reasonable to suppose that the people that
> define these standards would be willing to hear us out -- this is
> hardly a trifling matter, or something that only affects a small
> minority of *their* users.

OK, yeah, I'm thinking within the confines of things we can do easily
right now on existing systems as they are currently packaging software
only by changing our code, not "tell Debian to change their packaging
so we can compile and link against N versions".   Supposing Debian
maintainers (and all the others) agreed, there'd still something else
in favour of dlopen():  wouldn't it be nice if the users were not
limited by the versions that the packager of PostgreSQL decided to
link against?  What if someone has a good reason to want to use ICU
versions that are older than Debian currently ships, that are easily
available in add-on repos?

> > Yeah, I've flip-flopped a couple of times on the question of whether
> > ICU63 and ICU67 should be different collation providers, or
> > individual collations should somehow specify the library they want to
> > use (admittedly what I showed above with a raw library name is pretty
> > ugly and some indirection scheme might be nice).  It would be good to
> > drill into the pros and cons of those two choices.
>
> I think that there are pretty good technical reasons why each ICU
> version is tied to a particular version of CLDR. Implementing CLDR
> correctly and efficiently is a rather difficult process, even if we
> ignore figuring out what natural language rules make sense. And so
> linking to multiple different ICU versions doesn't really seem like
> overkill to me. Or if it is then I can easily think of far better
> examples of software bloat. Defining "stable behavior for collations"
> as "uses exactly the same software artifact over time" is defensive
> (compared to always linking to one ICU version that does it all), but
> we have plenty that we need to defend against here.

I think we're not understanding each other here: I was talking about
the technical choice of whether we'd model the multiple library
versions in our catalogues as different "collprovider" values, or
somehow encode them into the "collcollate" string, or something else.
I'm with you, I'm already sold on the mult-library concept (and have
been in several previous cycles of this recurring discussion), which
is why I'm trying to move to discussing nuts and bolts and packaging
and linking realities that apparently stopped any prototype from
appearing last time around.




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro  wrote:
> Yeah, it's possible to link against multiple versions in theory and
> that might be a way to do it if we were shipping our own N copies of
> ICU like DB2 does, but that's hard in practice for shared libraries on
> common distros (and vendoring or static linking of such libraries was
> said to be against many distros' rules, since it would be a nightmare
> if everyone did that, though I don't have a citation for that).

I'm not saying that it's going to be easy, but I can't see why it
should be impossible. I use Debian unstable for most of my work. It
supports multiple versions of LLVM/clang, not just one (though there
is a virtual package with a default version, I believe). What's the
difference, really?

Packaging standards certainly matter, but they're not immutable laws
of the universe. It seems reasonable to suppose that the people that
define these standards would be willing to hear us out -- this is
hardly a trifling matter, or something that only affects a small
minority of *their* users.

We don't need to support a huge number of versions on each OS -- just
enough to make it feasible for everybody to avoid the need to ever
reindex every index on a collatable type (maybe ICU versions that were
the default for the last several major versions of the OS are
available through special packages). We don't necessarily have to have
a hard dependency on every supported version from the point of view of
the package manager. And all of this would ultimately be the
responsibility of each individual packager; they'd need to figure out
how to make it work within the context of the platform that they're
targeting. We'd facilitate that important work, but would defer to
them on the final details. There could be a hands-off approach to the
whole thing, so it wouldn't be a total departure from what we do
today.

> Yeah, I've flip-flopped a couple of times on the question of whether
> ICU63 and ICU67 should be different collation providers, or
> individual collations should somehow specify the library they want to
> use (admittedly what I showed above with a raw library name is pretty
> ugly and some indirection scheme might be nice).  It would be good to
> drill into the pros and cons of those two choices.

I think that there are pretty good technical reasons why each ICU
version is tied to a particular version of CLDR. Implementing CLDR
correctly and efficiently is a rather difficult process, even if we
ignore figuring out what natural language rules make sense. And so
linking to multiple different ICU versions doesn't really seem like
overkill to me. Or if it is then I can easily think of far better
examples of software bloat. Defining "stable behavior for collations"
as "uses exactly the same software artifact over time" is defensive
(compared to always linking to one ICU version that does it all), but
we have plenty that we need to defend against here.

-- 
Peter Geoghegan




Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-06-07 Thread Jacob Champion
On Mon, Jun 6, 2022 at 11:44 AM Robert Haas  wrote:
> I think I'd feel more comfortable here if we were defining what went
> into which struct on some semantic basis rather than being like, OK,
> so all the stuff we want to serialize goes into struct #1, and the
> stuff we don't want to serialize goes into struct #2. I suppose if
> it's just based on whether or not we want to serialize it, then the
> placement of future additions will just be based on how people happen
> to feel about the thing they're adding right at that moment, and there
> won't be any consistency.

"This struct contains connection fields that are explicitly safe for
workers to access" _is_ a useful semantic, in my opinion. And it seems
like it'd make it easier to determine what needs to be included in the
struct; I'm not sure I follow why it would result in less consistency.

But to your suggestion, if we just called the new struct
"ClientConnectionInfo", would it be a useful step towards your
proposed three-bucket state? I guess I'm having trouble understanding
why a struct that is defined as "this stuff *doesn't* get serialized"
is materially different from having one that's the opposite.

Thanks,
--Jacob




Re: Collation version tracking for macOS

2022-06-07 Thread Thomas Munro
On Wed, Jun 8, 2022 at 8:16 AM Peter Geoghegan  wrote:
> On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro  wrote:
> > Earlier I mentioned distinct "providers" but I take that back, that's
> > too complicated.  Reprising an old idea that comes up each time we
> > talk about this, this time with some more straw-man detail: what about
> > teaching our ICU support to understand "libicu18n.so.71:en" to mean
> > that it should dlopen() that library and use its functions?  Or some
> > cleverer, shorter notation.  Then it's the user's problem to make sure
> > the right libraries are installed, and it'll fail if they're not.  For
> > example, on Debian bookworm right now you can install libicu63,
> > libicu67, libicu71, though only the "current" -dev package, but which
> > I'm sure we can cope with.  You're at the mercy of the distro or
> > add-on package repos to keep a lot of versions around, but that seems
> > OK.
>
> Right. Postgres could link to multiple versions of ICU at the same
> time. Right now it doesn't, and right now the ICU C symbol names that
> we use are actually versioned (this isn't immediately apparent because
> the C preprocessor makes it appear that ICU symbol names are generic).

Yeah, it's possible to link against multiple versions in theory and
that might be a way to do it if we were shipping our own N copies of
ICU like DB2 does, but that's hard in practice for shared libraries on
common distros (and vendoring or static linking of such libraries was
said to be against many distros' rules, since it would be a nightmare
if everyone did that, though I don't have a citation for that).  I
suspect it's better to use dlopen() to load them, because (1) I
believe that the major distros only have -dev/-devel packages for the
"current" version, even though they let you install the packages
containing the .so files for multiple versions at the same time so
that binaries linked against older versions keep working and (2) I
think it'd be cool if users were free to find more ICU versions in
add-on package repos and be able to use them to get a version that the
packager of PostgreSQL didn't anticipate.

> We could perhaps invent a new indirection that knows about
> multiple ICU versions, each of which is an independent collation
> provider, or maybe a related collation provider that gets used by
> default on REINDEX. ICU is designed for this kind of thing. That
> approach more or less puts packagers on the hook for managing
> collation stability. But now long term collation stability is at least
> feasible -- we at least have a coherent strategy. In the worst case
> the community .deb and .rpm repos might continue to support an older
> ICU version, or lobby for its continued support by the distro (while
> actively discouraging its use in new databases). This isn't the same
> thing as forking ICU. It's a compromise between that extreme, and
> the current situation.

Yeah, I've flip-flopped a couple of times on the question of whether
ICU63 and ICU67 should be different collation providers, or
individual collations should somehow specify the library they want to
use (admittedly what I showed above with a raw library name is pretty
ugly and some indirection scheme might be nice).  It would be good to
drill into the pros and cons of those two choices.  As for getting
sane defaults, I don't know if this is a good idea, but it's an idea:
perhaps schemas and search paths could be used,  you avoid having to
include ugly version strings in the collation identifiers, and the
search path effectively controls default when you don't want to be
explicit (= most users)?




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 2:13 PM Jeremy Schneider
 wrote:
> For my for my part, gut feeling is that MacOS major releases will be
> similar to any other OS major release, which may contain updates to
> collation algorithms and locales. ISTM like the same thing PG is looking
> for on other OS's to trigger the warning. But it might be good to get an
> official reference on MacOS, if someone knows where to find one?  (I don't.)

I just don't think that we should be relying on a huge entity like
Apple or even glibc for this -- they don't share our priorities, and
there is no reason for this to change. The advantage of ICU versioning
is that it is just one library, that can coexist with others,
including other versions of ICU.

Imagine a world in which we support multiple ICU versions (for Debian
packages, say), some of which are getting quite old. Maybe we can
lobby for the platform to continue to support that old version of the
library -- there ought to be options. Lobbying Debian to stick with an
older version of glibc is another matter entirely. That has precisely
zero chance of ever succeeding, for reasons that are quite
understandable.

Half the problem here is to detect breaking changes, but the other
half is to not break anything in the first place. Or to give the user
plenty of opportunity to transition incrementally, without needing to
reindex everything at the same time. Obviously the only way that's
possible is by supporting multiple versions of ICU at the same time,
in the same database. This requires indirection that distinguishes
between "physical and logical" collation versions, where the same
nominal collation can have different implementations across multiple
ICU versions.

The rules for standards like BCP47 (the system that defines the name
of an ICU/CLDR locale) are deliberately very tolerant of what they
accept in order to ensure forwards and backwards compatibility in
environments where there isn't just one ICU/CLDR version [1] (most
environments in the world of distributed or web applications). So you
can expect the BCP47 name of a collation to more or less work on any
ICU version, perhaps with some loss of functionality (this is
unavoidable when you downgrade ICU to a version that doesn't have
whatever CLDR customization you might have relied on). It's very
intentionally a "best effort" approach, because throwing a "locale not
found" error message usually isn't helpful from the point of view of
the end user. Note that this is a broader standard than ICU or CLDR or
even Unicode.

[1] https://www.ietf.org/rfc/rfc6067.txt
-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-07 Thread Bruce Momjian
On Tue, Jun  7, 2022 at 03:43:32PM -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor  wrote:
> >> Is this more involved than creating a list of all valid Unicode characters 
> >> (~144 thousand), sorting them, then running crc32 over the sorted order to 
> >> create the "version" for the library/collation pair? Far from free but few 
> >> databases use more than a couple different collations.
> 
> > Collation rules have multiple levels and all kinds of quirks, so that
> > won't work.
> 
> Yeah, and it's exactly at the level of quirks that things are likely
> to change.  Nobody's going to suddenly start sorting B before A.
> They might, say, change their minds about where the digram "cz"
> sorts relative to single letters, in languages where special rules
> for that are a thing.
> 
> The idea of fingerprinting a collation's behavior is interesting,
> but I've got doubts about whether we can make a sufficiently thorough
> fingerprint.

Rather than trying to figure out if the collations changed, have we ever
considered checking if index additions and lookups don't match the OS
collation and reporting these errors somehow?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread Tom Lane
I wrote:
> That ... is pretty quirky already.  How did it prefer a scan with cost
> 19.32 over one with cost 9.39?  Seems like we've got a bug here somewhere.
> The change in estimated rowcount is rather broken, too.

Ah, false alarm.  I can reproduce your results if I stick an ANALYZE
between the first and second EXPLAIN.  So probably your change in
estimated rowcount and hence cost can be explained by an auto-analyze
coming along at just the right time.

Also, if I fill the geom and location columns with non-null data,
the planner stops preferring those indexes.

So now I'm guessing that the OP's data *was* mostly null, and the
planner preferred the gist indexes because they were smallest,
and then tripped over the nonreturnable-column bug.

regards, tom lane




Re: [PoC] Let libpq reject unexpected authentication requests

2022-06-07 Thread Jacob Champion
v2 rebases over latest, removes the alternate spellings of "password",
and implements OR operations with a comma-separated list. For example:

- require_auth=cert means that the server must ask for, and the client
must provide, a client certificate.
- require_auth=password,md5 means that the server must ask for a
plaintext password or an MD5 hash.
- require_auth=scram-sha-256,gss means that one of SCRAM, Kerberos
authentication, or GSS transport encryption must be successfully
negotiated.
- require_auth=scram-sha-256,cert means that either a SCRAM handshake
must be completed, or the server must request a client certificate. It
has a potential pitfall in that it allows a partial SCRAM handshake,
as long as a certificate is requested and sent.

AND and NOT, discussed upthread, are not yet implemented. I tied
myself up in knots trying to make AND generic, so I think I"m going to
tackle NOT first, instead. The problem with AND is that it only makes
sense when one (and only one) of the options is a form of transport
authentication. (E.g. password+md5 never makes sense.) And although
cert+ and gss+ could be useful, the latter case
is already handled by gssencmode=require, and the gssencmode option is
more powerful since you can disable it (or set it to don't-care).

I'm not generally happy with how the "cert" option is working. With
the other methods, if you don't include a method in the list, then the
connection fails if the server tries to negotiate it. But if you don't
include the cert method in the list, we don't forbid the server from
asking for a cert, because the server always asks for a client
certificate via TLS whether it needs one or not. Behaving in the
intuitive way here would effectively break all use of TLS.

So I think Tom's recommendation that the cert method be handled by an
orthogonal option was a good one, and if that works then maybe we
don't need an AND syntax at all. Presumably I can just add an option
that parallels gssencmode, and then the current don't-care semantics
can be explicitly controlled. Skipping AND also means that I don't
have to create a syntax that can handle AND and NOT at the same time,
which I was dreading.

--Jacob
commit 403d27e07922babcdf6fd5e8ad8524d92811294c
Author: Jacob Champion 
Date:   Mon Jun 6 12:32:03 2022 -0700

squash! libpq: let client reject unexpected auth methods

- rebase over new sslkey() test function
- remove alternate spellings of "password"
- allow comma-separated methods (OR). At least one of the authentication
  methods in the list must complete for the connection to succeed.

diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index a883959756..04f9bf4831 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -870,6 +870,13 @@ auth_description(AuthRequest areq)
return libpq_gettext("an unknown authentication type");
 }
 
+/*
+ * Convenience macro for checking the allowed_auth_methods bitmask. Caller must
+ * ensure that type is not greater than 31 (high bit of the bitmask).
+ */
+#define auth_allowed(conn, type) \
+   (((conn)->allowed_auth_methods & (1 << (type))) != 0)
+
 /*
  * Verify that the authentication request is expected, given the connection
  * parameters. This is especially important when the client wishes to
@@ -881,7 +888,11 @@ check_expected_areq(AuthRequest areq, PGconn *conn)
boolresult = true;
char   *reason = NULL;
 
-   /* If the user required a specific auth method, reject all others. */
+   /*
+* If the user required a specific auth method, or specified an allowed 
set,
+* then reject all others here, and make sure the server actually 
completes
+* an authentication exchange.
+*/
if (conn->require_auth)
{
switch (areq)
@@ -890,21 +901,43 @@ check_expected_areq(AuthRequest areq, PGconn *conn)
/*
 * Check to make sure we've actually finished 
our exchange.
 */
-   if (strcmp(conn->require_auth, "cert") == 0)
+   if (conn->client_finished_auth)
+   break;
+
+   /*
+* No explicit authentication request was made 
by the server --
+* or perhaps it was made and not completed, in 
the case of
+* SCRAM -- but there are two special cases to 
check:
+*
+* 1) If the user allowed "cert", then as long 
as we sent a
+*client certificate to the server in 
response to its
+*TLS CertificateRequest, this check is 
satisfied.
+*
+* 2) If the user 

Re: Collation version tracking for macOS

2022-06-07 Thread Jeremy Schneider
On 6/7/22 1:51 PM, Peter Geoghegan wrote:
> On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
>  wrote:
>> This idea does seem to persist. It's not as frequent as timezones, but
>> collation rules reflect local dialects and customs, and there are
>> changes quite regularly for a variety of reasons. A brief perusal of
>> CLDR changelogs and CLDR jiras can give some insight here:
> 
>> Another misunderstanding that seems to persist is that this only relates
>> to exotic locales or that it's only the 2.28 version.
> 
> I'm not defending the status quo, and I think that I'm better informed
> than most about the problems in this area. My point was that it hardly
> matters that we don't necessarily see outright corruption. This was
> based in part on a misunderstanding of Tom's point, though.


I think I was guilty of the same misunderstanding - apologies Tom!
Thanks Peter for calling that out explicitly.

For my for my part, gut feeling is that MacOS major releases will be
similar to any other OS major release, which may contain updates to
collation algorithms and locales. ISTM like the same thing PG is looking
for on other OS's to trigger the warning. But it might be good to get an
official reference on MacOS, if someone knows where to find one?  (I don't.)

-Jeremy


-- 
http://about.me/jeremy_schneider




Re: Collation version tracking for macOS

2022-06-07 Thread Robert Haas
On Tue, Jun 7, 2022 at 4:24 PM Jeremy Schneider
 wrote:
> I haven't yet found a Red Hat minor release that changed
> glibc collation.

I feel like this is a thing that happens regularly enough that it's
known to be a gotcha by many of my colleagues here at EDB.

Perhaps that's all pure fiction, but I doubt it. People don't go
around making up stories about things being broken so they can say bad
things about Red Hat. They got told by customers that things are
broken and then go try to figure out how that happened.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread Tom Lane
David Rowley  writes:
> On Wed, 8 Jun 2022 at 07:55, Tom Lane  wrote:
>> I wonder if there is some quirk in gist cost estimation that makes it
>> improperly claim to be cheaper than btree scans.

> I installed PostGIS 3.1.1 and mocked this up with the attached.

> Looking at the plans, I see:

> # explain select count(*) from logistic_site;
>QUERY PLAN
> -
>  Aggregate  (cost=20.18..20.19 rows=1 width=8)
>->  Bitmap Heap Scan on logistic_site  (cost=5.92..19.32 rows=340 width=0)
>  ->  Bitmap Index Scan on logistic_site_location_54ae0166_id
> (cost=0.00..5.84 rows=340 width=0)
> (3 rows)

> # drop index logistic_site_location_54ae0166_id;
> # explain select count(*) from logistic_site;
>  QUERY PLAN
> -
>  Aggregate  (cost=9.92..9.93 rows=1 width=8)
>->  Bitmap Heap Scan on logistic_site  (cost=5.26..9.39 rows=213 width=0)
>  ->  Bitmap Index Scan on logistic_site_geom_105a08da_id
> (cost=0.00..5.20 rows=213 width=0)
> (3 rows)

That ... is pretty quirky already.  How did it prefer a scan with cost
19.32 over one with cost 9.39?  Seems like we've got a bug here somewhere.
The change in estimated rowcount is rather broken, too.

> So it does appear that the location index is being chosen, at least
> with the data that I inserted. Those gist indexes are costing quite a
> bit cheaper than the cheapest btree index.

It looks like the data you inserted for the geometry columns was uniformly
NULL, which perhaps would result in a very small gist index.  So maybe
for this test data the choice isn't so odd.  Seems unlikely that that'd
be true of the OP's production data, though.

regards, tom lane




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
 wrote:
> This idea does seem to persist. It's not as frequent as timezones, but
> collation rules reflect local dialects and customs, and there are
> changes quite regularly for a variety of reasons. A brief perusal of
> CLDR changelogs and CLDR jiras can give some insight here:

> Another misunderstanding that seems to persist is that this only relates
> to exotic locales or that it's only the 2.28 version.

I'm not defending the status quo, and I think that I'm better informed
than most about the problems in this area. My point was that it hardly
matters that we don't necessarily see outright corruption. This was
based in part on a misunderstanding of Tom's point, though.

> While the quality of glibc collations aren't great when compared with
> CLDR, I think the glibc maintainers have done versioning exactly right:
> they are clear about which patches are allowed to contain collation
> updates, and the OS distributions are able to ensure stability on major
> OS release. I haven't yet found a Red Hat minor release that changed
> glibc collation.

That might be true, but my impression from interacting with Carlos
O'Donnell is that they pretty much don't take the concern about
stability all that seriously. Which I think is reasonable, given his
position!

The fact that we are this sensitive to glibc collation versioning might
be a wholly unique situation (unlike with ICU, which was built with
that in mind). It might be that every other user of glibc collations
sees this as fairly inconsequential, because they don't have to deal
with persistent state that directly relies on the rules in various
ways that are critically important. Even if glibc theoretically does a
perfect job of versioning, I still think that their priorities are
very much unlike our priorities, and that that should be a relevant
consideration for us.

--
Peter Geoghegan




Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread David Rowley
On Wed, 8 Jun 2022 at 08:31, David Rowley  wrote:
> So it does appear that the location index is being chosen, at least
> with the data that I inserted. Those gist indexes are costing quite a
> bit cheaper than the cheapest btree index.

This seems just to be because the gist indexes are smaller, which is
likely due to me having inserted NULL values into them.

postgres=# select pg_relation_size('logistic_site_key_key');
 pg_relation_size
--
16384
(1 row)


postgres=# select pg_relation_size('logistic_site_location_54ae0166_id');
 pg_relation_size
--
 8192
(1 row)

David




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 1:16 PM Tom Lane  wrote:
> This is not the concern that I have.  I agree that if we tell a user
> that collation X changed behavior and he'd better reindex his indexes
> that use collation X, but none of them actually contain any cases that
> changed behavior, that's not a "false positive" --- that's "it's cheaper
> to reindex than to try to identify whether there's a problem".  What
> I mean by "false positive" is telling every macOS user that they'd better
> reindex everything every year, when in point of fact Apple changes those
> collations almost never.

That does seem like a meaningful distinction. I'm sorry if I
misrepresented your position on this.

We're talking about macOS here, which is hardly a paragon of lean
software. I think that it's worth revisiting the assumption that the C
standard library collations are the most useful set of collations, and
we shouldn't presume to know better than the operating system.
Couldn't individual packagers establish their own system for managing
collations across multiple ICU versions, as I outlined up-thread?

I think that it's okay (maybe unavoidable) that we keep "lib C
collations are authoritative" as a generic assumption when Postgres is
built from source. We can still have defacto standards that apply on
all mainstream platforms when users install standard packages for
production databases -- I don't see why we can't do both. Maybe the
best place to solve this problem is at the level of each individual
package ecosystem.

There can be some outsourcing to package managers this way, without
relying on the underlying OS, or lib C collations, or ICU in general.
This scheme wouldn't technically be under our direct control, but
would still be something that we could influence. We could have a back
and forth conversation about what's not working in the field.

-- 
Peter Geoghegan




Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread David Rowley
On Wed, 8 Jun 2022 at 07:55, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
> >  wrote:
> >> Here is the detail of the table (I have anonymized it on SO, this is its 
> >> real name):
> >> "logistic_site_location_54ae0166_id" gist (location)
> > I imagine this is due to the planner choosing an index-only scan on
> > the above index. A similar problem was reported in [1].
>
> The other gist index could also be the problem.  It seems odd though
> that the planner would favor either index for this purpose over the btree
> indexes on scalar columns, which you'd think would be a lot smaller.
> I wonder if there is some quirk in gist cost estimation that makes it
> improperly claim to be cheaper than btree scans.

I installed PostGIS 3.1.1 and mocked this up with the attached.

Looking at the plans, I see:

# explain select count(*) from logistic_site;
   QUERY PLAN
-
 Aggregate  (cost=20.18..20.19 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=5.92..19.32 rows=340 width=0)
 ->  Bitmap Index Scan on logistic_site_location_54ae0166_id
(cost=0.00..5.84 rows=340 width=0)
(3 rows)

# drop index logistic_site_location_54ae0166_id;
# explain select count(*) from logistic_site;
 QUERY PLAN
-
 Aggregate  (cost=9.92..9.93 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=5.26..9.39 rows=213 width=0)
 ->  Bitmap Index Scan on logistic_site_geom_105a08da_id
(cost=0.00..5.20 rows=213 width=0)
(3 rows)

# drop index logistic_site_geom_105a08da_id;
# explain select count(*) from logistic_site;
  QUERY PLAN
--
 Aggregate  (cost=13.93..13.94 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=9.26..13.39 rows=213 width=0)
 ->  Bitmap Index Scan on logistic_site_key_2e791173_like
(cost=0.00..9.21 rows=213 width=0)
(3 rows)

So it does appear that the location index is being chosen, at least
with the data that I inserted. Those gist indexes are costing quite a
bit cheaper than the cheapest btree index.

David


logistic_site.sql
Description: Binary data


Re: Collation version tracking for macOS

2022-06-07 Thread Jeremy Schneider
On 6/7/22 12:53 PM, Peter Geoghegan wrote:
> 
> Collations by their very nature are unlikely to change all that much.
> Obviously they can and do change, but the details are presumably
> pretty insignificant to a native speaker. 


This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

https://github.com/unicode-org/cldr

https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC

The difference between the unicode consortium and the GNU C Library is
that unicode is maintained by people who are specifically interested in
working with language and internationalization challenges. I've spoken
to a glibc maintainer who directly told me that they dislike working
with the collation code, and try to avoid it. It's not even ISO 14651
anymore with so many custom glibc-specific changes layered on top. I
looked at the first few commits in the glibc source that were
responsible for the big 2.28 changes - there were a serious of quite a
few commits and some were so large they wouldn't even load in the github
API.

Here's one such commit:

https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4

It's reasonable to expect that Red Hat and Debian will keep things
stable on one particular major, and to expect that every new major OS
version will update to the latest collation algorithms and locale data
for glibc.

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

My github repo is out-of-date (I know of more cases that I still need to
publish) but the old data already demonstrates changes to the root/DUCET
collation rules (evident in en_US without any tailoring) for glibc
versions 2.13, 2.21 and 2.26

https://github.com/ardentperf/glibc-unicode-sorting/

If a PosgreSQL user is unlucky enough to have one of those unicode
characters stored in a table, they can get broken indexes even if they
only use the default US english locale, and without touching glibc 2.28
- and all you need is an index on a field where end users can type any
string input.


> It's pretty clear that glibc as a project doesn't take the issue very
> seriously, because they see it as a problem of the GUI sorting a table
> in a way that seems slightly suboptimal to scholars of a natural
> language. 

I disagree that glibc maintainers are doing anything wrong.

While the quality of glibc collations aren't great when compared with
CLDR, I think the glibc maintainers have done versioning exactly right:
they are clear about which patches are allowed to contain collation
updates, and the OS distributions are able to ensure stability on major
OS release. I haven't yet found a Red Hat minor release that changed
glibc collation.

-Jeremy


-- 
http://about.me/jeremy_schneider




Re: Collation version tracking for macOS

2022-06-07 Thread Thomas Munro
On Wed, Jun 8, 2022 at 7:43 AM Tom Lane  wrote:
> The idea of fingerprinting a collation's behavior is interesting,
> but I've got doubts about whether we can make a sufficiently thorough
> fingerprint.

On one of the many threads about this I recall posting a thought
experiment patch that added system_collation_version_command or some
such, so you could train your computer to compute a hash for
/usr/share/locale/XXX/LC_COLLATE (or whatever it's called on your
system), but it all seemed a bit gross to me on various levels.  Most
people don't know or care about collations so they won't set it up, so
to make it useful it'd have to have useful defaults, and it seems like
a bad idea to teach PostgreSQL where all these systems keep their
collation rules.




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro  wrote:
> Earlier I mentioned distinct "providers" but I take that back, that's
> too complicated.  Reprising an old idea that comes up each time we
> talk about this, this time with some more straw-man detail: what about
> teaching our ICU support to understand "libicu18n.so.71:en" to mean
> that it should dlopen() that library and use its functions?  Or some
> cleverer, shorter notation.  Then it's the user's problem to make sure
> the right libraries are installed, and it'll fail if they're not.  For
> example, on Debian bookworm right now you can install libicu63,
> libicu67, libicu71, though only the "current" -dev package, but which
> I'm sure we can cope with.  You're at the mercy of the distro or
> add-on package repos to keep a lot of versions around, but that seems
> OK.

Right. Postgres could link to multiple versions of ICU at the same
time. Right now it doesn't, and right now the ICU C symbol names that
we use are actually versioned (this isn't immediately apparent because
the C preprocessor makes it appear that ICU symbol names are generic).

We could perhaps invent a new indirection that knows about
multiple ICU versions, each of which is an independent collation
provider, or maybe a related collation provider that gets used by
default on REINDEX. ICU is designed for this kind of thing. That
approach more or less puts packagers on the hook for managing
collation stability. But now long term collation stability is at least
feasible -- we at least have a coherent strategy. In the worst case
the community .deb and .rpm repos might continue to support an older
ICU version, or lobby for its continued support by the distro (while
actively discouraging its use in new databases). This isn't the same
thing as forking ICU. It's a compromise between that extreme, and
the current situation.

--
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-07 Thread Tom Lane
Peter Geoghegan  writes:
> I agree that "false positive" is not a valid way of describing a
> breaking change in a Postgres collation that happens to not affect one
> index in particular, due to the current phase of the moon. It's
> probably very likely that most individual indexes that we warn about
> will be so-called false positives.

This is not the concern that I have.  I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem".  What
I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never.  We will soon lose those users' attention ---
see fable about boy crying wolf --- and then when Apple actually does
change something, we've got a problem.  So if we give collation-change
warnings, they'd better have some measurable connection to reality.

regards, tom lane




Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread Tom Lane
David Rowley  writes:
> On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
>  wrote:
>> Here is the detail of the table (I have anonymized it on SO, this is its 
>> real name):
>> "logistic_site_location_54ae0166_id" gist (location)
> I imagine this is due to the planner choosing an index-only scan on
> the above index. A similar problem was reported in [1].

The other gist index could also be the problem.  It seems odd though
that the planner would favor either index for this purpose over the btree
indexes on scalar columns, which you'd think would be a lot smaller.
I wonder if there is some quirk in gist cost estimation that makes it
improperly claim to be cheaper than btree scans.

regards, tom lane




Re: Collation version tracking for macOS

2022-06-07 Thread Peter Geoghegan
On Tue, Jun 7, 2022 at 12:37 PM Robert Haas  wrote:
> It's true that we don't have any false positives right now, but we
> also have no true positives. Even a stopped clock is right twice a
> day, but not in a useful way. People want to be notified when a
> problem might exist, even if sometimes it doesn't actually.

Collations by their very nature are unlikely to change all that much.
Obviously they can and do change, but the details are presumably
pretty insignificant to a native speaker. Stands to reason that the
issue (which is fundamentally a problem for natural language experts)
would have been resolved far sooner if there really was a significant
controversy about something that tends to come up often.

It's pretty clear that glibc as a project doesn't take the issue very
seriously, because they see it as a problem of the GUI sorting a table
in a way that seems slightly suboptimal to scholars of a natural
language. Clearly that isn't actually a big deal. But the latent
possibility of wrong answers to queries is a very big deal. Both are
true. It's just a matter of priorities in each case.

I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives. I bet Postgres that there are many
near-misses that we never get to hear about already. That's rather
beside the point. The index must be assumed to be corrupt.

-- 
Peter Geoghegan




Re: Collation version tracking for macOS

2022-06-07 Thread Tom Lane
Robert Haas  writes:
> In fact, I'd go so far as to argue that you're basically sticking your
> head in the sand here. You wrote:

No, I quite agree that we have a problem.  What I don't agree is that
issuing a lot of false-positive warnings is a solution.  That will
just condition people to ignore the warnings, and then when their
platform really does change behavior, they're still screwed.  If we
could *accurately* report collation behavioral changes, I'd be all
for that.

Rod's idea upthread is certainly way too simplistic, but could we
build a set of test cases that do detect known changes in collation
behaviors?  We'd be shooting at a moving target; but even if we're
late in noticing that platform X changed the behavior of collation Y,
we could help users who run in the problem afterwards.

regards, tom lane




Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread David Rowley
On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
 wrote:
> Here is the detail of the table (I have anonymized it on SO, this is its real 
> name):

> "logistic_site_location_54ae0166_id" gist (location)

I imagine this is due to the planner choosing an index-only scan on
the above index. A similar problem was reported in [1].

A fix was committed in [2], which appears in 13.7.

You could turn off enable_indexonlyscan until 13.7 is available to you.

David

[1] 
https://www.postgresql.org/message-id/CAHUie24ddN+pDNw7fkhNrjrwAX=fxxfgzzehhruofv_n_ft...@mail.gmail.com
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0778b24ced8a873b432641001d046d1dde602466




Re: Collation version tracking for macOS

2022-06-07 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor  wrote:
>> Is this more involved than creating a list of all valid Unicode characters 
>> (~144 thousand), sorting them, then running crc32 over the sorted order to 
>> create the "version" for the library/collation pair? Far from free but few 
>> databases use more than a couple different collations.

> Collation rules have multiple levels and all kinds of quirks, so that
> won't work.

Yeah, and it's exactly at the level of quirks that things are likely
to change.  Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.

The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.

regards, tom lane




Re: Collation version tracking for macOS

2022-06-07 Thread Robert Haas
On Fri, Jun 3, 2022 at 4:58 PM Tom Lane  wrote:
> I think the real problem here is that the underlying software mostly
> doesn't take this issue seriously.  Unfortunately, that leads one to
> the conclusion that we need to maintain our own collation code and
> data (e.g., our own fork of ICU), and that isn't happening.  Unlike
> say Oracle, we do not have the manpower; nor do we want to bloat our
> code base that much.

You don't, but that opinion isn't universally held, or at least not
with the same vigor that you hold it. See e.g.
https://www.postgresql.org/message-id/a4019c5e570d4dbb5e3f816c080fb57c76ab604a.camel%40cybertec.at
and subsequent discussion, for example.

In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:

"Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update)."

But you wrote that to Peter, who was essentially complaining that we
hadn't done anything, and linked to another source, which was also
complaining about the problem, and then Jeremy Schneider replied to
your email and complained some more.

Complaining about "false positives" doesn't really make sense to me.
It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually. The
alternative is having no idea at all that things might be broken,
which is not better.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Collation version tracking for macOS

2022-06-07 Thread Thomas Munro
On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor  wrote:
> Is this more involved than creating a list of all valid Unicode characters 
> (~144 thousand), sorting them, then running crc32 over the sorted order to 
> create the "version" for the library/collation pair? Far from free but few 
> databases use more than a couple different collations.

Collation rules have multiple levels and all kinds of quirks, so that
won't work.




Re: pgcon unconference / impact of block size on performance

2022-06-07 Thread Robert Haas
On Tue, Jun 7, 2022 at 1:47 PM Tomas Vondra
 wrote:
> Possibly, but why would that be the case? Maybe there are places that do
> stuff with memory and have different optimizations based on length? I'd
> bet the 4k page is way more optimized than the other cases.

I don't really know. It was just a thought. It feels like the fact
that the page sizes are different could be hurting us somehow, but I
don't really know what the mechanism would be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Invalid memory access in pg_stat_get_subscription

2022-06-07 Thread Tom Lane
Kuntal Ghosh  writes:
> While exploring some code in logical replication worker
> implementation, I noticed that we're accessing an invalid memory while
> traversing LogicalRepCtx->workers[i].
> For the above structure, we're allocating
> max_logical_replication_workers times LogicalRepWorker amount of
> memory in ApplyLauncherShmemSize. But, in the for loop, we're
> accessing the max_logical_replication_workers + 1 location which is
> resulting in random crashes.

I concur that that's a bug, but eyeing the code, it seems like an
actual crash would be improbable.  Have you seen one?  Can you
reproduce it?

> Please find the patch that fixes the issue. I'm not sure whether we
> should add a regression test for the same.

How would you make a stable regression test for that?

regards, tom lane




Allow foreign keys to reference a superset of unique columns

2022-06-07 Thread Kaiting Chen
Hi everyone:

I'd like to propose a change to PostgreSQL to allow the creation of a foreign
key constraint referencing a superset of uniquely constrained columns.

As it currently stands:

  CREATE TABLE foo (a integer PRIMARY KEY, b integer);
  CREATE TABLE bar (x integer, y integer,
FOREIGN KEY (x, y) REFERENCES foo(a, b));
  > ERROR:  there is no unique constraint matching given keys for referenced
table "foo"

Despite the fact that in "foo", the combination of columns (a, b) is guaranteed
to be unique by virtue of being a superset of the primary key (a).

This capability has been requested before, at least once in 2004 [1] and again
in 2021 [2].

To illustrate when it'd be useful to define such a foreign key constraint,
consider a database that will store graphs (consisting of nodes and edges) where
graphs are discrete and intergraph edges are prohibited:

  CREATE TABLE graph (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
  );

  CREATE TABLE node (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
graph_id INTEGER NOT NULL REFERENCES graph(id)
  );

  CREATE TABLE edge (
graph_id INTEGER,
source_id INTEGER,
FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id),

target_id INTEGER,
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, graph_id),

PRIMARY KEY (graph_id, source_id, target_id)
  );

This schema is unsupported by PostgreSQL absent this constraint:

  ALTER TABLE node ADD UNIQUE (id, graph_id);

However, this constraint, as well as its underlying unique index, is superfluous
as node(id) itself is unique. Its addition serves no semantic purpose but incurs
cost of additional on disk storage and update time. Note the prohibition of
intergraph edges isn't enforceable on "edge" without the composite foreign keys
(or triggers).

An alternative approach is to redefine node's PRIMARY KEY as (id, graph_id).
However, this would force every table referring to "node" to duplicate both
columns into their schema, even when a singular "node_id" would suffice. This is
undesirable if there are many tables referring to "node" that have no such
intergraph restrictions and few that do.

While it can be argued that this schema contains some degree of denormalization,
it isn't uncommon and a recent patch was merged to support exactly this kind of
design [3].

In that case, the SET NULL and SET DEFAULT referential actions gained support
for an explicit column list to accomodate this type of design.

A problem evinced by Tom Lane in the 2004 discussion on this was that, were this
permitted, the index supporting a foreign key constraint could be ambiguous:

  > I think one reason for this is that otherwise it's not clear which
  > unique constraint the FK constraint depends on.  Consider
  >
  > create table a (f1 int unique, f2 int unique);
  >
  > create table b (f1 int, f2 int,
  > foreign key (f1,f2) references a(f1,f2));
  >
  > How would you decide which constraint to make the FK depend on?
  > It'd be purely arbitrary.

I propose that this problem is addressed, with an extension to the SQL standard,
as follows in the definition of a foreign key constraint:

1. Change the restriction on the referenced columns in a foreign key constraint
   to:

   The referenced columns must be a *superset* (the same, or a strict superset)
   of the columns of a non-deferrable unique or primary key index on the
   referenced table.

2. The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] clause
   optionally following the referenced column list.

   The index specified by this clause is used to support the foreign key
   constraint, and it must be a non-deferrable unique or primary key index on
   the referenced table compatible with the referenced columns.

   Here, compatible means that the columns of the index are a subset (the same,
   or a strict subset) of the referenced columns.

3. If the referenced columns are the same as the columns of such a unique (or
   primary key) index on the referenced table, then the behavior in PostgreSQL
   doesn't change.

4. If the referenced columns are a strict superset of the columns of such an
   index on the referenced table, then:
   1. If the primary key of the referenced table is a strict subset of the
  referenced columns, then its index is used to support the foreign key if
  no USING INDEX clause is present.
   2. Otherwise, the USING INDEX clause is required.

I believe that this scheme is unambiguous and should stably round trip a dump
and restore. In my previous example, the foreign key contraints could then be
defined as:

FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id),
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, graph_id),

Or alternatively:

FOREIGN KEY (source_id, graph_id) REFERENCES node(id, graph_id)
  USING INDEX node_pkey,
FOREIGN KEY (target_id, graph_id) REFERENCES node(id, 

Re: An inverted index using roaring bitmaps

2022-06-07 Thread Peter Geoghegan
On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi  wrote:
> The simulated index in this case is outrageously fast, up to ~150x on the 
> GROUP BY.

Couldn't you make a similar argument in favor of adding a B-Tree index
on "country"? This probably won't be effective in practice, but the
reasons for this have little to do with how a B-Tree index represents
TIDs. A GIN index can compress TIDs much more effectively, but the
same issues apply there.

The main reason why it won't work with a B-Tree is that indexes in
Postgres are not transactionally consistent structures, in general.
Whereas your cities_rb table is transactionally consistent (or perhaps
just simulates a transactionally consistent index). Maybe it could
work in cases where an index-only scan could be used, which is roughly
comparable to having a transactionally consistent index. But that
depends on having the visibility map set most or all heap pages
all-visible.

GIN indexes don't support index-only scans, and I don't see that
changing. So it's possible that just adding TID compression to B-Tree
indexes would significantly speedup this kind of query, just by making
low cardinality indexes much smaller. Though that's a hard project,
for many subtle reasons. This really amounts to building a bitmap
index, of the kind that are typically used for data warehousing, which
is something that has been discussed plenty of times on this list. GIN
indexes were really built for things like full-text search, not for
data warehousing.

B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
to "saturate" at about 3.5x smaller (relative to the same index with
deduplication disabled) once there are about 10 or so distinct keys
per row (the exception is indexes that happen to have huge keys, which
aren't very interesting). There are many B-Tree indexes (with typical
sized keys) that are similar in size to an "equivalent" GIN index --
the ability to compress TIDs isn't very valuable when you don't have
that many TIDs per key anyway. It's different when you have many TIDs
per key, of course. GIN indexes alone don't "saturate" at the same
point -- there is often a big size difference between low cardinality
and ultra low cardinality data. There are bound to be cases where not
having that level of space efficiency matters, especially with B-Tree
index-only scans that scan a significant fraction of the entire index,
or even the entire index.

-- 
Peter Geoghegan




Re: pgcon unconference / impact of block size on performance

2022-06-07 Thread Tomas Vondra
On 6/7/22 18:26, Robert Haas wrote:
> On Sat, Jun 4, 2022 at 7:23 PM Tomas Vondra
>  wrote:
>> This opened a long discussion about possible explanations - I claimed
>> one of the main factors is the adoption of flash storage, due to pretty
>> fundamental differences between HDD and SSD systems. But the discussion
>> concluded with an agreement to continue investigating this, so here's an
>> attempt to support the claim with some measurements/data.
> 
> Interesting. I wonder if the fact that x86 machines have a 4kB page
> size matters here. It seems hard to be sure because it's not something
> you can really change. But there are a few of your graphs where 4kB
> spikes up above any higher or lower value, and maybe that's why?
>

Possibly, but why would that be the case? Maybe there are places that do
stuff with memory and have different optimizations based on length? I'd
bet the 4k page is way more optimized than the other cases.

But honestly, I think the SSD page size matters much more, and the main
bump between 4k and 8k comes from having to deal with just a single
page. Imagine you write 8k postgres page - the filesystem splits that
into two 4k pages, and then eventually writes them to storage. It may
happen the writeback flushes them separately, possibly even to different
places on the device. Which might be more expensive to read later, etc.

I'm just speculating, of course. Maybe the storage is smarter and can
figure some of this internally, or maybe the locality will remain high.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Invalid memory access in pg_stat_get_subscription

2022-06-07 Thread Kuntal Ghosh
Hello hackers,

While exploring some code in logical replication worker
implementation, I noticed that we're accessing an invalid memory while
traversing LogicalRepCtx->workers[i].
For the above structure, we're allocating
max_logical_replication_workers times LogicalRepWorker amount of
memory in ApplyLauncherShmemSize. But, in the for loop, we're
accessing the max_logical_replication_workers + 1 location which is
resulting in random crashes.

Please find the patch that fixes the issue. I'm not sure whether we
should add a regression test for the same.

-- 
Thanks & Regards,
Kuntal Ghosh


0001-Fix-terminating-condition-while-fetching-the-replica.patch
Description: Binary data


Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Tom Lane
Christoph Berg  writes:
> in_hot_standby sounds very useful to have in that list.

I thought about this some more and concluded that we're blaming the
messenger.  There's nothing wrong with \dconfig's filtering rule;
rather, it's the fault of the backend for mislabeling a lot of
run-time-computed values as source=PGC_S_OVERRIDE when they really
are dynamic defaults.  Now in fairness, I think a lot of that code
predates the invention of PGC_S_DYNAMIC_DEFAULT, so there was not
a better way when it was written ... but there is now.

The attached draft patch makes the following changes:

* All run-time calculations of PGC_INTERNAL variables are relabeled
as PGC_S_DYNAMIC_DEFAULT.  There is not any higher source value
that we'd allow to set such a variable, so this is sufficient.
(I didn't do anything about in_hot_standby, which is set through
a hack rather than via set_config_option; not sure whether we want
to do anything there, or what it should be if we do.)  The net
effect of this compared to upthread examples is to hide
shared_memory_size and shared_memory_size_in_huge_pages from \dconfig.

* The auto-tune value of wal_buffers is relabeled as PGC_S_DYNAMIC_DEFAULT
if possible (but due to pre-existing hacks, we might have to force it).
This will hide that one too as long as you didn't manually set it.

* The rlimit-derived value of max_stack_depth is likewise relabeled
as PGC_S_DYNAMIC_DEFAULT, resolving the complaint Jonathan had upthread.
But now that we have a way to hide this, I'm having second thoughts
about whether we should.  If you are on a platform that's forcing an
unreasonably small stack size, it'd be good if \dconfig told you so.
Could it be sane to label that value as PGC_S_DYNAMIC_DEFAULT only when
it's the limit value (2MB) and PGC_S_ENV_VAR when it's smaller?

In any case, I expect that we'd apply this patch only to HEAD, which
means that when using psql's \dconfig against a pre-v15 server,
you'd still see these settings that we're trying to hide.
That doesn't bother me too much, but maybe some would find it
confusing.

Thoughts?

regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71136b11a2..2461317ee4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4167,7 +4167,7 @@ ReadControlFile(void)
 
 	snprintf(wal_segsz_str, sizeof(wal_segsz_str), "%d", wal_segment_size);
 	SetConfigOption("wal_segment_size", wal_segsz_str, PGC_INTERNAL,
-	PGC_S_OVERRIDE);
+	PGC_S_DYNAMIC_DEFAULT);
 
 	/* check and update variables dependent on wal_segment_size */
 	if (ConvertToXSegs(min_wal_size_mb, wal_segment_size) < 2)
@@ -4186,7 +4186,7 @@ ReadControlFile(void)
 
 	/* Make the initdb settings visible as GUC variables, too */
 	SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no",
-	PGC_INTERNAL, PGC_S_OVERRIDE);
+	PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
 }
 
 /*
@@ -4343,13 +4343,22 @@ XLOGShmemSize(void)
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
+	 *
+	 * We prefer to report this value as PGC_S_DYNAMIC_DEFAULT.  However, if
+	 * the DBA explicitly set wal_buffers = -1 in the config file, then
+	 * PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force the
+	 * matter with PGC_S_OVERRIDE.
 	 */
 	if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
 		snprintf(buf, sizeof(buf), "%d", XLOGChooseNumBuffers());
-		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER,
+		PGC_S_DYNAMIC_DEFAULT);
+		if (XLOGbuffers == -1)	/* failed to apply it? */
+			SetConfigOption("wal_buffers", buf, PGC_POSTMASTER,
+			PGC_S_OVERRIDE);
 	}
 	Assert(XLOGbuffers > 0);
 
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 9fa8fdd4cf..9a610d41ad 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -262,7 +262,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("-X requires a power of two value between 1 MB and 1 GB")));
 	SetConfigOption("wal_segment_size", optarg, PGC_INTERNAL,
-	PGC_S_OVERRIDE);
+	PGC_S_DYNAMIC_DEFAULT);
 }
 break;
 			case 'c':
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3b73e26956..dde4bc25b1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -921,7 +921,7 @@ PostmasterMain(int argc, char *argv[])
 		 * useful to show, even if one would only expect at least PANIC.  LOG
 		 * entries are hidden.
 		 */
-		SetConfigOption("log_min_messages", "FATAL", PGC_INTERNAL,
+		SetConfigOption("log_min_messages", "FATAL", PGC_SUSET,
 		PGC_S_OVERRIDE);
 	

Re: pgcon unconference / impact of block size on performance

2022-06-07 Thread Robert Haas
On Sat, Jun 4, 2022 at 7:23 PM Tomas Vondra
 wrote:
> This opened a long discussion about possible explanations - I claimed
> one of the main factors is the adoption of flash storage, due to pretty
> fundamental differences between HDD and SSD systems. But the discussion
> concluded with an agreement to continue investigating this, so here's an
> attempt to support the claim with some measurements/data.

Interesting. I wonder if the fact that x86 machines have a 4kB page
size matters here. It seems hard to be sure because it's not something
you can really change. But there are a few of your graphs where 4kB
spikes up above any higher or lower value, and maybe that's why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Collation version tracking for macOS

2022-06-07 Thread Rod Taylor
On Mon, Jun 6, 2022 at 8:25 PM Tom Lane  wrote:

> Jim Nasby  writes:
> >> I think the real problem here is that the underlying software mostly
> >> doesn't take this issue seriously.
>
> > The first step to a solution is admitting that the problem exists.
> > Ignoring broken backups, segfaults and data corruption as a "rant"
> > implies that we simply throw in the towel and tell users to suck it up
> > or switch engines. There are other ways to address this short of the
> > community doing all the work itself. One simple example would be to
> > refuse to start if the collation provider has changed since initdb
> > (which we'd need to allow users to override).
>
> You're conveniently skipping over the hard part, which is to tell
> whether the collation provider has changed behavior (which we'd better
> do with pretty darn high accuracy, if we're going to refuse to start
> on the basis of thinking it has).  Unfortunately, giving a reliable
> indication of collation behavioral changes is *exactly* the thing
> that the providers aren't taking seriously.
>

Is this more involved than creating a list of all valid Unicode characters
(~144 thousand), sorting them, then running crc32 over the sorted order to
create the "version" for the library/collation pair? Far from free but few
databases use more than a couple different collations.

-- 
Rod Taylor


Re: Add header support to text format and matching feature

2022-06-07 Thread Julien Rouhaud
Hi,

On Wed, Mar 30, 2022 at 09:11:09AM +0200, Peter Eisentraut wrote:
>
> Committed, after some further refinements as discussed.

While working on nearby code, I found some problems with this feature.

First, probably nitpicking, the HEADER MATCH is allowed for COPY TO, is that
expected?  The documentation isn't really explicit about it, but there's
nothing to match when exporting data it's a bit surprising.  I'm not opposed to
have HEADER MATCH means HEADER ON for COPY TO, as as-is one can easily reuse
the commands history, but maybe it should be clearly documented?

Then, apparently HEADER MATCH doesn't let you do sanity checks against a custom
column list.  This one looks like a clear oversight, as something like that
should be entirely valid IMHO:

CREATE TABLE tbl(col1 int, col2 int);
COPY tbl (col2, col1) TO '/path/to/file' WITH (HEADER MATCH);
COPY tbl (col2, col1) FROM '/path/to/file' WITH (HEADER MATCH);

but right now it errors out with:

ERROR:  column name mismatch in header line field 1: got "col1", expected "col2"

Note that the error message is bogus if you specify attributes in a
different order from the relation, as the code is mixing access to the tuple
desc and access to the raw fields with the same offset.

This also means that it will actually fail to detect a mismatch in the provided
column list and let you import data in the wrong position as long as the
datatypes are compatible and the column header in the file are in the correct
order.  For instance:

CREATE TABLE abc (a text, b text, c text);
INSERT INTO abc SELECT 'a', 'b', 'c';
COPY abc TO '/path/to/file' WITH (HEADER MATCH);

You can then import the data with any of those:
COPY abc(c, b, a) TO '/path/to/file' WITH (HEADER MATCH);
COPY abc(c, a, b) TO '/path/to/file' WITH (HEADER MATCH);
[...]
SELECT * FROM abc;

Even worse, if you try to do a COPY ... FROM ... WITH (HEADER ON) on a table
that has some dropped attribute(s).  The current code will access random memory
as there's no exact attnum / raw field mapping anymore.

I can work on a fix if needed (with some additional regression test to cover
those cases), but I'm still not sure that having a user provided column list is
supposed to be accepted or not for the HEADER MATCH.  In the meantime I will
add an open item.




Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Christoph Berg
Re: Jonathan S. Katz
> On 6/7/22 10:26 AM, Robert Haas wrote:
> I think some of these could be interesting if they deviate from the default
> (e.g. "in_hot_standby") as it will give the user context on the current
> state of the system.
> 
> However, something like that is still fairly easy to determine (e.g.
> `pg_catalog.pg_is_in_recovery()`). And looking through the settings marked
> "internal" showing the non-defaults may not provide much additional context
> to a user.

in_hot_standby sounds very useful to have in that list.

Christoph




Re: effective_io_concurrency and NVMe devices

2022-06-07 Thread Tomas Vondra
On 6/7/22 15:29, Jakub Wartak wrote:
> Hi Tomas,
> 
>>> I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
>>> NVMe SSD. I ran a few tests to see how different values of
>>> effective_io_concurrency would affect performance. I tried to come up
>>> with a query that did little enough CPU processing to ensure that I/O
>>> was the clear bottleneck.
>>>
>>> The test was with a 128GB table on a machine with 64GB of RAM.  I
>>> padded the tuples out so there were 4 per page so that the aggregation
>>> didn't have much work to do.
>>>
>>> The query I ran was: explain (analyze, buffers, timing off) select
>>> count(p) from r where a = 1;
>  
>> The other idea I had while looking at batching a while back, is that we 
>> should
>> batch the prefetches. The current logic interleaves prefetches with other 
>> work -
>> prefetch one page, process one page, ... But once reading a page gets
>> sufficiently fast, this means the queues never get deep enough for
>> optimizations. So maybe we should think about batching the prefetches, in 
>> some
>> way. Unfortunately posix_fadvise does not allow batching of requests, but we
>> can at least stop interleaving the requests.
> 
> .. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.  
> 

Interesting! Will take time to get into real systems, though.

>> The attached patch is a trivial version that waits until we're at least
>> 32 pages behind the target, and then prefetches all of them. Maybe give it a 
>> try?
>> (This pretty much disables prefetching for e_i_c below 32, but for an
>> experimental patch that's enough.)
> 
> I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults 
> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier 
> inspection with strace just to be sure//so please don't compare times). 
> 
> run:
> a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS 
> (~122k IOPS practical max here for libaio)
> b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled 
> prefetching, fadvise() not occurring
> c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms 
> (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even peaked for a 
> second on ~122k)
> d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms 
> (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks to 90..100k 
> IOPS)
> 
> ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth 
> throwing that patch onto more advanced / complete performance test farm too ? 
> (although it's only for bitmap heap scans)
> 
> run a: looked interleaved as you said:
> fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
> \230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
> fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
> \230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
> [..]
> 
> BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is 
> flipping between 16(*512=8kB) to ~256(*512=~128kB!) as if kernel was doing 
> some own prefetching heuristics on and off in cycles, while when calling 
> e_i_c/fadvise() is in action then it seems to be always 8kB requests. So with 
> disabled fadivse() one IMHO might have problems deterministically 
> benchmarking short queries as kernel voodoo might be happening (?)
> 

Yes, kernel certainly does it's own read-ahead, which works pretty well
for sequential patterns. What does

   blockdev --getra /dev/...

say?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: broken regress tests on fedora 36

2022-06-07 Thread Andrew Dunstan


On 2022-06-07 Tu 08:56, Michael Paquier wrote:
> On Tue, Jun 07, 2022 at 10:52:45AM +0200, Pavel Stehule wrote:
>> #   Failed test '\timing with query error: timing output appears'
>> #   at t/001_basic.pl line 95.
>> #   'Time: 0,293 ms'
>> # doesn't match '(?^m:^Time: \d+\.\d\d\d ms)'
>> # Looks like you failed 2 tests of 58.
> Fun.  The difference is in the separator: dot vs comma.  This should
> fail with French the same way.  Perhaps it would fail differently in
> other languages?  There is no need to be that precise with the regex
> IMO, so I would just cut the regex with the number, checking only the
> unit at the end.


or just replace '\.' with '[.,]'


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Jonathan S. Katz

On 6/7/22 10:26 AM, Robert Haas wrote:

On Mon, Jun 6, 2022 at 5:02 PM Tom Lane  wrote:

I think a reasonable case can be made for excluding "internal" GUCs
on the grounds that (a) they cannot be set, and therefore (b) whatever
value they have might as well be considered the default.


I agree.


I think some of these could be interesting if they deviate from the 
default (e.g. "in_hot_standby") as it will give the user context on the 
current state of the system.


However, something like that is still fairly easy to determine (e.g. 
`pg_catalog.pg_is_in_recovery()`). And looking through the settings 
marked "internal" showing the non-defaults may not provide much 
additional context to a user.


+1 for excluding them.

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: JSON_TABLE output collations

2022-06-07 Thread Andrew Dunstan


On 2022-06-07 Tu 09:19, Peter Eisentraut wrote:
>
> The present implementation of JSON_TABLE sets the collation of the
> output columns to the default collation if the specified data type is
> collatable.  Why don't we use the collation of the type directly? 
> This would make domains with attached collations work correctly.
>
> See attached patch for how to change this.  I hacked up a regression
> test case to demonstrate this.



Looks reasonable.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-07 Thread vignesh ravichandran
I think this is a good improvement and also like the option (on pg_rewind) to 
potentially send checkpoints to the source.




Personal anecdote. I was using stolon and frequently failing over. For sometime 
the rewind was failing that it wasn't required. Only learnt that it's the 
checkpoint on the source which was missing. 

References https://github.com/sorintlab/stolon/issues/601
And the fix https://github.com/sorintlab/stolon/pull/644
 https://github.com/sorintlab/stolon/issues/601





 On Sat, 04 Jun 2022 05:59:12 -0700 James Coleman  
wrote 



A few weeks back I sent a bug report [1] directly to the -bugs mailing 
list, and I haven't seen any activity on it (maybe this is because I 
emailed directly instead of using the form?), but I got some time to 
take a look and concluded that a first-level fix is pretty simple. 
 
A quick background refresher: after promoting a standby rewinding the 
former primary requires that a checkpoint have been completed on the 
new primary after promotion. This is correctly documented. However 
pg_rewind incorrectly reports to the user that a rewind isn't 
necessary because the source and target are on the same timeline. 
 
Specifically, this happens when the control file on the newly promoted 
server looks like: 
 
 Latest checkpoint's TimeLineID:   4 
 Latest checkpoint's PrevTimeLineID:   4 
 ... 
 Min recovery ending loc's timeline:   5 
 
Attached is a patch that detects this condition and reports it as an 
error to the user. 
 
In the spirit of the new-ish "ensure shutdown" functionality I could 
imagine extending this to automatically issue a checkpoint when this 
situation is detected. I haven't started to code that up, however, 
wanting to first get buy-in on that. 
 
Thanks, 
James Coleman 
 
1: 
https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com

Re: How about a psql backslash command to show GUCs?

2022-06-07 Thread Robert Haas
On Mon, Jun 6, 2022 at 5:02 PM Tom Lane  wrote:
> I think a reasonable case can be made for excluding "internal" GUCs
> on the grounds that (a) they cannot be set, and therefore (b) whatever
> value they have might as well be considered the default.

I agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-07 Thread Euler Taveira


On Tue, Jun 7, 2022, at 12:03 AM, Laurenz Albe wrote:
> On Sat, 2022-06-04 at 21:18 +, Phil Florent wrote:
> > I opened an issue with an attached code on oracle_fdw git page : 
> > https://github.com/laurenz/oracle_fdw/issues/534 
> > Basically I expected to obtain a "no privilege" error from PostgreSQL when 
> > I have no read privilege
> > on the postgres foreign table but I obtained an Oracle error instead.
> > Laurenz investigated and closed the issue but he suggested perhaps I should 
> > post that on
> > the hackers list since it also occurs with postgres-fdw on some occasion(I 
> > have investigated some more,
> > and postgres_fdw does the same thing when you turn onuse_remote_estimate.). 
> > Hence I do...
> 
> To add more detais: permissions are checked at query execution time, but if 
> "use_remote_estimate"
> is used, the planner already accesses the remote table, even if the user has 
> no permissions
> on the foreign table.
> 
> I feel that that is no bug, but I'd be curious to know if others disagree.
You should expect an error (like in the example) -- probably not at that point.
It is behaving accordingly. However, that error is exposing an implementation
detail (FDW has to access the remote table at that phase). I don't think that
changing the current design (permission check after planning) for FDWs to
provide a good UX is worth it. IMO it is up to the FDW author to hide such
cases if it doesn't cost much to do it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: replacing role-level NOINHERIT with a grant-level option

2022-06-07 Thread Robert Haas
[ trimming various comments that broadly make sense to me and which
don't seem to require further comment in this moment ]

On Mon, Jun 6, 2022 at 7:21 PM Stephen Frost  wrote:
> > To revoke a grant entirely, you just say REVOKE foo FROM bar, as now.
> > To change an option for an existing grant, you can re-execute the
> > grant statement with a different WITH clause. Any options that are
> > explicitly mentioned will be changed to have the associated values;
> > unmentioned options will retain their existing values. If you want to
> > change the value of a Boolean option to false, you have a second
> > option, which is to write "REVOKE option_name OPTION FOR foo FROM
> > bar," which means exactly the same thing as "GRANT foo TO bar WITH
> > option_name FALSE".
>
> I'm a bit concerned about this because, iiuc, it would mean:
>
> GRANT foo TO bar WITH FRUIT KIWI, SARDINES;
> GRANT foo TO bar WITH FRUIT STRAWBERRY;
>
> would mean that the GRANT of FRUIT would then *only* have STRAWBERRY,
> right?

I think that you are misunderstanding what kind of option I intended
FRUIT to be. Here, I was imagining FRUIT as a property that every
grant has. Any given grant is either strawberry, or it's kiwi, or it's
banana. It cannot be more than one of those things, nor can it be none
of those things. It follows that if you execute GRANT without
specifying a FRUIT, there's some default - hopefully banana, but
that's a matter of taste. Later, you can change the fruit associated
with a grant, but you cannot remove it, because there's no such thing
as a fruitless grant. Imagine that the catalog representation is a
"char" that is either 's', 'k', or 'b'.

Now you could certainly question whether it's a good idea for us to
have an option that works like this. I don't really know. For a while
I thought that it might make sense to propose something like ACCESS {
EXPLICIT | IMPLICIT | NONE }, where ACCESS IMPLICIT would mean that
the grantee implicitly has the permissions of the role, ACCESS
EXPLICIT means that they do not implicitly have those permissions but
can access them via SET ROLE, and ACCESS NONE means that they can't
even do that. The default would I suppose be ACCESS IMPLICIT but you
could change it to one of the other two. However, I then thought that
it made more sense to keep it as two separate Booleans because
actually all four combinations are sensible: you could want to have a
setup where you're allowed to implicitly access the permissions of the
role but you CANNOT SET ROLE to it. For instance, this might make
sense for a predefined role, so that you don't end up with tables
owned by pg_monitor or whatever.

Anyway, if the hypothetical FRUIT property works as I describe here -
there's always a single value - then the second GRANT leaves the
SARDINES property set, but changes the FRUIT property from strawberry
to kiwi. Since the property is single-valued, you cannot add a second
fruit, nor can you remove the fruit altogether, because those just
aren't sensible ideas with an option of this kind. As alonger example,
it's like the FORMAT property of EXPLAIN: it always has to be TEXT or
XML or JSON. You can choose not to explicitly specify the option, but
then you get a default. Your EXPLAIN output always has to have some
format.

> In your proposal, does:
>
> GRANT foo TO bar WITH FRUIT STRAWBERRY;
>
> mean that 'foo' is grant'd to 'bar' too?  Seems to be closest to current
> usage and the spec's ideas on these things.  I'm thinking that could be
> dealt with by having a MEMBERSHIP option (which would be a separate
> column in pg_auth_members and default would be 'true') but otherwise
> using exactly what you have here, eg:

Currently, GRANT always creates an entry in pg_auth_members, or
modifies an existing one, or does nothing because the one that's there
is the same as the one it would have created. I think we should stick
with that idea.

That's why I proposed the name SET, not MEMBERSHIP. You would still
get a catalog entry in pg_auth_members, so you are still a member in
some loose sense even if your grant has INHERIT FALSE and SET FALSE,
but in such a case the fact that you are a member isn't really doing
anything for you in terms of getting you access to privileges because
you're neither allowed to exercise them implicitly nor SET ROLE to the
role.

I find that idea - that GRANT always grants membership but membership
by itself doesn't really do anything for you unless you've got some
other options enabled somewhere - more appealing than the design you
seem to have in mind, which seems to me that membership is the same
thing as the ability to SET ROLE and thus, if the ability to SET ROLE
has not been granted, you have a grant that didn't confirm membership
in any sense. I'm not saying we couldn't make that work, but I think
it's awkward to make that work. Among other problems, what happens
with the actual catalog representation? You could for example still
create a role in pg_auth_members and then have a 

Re: pgcon unconference / impact of block size on performance

2022-06-07 Thread Tomas Vondra



On 6/7/22 15:48, Jakub Wartak wrote:
> Hi,
> 
>> The really
>> puzzling thing is why is the filesystem so much slower for smaller pages. I 
>> mean,
>> why would writing 1K be 1/3 of writing 4K?
>> Why would a filesystem have such effect?
> 
> Ha! I don't care at this point as 1 or 2kB seems too small to handle many 
> real world scenarios ;)
> 

I think that's not quite true - a lot of OLTP works with fairly narrow
rows, and if they use more data, it's probably in TOAST, so again split
into smaller rows. It's true smaller pages would cut some of the limits
(columns, index tuple, ...) of course, and that might be an issue.

Independently of that, it seems like an interesting behavior and it
might tell us something about how to optimize for larger pages.

>>> b) Another thing that you could also include in testing is that I've 
>>> spotted a
>> couple of times single-threaded fio might could be limiting factor 
>> (numjobs=1 by
>> default), so I've tried with numjobs=2,group_reporting=1 and got this below
>> ouput on ext4 defaults even while dropping caches (echo 3) each loop 
>> iteration -
>> - something that I cannot explain (ext4 direct I/O caching effect? how's that
>> even possible? reproduced several times even with numjobs=1) - the point 
>> being
>> 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some
>> caching effect because for randwrite it does not happen. I've triple-checked 
>> with
>> iostat -x... it cannot be any internal device cache as with direct I/O that 
>> doesn't
>> happen:
>>>
>>> [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
>>> nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s,
>>> iops=206643, runt= 60001msec [b]
>>> nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s,
>>> iops=160604, runt= 60001msec [b]
>>> nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s,
>>> iops=157802, runt= 60001msec [b]
>>> nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s,
>>> iops=122115, runt= 60128msec
>>> nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s,
>>> iops=17683, runt= 60001msec [a, as before]
>>> nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s,
>>> iops=17263, runt= 60001msec [a, as before]
>>> nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s,
>>> iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
>>> nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s,
>>> iops=48979, runt= 60099msec
>>>
>>
>> No idea what might be causing this. BTW so you're not using direct-io to 
>> access
>> the raw device? Or am I just misreading this?
> 
> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand 
> how having direct I/O enabled (which disables caching) achieves better read 
> IOPS on ext4 than on raw device... isn't it contradiction?
> 

Thanks for the clarification. Not sure what might be causing this. Did
you use the same parameters (e.g. iodepth) in both cases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak
Hi,

> The really
> puzzling thing is why is the filesystem so much slower for smaller pages. I 
> mean,
> why would writing 1K be 1/3 of writing 4K?
> Why would a filesystem have such effect?

Ha! I don't care at this point as 1 or 2kB seems too small to handle many real 
world scenarios ;)

> > b) Another thing that you could also include in testing is that I've 
> > spotted a
> couple of times single-threaded fio might could be limiting factor (numjobs=1 
> by
> default), so I've tried with numjobs=2,group_reporting=1 and got this below
> ouput on ext4 defaults even while dropping caches (echo 3) each loop 
> iteration -
> - something that I cannot explain (ext4 direct I/O caching effect? how's that
> even possible? reproduced several times even with numjobs=1) - the point being
> 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some
> caching effect because for randwrite it does not happen. I've triple-checked 
> with
> iostat -x... it cannot be any internal device cache as with direct I/O that 
> doesn't
> happen:
> >
> > [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
> > nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s,
> > iops=206643, runt= 60001msec [b]
> > nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s,
> > iops=160604, runt= 60001msec [b]
> > nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s,
> > iops=157802, runt= 60001msec [b]
> > nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s,
> > iops=122115, runt= 60128msec
> > nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s,
> > iops=17683, runt= 60001msec [a, as before]
> > nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s,
> > iops=17263, runt= 60001msec [a, as before]
> > nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s,
> > iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
> > nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s,
> > iops=48979, runt= 60099msec
> >
> 
> No idea what might be causing this. BTW so you're not using direct-io to 
> access
> the raw device? Or am I just misreading this?

Both scenarios (raw and fs) have had direct=1 set. I just cannot understand how 
having direct I/O enabled (which disables caching) achieves better read IOPS on 
ext4 than on raw device... isn't it contradiction?

-J.





Re: tablesync copy ignores publication actions

2022-06-07 Thread Euler Taveira
On Tue, Jun 7, 2022, at 1:10 AM, Peter Smith wrote:
> The logical replication tablesync ignores the publication 'publish'
> operations during the initial data copy.
> 
> This is current/known PG behaviour (e.g. as recently mentioned [1])
> but it was not documented anywhere.
initial data synchronization != replication. publish parameter is a replication
property; it is not a initial data synchronization property. Maybe we should
make it clear like you are suggesting.

> This patch just documents the existing behaviour and gives some examples.
Why did you add this information to that specific paragraph? IMO it belongs to
a separate paragraph; I would add it as the first paragraph in that subsection.

I suggest the following paragraph:


The initial data synchronization does not take into account the 
publish parameter to copy the existing data.


There is no point to link the Initial Snapshot subsection. That subsection is
explaining the initial copy steps and you want to inform about the effect of a
publication parameter on the initial copy. Although both are talking about the
same topic (initial copy), that link to Initial Snapshot subsection won't add
additional information about the publish parameter. You could expand the
suggested sentence to make it clear what publish parameter is or even add a
link to the CREATE PUBLICATION synopsis (that explains about publish
parameter).

You add an empty paragraph. Remove it.

I'm not sure it deserves an example. It is an easy-to-understand concept and a
good description is better than ~ 80 new lines.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: pgcon unconference / impact of block size on performance

2022-06-07 Thread Tomas Vondra
On 6/7/22 11:46, Jakub Wartak wrote:
> Hi Tomas,
> 
>> Well, there's plenty of charts in the github repositories, including the 
>> charts I
>> think you're asking for:
> 
> Thanks.
> 
>> I also wonder how is this related to filesystem page size - in all the 
>> benchmarks I
>> did I used the default (4k), but maybe it'd behave if the filesystem page 
>> matched
>> the data page.
> 
> That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:
> 
> [root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
> nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, 
> runt= 60001msec [b]
> nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, 
> runt= 60001msec [b] 
> nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, 
> runt= 60001msec [b]
> nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, 
> runt= 60001msec
> nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, 
> runt= 60366msec [a]
> nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, 
> runt= 60002msec [a]
> nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, 
> runt= 60248msec
> nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, 
> runt= 60047msec
> 
> So, I've found out two interesting things while playing with raw vs ext4:
> a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as 
> expected, this was ext4 4kb default fs page size impact as you was thinking 
> about when fio 1k was hitting ext4 4kB block)

Right. Interesting, so for randread we get a consistent +30% speedup on
raw devices with all page sizes, while on randwrite it's about 1.0x for
4K. The really puzzling thing is why is the filesystem so much slower
for smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
Why would a filesystem have such effect?

> b) Another thing that you could also include in testing is that I've spotted 
> a couple of times single-threaded fio might could be limiting factor 
> (numjobs=1 by default), so I've tried with numjobs=2,group_reporting=1 and 
> got this below ouput on ext4 defaults even while dropping caches (echo 3) 
> each loop iteration -- something that I cannot explain (ext4 direct I/O 
> caching effect? how's that even possible? reproduced several times even with 
> numjobs=1) - the point being 206643 1kb IOPS @ ext4 direct-io > 131783 1kB 
> IOPS @ raw, smells like some caching effect because for randwrite it does not 
> happen. I've triple-checked with iostat -x... it cannot be any internal 
> device cache as with direct I/O that doesn't happen:
> 
> [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
> nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, 
> runt= 60001msec [b]
> nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, 
> runt= 60001msec [b]
> nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, 
> runt= 60001msec [b]
> nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, 
> runt= 60128msec
> nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, 
> runt= 60001msec [a, as before]
> nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, 
> runt= 60001msec [a, as before]
> nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, 
> runt= 60311msec [reproduced benefit, as per earlier email]
> nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, 
> runt= 60099msec
> 

No idea what might be causing this. BTW so you're not using direct-io to
access the raw device? Or am I just misreading this?

>>> One way or another it would be very nice to be able to select the
>>> tradeoff using initdb(1) without the need to recompile, which then
>>> begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
>>> DB page size, ...).> Do you envision any plans for this we still in a
>>> need to gather more info exactly why this happens? (perf reports?)
>>>
>>
>> Not sure I follow. Plans for what? Something that calibrates cost parameters?
>> That might be useful, but that's a rather separate issue from what's 
>> discussed
>> here - page size, which needs to happen before initdb (at least with how 
>> things
>> work currently).
> [..]
> 
> Sorry, I was too far teched and assumed you guys were talking very long term. 
> 

Np, I think that'd be an useful tool, but it seems more like a
completely separate discussion.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: effective_io_concurrency and NVMe devices

2022-06-07 Thread Jakub Wartak
Hi Tomas,

> > I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
> > NVMe SSD. I ran a few tests to see how different values of
> > effective_io_concurrency would affect performance. I tried to come up
> > with a query that did little enough CPU processing to ensure that I/O
> > was the clear bottleneck.
> >
> > The test was with a 128GB table on a machine with 64GB of RAM.  I
> > padded the tuples out so there were 4 per page so that the aggregation
> > didn't have much work to do.
> >
> > The query I ran was: explain (analyze, buffers, timing off) select
> > count(p) from r where a = 1;
 
> The other idea I had while looking at batching a while back, is that we should
> batch the prefetches. The current logic interleaves prefetches with other 
> work -
> prefetch one page, process one page, ... But once reading a page gets
> sufficiently fast, this means the queues never get deep enough for
> optimizations. So maybe we should think about batching the prefetches, in some
> way. Unfortunately posix_fadvise does not allow batching of requests, but we
> can at least stop interleaving the requests.

.. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.  

> The attached patch is a trivial version that waits until we're at least
> 32 pages behind the target, and then prefetches all of them. Maybe give it a 
> try?
> (This pretty much disables prefetching for e_i_c below 32, but for an
> experimental patch that's enough.)

I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults 
s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier 
inspection with strace just to be sure//so please don't compare times). 

run:
a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS 
(~122k IOPS practical max here for libaio)
b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled 
prefetching, fadvise() not occurring
c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms 
(mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even peaked for a second 
on ~122k)
d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms 
(mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks to 90..100k IOPS)

~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth 
throwing that patch onto more advanced / complete performance test farm too ? 
(although it's only for bitmap heap scans)

run a: looked interleaved as you said:
fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
\230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
\230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
[..]

BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is 
flipping between 16(*512=8kB) to ~256(*512=~128kB!) as if kernel was doing some 
own prefetching heuristics on and off in cycles, while when calling 
e_i_c/fadvise() is in action then it seems to be always 8kB requests. So with 
disabled fadivse() one IMHO might have problems deterministically benchmarking 
short queries as kernel voodoo might be happening (?)

-J.




JSON_TABLE output collations

2022-06-07 Thread Peter Eisentraut


The present implementation of JSON_TABLE sets the collation of the 
output columns to the default collation if the specified data type is 
collatable.  Why don't we use the collation of the type directly?  This 
would make domains with attached collations work correctly.


See attached patch for how to change this.  I hacked up a regression 
test case to demonstrate this.From b9f6de8bff992bf193d5710d175f6421851c869f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Jun 2022 15:11:13 +0200
Subject: [PATCH] Fix collation of JSON_TABLE output columns

The output columns of JSON_TABLE should have the collations of their
data type.  The existing implementation sets the default collation if
the type is collatable.

XXX regression test changes only for demonstration
---
 src/backend/parser/parse_jsontable.c   | 5 +
 src/test/regress/sql/jsonb_sqljson.sql | 8 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/parser/parse_jsontable.c 
b/src/backend/parser/parse_jsontable.c
index dbd3e66205..ae559d9cae 100644
--- a/src/backend/parser/parse_jsontable.c
+++ b/src/backend/parser/parse_jsontable.c
@@ -514,10 +514,7 @@ appendJsonTableColumns(JsonTableContext *cxt, List 
*columns)
 
tf->coltypes = lappend_oid(tf->coltypes, typid);
tf->coltypmods = lappend_int(tf->coltypmods, typmod);
-   tf->colcollations = lappend_oid(tf->colcollations,
-   
type_is_collatable(typid)
-   
? DEFAULT_COLLATION_OID
-   
: InvalidOid);
+   tf->colcollations = lappend_oid(tf->colcollations, 
get_typcollation(typid));
tf->colvalexprs = lappend(tf->colvalexprs, colexpr);
}
 }
diff --git a/src/test/regress/sql/jsonb_sqljson.sql 
b/src/test/regress/sql/jsonb_sqljson.sql
index fff2537480..8127012f2b 100644
--- a/src/test/regress/sql/jsonb_sqljson.sql
+++ b/src/test/regress/sql/jsonb_sqljson.sql
@@ -338,7 +338,9 @@ CREATE INDEX ON test_jsonb_mutability (JSON_QUERY(js, '$[1, 
$.a ? (@.datetime("H
COLUMNS (item int PATH '$', foo int)) bar;
 
 -- JSON_TABLE: basic functionality
-CREATE DOMAIN jsonb_test_domain AS text CHECK (value <> 'foo');
+--CREATE COLLATION foo (provider = icu, locale = '');
+CREATE COLLATION foo (provider = icu, locale = '@colAlternate=shifted');
+CREATE DOMAIN jsonb_test_domain AS text CHECK (value <> 'foo') COLLATE foo;
 
 SELECT *
 FROM
@@ -383,7 +385,9 @@ CREATE DOMAIN jsonb_test_domain AS text CHECK (value <> 
'foo');
jba jsonb[] PATH '$'
)
) jt
-   ON true;
+   ON true
+ORDER BY "domain"
+;
 
 -- JSON_TABLE: Test backward parsing
 
-- 
2.36.1



Re: docs: mention "pg_read_all_stats" in "track_activities" description

2022-06-07 Thread Ian Lawrence Barwick
Hi

Apologies for the delayed response, was caught up in a minor life diversion
over the past couple of weeks.

2022年5月21日(土) 12:29 Michael Paquier :
>
> On Fri, May 20, 2022 at 04:08:37PM -0700, Nathan Bossart wrote:
> > LGTM
>
> Indeed, it is a good idea to add this information.  Will apply and
> backpatch accordingly.

Thanks!

2022年5月30日(月) 11:34 Michael Paquier :
>
> On Sat, May 28, 2022 at 06:10:31AM -0700, Nathan Bossart wrote:
> > Sorry, I missed this one earlier.  I'm okay with something along those
> > lines.  I'm still trying to think of ways to make the last part a little
> > clearer, but I don't have any ideas beyond what we've discussed upthread.
>
> Okay.  I have used the wording of upthread then.  Thanks!

A little late to the party, but as an alternative suggestion for the last
part:

  "... and users who either own the session being reported on, or who have
  privileges of the role to which the session belongs,"

so the whole sentence would read:

  Note that even when enabled, this information is only visible to superusers,
  roles with privileges of the pg_read_all_stats role, and users who either own
  the session being reported on or who have privileges of the role to which the
  session belongs, so it should not represent a security risk.

or with some parentheses to break it up a little:

  Note that even when enabled, this information is only visible to superusers,
  roles with privileges of the pg_read_all_stats role, and users who either own
  the session being reported on (or who have privileges of the role to which the
  session belongs), so it should not represent a security risk.

I'm not sure if it really improves on the latest committed change, so just a
suggestion.


Regards

Ian Barwick




Re: broken regress tests on fedora 36

2022-06-07 Thread Michael Paquier
On Tue, Jun 07, 2022 at 10:52:45AM +0200, Pavel Stehule wrote:
> #   Failed test '\timing with query error: timing output appears'
> #   at t/001_basic.pl line 95.
> #   'Time: 0,293 ms'
> # doesn't match '(?^m:^Time: \d+\.\d\d\d ms)'
> # Looks like you failed 2 tests of 58.

Fun.  The difference is in the separator: dot vs comma.  This should
fail with French the same way.  Perhaps it would fail differently in
other languages?  There is no need to be that precise with the regex
IMO, so I would just cut the regex with the number, checking only the
unit at the end.
--
Michael


signature.asc
Description: PGP signature


Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-06-07 Thread Amit Kapila
On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  wrote:
>
> On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada  
> wrote:
> >
>
> poc_add_regression_tests.patch adds regression tests for this bug. The
> regression tests are required for both HEAD and back-patching but I've
> separated this patch for testing the above two patches easily.
>

Few comments on the test case patch:
===
1.
+# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
+# only its COMMIT record, because it starts from the RUNNING_XACT
record emitted
+# during the first checkpoint execution.  This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"

Will this test always work? What if we get an additional running_xact
record between steps "s0_commit" and "s0_begin" that is logged via
bgwriter? You can mimic that by adding an additional checkpoint
between those two steps. If we do that, the test will pass even
without the patch because I think the last decoding will start
decoding from this new running_xact record.

2.
+step "s1_get_changes" { SELECT data FROM
pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids', '0'); }

It is better to skip empty transactions by using 'skip-empty-xacts' to
avoid any transaction from a background process like autovacuum. We
have previously seen some buildfarm failures due to that.

3. Did you intentionally omit the .out from the test case patch?

4.
This transaction must be marked as
+# catalog-changes while decoding the COMMIT record and the decoding
of the INSERT
+# record must read the pg_class with the correct historic snapshot.

/marked as catalog-changes/marked as containing catalog changes

-- 
With Regards,
Amit Kapila.




Re: Logging query parmeters in auto_explain

2022-06-07 Thread Dagfinn Ilmari Mannsåker
Michael Paquier  writes:

> On Tue, May 31, 2022 at 09:33:20PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> Here's a patch that adds a corresponding
>> auto_explain.log_parameter_max_length config setting, which controls the
>> "Query Parameters" node in the logged plan.  Just like in core, the
>> default is -1, which logs the parameters in full, and 0 disables
>> parameter logging, while any other value truncates each parameter to
>> that many bytes.
>
> With a behavior similar to the in-core log_parameter_max_length, this
> looks rather sane to me.  This is consistent with the assumptions of
> errdetail_params().

That was the intention, yes.

> +$node->append_conf('postgresql.conf', "auto_explain.log_parameter_max_length 
> = -1");
> Nit.  You don't need this change in the TAP test, as this is the
> default value to log everything.

Point, fixed in the attached v2. I've also added a test that truncation
and disabling works.

- ilmari

>From 4ba1f1353413b2348228d98d7f102620d6c2e060 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 31 May 2022 21:12:12 +0100
Subject: [PATCH v2] Log query parameters in auto_explain

Add an auto_explain.log_parameter_max_length config setting, similar
to the corresponding core setting, that controls the inclusion of
query parameters in the logged explain output.
---
 contrib/auto_explain/auto_explain.c| 15 +++
 contrib/auto_explain/t/001_auto_explain.pl | 46 +-
 doc/src/sgml/auto-explain.sgml | 19 +
 src/backend/commands/explain.c | 22 +++
 src/include/commands/explain.h |  1 +
 5 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c9a0d947c8..1ba7536879 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -19,12 +19,14 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/params.h"
 #include "utils/guc.h"
 
 PG_MODULE_MAGIC;
 
 /* GUC variables */
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
+static int	auto_explain_log_parameter_max_length = -1; /* bytes or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
@@ -105,6 +107,18 @@ _PG_init(void)
 			NULL,
 			NULL);
 
+	DefineCustomIntVariable("auto_explain.log_parameter_max_length",
+			"Sets the maximum length of query parameters to log.",
+			"Zero logs no query parameters, -1 logs them in full.",
+			_explain_log_parameter_max_length,
+			-1,
+			-1, INT_MAX,
+			PGC_SUSET,
+			GUC_UNIT_BYTE,
+			NULL,
+			NULL,
+			NULL);
+
 	DefineCustomBoolVariable("auto_explain.log_analyze",
 			 "Use EXPLAIN ANALYZE for plan logging.",
 			 NULL,
@@ -389,6 +403,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 
 			ExplainBeginOutput(es);
 			ExplainQueryText(es, queryDesc);
+			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
 			ExplainPrintPlan(es, queryDesc);
 			if (es->analyze && auto_explain_log_triggers)
 ExplainPrintTriggers(es, queryDesc);
diff --git a/contrib/auto_explain/t/001_auto_explain.pl b/contrib/auto_explain/t/001_auto_explain.pl
index 82e4d9d15c..518ba16826 100644
--- a/contrib/auto_explain/t/001_auto_explain.pl
+++ b/contrib/auto_explain/t/001_auto_explain.pl
@@ -19,14 +19,26 @@
 # run a couple of queries
 $node->safe_psql("postgres", "SELECT * FROM pg_class;");
 $node->safe_psql("postgres",
-	"SELECT * FROM pg_proc WHERE proname = 'int4pl';");
+	q{PREPARE get_proc(name) AS SELECT * FROM pg_proc WHERE proname = $1; EXECUTE get_proc('int4pl');});
 
 # emit some json too
 $node->append_conf('postgresql.conf', "auto_explain.log_format = json");
 $node->reload;
 $node->safe_psql("postgres", "SELECT * FROM pg_proc;");
 $node->safe_psql("postgres",
-	"SELECT * FROM pg_class WHERE relname = 'pg_class';");
+	q{PREPARE get_class(name) AS SELECT * FROM pg_class WHERE relname = $1; EXECUTE get_class('pg_class');});
+
+# check that logged parameters can be truncated
+$node->append_conf('postgresql.conf', "auto_explain.log_parameter_max_length = 3");
+$node->reload;
+$node->safe_psql("postgres",
+	q{PREPARE get_type(name) AS SELECT * FROM pg_type WHERE typname = $1; EXECUTE get_type('float8');});
+
+# check that parameter logging can be disabled
+$node->append_conf('postgresql.conf', "auto_explain.log_parameter_max_length = 0");
+$node->reload;
+$node->safe_psql("postgres",
+	q{PREPARE get_type(name) AS SELECT * FROM pg_type WHERE typname = $1; EXECUTE get_type('int8');});
 
 $node->stop('fast');
 
@@ -34,6 +46,16 @@
 
 my $log_contents = slurp_file($log);
 
+like(
+	$log_contents,
+	qr/Query Text: SELECT \* FROM pg_class;/,
+	"query text logged, text mode");
+
+like(
+	$log_contents,
+	qr/Query Parameters: \$1 = 

Re: Proposal: adding a better description in psql command about large objects

2022-06-07 Thread Thibaud W.

On 6/3/22 19:29, Nathan Bossart wrote:

On Fri, Jun 03, 2022 at 12:56:20PM -0400, Tom Lane wrote:

Nathan Bossart  writes:

Another option could be to move it after the "Input/Output" section so that
it's closer to some other commands that involve files.  I can't say I have
a strong opinion about whether/where to move it, though.

Yeah, I thought of that choice too, but it ends up placing the
Large Objects section higher up the list than seems warranted on
frequency-of-use grounds.

Fair point.


After looking at the output I concluded that we'd be better off to
stick with the normal indentation amount, and break the lo_import
entry into two lines to make that work.  One reason for this is
that some translators might've already settled on a different
indentation amount in order to cope with translated parameter names,
and deviating from the normal here will just complicate their lives.
So that leaves me proposing v5.

I see.  As you noted earlier, moving the entries higher makes the
inconsistent indentation less appealing, too.  So this LGTM.


(I also fixed the out-of-date line count in helpVariables.)

Yeah, it looks like 7844c99 missed this.

Thanks, output is more readable this way.

Best regards.
--
Thibaud W.




RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak
[..]
>I doubt we could ever
> make the default smaller than it is today as it would nobody would be able to
> insert rows larger than 4 kilobytes into a table anymore. 

Add error "values larger than 1/3 of a buffer page cannot be indexed" to that 
list...

-J.




RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak
Hi Tomas,

> Well, there's plenty of charts in the github repositories, including the 
> charts I
> think you're asking for:

Thanks.

> I also wonder how is this related to filesystem page size - in all the 
> benchmarks I
> did I used the default (4k), but maybe it'd behave if the filesystem page 
> matched
> the data page.

That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:

[root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, 
runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, 
runt= 60001msec [b] 
nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, 
runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, 
runt= 60001msec
nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, 
runt= 60366msec [a]
nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, 
runt= 60002msec [a]
nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, 
runt= 60248msec
nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, 
runt= 60047msec

So, I've found out two interesting things while playing with raw vs ext4:
a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as 
expected, this was ext4 4kb default fs page size impact as you was thinking 
about when fio 1k was hitting ext4 4kB block)
b) Another thing that you could also include in testing is that I've spotted a 
couple of times single-threaded fio might could be limiting factor (numjobs=1 
by default), so I've tried with numjobs=2,group_reporting=1 and got this below 
ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration 
-- something that I cannot explain (ext4 direct I/O caching effect? how's that 
even possible? reproduced several times even with numjobs=1) - the point being 
206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some 
caching effect because for randwrite it does not happen. I've triple-checked 
with iostat -x... it cannot be any internal device cache as with direct I/O 
that doesn't happen:

[root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, 
runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, 
runt= 60001msec [b]
nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, 
runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, 
runt= 60128msec
nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, 
runt= 60001msec [a, as before]
nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, 
runt= 60001msec [a, as before]
nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, 
runt= 60311msec [reproduced benefit, as per earlier email]
nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, 
runt= 60099msec

> > One way or another it would be very nice to be able to select the
> > tradeoff using initdb(1) without the need to recompile, which then
> > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
> > DB page size, ...).> Do you envision any plans for this we still in a
> > need to gather more info exactly why this happens? (perf reports?)
> >
> 
> Not sure I follow. Plans for what? Something that calibrates cost parameters?
> That might be useful, but that's a rather separate issue from what's discussed
> here - page size, which needs to happen before initdb (at least with how 
> things
> work currently).
[..]

Sorry, I was too far teched and assumed you guys were talking very long term. 

-J.



Re: Add TAP test for auth_delay extension

2022-06-07 Thread Dong Wook Lee
Hi Hackers,
I just wrote a test for `auth_delay` extension.
It's a test which confirms whether there is a delay for a second when
you enter the wrong password.
I sent an email using mutt, but I have a problem and sent it again.

---
Regards,
Dong Wook Lee.




Add TAP test for auth_delay extension

2022-06-07 Thread sh95119
diff --git a/contrib/auth_delay/Makefile b/contrib/auth_delay/Makefile
index 4b86ec37f0..b65097789a 100644
--- a/contrib/auth_delay/Makefile
+++ b/contrib/auth_delay/Makefile
@@ -3,6 +3,8 @@
 MODULES = auth_delay
 PGFILEDESC = "auth_delay - delay authentication failure reports"
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/auth_delay/t/001_auth_delay.pl b/contrib/auth_delay/t/001_auth_delay.pl
new file mode 100644
index 00..644026e4f2
--- /dev/null
+++ b/contrib/auth_delay/t/001_auth_delay.pl
@@ -0,0 +1,87 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(gettimeofday tv_interval);
+
+# Delete pg_hba.conf from the given node, add a new entry to it
+# and then execute a reload to refresh it.
+sub reset_pg_hba
+{
+my $node   = shift;
+my $hba_method = shift;
+
+unlink($node->data_dir . '/pg_hba.conf');
+# just for testing purposes, use a continuation line
+$node->append_conf('pg_hba.conf', "local all all $hba_method");
+$node->reload;
+return;
+}
+
+# Test access for a single role, with password
+sub test_login
+{
+local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+my $node  = shift;
+my $role  = shift;
+my $password  = shift;
+my $expected_res  = shift;
+my $status_string = 'failed';
+
+$status_string = 'success' if ($expected_res eq 0);
+
+my $connstr = "user=$role";
+my $testname =
+"authentication $status_string for role $role with password $password";
+
+$ENV{"PGPASSWORD"} = $password;
+
+if ($expected_res eq 0)
+{
+$node->connect_ok($connstr, $testname);
+}
+else
+{
+# No checks of the error message, only the status code.
+$node->connect_fails($connstr, $testname);
+}
+}
+
+note "setting up PostgreSQL instance";
+
+my $delay_milliseconds = 500;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf',
+qq{shared_preload_libraries = 'auth_delay'
+   auth_delay.milliseconds  = '$delay_milliseconds'});
+$node->start;
+
+$node->safe_psql(
+'postgres',
+"CREATE ROLE user_role LOGIN PASSWORD 'pass';");
+reset_pg_hba($node, 'password');
+
+note "running tests";
+
+# check enter wrong password
+my $t0 = [gettimeofday];
+test_login($node, 'user_role', "wrongpass", 2);
+my $elapsed = tv_interval($t0, [gettimeofday]);
+ok($elapsed >= $delay_milliseconds / 1000, "auth_delay $elapsed seconds");
+
+# check enter correct password
+my $t0 = [gettimeofday];
+test_login($node, 'user_role', "pass", 0);
+my $elapsed = tv_interval($t0, [gettimeofday]);
+ok($elapsed < $delay_milliseconds / 1000, "auth_delay $elapsed seconds");
+
+done_testing();
+




broken regress tests on fedora 36

2022-06-07 Thread Pavel Stehule
Hi

pgbench tests fails, probably due using  czech locale

All tests successful.
Files=2, Tests=633,  7 wallclock secs ( 0.14 usr  0.02 sys +  1.91 cusr
 1.05 csys =  3.12 CPU)
Result: PASS
make[2]: Opouští se adresář
„/home/pavel/src/postgresql.master/src/bin/pgbench“
make -C psql check
make[2]: Vstupuje se do adresáře
„/home/pavel/src/postgresql.master/src/bin/psql“
echo "+++ tap check in src/bin/psql +++" && rm -rf
'/home/pavel/src/postgresql.master/src/bin/psql'/tmp_check &&
/usr/bin/mkdir -p
'/home/pavel/src/postgresql.master/src/bin/psql'/tmp_check && cd . &&
TESTDIR='/home/pavel/src/postgresql.master/src/bin/psql'
PATH="/home/pavel/src/postgresql.master/tmp_install/usr/local/pgsql/master/bin:/home/pavel/src/postgresql.master/src/bin/psql:$PATH"
LD_LIBRARY_PATH="/home/pavel/src/postgresql.master/tmp_install/usr/local/pgsql/master/lib"
 PGPORT='65432'
PG_REGRESS='/home/pavel/src/postgresql.master/src/bin/psql/../../../src/test/regress/pg_regress'
/usr/bin/prove -I ../../../src/test/perl/ -I .  t/*.pl
+++ tap check in src/bin/psql +++
t/001_basic.pl ... 15/?
#   Failed test '\timing with successful query: matches'
#   at t/001_basic.pl line 83.
#   '1
# Time: 0,717 ms'
# doesn't match '(?^m:^1$
# ^Time: \d+\.\d\d\d ms)'

#   Failed test '\timing with query error: timing output appears'
#   at t/001_basic.pl line 95.
#   'Time: 0,293 ms'
# doesn't match '(?^m:^Time: \d+\.\d\d\d ms)'
# Looks like you failed 2 tests of 58.
t/001_basic.pl ... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/58 subtests
t/010_tab_completion.pl .. ok
t/020_cancel.pl .. ok

Test Summary Report
---
t/001_basic.pl (Wstat: 512 (exited 2) Tests: 58 Failed: 2)
  Failed tests:  28, 30
  Non-zero exit status: 2
Files=3, Tests=146,  6 wallclock secs ( 0.07 usr  0.01 sys +  3.15 cusr
 1.14 csys =  4.37 CPU)
Result: FAIL
make[2]: *** [Makefile:87: check] Chyba 1
make[2]: Opouští se adresář „/home/pavel/src/postgresql.master/src/bin/psql“
make[1]: *** [Makefile:43: check-psql-recurse] Chyba 2
make[1]: Opouští se adresář „/home/pavel/src/postgresql.master/src/bin“
make: *** [GNUmakefile:71: check-world-src/bin-recurse] Chyba 2

Regards

Pavel


Re: Inconvenience of pg_read_binary_file()

2022-06-07 Thread Kyotaro Horiguchi
At Tue, 7 Jun 2022 16:33:53 +0900, Michael Paquier  wrote 
in 
> On Tue, Jun 07, 2022 at 04:05:20PM +0900, Kyotaro Horiguchi wrote:
> > If I want to read a file that I'm not sure of the existence but I want
> > to read the whole file if exists, I would call
> > pg_read_binary_file('path', 0, -1, true) but unfortunately this
> > doesn't work.
> 
> Yeah, the "normal" cases that I have seen in the past just used an
> extra call to pg_stat_file() to retrieve the size of the file before
> reading it, but arguably it does not help if the file gets extended
> between the stat() call and the read() call (we don't need to care
> about this case with pg_rewind that has been the reason why the
> missing_ok argument was introduced first, for one, as file extensions
> don't matter as we'd replay from the LSN point where the rewind
> began, adding the new blocks at replay).

Sure.

> There is also an argument for supporting negative values rather than
> just -1.  For example -2 could mean to read all the file except the
> last byte.  Or you could have an extra flavor, as of
> pg_read_file(text, bool) to read the whole by default.  Supporting
> just -1 as special value for the amount of data to read would be
> confusing IMO.  So I would tend to choose for a set of arguments based
> on (text, bool).

I'm not sure about the negative length smaller than -1, since I don't
find an apprpriate offset that represents (last byte + 1).

pg_read_file(text, bool) makes sense to me, but it doesn't seem like
to be able to share C function with other variations.
pg_read_binary_file() need to accept some out-of-range value for
offset or length to signal that offset and length are not specified.

In the attached pg_read(_binary)_file_all() is modifiedf so that they
have the different body from pg_read(_binary)_file().

(function comments needs to be edited and docs are needed)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 2bf5219256..62e16b014e 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -349,7 +349,6 @@ pg_read_binary_file(PG_FUNCTION_ARGS)
 		PG_RETURN_NULL();
 }
 
-
 /*
  * Wrapper functions for the 1 and 3 argument variants of pg_read_file_v2()
  * and pg_read_binary_file().
@@ -367,7 +366,21 @@ pg_read_file_off_len(PG_FUNCTION_ARGS)
 Datum
 pg_read_file_all(PG_FUNCTION_ARGS)
 {
-	return pg_read_file_v2(fcinfo);
+	text	   *filename_t = PG_GETARG_TEXT_PP(0);
+	bool		missing_ok = false;
+	char	   *filename;
+	text	   *result;
+
+	if (PG_NARGS() >= 2)
+		missing_ok = PG_GETARG_BOOL(1);
+
+	filename = convert_and_check_filename(filename_t);
+
+	result = read_text_file(filename, 0, -1, missing_ok);
+	if (result)
+		PG_RETURN_TEXT_P(result);
+	else
+		PG_RETURN_NULL();
 }
 
 Datum
@@ -379,7 +392,21 @@ pg_read_binary_file_off_len(PG_FUNCTION_ARGS)
 Datum
 pg_read_binary_file_all(PG_FUNCTION_ARGS)
 {
-	return pg_read_binary_file(fcinfo);
+	text	   *filename_t = PG_GETARG_TEXT_PP(0);
+	bool		missing_ok = false;
+	char	   *filename;
+	bytea	   *result;
+
+	if (PG_NARGS() >= 2)
+		missing_ok = PG_GETARG_BOOL(1);
+
+	filename = convert_and_check_filename(filename_t);
+
+	result = read_binary_file(filename, 0, -1, missing_ok);
+	if (result)
+		PG_RETURN_BYTEA_P(result);
+	else
+		PG_RETURN_NULL();
 }
 
 /*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 87aa571a33..e180286749 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6423,6 +6423,9 @@
 { oid => '3826', descr => 'read text from a file',
   proname => 'pg_read_file', provolatile => 'v', prorettype => 'text',
   proargtypes => 'text', prosrc => 'pg_read_file_all' },
+{ oid => '8530', descr => 'read text from a file',
+  proname => 'pg_read_file', provolatile => 'v', prorettype => 'text',
+  proargtypes => 'text bool', prosrc => 'pg_read_file_all' },
 { oid => '3827', descr => 'read bytea from a file',
   proname => 'pg_read_binary_file', provolatile => 'v', prorettype => 'bytea',
   proargtypes => 'text int8 int8', prosrc => 'pg_read_binary_file_off_len' },
@@ -6432,6 +6435,9 @@
 { oid => '3828', descr => 'read bytea from a file',
   proname => 'pg_read_binary_file', provolatile => 'v', prorettype => 'bytea',
   proargtypes => 'text', prosrc => 'pg_read_binary_file_all' },
+{ oid => '8529', descr => 'read bytea from a file',
+  proname => 'pg_read_binary_file', provolatile => 'v', prorettype => 'bytea',
+  proargtypes => 'text bool', prosrc => 'pg_read_binary_file_all' },
 { oid => '2625', descr => 'list all files in a directory',
   proname => 'pg_ls_dir', prorows => '1000', proretset => 't',
   provolatile => 'v', prorettype => 'text', proargtypes => 'text',


RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-07 Thread Jean Landercy - BEEODIVERSITY
Dear David,

Thank you for taking time on this issue.

Here is the detail of the table (I have anonymized it on SO, this is its real 
name):

\d logistic_site
  Table « public.logistic_site »
   Colonne   |   Type   | Collationnement | NULL-able | 
   Par défaut

-+--+-+---+---
 id  | bigint   | | not null  | 
nextval('logistic_site_id_seq'::regclass)
 key | character varying(32)| | not null  |
 name| character varying(128)   | |   |
 created | timestamp with time zone | | not null  |
 updated | timestamp with time zone | | not null  |
 archived| timestamp with time zone | |   |
 geom| geometry(Polygon,4326)   | |   |
 location| geometry(Point,4326) | |   |
 notes   | text | |   |
 country_id  | bigint   | |   |
 customer_id | bigint   | |   |

Index :
"logistic_site_pkey" PRIMARY KEY, btree (id)
"logistic_site_country_id_9a696481" btree (country_id)
"logistic_site_customer_id_a2c8a74a" btree (customer_id)
"logistic_site_geom_105a08da_id" gist (geom)
"logistic_site_key_2e791173_like" btree (key varchar_pattern_ops)
"logistic_site_key_key" UNIQUE CONSTRAINT, btree (key)
"logistic_site_location_54ae0166_id" gist (location)
Contraintes de clés étrangères :
"logistic_site_country_id_9a696481_fk_logistic_country_id" FOREIGN KEY 
(country_id) REFERENCES logistic_country(id) DEFERRABLE INITIALLY DEFERRED
"logistic_site_customer_id_a2c8a74a_fk_logistic_customer_id" FOREIGN KEY 
(customer_id) REFERENCES logistic_customer(id) DEFERRABLE INITIALLY DEFERRED
Référencé par :
TABLE "logistic_hive" CONSTRAINT 
"logistic_hive_site_id_50c29dd8_fk_logistic_site_id" FOREIGN KEY (site_id) 
REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "logistic_packorder" CONSTRAINT 
"logistic_packorder_site_id_16e1a41a_fk_logistic_site_id" FOREIGN KEY (site_id) 
REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "logistic_projectsite" CONSTRAINT 
"logistic_projectsite_site_id_522bf74b_fk_logistic_site_id" FOREIGN KEY 
(site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "scientific_identification" CONSTRAINT 
"scientific_identification_site_id_d9e79149_fk_logistic_site_id" FOREIGN KEY 
(site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "scientific_inventory" CONSTRAINT 
"scientific_inventory_site_id_72521353_fk_logistic_site_id" FOREIGN KEY 
(site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "scientific_result" CONSTRAINT 
"scientific_result_site_id_af6c815d_fk_logistic_site_id" FOREIGN KEY (site_id) 
REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
TABLE "scientific_selection" CONSTRAINT 
"scientific_selection_site_id_88d69cab_fk_logistic_site_id" FOREIGN KEY 
(site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED

And the output of the related query:

SELECT
attname, atttypid::regtype, attnum,atthasdef, atthasmissing, attgenerated, 
attisdropped
FROM
pg_attribute 
WHERE
attrelid = 'logistic_site'::regclass
ORDER BY
attnum;

   attname   | atttypid | attnum | atthasdef | atthasmissing | 
attgenerated | attisdropped
-+--++---+---+--+--
 tableoid| oid  | -6 | f | f |  
| f
 cmax| cid  | -5 | f | f |  
| f
 xmax| xid  | -4 | f | f |  
| f
 cmin| cid  | -3 | f | f |  
| f
 xmin| xid  | -2 | f | f |  
| f
 ctid| tid  | -1 | f | f |  
| f
 id  | bigint   |  1 | t | f |  
| f
 key | character varying|  2 | f | f |  
| f
 name| character varying|  3 | f | f |  
| f
 created | timestamp with time zone |  4 | f | f |  
| f
 updated | timestamp with time zone |  5 | f | f |  
| f
 archived| timestamp with time zone |  6 | f | f |  
| f
 geom| geometry |  

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-07 Thread Kyotaro Horiguchi
At Tue, 7 Jun 2022 16:16:09 +0900, Michael Paquier  wrote 
in 
> On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman  wrote 
> > in 
> >> To confirm I'm following you correctly, you're envisioning a situation 
> >> like:
> >> 
> >> - Primary A
> >> - Replica B replicating from primary
> >> - Replica C replicating from replica B
> >> 
> >> then on failover from A to B you end up with:
> >> 
> >> - Primary B
> >> - Replica C replication from primary
> >> - [needs rewind] A
> >> 
> >> and you try to rewind A from C as the source?
> > 
> > Yes. I think it is a legit use case.  That being said, like other
> > points, it might be acceptable.
> 
> This configuration is a case supported by pg_rewind, meaning that your
> patch to check after minRecoveryPointTLI would be confusing when using
> a standby as a source because the checkpoint needs to apply on its
> primary to allow the TLI of the standby to go up.  If you want to

Yeah, that what I meant.

> provide to the user more context, a more meaningful way may be to rely
> on an extra check for ControlFileData.state, I guess, as a promoted 
> cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
> cleared by the first post-promotion checkpoint, with
> DB_IN_ARCHIVE_RECOVERY for a cascading standby.

Right. However, IIUC, checkpoint LSN/TLI is not updated at the
time. The point of the minRecoveryPoint check is to confirm that we
can read the timeline ID of the promoted source cluster from
checkPointCopy.ThisTimeLineID. But we cannot do that yet at the time
the cluster state moves to DB_IN_PRODUCTION.  And a standby is in
DB_IN_ARCHIVE_RECOVERY since before the upstream promotes. It also
doesn't signal the reliability of checkPointCopy.ThisTimeLineID..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Logging query parmeters in auto_explain

2022-06-07 Thread Michael Paquier
On Tue, May 31, 2022 at 09:33:20PM +0100, Dagfinn Ilmari Mannsåker wrote:
> Here's a patch that adds a corresponding
> auto_explain.log_parameter_max_length config setting, which controls the
> "Query Parameters" node in the logged plan.  Just like in core, the
> default is -1, which logs the parameters in full, and 0 disables
> parameter logging, while any other value truncates each parameter to
> that many bytes.

With a behavior similar to the in-core log_parameter_max_length, this
looks rather sane to me.  This is consistent with the assumptions of
errdetail_params().

+$node->append_conf('postgresql.conf', "auto_explain.log_parameter_max_length = 
-1");
Nit.  You don't need this change in the TAP test, as this is the
default value to log everything.
--
Michael


signature.asc
Description: PGP signature


Re: Inconvenience of pg_read_binary_file()

2022-06-07 Thread Michael Paquier
On Tue, Jun 07, 2022 at 04:05:20PM +0900, Kyotaro Horiguchi wrote:
> If I want to read a file that I'm not sure of the existence but I want
> to read the whole file if exists, I would call
> pg_read_binary_file('path', 0, -1, true) but unfortunately this
> doesn't work.

Yeah, the "normal" cases that I have seen in the past just used an
extra call to pg_stat_file() to retrieve the size of the file before
reading it, but arguably it does not help if the file gets extended
between the stat() call and the read() call (we don't need to care
about this case with pg_rewind that has been the reason why the
missing_ok argument was introduced first, for one, as file extensions
don't matter as we'd replay from the LSN point where the rewind
began, adding the new blocks at replay).

There is also an argument for supporting negative values rather than
just -1.  For example -2 could mean to read all the file except the
last byte.  Or you could have an extra flavor, as of
pg_read_file(text, bool) to read the whole by default.  Supporting
just -1 as special value for the amount of data to read would be
confusing IMO.  So I would tend to choose for a set of arguments based
on (text, bool).
--
Michael


signature.asc
Description: PGP signature


Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-07 Thread Michael Paquier
On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman  wrote in 
>> To confirm I'm following you correctly, you're envisioning a situation like:
>> 
>> - Primary A
>> - Replica B replicating from primary
>> - Replica C replicating from replica B
>> 
>> then on failover from A to B you end up with:
>> 
>> - Primary B
>> - Replica C replication from primary
>> - [needs rewind] A
>> 
>> and you try to rewind A from C as the source?
> 
> Yes. I think it is a legit use case.  That being said, like other
> points, it might be acceptable.

This configuration is a case supported by pg_rewind, meaning that your
patch to check after minRecoveryPointTLI would be confusing when using
a standby as a source because the checkpoint needs to apply on its
primary to allow the TLI of the standby to go up.  If you want to
provide to the user more context, a more meaningful way may be to rely
on an extra check for ControlFileData.state, I guess, as a promoted 
cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
cleared by the first post-promotion checkpoint, with
DB_IN_ARCHIVE_RECOVERY for a cascading standby.
--
Michael


signature.asc
Description: PGP signature


Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-07 Thread Kyotaro Horiguchi
At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> One possible way to detect promotion reliably is to look into timeline
> history files. It is written immediately at promotion even on
> standbys.

The attached seems to work. It uses timeline history files to identify
the source timeline.  With this change pg_waldump no longer need to
wait for end-of-recovery to finish.

(It lacks doc part and test.. But I'm not sure how we can test this
behavior.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 6cb288f099..2a407da1e4 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool 
noerror)
 {
int fd;
char   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t 
*filesize)
snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+   {
+   if (noerror && errno == ENOENT)
+   return NULL;
+
pg_fatal("could not open file \"%s\" for reading: %m",
 fullpath);
+   }
 
if (fstat(fd, ) < 0)
pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 54a853bd42..92e19042cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t 
*filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+   bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, 
size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t 
callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..92067d4f2c 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, 
const char *path,
off_t 
off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
- size_t *filesize);
+ size_t *filesize, 
bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+bool noerror)
 {
PGconn *conn = ((libpq_source *) source)->conn;
PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, 
size_t *filesize)
const char *paramValues[1];
 
paramValues[0] = path;
+
+   /*
+* check the existence of the file. We don't do this separately from
+* pg_read_binary_file so that server doesn't emit an error
+*/
+   if (noerror)
+   {
+   res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+  1, NULL, paramValues, NULL, 
NULL, 1);
+
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_fatal("could not stat remote file \"%s\": %s",
+path, PQresultErrorMessage(res));
+   }
+
+   /* sanity check the result set */
+   if (PQntuples(res) != 1)
+   pg_fatal("unexpected result set while stating remote 
file \"%s\"",
+path);
+
+   /* Return NULL if the file was not found */
+   if (PQgetisnull(res, 0, 0))
+   

Inconvenience of pg_read_binary_file()

2022-06-07 Thread Kyotaro Horiguchi
If I want to read a file that I'm not sure of the existence but I want
to read the whole file if exists, I would call
pg_read_binary_file('path', 0, -1, true) but unfortunately this
doesn't work.

Does it make sense to change the function so as to accept the
parameter specification above? Or the arguments could be ('path',
null, null, true) but (0,-1) is simpler considering the
characteristics of the function.

(We could also rearrange the the parameter order as "filename,
missing_ok, offset, length" but that is simply confusing..)

If it is, pg_read_file() is worth receive the same modification and
I'll post the version containing doc part.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 2bf5219256..219203be1c 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -331,10 +331,10 @@ pg_read_binary_file(PG_FUNCTION_ARGS)
 		seek_offset = PG_GETARG_INT64(1);
 		bytes_to_read = PG_GETARG_INT64(2);
 
-		if (bytes_to_read < 0)
+		if (bytes_to_read < -1)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("requested length cannot be negative")));
+	 errmsg("invalid requested length")));
 	}
 	if (PG_NARGS() >= 4)
 		missing_ok = PG_GETARG_BOOL(3);