date:20230913

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Amit Kapila

On Thu, Sep 14, 2023 at 10:37 AM Dilip Kumar  wrote:
>
> On Thu, Sep 14, 2023 at 10:00 AM Amit Kapila  wrote:
> >
> > On Thu, Sep 14, 2023 at 9:21 AM Dilip Kumar  wrote:
>
> > > > ---
> > > >
> > > > 3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest 
> > > > if user
> > > >already did the upgrade check for stopped server, they can use this 
> > > > option
> > > >when trying to upgrade later.
> > > >
> > > > Pros: Can save some efforts for user to advance each slot's lsn.
> > > >
> > > > Cons: I didn't see similar options in pg_upgrade, might need some 
> > > > agreement.
> > >
> > > Yeah right, in fact during the --check command we can give that
> > > suggestion as well.
> > >
> >
> > Hmm, we can't mandate users to skip checking slots because that is the
> > whole point of --check slots.
>
> I mean not to mandate skipping in the --check command.  But once the
> check command has already checked the slot then we can issue a
> suggestion to the user that the slots are already checked so that
> during the actual upgrade we can --skip checking the slots.  So for
> user who has already run the check command and is now following with
> an upgrade can skip slot checking if we can provide such an option.
>

oh, okay, we can document and request the user to follow as you
suggest but I guess it will be more work for the user and also is less
intuitive.

> > > I feel option 2 looks best to me unless there is some design issue to
> > > that, as of now I do not see any issue with that though.  Let's see
> > > what others think.
> > >
> >
> > By the way, did you consider the previous approach this patch was
> > using? Basically, instead of getting the last checkpoint location from
> > the control file, we will read the WAL file starting from the
> > confirmed_flush location of a slot and if we find any WAL other than
> > expected WALs like shutdown checkpoint, running_xacts, etc. then we
> > will error out.
>
> So basically, while scanning from confirmed_flush we must ensure that
> we find a first record as SHUTDOWN CHECKPOINT record at the same LSN,
> and after that, we should not get any other WAL other than like you
> said shutdown checkpoint, running_xacts.  That way we will ensure both
> aspect that the confirmed flush LSN is at the shutdown checkpoint and
> after that there is no real activity in the system.
>

Right.

>  I think to me,
> this seems like the best available option so far.
>

Yeah, let's see if someone else has a different opinion or has a better idea.

-- 
With Regards,
Amit Kapila.

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Dilip Kumar

On Thu, Sep 14, 2023 at 10:00 AM Amit Kapila  wrote:
>
> On Thu, Sep 14, 2023 at 9:21 AM Dilip Kumar  wrote:

> > > Cons: Although we have some examples for using functions
> > > (binary_upgrade_set_next_pg_enum_oid ...) to set some variables during 
> > > upgrade
> > > , but not sure if it's a standard behavior to change the slot's lsn during
> > > upgrade.
> >
> > I feel this seems like a good option.
> >
>
> In this idea, if the user decides not to proceed after the upgrade
> --check, then we would have incremented the confirmed_flush location
> of all slots without the subscriber's acknowledgment.

Yeah, thats a problem.

> > > ---
> > >
> > > 3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest 
> > > if user
> > >already did the upgrade check for stopped server, they can use this 
> > > option
> > >when trying to upgrade later.
> > >
> > > Pros: Can save some efforts for user to advance each slot's lsn.
> > >
> > > Cons: I didn't see similar options in pg_upgrade, might need some 
> > > agreement.
> >
> > Yeah right, in fact during the --check command we can give that
> > suggestion as well.
> >
>
> Hmm, we can't mandate users to skip checking slots because that is the
> whole point of --check slots.

I mean not to mandate skipping in the --check command.  But once the
check command has already checked the slot then we can issue a
suggestion to the user that the slots are already checked so that
during the actual upgrade we can --skip checking the slots.  So for
user who has already run the check command and is now following with
an upgrade can skip slot checking if we can provide such an option.

> > I feel option 2 looks best to me unless there is some design issue to
> > that, as of now I do not see any issue with that though.  Let's see
> > what others think.
> >
>
> By the way, did you consider the previous approach this patch was
> using? Basically, instead of getting the last checkpoint location from
> the control file, we will read the WAL file starting from the
> confirmed_flush location of a slot and if we find any WAL other than
> expected WALs like shutdown checkpoint, running_xacts, etc. then we
> will error out.

So basically, while scanning from confirmed_flush we must ensure that
we find a first record as SHUTDOWN CHECKPOINT record at the same LSN,
and after that, we should not get any other WAL other than like you
said shutdown checkpoint, running_xacts.  That way we will ensure both
aspect that the confirmed flush LSN is at the shutdown checkpoint and
after that there is no real activity in the system.  I think to me,
this seems like the best available option so far.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: JSON Path and GIN Questions

2023-09-13 Thread Tom Lane

Erik Rijkers  writes:
> p 9/13/23 om 22:01 schreef David E. Wheeler:
>> On Sep 13, 2023, at 01:11, Erik Rijkers  wrote:
>>> "All use of json*() functions preclude index usage."

>> Where did that come from? Why wouldn’t JSON* functions use indexes? I see 
>> that the docs only mention operators; why would the corresponding functions 
>> behave the same?

> Sorry, perhaps my reply was a bit off-topic.
> But you mentioned perhaps touching the docs and
> the not-use-of-index is just so unexpected.

Unexpected to who?  I think the docs make it pretty plain that only
operators on indexed columns are considered as index qualifications.
Admittedly, 11.2 Index Types [1] makes the point only by not
discussing any other case, but when you get to 11.10 Operator Classes
and Operator Families [2] and discover that the entire index definition
mechanism is based around operators not functions, you should be able
to reach that conclusion.  The point is made even more directly in
38.16 Interfacing Extensions to Indexes [3], though I'll concede
that that's not material I'd expect the average PG user to read.
As far as json in particular is concerned, 8.14.4 jsonb Indexing [4]
is pretty clear about what is or is not supported.

regards, tom lane

[1] https://www.postgresql.org/docs/current/indexes-types.html
[2] https://www.postgresql.org/docs/current/indexes-opclass.html
[3] https://www.postgresql.org/docs/current/xindex.html
[4] https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 07:09:32PM -0700, Andres Freund wrote:
> On 2023-09-13 19:07:24 -0700, Andres Freund wrote:
>> Huh. I don't think this is a good idea - and certainly not in the back
>> branches. The prior message made more sense, imo. The fact that the snapshot
>> identifier is a file is an implementation detail, no snapshot with the
>> identifier being exported is a user level detail. Hence that being mentioned
>> in the error message.
>> 
>> I can see an argument for treating ENOENT different than other errors though,
>> and using the standard file opening error message for anything other than
>> ENOENT.
> 
> Oh, and given that this actually changes the error code for an invalid
> snapshot, I think this needs to be reverted. It's not that unlikely that
> there's code out there that depends on getting ERRCODE_INVALID_PARAMETER_VALUE
> when the snapshot doesn't exist.

Ahem.  This seems to be the only code path that tracks a failure on
AllocateFile() where we don't show %m at all, while the error is
misleading in basically all the cases as errno holds the extra
information telling somebody that something's going wrong, so I don't
quite see how it is useful to tell "invalid snapshot identifier" on
an EACCES or even ENOENT when opening this file, with zero information
about what's happening on top of that?  Even on ENOENT, one can be
confused with the same error message generated a few lines above: if
AllocateFile() fails, the snapshot identifier is correctly shaped, but
its file is missing.  If ENOENT is considered a particular case with
the old message, we'd still not know if this refers to the first
failure or the second failure.

Saying that, I'm OK with reverting to the previous behavior on
back-branches if you feel strongly about that.
--
Michael

signature.asc
Description: PGP signature

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Amit Kapila

On Thu, Sep 14, 2023 at 9:21 AM Dilip Kumar  wrote:
>
> On Thu, Sep 14, 2023 at 8:40 AM Zhijie Hou (Fujitsu)
>  wrote:
>
> >
> > Here are some more ideas about the issue for reference.
> >
> > 1) Extending the controlfile.
> >
> > We can dd a new field (e.g. non_upgrade_checkPoint) to record the last 
> > check point
> > ptr happened in non-upgrade mode. The new field won't be updated due to
> > "pg_upgrade --check", so pg_upgrade can use this LSN to compare with the 
> > slot's
> > confirmed_flush_lsn.
> >
> > Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
> > --check" in advance.
> >
> > Cons: Not sure if this is a enough reason to introduce new field in
> > controlfile.
>
> Yeah, this could be an option but I am not sure either that adding a
> new option for this purpose is the best way.
>

I also think so. It seems this could work but adding upgrade-specific
information to other data structures doesn't sound like a clean
solution.

> > ---
> >
> > 2) Advance the slot's confirmed_flush_lsn in pg_upgrade if the check passes.
> >
> > Introducing an upgrade support SQL function
> > (binary_upgrade_advance_logical_slot_lsn()) to set a
> > flag(catch_confirmed_lsn_up) on server side. On server side, when trying to
> > flush the slot in shutdown checkpoint(CheckPointReplicationSlots), we update
> > the slot's confirmed_flush_lsn to the lsn of the current checkpoint if
> > catch_confirmed_lsn_up is set.
> >
> > Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
> > --check" in advance.
> >
> > Cons: Although we have some examples for using functions
> > (binary_upgrade_set_next_pg_enum_oid ...) to set some variables during 
> > upgrade
> > , but not sure if it's a standard behavior to change the slot's lsn during
> > upgrade.
>
> I feel this seems like a good option.
>

In this idea, if the user decides not to proceed after the upgrade
--check, then we would have incremented the confirmed_flush location
of all slots without the subscriber's acknowledgment. It may not be
the usual scenario but in theory, it may violate our basic principle
of incrementing confirmed_flush location. Another thing to consider is
we have to do this for all logical slots under the assumption that all
are already caught up as pg_upgrade would have ensured that. So,
ideally, the server should have some knowledge that the slots are
already caught up to the latest location which again doesn't seem like
a clean idea.

> > ---
> >
> > 3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest if 
> > user
> >already did the upgrade check for stopped server, they can use this 
> > option
> >when trying to upgrade later.
> >
> > Pros: Can save some efforts for user to advance each slot's lsn.
> >
> > Cons: I didn't see similar options in pg_upgrade, might need some agreement.
>
> Yeah right, in fact during the --check command we can give that
> suggestion as well.
>

Hmm, we can't mandate users to skip checking slots because that is the
whole point of --check slots.

> I feel option 2 looks best to me unless there is some design issue to
> that, as of now I do not see any issue with that though.  Let's see
> what others think.
>

By the way, did you consider the previous approach this patch was
using? Basically, instead of getting the last checkpoint location from
the control file, we will read the WAL file starting from the
confirmed_flush location of a slot and if we find any WAL other than
expected WALs like shutdown checkpoint, running_xacts, etc. then we
will error out.

-- 
With Regards,
Amit Kapila.

Re: CHECK Constraint Deferrable

2023-09-13 Thread vignesh C

On Thu, 7 Sept 2023 at 17:26, Himanshu Upadhyaya
 wrote:
>
> Attached is v2 of the patch, rebased against the latest HEAD.

Few issues:
1) Create domain fails but alter domain is successful, I feel we
should support create domain too:
postgres=# create domain d1 as int check(value<>0) deferrable;
ERROR:  specifying constraint deferrability not supported for domains
postgres=# create domain d1 as int check(value<>0);
CREATE DOMAIN
postgres=# alter domain d1 add constraint con_2 check(value<>1) deferrable;
ALTER DOMAIN

2) I was not sure, if the error message change was intentional:
2a)
In Head:
CREATE FOREIGN TABLE t9(a int CHECK(a<>0) DEFERRABLE) SERVER s1;
ERROR:  misplaced DEFERRABLE clause
LINE 1: CREATE FOREIGN TABLE t9(a int CHECK(a<>0) DEFERRABLE) SERVER...
  ^
postgres=# CREATE FOREIGN TABLE t9(a int CHECK(a<>0) DEFERRABLE) SERVER s1;
ERROR:  "t9" is a foreign table
DETAIL:  Foreign tables cannot have constraint triggers.

2b)
In Head:
postgres=# CREATE FOREIGN TABLE t2(a int CHECK(a<>0)) SERVER s1;
CREATE FOREIGN TABLE
postgres=# ALTER FOREIGN TABLE t2 ADD CONSTRAINT t2_chk_1 CHECK(a<>1)
DEFERRABLE;
ERROR:  CHECK constraints cannot be marked DEFERRABLE

With patch:
postgres=# ALTER FOREIGN TABLE t8 ADD CONSTRAINT t8_chk_1 CHECK(a<>1)
DEFERRABLE;
ERROR:  "t8" is a foreign table
DETAIL:  Foreign tables cannot have constraint triggers.

3) Insert check is not deferred to commit:
This insert check here is deferred to commit:
postgres=# CREATE TABLE tbl (i int ) partition by range (i);
CREATE TABLE tbl_1 PARTITION OF tbl FOR VALUES FROM (0) TO (10);
CREATE TABLE tbl_2 PARTITION OF tbl FOR VALUES FROM (20) TO (30);
CREATE TABLE
CREATE TABLE
CREATE TABLE
postgres=# ALTER TABLE tbl ADD CONSTRAINT tbl_chk_1 CHECK(i<>1) DEFERRABLE;
ALTER TABLE
postgres=# begin;
BEGIN
postgres=*# SET CONSTRAINTS tbl_chk_1 DEFERRED;
SET CONSTRAINTS
postgres=*# INSERT INTO tbl values (1);
INSERT 0 1
postgres=*# commit;
ERROR:  new row for relation "tbl_1" violates check constraint "tbl_chk_1"
DETAIL:  Failing row contains (1).

But the check here is not deferred to commit:
postgres=# CREATE TABLE tbl (i int check(i<>0) DEFERRABLE) partition
by range (i);
CREATE TABLE tbl_1 PARTITION OF tbl FOR VALUES FROM (0) TO (10);
CREATE TABLE tbl_2 PARTITION OF tbl FOR VALUES FROM (20) TO (30);
CREATE TABLE
CREATE TABLE
CREATE TABLE
postgres=# ALTER TABLE tbl ADD CONSTRAINT tbl_chk_1 CHECK(i<>1) DEFERRABLE;
ALTER TABLE
postgres=# begin;
BEGIN
postgres=*#  SET CONSTRAINTS tbl_chk_1 DEFERRED;
SET CONSTRAINTS
postgres=*# INSERT INTO tbl values (1);
ERROR:  new row for relation "tbl_1" violates check constraint "tbl_chk_1"
DETAIL:  Failing row contains (1).

4) There is a new warning popping up now:
CREATE TABLE tbl_new_3 (i int check(i<>0)) partition by range (i);
CREATE FOREIGN TABLE ftbl_new_3 PARTITION OF tbl_new_3 FOR VALUES FROM
(40) TO (50) server s1;
postgres=# ALTER TABLE tbl_new_3 ADD CONSTRAINT tbl_new_3_chk
CHECK(i<>1) DEFERRABLE;
WARNING:  unexpected pg_constraint record found for relation "tbl_new_3"
ERROR:  "ftbl_new_3" is a foreign table
DETAIL:  Foreign tables cannot have constraint triggers.

Regards,
Vignesh

Re: JSON Path and GIN Questions

2023-09-13 Thread Erik Rijkers


p 9/13/23 om 22:01 schreef David E. Wheeler:

On Sep 13, 2023, at 01:11, Erik Rijkers  wrote:


"All use of json*() functions preclude index usage."

That sentence is missing from the documentation.


Where did that come from? Why wouldn’t JSON* functions use indexes? I see that 
the docs only mention operators; why would the corresponding functions behave 
the same?

D


Sorry, perhaps my reply was a bit off-topic.
But you mentioned perhaps touching the docs and
the not-use-of-index is just so unexpected.
Compare these two statements:

select count(id) from movies where
movie @? '$ ? (@.year == 2023)'
Time: 1.259 ms
  (index used)

select count(id) from movies where
jsonb_path_match(movie, '$.year == 2023');
Time: 17.260 ms
  (no index used - unexpectedly slower)

With these two indexes available:
  using gin (movie);
  using gin (movie jsonb_path_ops);

(REL_15_STABLE; but it's the same in HEAD and
the not-yet-committed SQL/JSON patches.)

Erik Rijkers

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Dilip Kumar

On Thu, Sep 14, 2023 at 8:40 AM Zhijie Hou (Fujitsu)
 wrote:

>
> Here are some more ideas about the issue for reference.
>
> 1) Extending the controlfile.
>
> We can dd a new field (e.g. non_upgrade_checkPoint) to record the last check 
> point
> ptr happened in non-upgrade mode. The new field won't be updated due to
> "pg_upgrade --check", so pg_upgrade can use this LSN to compare with the 
> slot's
> confirmed_flush_lsn.
>
> Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
> --check" in advance.
>
> Cons: Not sure if this is a enough reason to introduce new field in
> controlfile.

Yeah, this could be an option but I am not sure either that adding a
new option for this purpose is the best way.

> ---
>
> 2) Advance the slot's confirmed_flush_lsn in pg_upgrade if the check passes.
>
> Introducing an upgrade support SQL function
> (binary_upgrade_advance_logical_slot_lsn()) to set a
> flag(catch_confirmed_lsn_up) on server side. On server side, when trying to
> flush the slot in shutdown checkpoint(CheckPointReplicationSlots), we update
> the slot's confirmed_flush_lsn to the lsn of the current checkpoint if
> catch_confirmed_lsn_up is set.
>
> Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
> --check" in advance.
>
> Cons: Although we have some examples for using functions
> (binary_upgrade_set_next_pg_enum_oid ...) to set some variables during upgrade
> , but not sure if it's a standard behavior to change the slot's lsn during
> upgrade.

I feel this seems like a good option.

> ---
>
> 3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest if 
> user
>already did the upgrade check for stopped server, they can use this option
>when trying to upgrade later.
>
> Pros: Can save some efforts for user to advance each slot's lsn.
>
> Cons: I didn't see similar options in pg_upgrade, might need some agreement.

Yeah right, in fact during the --check command we can give that
suggestion as well.

I feel option 2 looks best to me unless there is some design issue to
that, as of now I do not see any issue with that though.  Let's see
what others think.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 08:01:39PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> Upon closer inspection, I found a rather nasty problem.  The qsort
>> comparator expects a TocEntry **, but the binaryheap comparator expects a
>> TocEntry *, and we simply pass the arguments through to the qsort
>> comparator.  In v9, I added the requisite ampersands.
> 
> Ooops :-(
> 
>> I'm surprised this
>> worked at all.
> 
> Probably it was not sorting things appropriately.  Might be worth adding
> some test scaffolding to check that bigger tasks are chosen before
> smaller ones.

Further testing revealed that the binaryheap comparator function was
actually generating a min-heap since the qsort comparator sorts by
decreasing dataLength.  This is fixed in v10.  And I am 0 for 2 today...

Now that this appears to be functioning as expected, I see that the larger
entries are typically picked up earlier, but we do sometimes pick entries
quite a bit further down the list, as anticipated.  The case I was testing
(10k tables with the number of rows equal to the table number) was much
faster with this patch (just over a minute) than without it (over 16
minutes).

Sincerest apologies for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 59498e233d725d5c427e58f700402908fce2a10f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v10 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >=

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Dilip Kumar

On Wed, Sep 13, 2023 at 7:22 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thank you for reviewing! Before making a patch I can reply the important 
> point.
>
> > 1. One thing to note is that if user checks whether the old cluster is
> > upgradable with --check option and then try to upgrade, that will also
> > fail. Because during the --check run there would at least one
> > additional shutdown checkpoint WAL and then in the next run the slots
> > position won't match. Note, I am saying this in context of using
> > --check option with not-running old cluster. Won't that be surprising
> > to users? One possibility is that we document such a behaviour and
> > other is that we go back to WAL reading design where we can ignore
> > known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.
>
> Good catch, we have never considered the case that --check is executed for
> stopped cluster. You are right, the old cluster is turned on/off during the
> check and it generates SHUTDOWN_CHECKPOINT. This leads that confirmed_flush is
> behind the latest checkpoint lsn.
>
> Good catch, we have never considered the case that --check is executed for
> stopped cluster. You are right, the old cluster is turned on/off during the
> check and it generates SHUTDOWN_CHECKPOINT. This leads that confirmed_flush is
> behind the latest checkpoint lsn.

Good catch.

> Here are other approaches we came up with:
>
> 1. adds WARNING message when the --check is executed and slots are checked.
>We can say like:
>
> ```
> ...
> Checking for valid logical replication slots
> WARNING: this check generated WALs
> Next pg_uprade would fail.
> Please ensure again that all WALs are replicated.
> ...

IMHO the --check is a very common command users execute before the
actual upgrade.  So issuing such a WARNING might not be good because
then what option user have? Do they need to again restart the cluster
in order to stream the new WAL and again shut it down?  I don't think
that is really an acceptable idea.  Maybe as discussed in the past we
can provide an option to skip the slot checking and during the --check
command we can give a WARNING and suggest that better to use
--skip-slot-checking for the main upgrade as we have already checked.
This could still be okay for the user.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Zhijie Hou (Fujitsu)

On Wednesday, September 13, 2023 9:52 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Amit,
> 
> Thank you for reviewing! Before making a patch I can reply the important 
> point.
> 
> > 1. One thing to note is that if user checks whether the old cluster is
> > upgradable with --check option and then try to upgrade, that will also
> > fail. Because during the --check run there would at least one
> > additional shutdown checkpoint WAL and then in the next run the slots
> > position won't match. Note, I am saying this in context of using
> > --check option with not-running old cluster. Won't that be surprising
> > to users? One possibility is that we document such a behaviour and
> > other is that we go back to WAL reading design where we can ignore
> > known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.
> 
> Good catch, we have never considered the case that --check is executed for
> stopped cluster. You are right, the old cluster is turned on/off during the
> check and it generates SHUTDOWN_CHECKPOINT. This leads that
> confirmed_flush is
> behind the latest checkpoint lsn.
> 
> Here are other approaches we came up with:
> 
> 1. adds WARNING message when the --check is executed and slots are
> checked.
>We can say like:
> 
> ```
> ...
> Checking for valid logical replication slots
> WARNING: this check generated WALs
> Next pg_uprade would fail.
> Please ensure again that all WALs are replicated.
> ...
> ```
> 
> 
> 2. adds hint message in the FATAL error when the confirmed_flush is not same
> as
>the latest checkpoint:
> 
> ```
> ...
> Checking for valid logical replication slots  fatal
> 
> Your installation contains invalid logical replication slots.
> These slots can't be copied, so this cluster cannot be upgraded.
> Consider removing such slots or consuming the pending WAL if any,
> and then restart the upgrade.
> If you did pg_upgrade --check before this run, it may be a cause.
> Please start clusters and confirm again that all changes are
> replicated.
> A list of invalid logical replication slots is in the file:
> ```
> 
> 3. requests users to do pg_upgrade --check on backup database, if old cluster
>has logical slots. Basically they save a whole of cluster before doing
> pg_uprade,
>so it may be acceptable. This is not a modification of codes.
> 

Here are some more ideas about the issue for reference.

1) Extending the controlfile.

We can dd a new field (e.g. non_upgrade_checkPoint) to record the last check 
point
ptr happened in non-upgrade mode. The new field won't be updated due to
"pg_upgrade --check", so pg_upgrade can use this LSN to compare with the slot's
confirmed_flush_lsn.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Not sure if this is a enough reason to introduce new field in
controlfile.

---

2) Advance the slot's confirmed_flush_lsn in pg_upgrade if the check passes.

Introducing an upgrade support SQL function
(binary_upgrade_advance_logical_slot_lsn()) to set a
flag(catch_confirmed_lsn_up) on server side. On server side, when trying to
flush the slot in shutdown checkpoint(CheckPointReplicationSlots), we update
the slot's confirmed_flush_lsn to the lsn of the current checkpoint if
catch_confirmed_lsn_up is set.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Although we have some examples for using functions
(binary_upgrade_set_next_pg_enum_oid ...) to set some variables during upgrade
, but not sure if it's a standard behavior to change the slot's lsn during
upgrade.

---

3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest if user
   already did the upgrade check for stopped server, they can use this option
   when trying to upgrade later.

Pros: Can save some efforts for user to advance each slot's lsn.

Cons: I didn't see similar options in pg_upgrade, might need some agreement.

Best Regards,
Hou zj

Re: BufferUsage counters' values have changed

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-13 11:59:39 -0700, Andres Freund wrote:
> Running the attached patch through CI, planning to push after that succeeds,
> unless somebody has a comment?

And pushed.

Thanks Karina for the report and fix!

Greetings,

Andres Freund

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-13 19:07:24 -0700, Andres Freund wrote:
> On 2023-09-14 10:33:33 +0900, Michael Paquier wrote:
> > On Wed, Sep 13, 2023 at 01:19:38PM +0200, Daniel Gustafsson wrote:
> > > +1. This errmsg is already present so it eases the translation burden as 
> > > well.
> > 
> > I was thinking about doing only that on HEAD, but there is an argument
> > that one could get confusing errors when dealing with snapshot imports
> > on back-branches as well, and it applies down to 11 without conflicts.
> > So, applied and backpatched.
> 
> Huh. I don't think this is a good idea - and certainly not in the back
> branches. The prior message made more sense, imo. The fact that the snapshot
> identifier is a file is an implementation detail, no snapshot with the
> identifier being exported is a user level detail. Hence that being mentioned
> in the error message.
> 
> I can see an argument for treating ENOENT different than other errors though,
> and using the standard file opening error message for anything other than
> ENOENT.

Oh, and given that this actually changes the error code for an invalid
snapshot, I think this needs to be reverted. It's not that unlikely that
there's code out there that depends on getting ERRCODE_INVALID_PARAMETER_VALUE
when the snapshot doesn't exist.

- Andres

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-14 10:33:33 +0900, Michael Paquier wrote:
> On Wed, Sep 13, 2023 at 01:19:38PM +0200, Daniel Gustafsson wrote:
> > +1. This errmsg is already present so it eases the translation burden as 
> > well.
> 
> I was thinking about doing only that on HEAD, but there is an argument
> that one could get confusing errors when dealing with snapshot imports
> on back-branches as well, and it applies down to 11 without conflicts.
> So, applied and backpatched.

Huh. I don't think this is a good idea - and certainly not in the back
branches. The prior message made more sense, imo. The fact that the snapshot
identifier is a file is an implementation detail, no snapshot with the
identifier being exported is a user level detail. Hence that being mentioned
in the error message.

I can see an argument for treating ENOENT different than other errors though,
and using the standard file opening error message for anything other than
ENOENT.

Greetings,

Andres

Re: persist logical slots to disk during shutdown checkpoint

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:
> The patch is updated as per recent discussion.

WFM.  Thanks for the updated version.
--
Michael


signature.asc
Description: PGP signature

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 01:19:38PM +0200, Daniel Gustafsson wrote:
> +1. This errmsg is already present so it eases the translation burden as well.

I was thinking about doing only that on HEAD, but there is an argument
that one could get confusing errors when dealing with snapshot imports
on back-branches as well, and it applies down to 11 without conflicts.
So, applied and backpatched.
--
Michael


signature.asc
Description: PGP signature

Re: Avoid a possible null pointer (src/backend/utils/adt/pg_locale.c)

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 08:14:11AM -0700, Jeff Davis wrote:
> Looks good to me, thank you.

Applied, then.  Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: Redundant Unique plan node for table with a unique index

2023-09-13 Thread Andy Fan

>
>
> I don't think this is a good way to do this.  The method you're using
> only supports this optimisation when querying a table directly.  If
> there were subqueries, joins, etc then it wouldn't work as there are
> no unique indexes.  You should probably have a look at [1] to see
> further details of an alternative method without the said limitations.
>
> David
>
> [1]
> https://postgr.es/m/flat/CAKU4AWqZvSyxroHkbpiHSCEAY2C41dG7VWs%3Dc188KKznSK_2Zg%40mail.gmail.com
>
>
The nullable tracking blocker probably has been removed by varnullingrels
so I will start working on UniqueKey stuff very soon, thank you David
for remember of this feature!

-- 
Best Regards
Andy Fan

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Tom Lane

Nathan Bossart  writes:
> Upon closer inspection, I found a rather nasty problem.  The qsort
> comparator expects a TocEntry **, but the binaryheap comparator expects a
> TocEntry *, and we simply pass the arguments through to the qsort
> comparator.  In v9, I added the requisite ampersands.

Ooops :-(

> I'm surprised this
> worked at all.

Probably it was not sorting things appropriately.  Might be worth adding
some test scaffolding to check that bigger tasks are chosen before
smaller ones.

regards, tom lane

Re: Surely this code in setrefs.c is wrong?

2023-09-13 Thread David Rowley

On Sun, 10 Sept 2023 at 21:07, David Rowley  wrote:
>
> On Sun, 10 Sept 2023 at 11:22, Tom Lane  wrote:
> > if (!OidIsValid(saop->hashfuncid))
> > record_plan_function_dependency(root, saop->hashfuncid);
> >
> > if (!OidIsValid(saop->negfuncid))
> > record_plan_function_dependency(root, saop->negfuncid);
> >
> > Surely those if-conditions are exactly backward, and we should be
> > recording nonzero hashfuncid and negfuncid entries, not zero ones.
>

> I'll push fixes once the 16.0 release is out of the way.

Fixed in ee3a551e9.

David

Re: Redundant Unique plan node for table with a unique index

2023-09-13 Thread David Rowley

On Thu, 14 Sept 2023 at 02:28, Damir Belyalov  wrote:
> create table a (n int);
> insert into a (n) select x from generate_series(1, 14) as g(x);
> create unique index on a (n);
> explain select distinct n from a;
>  QUERY PLAN
> 
>  Unique  (cost=0.42..6478.42 rows=14 width=4)
>->  Index Only Scan using a_n_idx on a  (cost=0.42..6128.42 rows=14 
> width=4)
> (2 rows)
>
>
> We can see that Unique node is redundant for this case. So I implemented a 
> simple patch that removes Unique node from the plan.

I don't think this is a good way to do this.  The method you're using
only supports this optimisation when querying a table directly.  If
there were subqueries, joins, etc then it wouldn't work as there are
no unique indexes.  You should probably have a look at [1] to see
further details of an alternative method without the said limitations.

David

[1] 
https://postgr.es/m/flat/CAKU4AWqZvSyxroHkbpiHSCEAY2C41dG7VWs%3Dc188KKznSK_2Zg%40mail.gmail.com

Re: Jumble the CALL command in pg_stat_statements

2023-09-13 Thread Imseih (AWS), Sami

> Still this grouping is much better than having thousands of entries
> with different values. I am not sure if we should bother improving
> that more than what you suggest that, especially as FuncExpr->args can
> itself include Const nodes as far as I recall.

I agree.

> As far as the SET command mentioned in [1] is concerned, it is a bit more 
> complex
> as it requires us to deal with A_Constants which is not very straightforward. 
> We can surely
> deal with SET currently by applying custom query jumbling logic to 
> VariableSetStmt,
> but this can be dealt with in a separate discussion.

> As VariableSetStmt is the top-most node structure for SET/RESET
> commands, using a custom implementation may be wise in this case,

I do have a patch for this with test cases, 0001-v1-Jumble-the-SET-command.patch
If you feel this needs a separate discussion I can start one.

In the patch, the custom _jumbleVariableSetStmt jumbles
 the kind, name, is_local and number of arguments ( in case of a list ) 
and tracks the locations for normalization.

> This choice is a bit surprising. How does it influence the jumbling?
> For example, if I add a query_jumble_ignore to it, the regression
> tests of pg_stat_statements still pass. This is going to require more
> test coverage to prove that this addition is useful.

CALL with OUT or INOUT args is a bit strange, because
as the doc [1] mentions "Arguments must be supplied for all procedure 
parameters 
that lack defaults, including OUT parameters. However, arguments 
matching OUT parameters are not evaluated, so it's customary
to just write NULL for them."

so for pgss, passing a NULL or some other value into OUT/INOUT args should 
be normalized like IN args.

0001-v2-Jumble-the-CALL-command-in-pg_stat_statements.patch adds
these test cases.


Regards,

Sami Imseih
Amazon Web Services (AWS)

[1] https://www.postgresql.org/docs/current/sql-call.html



0001-v1-Jumble-the-SET-command.patch
Description: 0001-v1-Jumble-the-SET-command.patch


0001-v2-Jumble-the-CALL-command-in-pg_stat_statements.patch
Description: 0001-v2-Jumble-the-CALL-command-in-pg_stat_statements.patch

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 11:34:50AM -0700, Nathan Bossart wrote:
> On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
>> Other than those nitpicks, I like v6.  I'll mark this RfC.
> 
> Great.  I've posted a v8 with your comments addressed in order to get one
> more round of cfbot coverage.  Assuming those tests pass and there is no
> additional feedback, I'll plan on committing this in the next few days.

Upon closer inspection, I found a rather nasty problem.  The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator.  In v9, I added the requisite ampersands.  I'm surprised this
worked at all.  I'm planning to run some additional tests to make sure this
patch set works as expected.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From df3d862907b29687503a8f362e0a3cc3dcbc3cc4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v9 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap,

Re: Support prepared statement invalidation when result types change

2023-09-13 Thread Euler Taveira

On Tue, Sep 12, 2023, at 10:17 AM, Jelte Fennema wrote:
> When running the Postgres JDBC tests with this patchset I found dumb
> mistake in my last patch where I didn't initialize the contents of
> orig_params correctly. This new patchset  fixes that.
> 
0001:

Don't you want to execute this code path only for EXECUTE command?
PORTAL_UTIL_SELECT includes other commands such as CALL, FETCH, SHOW, and
EXPLAIN. If so, check if commandTag is CMDTAG_EXECUTE.

Regarding tupDesc, you don't need to call FreeTupleDesc instead you can modify
PortalStart as

case PORTAL_UTIL_SELECT:

/*
 * We don't set snapshot here, because PortalRunUtility will
 * take care of it if needed.
 */
{
PlannedStmt *pstmt = PortalGetPrimaryStmt(portal);

Assert(pstmt->commandType == CMD_UTILITY);
/*
 * tupDesc will be filled by FillPortalStore later because
 * it might change due to replanning when ExecuteQuery calls
 * GetCachedPlan.
 */
if (portal->commandTag != CMDTAG_EXECUTE)
portal->tupDesc = 
UtilityTupleDescriptor(pstmt->utilityStmt);
}

Regarding the commit message, ...if the the result... should be fixed. The
sentence "it's actually not needed..." could be "It doesn't need to be an error
as long as it sends the RowDescription...". The sentence "This patch starts to
allow a prepared ..." could be "This patch allows a prepared ...".

0002:

You should remove this comment because it refers to the option you are
removing.

-  plan->cursor_options,
-  false);  /* not fixed result */
+  plan->cursor_options);   /* not fixed result */

You should also remove the sentence that refers to fixed_result in
CompleteCachedPlan.

* cursor_options: options bitmask to pass to planner
* fixed_result: true to disallow future changes in query's result tupdesc
*/
void
CompleteCachedPlan(CachedPlanSource *plansource,
   List *querytree_list,
   MemoryContext querytree_context,

0003:

You should initialize the new parameters (orig_param_types and orig_num_params)
in CreateCachedPlan. One suggestion is to move the following code to
CompleteCachedPlan because plansource->param_types are assigned there.

@@ -108,6 +108,10 @@ PrepareQuery(ParseState *pstate, PrepareStmt *stmt,

argtypes[i++] = toid;
}
+
+   plansource->orig_num_params = nargs;
+   plansource->orig_param_types = MemoryContextAlloc(plansource->context, 
nargs * sizeof(Oid));
+   memcpy(plansource->orig_param_types, argtypes, nargs * sizeof(Oid));
}

This comment is confusing. Since the new function
(GetCachedPlanFromRevalidated) contains almost all code from GetCachedPlan, its
content is the same as the *previous* GetCachedPlan function. You could expand
this comment a bit to make it clear that it contains the logic to decide
between generic x custom plan. I don't like the function name but have only a
not-so-good suggestion: GetCachedPlanAfterRevalidate. I also don't like the
revalidationResult as a variable name. Why don't you keep qlist? Or use a name
near query-tree list (query_list? qtlist?). s/te caller/the caller/

+ * GetCachedPlanFromRevalidated: is the same as get GetCachedPlan, but requires
+ * te caller to first revalidate the query. This is needed for callers that
+ * need to use the revalidated plan to generate boundParams.
+ */
+CachedPlan *
+GetCachedPlanFromRevalidated(CachedPlanSource *plansource,
+ParamListInfo boundParams,
+ResourceOwner owner,
+QueryEnvironment *queryEnv,
+List *revalidationResult)

Are these names accurate? The original are the current ones; new ones are
"runtime" data. It would be good to explain why a new array is required.

Oid*param_types;/* array of parameter type OIDs, or NULL */
int num_params; /* length of param_types array */
+   Oid*orig_param_types;   /* array of original parameter type OIDs,
+* or NULL */
+   int orig_num_params;/* length of orig_param_types array */

You should expand the commit message a bit. Explain this feature. Inform the
behavior change.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Extract numeric filed in JSONB more effectively

2023-09-13 Thread Chapman Flack


On 2023-09-04 10:35, Andy Fan wrote:

  v13 attached.  Changes includes:

1.  fix the bug Jian provides.
2.  reduce more code duplication without DirectFunctionCall.
3.  add the overlooked  jsonb_path_query and jsonb_path_query_first as
candidates


Apologies for the delay. I like the way this is shaping up.

My first comment will be one that may be too large for this patch
(and too large to rest on my opinion alone); that's why I'm making
it first.

It seems at first a minor point: to me it feels like a wart to have
to pass jsonb_finish_numeric (and *only* jsonb_finish_numeric) a type
oid reflecting the target type of a cast that's going to be applied
*after jsonb_finish_numeric has done its work*, and only for the
purpose of generating a message if the jsonb type *isn't numeric*,
but saying "cannot cast to" (that later target type) instead.

I understand this is done to exactly match the existing behavior,
so what makes this a larger issue is I'm not convinced the existing
behavior is good. Therefore I'm not convinced that bending over
backward to preserve it is good.

What's not good: the places a message from cannotCastJsonbValue
are produced, there has been no attempt yet to cast anything.
The message purely tells you about whether you have the kind
of jsonb node you think you have (and array, bool, null, numeric,
object, string are the only kinds of those). If you're wrong
about what kind of jsonb node it is, you get this message.
If you're right about the kind of node, you don't get this
message, regardless of whether its value can be cast to the
later target type. For example, '32768'::jsonb::int2 produces
ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE "smallint out of range"
but that message comes from the actual int2 cast.

IMV, what the "cannot cast jsonb foo to type %s" message really
means is "jsonb foo where jsonb bar is required" and that's what
it should say, and that message depends on nothing about any
future plans for what will be done to the jsonb bar, so it can
be produced without needing any extra information to be passed.

I'm also not convinced ERRCODE_INVALID_PARAMETER_VALUE is a
good errcode for that message (whatever the wording). I do not
see much precedent elsewhere in the code for using
INVALID_PARAMETER_VALUE to signal this kind of "data value
isn't what you think it is" condition. Mostly it is used
when checking the kinds of parameters passed to a function to
indicate what it should do.

There seem to be several more likely choices for an errcode
there in the 2203x range.

But I understand that issue is not with this patch so much
as with preexisting behavior, and because it's preexisting,
there can be sound arguments against changing it.

But if that preexisting message could be changed, it would
eliminate the need for an unpleasing wart here.

Other notes are more minor:

+   else
+   /* not the desired pattern. */
+   PG_RETURN_POINTER(fexpr);
...
+
+   if (!OidIsValid(new_func_id))
+   PG_RETURN_POINTER(fexpr);
...
+   default:
+   PG_RETURN_POINTER(fexpr);

If I am reading supportnodes.h right, returning NULL is
sufficient to say no transformation is needed.

+   FuncExpr*fexpr = palloc0(sizeof(FuncExpr));
...
+   memcpy(fexpr, req->fcall, sizeof(FuncExpr));

Is the shallow copy necessary? If so, a comment explaining why
might be apropos. Because the copy is shallow, if there is any
concern about the lifespan of req->fcall, would there not be a
concern about its children?

Is there a reason not to transform the _tz flavors of
jsonb_path_query and jsonb_path-query_first?

-   JsonbValue *v;
-   JsonbValue  vbuf;
+   JsonbValue  *v;
...
+   int i;
JsonbValue *jbvp = NULL;
-   int i;

Sometimes it's worth looking over a patch for changes like these,
and reverting such whitespace or position changes, if they aren't
things you want a reviewer to be squinting at. :)

Regards,
-Chap

Re: BufferUsage counters' values have changed

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-11 09:23:59 +0300, Karina Litskevich wrote:
> On Mon, Sep 11, 2023 at 9:08 AM Karina Litskevich <
> litskevichkar...@gmail.com> wrote:
> 
> > I found a bug that causes one of the differences. Wrong counter is
> > incremented
> > in ExtendBufferedRelLocal(). The attached patch fixes it and should be
> > applied
> > to REL_16_STABLE and master.
> >
> 
>  I've forgotten to attach the patch. Here it is.

> From 999a3d533a9b74c8568cc8a3d715c287de45dd2c Mon Sep 17 00:00:00 2001
> From: Karina Litskevich 
> Date: Thu, 7 Sep 2023 17:44:40 +0300
> Subject: [PATCH v1] Fix local_blks_written counter incrementation
> 
> ---
>  src/backend/storage/buffer/localbuf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/backend/storage/buffer/localbuf.c 
> b/src/backend/storage/buffer/localbuf.c
> index 1735ec7141..567b8d15ef 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -431,7 +431,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
>  
>   *extended_by = extend_by;
>  
> - pgBufferUsage.temp_blks_written += extend_by;
> + pgBufferUsage.local_blks_written += extend_by;
>  
>   return first_block;
>  }
> -- 
> 2.34.1
> 

Ugh, you're right.

The naming of local vs temp here is pretty unfortunate imo. I wonder if we
ought to at least dd a comment to BufferUsage clarifying the situation? Just
reading the comments therein one would be hard pressed to figure out which of
the variables temp table activity should be added to.

I don't think we currently can write a test for this in the core tests, as the
relevant data isn't visible anywhere, iirc. Thus I added a test to
pg_stat_statements. Afaict it should be stable?

Running the attached patch through CI, planning to push after that succeeds,
unless somebody has a comment?

Greetings,

Andres Freund
>From ae7b170e9032af0fd257a39953fe0b6ef7e9637c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 13 Sep 2023 11:46:45 -0700
Subject: [PATCH v2] Fix tracking of temp table relation extensions as writes

Karina figured out that I (Andres) confused BufferUsage.temp_blks_written with
BufferUsage.local_blks_written in fcdda1e4b5.

Tests in core PG can't easily test this, as BufferUsage is just used for
EXPLAIN (ANALYZE, BUFFERS) and pg_stat_statements. Thus this commit adds tests
for this to pg_stat_statements.

Reported-by: Karina Litskevich 
Author: Karina Litskevich 
Author: Andres Freund 
Discussion: https://postgr.es/m/cacit8ibxxa6+0amgikbefhm8b84xdqvo6d0qfd1pq1s8zps...@mail.gmail.com
Backpatch: 16-, where fcdda1e4b5 was merged
---
 src/backend/storage/buffer/localbuf.c   |  2 +-
 contrib/pg_stat_statements/expected/dml.out | 27 +
 contrib/pg_stat_statements/sql/dml.sql  | 19 +++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1735ec71419..567b8d15ef0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -431,7 +431,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 	*extended_by = extend_by;
 
-	pgBufferUsage.temp_blks_written += extend_by;
+	pgBufferUsage.local_blks_written += extend_by;
 
 	return first_block;
 }
diff --git a/contrib/pg_stat_statements/expected/dml.out b/contrib/pg_stat_statements/expected/dml.out
index 7b9c8f979ee..ede47a71acc 100644
--- a/contrib/pg_stat_statements/expected/dml.out
+++ b/contrib/pg_stat_statements/expected/dml.out
@@ -139,6 +139,33 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
  1 |1 | SELECT pg_stat_statements_reset()
 (10 rows)
 
+-- check that [temp] table relation extensions are tracked as writes
+CREATE TABLE pgss_extend_tab (a int, b text);
+CREATE TEMP TABLE pgss_extend_temp_tab (a int, b text);
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+INSERT INTO pgss_extend_tab (a, b) SELECT generate_series(1, 1000), 'something';
+INSERT INTO pgss_extend_temp_tab (a, b) SELECT generate_series(1, 1000), 'something';
+WITH sizes AS (
+  SELECT
+pg_relation_size('pgss_extend_tab') / current_setting('block_size')::int8 AS rel_size,
+pg_relation_size('pgss_extend_temp_tab') / current_setting('block_size')::int8 AS temp_rel_size
+)
+SELECT
+SUM(local_blks_written) >= (SELECT temp_rel_size FROM sizes) AS temp_written_ok,
+SUM(local_blks_dirtied) >= (SELECT temp_rel_size FROM sizes) AS temp_dirtied_ok,
+SUM(shared_blks_written) >= (SELECT rel_size FROM sizes) AS written_ok,
+SUM(shared_blks_dirtied) >= (SELECT rel_size FROM sizes) AS dirtied_ok
+FROM pg_stat_statements;
+ temp_written_ok | temp_dirtied_ok | written_ok | dirtied_ok 
+-+-++
+ t   | t   | t  | t
+(1 row)
+
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset

Re: JSON Path and GIN Questions

2023-09-13 Thread David E. Wheeler

On Sep 13, 2023, at 01:11, Erik Rijkers  wrote:

> "All use of json*() functions preclude index usage."
> 
> That sentence is missing from the documentation.

Where did that come from? Why wouldn’t JSON* functions use indexes? I see that 
the docs only mention operators; why would the corresponding functions behave 
the same?

D

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] Add native windows on arm64 support

2023-09-13 Thread Peter Eisentraut


On 31.08.23 06:44, Tom Lane wrote:

I agree.  I'm really uncomfortable with claiming support for
Windows-on-ARM if we don't have a buildfarm member testing it.
For other platforms that have a track record of multiple
hardware support, it might not be a stretch ... but Windows was
so resolutely Intel-only for so long that "it works on ARM" is
a proposition that I won't trust without hard evidence.  There
are too many bits of that system that might not have gotten the
word yet, or at least not gotten sufficient testing.

My vote for this is we don't commit without a buildfarm member.


I think we can have a multi-tiered approach, where we can commit support 
but consider it experimental until we have buildfarm coverage.

Re: [PATCH] Add native windows on arm64 support

2023-09-13 Thread Andres Freund

Hi,

On 2023-08-31 14:33:56 +0900, Michael Paquier wrote:
> On Thu, Aug 31, 2023 at 01:06:53AM -0400, Tom Lane wrote:
> > That's a good point.  The hardware resources to support a buildfarm
> > animal are really pretty trivial these days.  What is important is
> > an animal owner who is willing to help chase down problems when they
> > arise.
> 
> Unfortunately I don't see myself running that at home.  It is too hot
> and humid in Summer so the machine would not last.  Something in the
> cloud would be OK for me, but AWS has no option for a Windows host
> with ARM yet.

FWIW, we could run ARM windows as part of CI. It'd be a bit of scripting work,
as google doesn't yet have readymade VM images with windows for ARM. Right now
the CI VM images for windows are based on top of the google images. Scripting
generation of ARM images would be some one-off work, but after that...

Greetings,

Andres Freund

Re: RFC: pg_stat_logmsg

2023-09-13 Thread Joe Conway


On 7/9/23 14:13, Joe Conway wrote:

On 7/7/23 01:38, Gurjeet Singh wrote:

In this case, the error message stored in pg_stat_logmsg is just '%s'.
The filename and line number columns might help identify the actual
error but it requires users to read the source code and cannot know
the actual error message.


I believe that the number of such error messages, the ones with very
little descriptive content, is very low in Postgres code. These kinds
of messages are not prevalent, and must be used when code hits obscure
conditions, like seen in your example above. These are the kinds of
errors that Joe is referring to as "should not happen". In these
cases, even if the error message was descriptive, the user will very
likely have to dive deep into code to find out the real cause.


Agreed. As an example, after running `make installcheck`

8<-
select sum(count) from pg_stat_logmsg;
   sum
--
   6005
(1 row)

select message,sum(count)
from pg_stat_logmsg
where length(message) < 30
and elevel in ('ERROR','FATAL','PANIC')
and message like '%\%s%' escape '\'
group by message
order by length(message);
  message| sum
---+-
   %s| 107
   "%s" is a view|   9
   "%s" is a table   |   4
   %s is a procedure |   1
   invalid size: "%s"|  13
   %s at or near "%s"| 131
   %s at end of input|   3
...
8<-

I only see three message formats there that are generic enough to be of
concern (the first and last two shown -- FWIW I did not see any more of
them as the fmt string gets longer)

So out of 6005 log events, 241 fit this concern.

Perhaps given the small number of message format strings affected, it
would make sense to special case those, but I am not sure it is worth
the effort, at least for version 1.


Attached is an update, this time as a patch against 17devel. It is not 
quite complete, but getting close IMHO.


Changes:

1. Now is integrated into contrib as a "Additional Supplied Extension"

2. Documentation added

3. I had a verbal conversation with Andres and he reminded me that the 
original idea for this was to collect data across fleets of pg hosts so 
we could look for how often "should never happen" errors actually 
happen. As such, we need to think in terms of aggregating the info from 
many hosts, potentially with many localized languages for the messages. 
So I converted the "message" column back to an untranslated string, and 
added a "translated_message" column which is localized.


An alternative approach might be to provide a separate function that 
accepts the message string and returns the translation. Thoughts on that?


4. In the same vein, I added a pgversion column since the filename and 
line number for the same log message could vary across major or even 
minor releases.


Not done:
-
1. The main thing lacking at this point is a regression test.

2. No special casing for message == "%s". I am still not convinced it is 
worthwhile to do so.


Comments gratefully welcomed.

Thanks,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/Makefile b/contrib/Makefile
index da4e231..9144cb7 100644
*** a/contrib/Makefile
--- b/contrib/Makefile
*** SUBDIRS = \
*** 34,39 
--- 34,40 
  		pg_buffercache	\
  		pg_freespacemap \
  		pg_prewarm	\
+ 		pg_stat_logmsg \
  		pg_stat_statements \
  		pg_surgery	\
  		pg_trgm		\
diff --git a/contrib/meson.build b/contrib/meson.build
index 84d4e18..4e8ac39 100644
*** a/contrib/meson.build
--- b/contrib/meson.build
*** subdir('pgcrypto')
*** 44,49 
--- 44,50 
  subdir('pg_freespacemap')
  subdir('pg_prewarm')
  subdir('pgrowlocks')
+ subdir('pg_stat_logmsg')
  subdir('pg_stat_statements')
  subdir('pgstattuple')
  subdir('pg_surgery')
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index ab7e38b..4dfc5dd 100644
*** a/doc/src/sgml/contrib.sgml
--- b/doc/src/sgml/contrib.sgml
*** CREATE EXTENSION extension_
*** 157,162 
--- 157,163 
   
   
   
+  
   
   
   
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 4c63a7e..acede66 100644
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***
*** 145,150 
--- 145,151 
  
  
  
+ 
  
  
  
diff --git a/contrib/pg_stat_logmsg/.gitignore b/contrib/pg_stat_logmsg/.gitignore
index ...5dcb3ff .
*** a/contrib/pg_stat_logmsg/.gitignore
--- b/contrib/pg_stat_logmsg/.gitignore
***
*** 0 
--- 1,4 
+ # Generated subdirectories
+ /log/
+ /results/
+ /tmp_check/
diff --git a/contrib/pg_stat_logmsg/LICENSE b/contrib/pg_stat_logmsg/LICENSE
index ...998f814 .
*** a/contrib/pg_stat_logmsg/LICENSE
--- b/contrib/pg_stat_logmsg/LICENSE
***
*** 0 
--- 1,4 
+ This code is

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
> Hmm ... I do not like v7 very much at all.  It requires rather ugly
> changes to all of the existing callers, and what are we actually
> buying?  If anything, it makes things slower for pass-by-value items
> like integers.  I'd stick with the Datum convention in the backend.
> 
> Instead, I took a closer look through the v6 patch set.
> I think that's in pretty good shape and nearly committable,
> but I have a few thoughts:

Thanks for reviewing.  I'm fine with proceeding with the v6 approach.  Even
though the alternative approach makes the API consistent for the frontend
and backend, I'm also not a huge fan of the pointer gymnastics required in
the comparators.  Granted, we still have to do some intptr_t conversions in
pg_dump_sort.c with the v6 approach, but that seems to be an exception.

> * I'm not sure about defining bh_node_type as a macro:
> 
> +#ifdef FRONTEND
> +#define bh_node_type void *
> +#else
> +#define bh_node_type Datum
> +#endif
> 
> rather than an actual typedef:
> 
> +#ifdef FRONTEND
> +typedef void *bh_node_type;
> +#else
> +typedef Datum bh_node_type;
> +#endif
> 
> My concern here is that bh_node_type is effectively acting as a
> typedef, so that pgindent might misbehave if it's not declared as a
> typedef.  On the other hand, there doesn't seem to be any indentation
> problem in the patchset as it stands, and we don't expect any code
> outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
> (If you do choose to make it a typedef, remember to add it to
> typedefs.list.)

I think a typedef makes more sense here.

> * As a matter of style, I'd recommend adding braces in places
> like this:
> 
>   if (heap->bh_size >= heap->bh_space)
> + {
> +#ifdef FRONTEND
> + pg_fatal("out of binary heap slots");
> +#else
>   elog(ERROR, "out of binary heap slots");
> +#endif
> + }
>   heap->bh_nodes[heap->bh_size] = d;
> 
> It's not wrong as you have it, but I think it's more readable
> and less easy to accidentally break with the extra braces.

Fair point.

> * In 0002, isn't the comment for binaryheap_remove_node wrong?
> 
> + * Removes the nth node from the heap.  The caller must ensure that there are
> + * at least (n - 1) nodes in the heap.  O(log n) worst case.
> 
> Shouldn't that be "(n + 1)"?  Also, I'd specify "n'th (zero based) node"
> for clarity.

Yeah, that's a mistake.

> * I would say that this bit in 0004:
> 
> - j = removeHeapElement(pendingHeap, heapLength--);
> + j = (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> needs an explicit cast to int:
> 
> + j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> otherwise some compilers might complain about the result possibly
> not fitting in "j".

Sure.  IMO it's a tad more readable, too.

> Other than those nitpicks, I like v6.  I'll mark this RfC.

Great.  I've posted a v8 with your comments addressed in order to get one
more round of cfbot coverage.  Assuming those tests pass and there is no
additional feedback, I'll plan on committing this in the next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7dd09c72a44e5ba293c6e9b2d2afa608ba070f03 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v8 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build

Re: BufferUsage counters' values have changed

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-13 16:04:00 +0300, Nazir Bilal Yavuz wrote:
> On Mon, 11 Sept 2023 at 14:28, Karina Litskevich
>  wrote:
> >
> > Hi hackers,
> >
> > I noticed that BufferUsage counters' values are strangely different for the
> > same queries on REL_15_STABLE and REL_16_STABLE. For example, I run
> >
> > CREATE EXTENSION pg_stat_statements;
> > CREATE TEMP TABLE test(b int);
> > INSERT INTO test(b) SELECT generate_series(1,1000);
> > SELECT query, local_blks_hit, local_blks_read, local_blks_written,
> >local_blks_dirtied, temp_blks_written FROM pg_stat_statements;
> >
> > and output on REL_15_STABLE contains
> >
> > query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> > local_blks_hit | 1005
> > local_blks_read| 2
> > local_blks_written | 5
> > local_blks_dirtied | 5
> > temp_blks_written  | 0
> >
> > while output on REL_16_STABLE contains
> >
> > query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> > local_blks_hit | 1006
> > local_blks_read| 0
> > local_blks_written | 0
> > local_blks_dirtied | 5
> > temp_blks_written  | 8
> >
> >
> > I found a bug that causes one of the differences. Wrong counter is 
> > incremented
> > in ExtendBufferedRelLocal(). The attached patch fixes it and should be 
> > applied
> > to REL_16_STABLE and master. With the patch applied output contains
>
> Nice finding! I agree, it should be changed.
>
> > query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> > local_blks_hit | 1006
> > local_blks_read| 0
> > local_blks_written | 8
> > local_blks_dirtied | 5
> > temp_blks_written  | 0
> >
> >
> > I still wonder why local_blks_written is greater than it was on 
> > REL_15_STABLE,
> > and why local_blks_read became zero. These changes are caused by fcdda1e4b5.
> > This code is new to me, and I'm still trying to understand whether it's a 
> > bug
> > in computing the counters or just changes in how many blocks are 
> > read/written
> > during the query execution. If anyone can help me, I would be grateful.
>
> I spent some time on it:
>
> local_blks_read became zero because:
> 1_ One more cache hit. It was supposed to be local_blks_read but it is
> local_blks_hit now. This is an assumption, I didn't check this deeply.
> 2_ Before fcdda1e4b5, there was one local_blks_read coming from
> buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
> RBM_ZERO_ON_ERROR, NULL) in freespace.c -> ReadBuffer_common() ->
> pgBufferUsage.local_blks_read++.
> But buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
> RBM_ZERO_ON_ERROR, NULL) is moved into the else case, so it didn't
> called and local_blks_read isn't incremented.

That imo is a legitimate difference / improvement. The read we previously did
here was unnecessary.


> local_blks_written is greater because of the combination of fcdda1e4b5
> and 00d1e02be2.
> In PG_15:
> RelationGetBufferForTuple() -> ReadBufferBI(P_NEW, RBM_ZERO_AND_LOCK)
> -> ReadBufferExtended() -> ReadBuffer_common() ->
> pgBufferUsage.local_blks_written++; (called 5 times) [0]
> In PG_16:
> 1_ 5 of the local_blks_written is coming from:
> RelationGetBufferForTuple() -> RelationAddBlocks() ->
> ExtendBufferedRelBy() -> ExtendBufferedRelCommon() ->
> ExtendBufferedRelLocal() -> pgBufferUsage.local_blks_written +=
> extend_by; (extend_by is 1, this is called 5 times) [1]
> 2_ 3 of the local_blks_written is coming from:
> RelationGetBufferForTuple() -> RecordAndGetPageWithFreeSpace() ->
> fsm_set_and_search() -> fsm_readbuf() -> fsm_extend() ->
> ExtendBufferedRelTo() -> ExtendBufferedRelCommon() ->
> ExtendBufferedRelLocal() -> pgBufferUsage.local_blks_written +=
> extend_by; (extend_by is 3, this is called 1 time) [2]
>
> I think [0] is the same path as [1] but [2] is new. 'fsm extends'
> wasn't counted in local_blks_written in PG_15. Calling
> ExtendBufferedRelTo() from fsm_extend() caused 'fsm extends' to be
> counted in local_blks_written. I am not sure which one is correct.

I think it's correct to count the fsm writes here. The pg_stat_statement
columns aren't specific to the main relation for or such... If anything it was
a bug to not count them before.

Thanks for looking into this.

Greetings,

Andres Freund

Re: [PATCH] Add native windows on arm64 support

2023-09-13 Thread Peter Eisentraut


On 02.05.23 09:51, Niyas Sait wrote:

diff --git a/doc/src/sgml/install-windows.sgml 
b/doc/src/sgml/install-windows.sgml
index cbc70a039c..2ecd5fcf38 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -352,7 +352,8 @@ $ENV{MSBFLAGS}="/m";
Special Considerations for 64-Bit Windows
  


-   PostgreSQL will only build for the x64 architecture on 64-bit Windows.
+   PostgreSQL will only build for the x64 and ARM64 architectures on 64-bit
+   Windows.



Are there other possible architectures besides x64 and ARM64?  Itanium? 
Or should we just delete this sentence?



diff --git a/meson.build b/meson.build
index 096044628c..6cca212fae 100644
--- a/meson.build
+++ b/meson.build
@@ -2045,8 +2045,11 @@ int main(void)
  elif host_cpu == 'arm' or host_cpu == 'aarch64'
  
prog = '''

+#ifdef _MSC_VER
+#include 
+#else
  #include 
-
+#endif


Maybe keep the whitespace here.


-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
without -march=armv8-a+crc',
+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
without -march=armv8-a+crc',


Add a comment here.


diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..3d7eb748ff 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
   */
  #include "c.h"
  
+#ifndef _MSC_VER

  #include 
+#endif


Also add a comment here.

  
  #include "port/pg_crc32c.h"
  
diff --git a/src/tools/msvc/gendef.pl b/src/tools/msvc/gendef.pl

index e7cbefcbc3..934dc17b96 100644
--- a/src/tools/msvc/gendef.pl
+++ b/src/tools/msvc/gendef.pl
@@ -120,9 +120,9 @@ sub writedef
{
my $isdata = $def->{$f} eq 'data';
  
-		# Strip the leading underscore for win32, but not x64

+   # Strip the leading underscore for win32, but not x64 and 
aarch64


Is x64 the opposite of win32?  Does this make sense?  Should we reverse 
the logic here and single out the one variant where the stripping is 
necessary?

Re: pg_resetwal tests, logging, and docs update

2023-09-13 Thread Peter Eisentraut


On 13.09.23 16:36, Aleksander Alekseev wrote:

```
+// FIXME: why 2?
  if (set_oldest_commit_ts_xid < 2 &&
```

Should we rewrite this to < FrozenTransactionId ?


That's what I suspect, but we should confirm that.



```
+$mult = 32 * $blcksz * 4; # FIXME
```

Unless I'm missing something this $mult value is not used. Is it
really needed here?


The FIXME is that I think a multiplier *should* be applied somehow.  See 
also the FIXME in the documentation for the -c option.

Re: Add 'worker_type' to pg_stat_subscription

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 09:59:04AM -0700, Nathan Bossart wrote:
> On Wed, Sep 13, 2023 at 05:06:28PM +0300, Maxim Orlov wrote:
>> So, should we mark this thread as RfC?
> 
> I've done so.  Barring additional feedback, I intend to commit this in the
> next few days.

Note to self: this needs a catversion bump.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Make --help output fit within 80 columns per line

2023-09-13 Thread Greg Sabino Mullane

On Tue, Jul 4, 2023 at 9:47 PM torikoshia 
wrote:

> Since it seems preferable to have consistent line break policy and some
> people use 80-column terminal, wouldn't it be better to make all commands
> in 80
> columns per line?
>

All this seems an awful lot of work to support this mythical 80-column
terminal user. It's 2023, perhaps it's time to widen the default assumption
past 80 characters?

Cheers,
Greg

Re: Eliminate redundant tuple visibility check in vacuum

2023-09-13 Thread Andres Freund

Hi,

On 2023-09-07 18:23:22 -0400, Melanie Plageman wrote:
> From e986940e546171d1f1d06f62a101d695a8481e7a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Wed, 6 Sep 2023 14:57:20 -0400
> Subject: [PATCH v4 2/3] Move heap_page_prune output parameters into struct
> 
> Add PruneResult, a structure containing the output parameters and result
> of heap_page_prune(). Reorganizing the results of heap_page_prune() into
> a struct simplifies the function signature and provides a location for
> future commits to store additional output parameters.
> 
> Discussion: 
> https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
> ---
>  src/backend/access/heap/pruneheap.c  | 33 +---
>  src/backend/access/heap/vacuumlazy.c | 17 +-
>  src/include/access/heapam.h  | 13 +--
>  src/tools/pgindent/typedefs.list |  1 +
>  4 files changed, 33 insertions(+), 31 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index 392b54f093..5ac286e152 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -155,15 +155,13 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>*/
>   if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
>   {
> - int ndeleted,
> - nnewlpdead;
> + PruneResult presult;
>  
> - ndeleted = heap_page_prune(relation, buffer, vistest,
> -
> , NULL);
> + heap_page_prune(relation, buffer, vistest, , 
> NULL);
>  
>   /*
>* Report the number of tuples reclaimed to pgstats.  
> This is
> -  * ndeleted minus the number of newly-LP_DEAD-set items.
> +  * presult.ndeleted minus the number of 
> newly-LP_DEAD-set items.
>*
>* We derive the number of dead tuples like this to 
> avoid totally
>* forgetting about items that were set to LP_DEAD, 
> since they
> @@ -175,9 +173,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>* tracks ndeleted, since it will set the same LP_DEAD 
> items to
>* LP_UNUSED separately.
>*/
> - if (ndeleted > nnewlpdead)
> + if (presult.ndeleted > presult.nnewlpdead)
>   pgstat_update_heap_dead_tuples(relation,
> - 
>ndeleted - nnewlpdead);
> + 
>presult.ndeleted - presult.nnewlpdead);
>   }
>  
>   /* And release buffer lock */
> @@ -204,24 +202,22 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>   * (see heap_prune_satisfies_vacuum and
>   * HeapTupleSatisfiesVacuum).
>   *
> - * Sets *nnewlpdead for caller, indicating the number of items that were
> - * newly set LP_DEAD during prune operation.
> + * presult contains output parameters needed by callers such as the number of
> + * tuples removed and the number of line pointers newly marked LP_DEAD.
> + * heap_page_prune() is responsible for initializing it.
>   *
>   * off_loc is the current offset into the line pointer array while pruning.
>   * This is used by vacuum to populate the error context message. On-access
>   * pruning has no such callback mechanism for populating the error context, 
> so
>   * it passes NULL. When provided by the caller, off_loc is set exclusively by
>   * heap_page_prune().
> - *
> - * Returns the number of tuples deleted from the page during this call.
>   */
> -int
> +void
>  heap_page_prune(Relation relation, Buffer buffer,
>   GlobalVisState *vistest,
> - int *nnewlpdead,
> + PruneResult *presult,
>   OffsetNumber *off_loc)
>  {
> - int ndeleted = 0;
>   Pagepage = BufferGetPage(buffer);
>   BlockNumber blockno = BufferGetBlockNumber(buffer);
>   OffsetNumber offnum,
> @@ -247,6 +243,9 @@ heap_page_prune(Relation relation, Buffer buffer,
>   prstate.nredirected = prstate.ndead = prstate.nunused = 0;
>   memset(prstate.marked, 0, sizeof(prstate.marked));
>  
> + presult->ndeleted = 0;
> + presult->nnewlpdead = 0;
> +
>   maxoff = PageGetMaxOffsetNumber(page);
>   tup.t_tableOid = RelationGetRelid(prstate.rel);
>  
> @@ -321,7 +320,7 @@ heap_page_prune(Relation relation, Buffer buffer,
>   continue;
>  
>   /* Process this item or chain

Re: Possibility to disable `ALTER SYSTEM`

2023-09-13 Thread Greg Sabino Mullane

Seems to be some resistance to getting this in core, so why not just use an
extension? I was able to create a quick POC to do just that. Hook into PG
and look for AlterSystemStmt, throw a "Sorry, ALTER SYSTEM is not currently
allowed" error. Put into shared_preload_libraries and you're done. As a
bonus, works on all supported versions, so no need to wait for Postgres 17
- or Postgres 18/19 given the feature drift this thread is experiencing :)

Cheers,
Greg

Re: Add 'worker_type' to pg_stat_subscription

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 05:06:28PM +0300, Maxim Orlov wrote:
> I did a look at the patch, like the idea. The overall code is in a good
> condition, implements the described feature.

Thanks for reviewing.

> Side note: this is not a problem of this particular patch, but in
> pg_stat_get_subscription and many other places, there
> is a switch with worker types. Can we use a default section there to have
> an explicit error instead of the compiler
> warnings if somehow we decide to add another one worker type?

-1.  We want such compiler warnings to remind us to adjust the code
accordingly.  If we just rely on an ERROR in the default section, we might
miss it if there isn't a relevant test.

> So, should we mark this thread as RfC?

I've done so.  Barring additional feedback, I intend to commit this in the
next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Avoid a possible null pointer (src/backend/utils/adt/pg_locale.c)

2023-09-13 Thread Jeff Davis

On Wed, 2023-09-13 at 11:48 +0900, Michael Paquier wrote:
> Hmm.  I see your point, one could be confused that the function name
> is the provider with this wording.  How about that instead:
>  ERROR:  unsupported collprovider for %s: %c
> 
> I've hidden that in a macro that uses __func__ as Jeff has suggested.
> 
> What do you think?

Looks good to me, thank you.

Regards,
Jeff Davis

Re: pg_resetwal tests, logging, and docs update

2023-09-13 Thread Aleksander Alekseev

Hi,

> I noticed that pg_resetwal has poor test coverage.  There are some TAP
> tests, but they all run with -n, so they don't actually test the full
> functionality.  (There is a non-dry-run call of pg_resetwal in the
> recovery test suite, but that is incidental.)
>
> So I added a bunch of more tests to test all the different options and
> scenarios.  That also led to adding more documentation about more
> details how some of the options behave, and some tweaks in the program
> output.
>
> It's attached as one big patch, but it could be split into smaller
> pieces, depending what gets agreement.

All in all the patch looks OK but I have a couple of nitpicks.

```
+   working on a data directory in an unclean shutdown state or with a corrupt
+   control file.
```

```
+   After running this command on a data directory with corrupted WAL or a
+   corrupt control file,
```

I'm not a native English speaker but shouldn't it be "corruptED control file"?

```
+  Force pg_resetwal to proceed even in situations where
+  it could be dangerous,
```

"where" is probably fine but wouldn't "WHEN it could be dangerous" be better?

```
+// FIXME: why 2?
 if (set_oldest_commit_ts_xid < 2 &&
```

Should we rewrite this to < FrozenTransactionId ?

```
+$mult = 32 * $blcksz * 4; # FIXME
```

Unless I'm missing something this $mult value is not used. Is it
really needed here?

-- 
Best regards,
Aleksander Alekseev

Re: Reducing connection overhead in pg_upgrade compat check phase

2023-09-13 Thread Peter Eisentraut


On 31.08.23 23:34, Daniel Gustafsson wrote:

On 12 Jul 2023, at 01:36, Nathan Bossart  wrote:

On Wed, Jul 12, 2023 at 12:43:14AM +0200, Daniel Gustafsson wrote:

I did have coffee before now, but only found time to actually address this now
so here is a v7 with just that change and a fresh rebase.


Thanks.  I think the patch is in decent shape.


Due to ENOTENOUGHTIME it bitrotted a bit, so here is a v8 rebase which I really
hope to close in this CF.


The alignment of this output looks a bit funny:

...
Checking for prepared transactionsok
Checking for contrib/isn with bigint-passing mismatch ok
Checking for data type usage  checking all 
databases
ok
Checking for presence of required libraries   ok
Checking database user is the install userok
...


Also, you should put gettext_noop() calls into the .status = "Checking ..."
assignments and arrange to call gettext() where they are used, to maintain
the translatability.

Re: Add 'worker_type' to pg_stat_subscription

2023-09-13 Thread Maxim Orlov

Hi!

I did a look at the patch, like the idea. The overall code is in a good
condition, implements the described feature.

Side note: this is not a problem of this particular patch, but in
pg_stat_get_subscription and many other places, there
is a switch with worker types. Can we use a default section there to have
an explicit error instead of the compiler
warnings if somehow we decide to add another one worker type?

So, should we mark this thread as RfC?

-- 
Best regards,
Maxim Orlov.

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

Thank you for reviewing! Before making a patch I can reply the important point.

> 1. One thing to note is that if user checks whether the old cluster is
> upgradable with --check option and then try to upgrade, that will also
> fail. Because during the --check run there would at least one
> additional shutdown checkpoint WAL and then in the next run the slots
> position won't match. Note, I am saying this in context of using
> --check option with not-running old cluster. Won't that be surprising
> to users? One possibility is that we document such a behaviour and
> other is that we go back to WAL reading design where we can ignore
> known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.

Good catch, we have never considered the case that --check is executed for
stopped cluster. You are right, the old cluster is turned on/off during the
check and it generates SHUTDOWN_CHECKPOINT. This leads that confirmed_flush is
behind the latest checkpoint lsn.

Good catch, we have never considered the case that --check is executed for
stopped cluster. You are right, the old cluster is turned on/off during the
check and it generates SHUTDOWN_CHECKPOINT. This leads that confirmed_flush is
behind the latest checkpoint lsn.

Here are other approaches we came up with:

1. adds WARNING message when the --check is executed and slots are checked.
   We can say like: 

```
...
Checking for valid logical replication slots  
WARNING: this check generated WALs
Next pg_uprade would fail.
Please ensure again that all WALs are replicated.
...
```


2. adds hint message in the FATAL error when the confirmed_flush is not same as
   the latest checkpoint:

```
...
Checking for valid logical replication slots  fatal

Your installation contains invalid logical replication slots.
These slots can't be copied, so this cluster cannot be upgraded.
Consider removing such slots or consuming the pending WAL if any,
and then restart the upgrade.
If you did pg_upgrade --check before this run, it may be a cause.
Please start clusters and confirm again that all changes are
replicated.
A list of invalid logical replication slots is in the file:
```

3. requests users to do pg_upgrade --check on backup database, if old cluster
   has logical slots. Basically they save a whole of cluster before doing 
pg_uprade,
   so it may be acceptable. This is not a modification of codes.

How do others think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [ psql - review request ] review request for \d+ tablename, \d+ indexname indenting

2023-09-13 Thread Peter Eisentraut


On 12.09.23 09:27, 쿼리트릭스 wrote:

Thank you for letting me know more about the test method.
As you said, we applied the patch using git diff and created a test case 
on the src/test/regress/sql.


Because of the change of the psql output, a lot of existing test cases 
are now failing.  You should run "make check" and fix up the failures. 
Also, your new test file "subpartition_indentation" isn't actually run 
because it was not added to src/test/regress/parallel_schedule.  I 
suspect you probably don't want to add a new test file for this but 
instead see if the existing tests cover this case.

Re: Redundant Unique plan node for table with a unique index

2023-09-13 Thread Daniel Gustafsson

> On 13 Sep 2023, at 15:22, Damir Belyalov  wrote:

> There is a table with a unique index on it and we have a query that searching 
> DISTINCT values on this table on columns of unique index.

> We can see that Unique node is redundant for this case. So I implemented a 
> simple patch that removes Unique node from the plan.

Is this query pattern common enough to warrant spending time on in the planner
(are there perhaps ORMs that generate such)?  Have you measured the overhead of
this?

--
Daniel Gustafsson

Redundant Unique plan node for table with a unique index

2023-09-13 Thread Damir Belyalov

Hello!

There is a table with a unique index on it and we have a query that
searching DISTINCT values on this table on columns of unique index. Example:


create table a (n int);
insert into a (n) select x from generate_series(1, 14) as g(x);
create unique index on a (n);
explain select distinct n from a;
 QUERY PLAN


 Unique  (cost=0.42..6478.42 rows=14 width=4)
   ->  Index Only Scan using a_n_idx on a  (cost=0.42..6128.42 rows=14
width=4)
(2 rows)


We can see that Unique node is redundant for this case. So I implemented a
simple patch that removes Unique node from the plan.
After patch:


explain select distinct n from a;
   QUERY PLAN
-
 Seq Scan on a  (cost=0.00..2020.00 rows=14 width=4)
(1 row)


The patch is rather simple and doesn't consider queries with joins. The
criteria when Unique node is should be removed is a case when a set of Vars
in DISTINCT clause contains unique index columns from the same table.
Another example:
CREATE TABLE a (n int, m int);
CRETE UNIQUE INDEX ON a (n);
SELECT DISTINCT (n,m) FROM a;
The Unique node should be deleted because n is contained in (n,m).


The patch doesn't consider these cases:
1. DISTINCT ON [EXPR]
   Because this case can need grouping.
2. Subqueries.
   Because this case can need grouping:
   CREATE TABLE a (n int);
   CREA UNIQUE INDEX ON a (n);
   SELECT DISTINCT g FROM (SELECT * FROM a) as g;
3. Joins, because it demands complication of code.
   Example:
   SELECT DISTINCT a.n1 JOIN b where a.n1 = b.n1;
   where a.n1 and b.n1 should be unique indexes and join qual should be
on this index columns.
   or
   a have a unique index on n1 and b is "unique for a" on join qual.


I am wondering if there are opportunities for further development of this
patch, in particular for JOIN cases.
For several levels of JOINs we should understand which set columns is
unique for the every joinrel in query. In general terms I identified two
cases when joinrel "saves" unique index from table: when tables are joined
by unique index columns and when one table has unique index and it is
"unique_for" (has one common tuple) another table.


Regards,
Damir Belyalov
Postgres Professional
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 44efb1f4ebc..8f9912f48ad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1266,6 +1266,47 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 	return (Expr *) preprocess_expression(root, (Node *) expr, EXPRKIND_PHV);
 }
 
+static bool
+is_distinct_redundant(Query *parse, RelOptInfo *current_rel, PathTarget *sort_input_target)
+{
+	List	 *distinct_vars = NIL;
+	ListCell *lc1, *lc2;
+
+	/* Doesn't process DISTINCT ON because it can need grouping */
+	if (parse->hasDistinctOn)
+		return false;
+
+	/* Fetch Vars from DISTINCT clause */
+	foreach(lc1, sort_input_target->exprs)
+	{
+		Var *distinct_expr = (Var *) lfirst(lc1);
+
+		if (IsA(distinct_expr, Var))
+			distinct_vars = list_append_unique_int(distinct_vars, distinct_expr->varattno);
+		else
+			/* Doesn't process this case because it can need grouping */
+			return false;
+	}
+
+	foreach(lc2, current_rel->indexlist)
+	{
+		IndexOptInfo *indexinfo = (IndexOptInfo *) lfirst(lc2);
+		List		 *unique_indexkeys = NIL;
+		int			  i;
+
+		if (indexinfo->unique)
+		{
+			for (i = 0; i < indexinfo->ncolumns; i++)
+unique_indexkeys = list_append_unique_int(unique_indexkeys, indexinfo->indexkeys[i]);
+
+			if (list_difference_int(unique_indexkeys, distinct_vars) == NIL)
+return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * grouping_planner
  *	  Perform planning steps related to grouping, aggregation, etc.
@@ -1694,8 +1735,9 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 		 */
 		if (parse->distinctClause)
 		{
-			current_rel = create_distinct_paths(root,
-current_rel);
+			if (!is_distinct_redundant(parse, current_rel, sort_input_target))
+current_rel = create_distinct_paths(root,
+	current_rel);
 		}
 	}			/* end of if (setOperations) */
 
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f286a..49223c5be10 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5645,18 +5645,15 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
-  QUERY PLAN  
---
+ QUERY PLAN 
+
  Merge Right Join
Merge Cond: (b.id = d.a)
-   ->  Unique
- ->  Sort
-

Re: logical decoding and replication of sequences, take 2

2023-09-13 Thread Ashutosh Bapat

On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila  wrote:
>
> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila  wrote:
> >
> > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> > >  wrote:
> > > >
> > > > >
> > > > > But whether or not that's the case, downstream should not request (and
> > > > > hence receive) any changes that have been already applied (and
> > > > > committed) downstream as a principle. I think a way to achieve this is
> > > > > to update the replorigin_session_origin_lsn so that a sequence change
> > > > > applied once is not requested (and hence sent) again.
> > > > >
> > > >
> > > > I guess we could update the origin, per attached 0004. We don't have
> > > > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > > > don't need that.
> > > >
> > > > The attached patch merges the earlier improvements, except for the part
> > > > that experimented with adding a "fake" transaction (which turned out to
> > > > have a number of difficult issues).
> > >
> > > 0004 looks good to me.
> >
> >
> > + {
> >   CommitTransactionCommand();
> > +
> > + /*
> > + * Update origin state so we don't try applying this sequence
> > + * change in case of crash.
> > + *
> > + * XXX We don't have replorigin_session_origin_timestamp, but we
> > + * can just leave that set to 0.
> > + */
> > + replorigin_session_origin_lsn = seq.lsn;
> >
> > IIUC, your proposal is to update the replorigin_session_origin_lsn, so
> > that after restart, it doesn't use some prior origin LSN to start with
> > which can in turn lead the sequence to go backward. If so, it should
> > be updated before calling CommitTransactionCommand() as we are doing
> > in apply_handle_commit_internal(). If that is not the intention then
> > it is not clear to me how updating replorigin_session_origin_lsn after
> > commit is helpful.
> >
>
> typedef struct ReplicationState
> {
> ...
> /*
>  * Location of the latest commit from the remote side.
>  */
> XLogRecPtrremote_lsn;
>
> This is the variable that will be updated with the value of
> replorigin_session_origin_lsn. This means we will now track some
> arbitrary LSN location of the remote side in this variable. The above
> comment makes me wonder if there is anything we are missing or if it
> is just a matter of updating this comment because before the patch we
> always adhere to what is written in the comment.

I don't think we are missing anything. This value is used to track the
remote LSN upto which all the commits from upstream have been applied
locally. Since a non-transactional sequence change is like a single
WAL record transaction, it's LSN acts as the LSN of the mini-commit.
So it should be fine to update remote_lsn with sequence WAL record's
end LSN. That's what the patches do. I don't see any hazard. But you
are right, we need to update comments. Here and also at other places
like
replorigin_session_advance() which uses remote_commit as name of the
argument which gets assigned to ReplicationState::remote_lsn.

-- 
Best Wishes,
Ashutosh Bapat

Re: BufferUsage counters' values have changed

2023-09-13 Thread Nazir Bilal Yavuz

Hi,

On Mon, 11 Sept 2023 at 14:28, Karina Litskevich
 wrote:
>
> Hi hackers,
>
> I noticed that BufferUsage counters' values are strangely different for the
> same queries on REL_15_STABLE and REL_16_STABLE. For example, I run
>
> CREATE EXTENSION pg_stat_statements;
> CREATE TEMP TABLE test(b int);
> INSERT INTO test(b) SELECT generate_series(1,1000);
> SELECT query, local_blks_hit, local_blks_read, local_blks_written,
>local_blks_dirtied, temp_blks_written FROM pg_stat_statements;
>
> and output on REL_15_STABLE contains
>
> query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> local_blks_hit | 1005
> local_blks_read| 2
> local_blks_written | 5
> local_blks_dirtied | 5
> temp_blks_written  | 0
>
> while output on REL_16_STABLE contains
>
> query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> local_blks_hit | 1006
> local_blks_read| 0
> local_blks_written | 0
> local_blks_dirtied | 5
> temp_blks_written  | 8
>
>
> I found a bug that causes one of the differences. Wrong counter is incremented
> in ExtendBufferedRelLocal(). The attached patch fixes it and should be applied
> to REL_16_STABLE and master. With the patch applied output contains

Nice finding! I agree, it should be changed.

> query  | INSERT INTO test(b) SELECT generate_series($1,$2)
> local_blks_hit | 1006
> local_blks_read| 0
> local_blks_written | 8
> local_blks_dirtied | 5
> temp_blks_written  | 0
>
>
> I still wonder why local_blks_written is greater than it was on REL_15_STABLE,
> and why local_blks_read became zero. These changes are caused by fcdda1e4b5.
> This code is new to me, and I'm still trying to understand whether it's a bug
> in computing the counters or just changes in how many blocks are read/written
> during the query execution. If anyone can help me, I would be grateful.

I spent some time on it:

local_blks_read became zero because:
1_ One more cache hit. It was supposed to be local_blks_read but it is
local_blks_hit now. This is an assumption, I didn't check this deeply.
2_ Before fcdda1e4b5, there was one local_blks_read coming from
buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
RBM_ZERO_ON_ERROR, NULL) in freespace.c -> ReadBuffer_common() ->
pgBufferUsage.local_blks_read++.
But buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
RBM_ZERO_ON_ERROR, NULL) is moved into the else case, so it didn't
called and local_blks_read isn't incremented.

local_blks_written is greater because of the combination of fcdda1e4b5
and 00d1e02be2.
In PG_15:
RelationGetBufferForTuple() -> ReadBufferBI(P_NEW, RBM_ZERO_AND_LOCK)
-> ReadBufferExtended() -> ReadBuffer_common() ->
pgBufferUsage.local_blks_written++; (called 5 times) [0]
In PG_16:
1_ 5 of the local_blks_written is coming from:
RelationGetBufferForTuple() -> RelationAddBlocks() ->
ExtendBufferedRelBy() -> ExtendBufferedRelCommon() ->
ExtendBufferedRelLocal() -> pgBufferUsage.local_blks_written +=
extend_by; (extend_by is 1, this is called 5 times) [1]
2_ 3 of the local_blks_written is coming from:
RelationGetBufferForTuple() -> RecordAndGetPageWithFreeSpace() ->
fsm_set_and_search() -> fsm_readbuf() -> fsm_extend() ->
ExtendBufferedRelTo() -> ExtendBufferedRelCommon() ->
ExtendBufferedRelLocal() -> pgBufferUsage.local_blks_written +=
extend_by; (extend_by is 3, this is called 1 time) [2]

I think [0] is the same path as [1] but [2] is new. 'fsm extends'
wasn't counted in local_blks_written in PG_15. Calling
ExtendBufferedRelTo() from fsm_extend() caused 'fsm extends' to be
counted in local_blks_written. I am not sure which one is correct.

I hope these help.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Make --help output fit within 80 columns per line

2023-09-13 Thread torikoshia


On 2023-09-12 15:27, Peter Eisentraut wrote:

I like this work a lot.  It's good to give developers easy to verify
guidance about formatting the --help messages.

However, I think instead of just adding a bunch of line breaks to
satisfy the test, we should really try to make the lines shorter by
rewording.  Otherwise, chances are we are making the usability worse
for many users, because it's more likely that part of the help will
scroll off the screen.  For example, in many cases, we could replace
"do not" by "don't", or we could decrease the indentation of the
second column by a few spaces, or just reword altogether.

Also, it would be very useful if the TAP test function could print out
the violating lines if a test fails.  (Similar to how is() and like()
print the failing values.)  Maybe start with that, and then it's
easier to play with different wording variants to try to make it fit
better.


Thanks for the review!
I'll try to fix the patch according to your comments.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Row pattern recognition

2023-09-13 Thread Tatsuo Ishii

> On 9/13/23 07:14, Tatsuo Ishii wrote:
>>> 
>> I was looking for this but I only found ISO/IEC 19075-5:2021.
>> https://www.iso.org/standard/78936.html
>> Maybe 19075-5:2021 is the latest one?
> 
> Yes, probably.  Sorry.

No problem. Thanks for confirmation.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: make add_paths_to_append_rel aware of startup cost

2023-09-13 Thread Andy Fan

> - We shouldn't do the optimization if there are still more tables to join,
>   the reason is similar to has_multiple_baserels(root) in
>   set_subquery_pathlist.
>

After some internal discussion, we have 2 different choices here. Let's
call the current choice as method-a, and the new choice as method-b.
Method-b will just ignore the "no more tables to join "limitation
and build the append path with both cheapest startup cost and cheapest
total cost, this is pretty like the method we joins a plain relation with
another relation. The uneasy part is it is the cheapest start up cost
rather than the cheapest fractional cost.

method-a is pretty same as what set_subquery_pathlist is doing, which has
a limitation on "no more tables to join" and has no the "cheapest startup
cost" part.

Ideally we can apply both strategies if we don't consider the effort. If
there are no more tables to join, we use method-a. otherwise use
method-b. With this thinking, we can even apply the same strategy to plain
relations as well.

However, I am not sure if the "cheapest startup cost" is a real problem.
If it is not, we can apply method-b directly and modify
set_subquery_pathlist to do the same for consistency.


-- 
Best Regards
Andy Fan

Re: Quoting filename in using facing log messages

2023-09-13 Thread Daniel Gustafsson

> On 13 Sep 2023, at 13:55, Peter Eisentraut  wrote:
> 
> On 13.09.23 13:48, Daniel Gustafsson wrote:
>> Looking at zqfugous5du4d...@paquier.xyz I noticed that we had a a few 
>> instances
>> of filenames in userfacing log messages (ie not elog or DEBUGx etc) not being
>> quoted, where the vast majority are quoted like \"%s\".  Any reason not to
>> quote them as per the attached to be consistent across all log messages?
> 
> Since WAL file names have a predictable format, there is less pressure to 
> quote them to avoid ambiguities.  But in general we should try to be 
> consistent

Correct, this is all for consistency.

> so your patch makes sense to me.

Thanks!

It might be worth concatenating the errmsg() while there since we typically
don't linebreak errmsg strings anymore for greppability:

-errmsg("could not write to log file %s "
-   "at offset %u, length %zu: %m",
+errmsg("could not write to log file \"%s\" at offset %u, length %zu: 
%m",

I don't have strong feelings wrt that, just have a vague memory of "concatenate
when touching" as an informal guideline.

--
Daniel Gustafsson

Re: Quoting filename in using facing log messages

2023-09-13 Thread Peter Eisentraut


On 13.09.23 13:48, Daniel Gustafsson wrote:

Looking at zqfugous5du4d...@paquier.xyz I noticed that we had a a few instances
of filenames in userfacing log messages (ie not elog or DEBUGx etc) not being
quoted, where the vast majority are quoted like \"%s\".  Any reason not to
quote them as per the attached to be consistent across all log messages?


Since WAL file names have a predictable format, there is less pressure 
to quote them to avoid ambiguities.  But in general we should try to be 
consistent, so your patch makes sense to me.

Re: Synchronizing slots from primary to standby

2023-09-13 Thread Amit Kapila

On Wed, Sep 13, 2023 at 4:54 PM shveta malik  wrote:
>
> PFA  v17. It has below changes:
>

@@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  }
  else
  {
+ /*
+ * Before we send out the last set of changes to logical decoding
+ * output plugin, wait for specified streaming replication standby
+ * servers (if any) to confirm receipt of WAL upto commit_lsn.
+ */
+ WaitForStandbyLSN(commit_lsn);

It seems the first patch has a wait logic for every commit. I think it
is better to integrate this wait with WalSndWaitForWal() as suggested
by Andres in his email[1].

[1] - 
https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.

Quoting filename in using facing log messages

2023-09-13 Thread Daniel Gustafsson

Looking at zqfugous5du4d...@paquier.xyz I noticed that we had a a few instances
of filenames in userfacing log messages (ie not elog or DEBUGx etc) not being
quoted, where the vast majority are quoted like \"%s\".  Any reason not to
quote them as per the attached to be consistent across all log messages?

--
Daniel Gustafsson



quote_file_errmsg.diff
Description: Binary data

Re: logical decoding and replication of sequences, take 2

2023-09-13 Thread Ashutosh Bapat

On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
 wrote:
> >
> > The attached patch merges the earlier improvements, except for the part
> > that experimented with adding a "fake" transaction (which turned out to
> > have a number of difficult issues).
>
> 0004 looks good to me. But I need to review the impact of not setting
> replorigin_session_origin_timestamp.

I think it will be good to set replorigin_session_origin_timestamp = 0
explicitly so as not to pick up a garbage value. The timestamp is
written to the commit record. Beyond that I don't see any use of it.
It is further passed downstream if there is cascaded logical
replication setup. But I don't see it being used. So it should be fine
to leave it 0. I don't think we can use logically replicated sequences
in a mult-master environment where the timestamp may be used to
resolve conflict. Such a setup will require a distributed sequence
management which can not be achieved by logical replication alone.

In short, I didn't find any hazard in leaving the
replorigin_session_origin_timestamp as 0.

-- 
Best Wishes,
Ashutosh Bapat

Re: Row pattern recognition

2023-09-13 Thread Vik Fearing


On 9/13/23 07:14, Tatsuo Ishii wrote:




I was looking for this but I only found ISO/IEC 19075-5:2021.
https://www.iso.org/standard/78936.html

Maybe 19075-5:2021 is the latest one?


Yes, probably.  Sorry.
--
Vik Fearing

Re: Synchronizing slots from primary to standby

2023-09-13 Thread shveta malik

On Wed, Sep 6, 2023 at 2:18 PM shveta malik  wrote:
>
> On Fri, Sep 1, 2023 at 1:59 PM Peter Smith  wrote:
> >
> > Hi Shveta. Here are some comments for patch v14-0002
> >
> > The patch is large, so my code review is a WIP... more later next week...
> >
>
> Thanks Peter for the feedback. I have tried to address most of these
> in v15. Please find my response inline for the ones which I have not
> addressed.
>
> >
> > 26.
> > +typedef struct SlotSyncWorker
> > +{
> > + /* Time at which this worker was launched. */
> > + TimestampTz launch_time;
> > +
> > + /* Indicates if this slot is used or free. */
> > + bool in_use;
> > +
> > + /* The slot in worker pool to which it is attached */
> > + int slot;
> > +
> > + /* Increased every time the slot is taken by new worker. */
> > + uint16 generation;
> > +
> > + /* Pointer to proc array. NULL if not running. */
> > + PGPROC*proc;
> > +
> > + /* User to use for connection (will be same as owner of subscription). */
> > + Oid userid;
> > +
> > + /* Database id to connect to. */
> > + Oid dbid;
> > +
> > + /* Count of Database ids it manages */
> > + uint32 dbcount;
> > +
> > + /* DSA for dbids */
> > + dsa_area *dbids_dsa;
> > +
> > + /* dsa_pointer for database ids it manages */
> > + dsa_pointer dbids_dp;
> > +
> > + /* Mutex to access dbids in dsa */
> > + slock_t mutex;
> > +
> > + /* Info about slot being monitored for worker's naptime purpose */
> > + SlotSyncWorkerWatchSlot monitor;
> > +} SlotSyncWorker;
> >
> > There seems an awful lot about this struct which is common with
> > 'LogicalRepWorker' struct.
> >
> > It seems a shame not to make use of the commonality instead of all the
> > cut/paste here.
> >
> > E.g. Can it be rearranged so all these common fields are shared:
> > - launch_time
> > - in_use
> > - slot
> > - generation
> > - proc
> > - userid
> > - dbid
> >
>
> Sure, I had this in mind along with previous comments where it was
> suggested to merge similar functions like
> WaitForReplicationWorkerAttach, WaitForSlotSyncWorkerAttach etc. That
> merging could only be possible if we try to merge the common part of
> these structures. This is WIP, will be addressed in the next version.
>

This has been addressed in version-17 now.

thanks
Shveta

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Daniel Gustafsson

> On 13 Sep 2023, at 08:18, Michael Paquier  wrote:

> f = AllocateFile(path, PG_BINARY_R);
> if (!f)
> ereport(ERROR,
> -(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> - errmsg("invalid snapshot identifier: \"%s\"", idstr)));
> +(errcode_for_file_access(),
> + errmsg("could not open file \"%s\" for reading: %m",
> +path)));
> 
> Agreed that this just looks like a copy-pasto.  The path provides
> enough context about what's being read, so using this generic error
> message is fine.  Will apply if there are no objections.

+1. This errmsg is already present so it eases the translation burden as well.

--
Daniel Gustafsson

RE: Synchronizing slots from primary to standby

2023-09-13 Thread Hayato Kuroda (Fujitsu)

Dear Shveta,

Sorry for the late response.

> Thanks Kuroda-san for the feedback.
> >
> > 01. General
> >
> > I think the documentation can be added, not only GUCs. How about adding
> examples
> > for combinations of physical and logical replications?  You can say that 
> > both of
> > physical primary can be publisher and slots on primary/standby are
> synchronized.
> >
> 
> I did not fully understand this. Can you please state a clear example.
> We are only synchronizing logical replication slots in this draft and
> that too on physical standby from primary. So the last statement is
> not completely true.

I expected to add a new subsection in "Log-Shipping Standby Servers". I think we
can add like following infos:

* logical replication publisher can be also replicated
* For that, a physical repliation slot must be defined on primar
* Then we can set up standby_slot_names(on primary) and synchronize_slot_names
  (on both server).
* slots are synchronized automatically

> > 02. General
> >
> > standby_slot_names ensures that physical standby is always ahead subscriber,
> but I
> > think it may be not sufficient. There is a possibility that primary server 
> > does
> > not have any physical slots.So it expects a slot to be present.
> > In this case the physical standby may be behind the
> > subscriber and the system may be confused when the failover is occured.
> 
> Currently there is a check in slot-sync worker which mandates that
> there is a physical slot present between primary and standby for this
> feature to proceed.So that confusion state will not arise.
> + /* WalRcvData is not set or primary_slot_name is not set yet */
> + if (!WalRcv || WalRcv->slotname[0] == '\0')
> + return naptime;

Right, but I wanted to know why it is needed. One motivation seemed to know the
WAL location of physical standby, but I thought that struct WalSnd.apply could
be also used. Is it bad to assume that the physical walsender always exists?

> >Can't we specify the name of standby via application_name or something?
> 
> So do you mean that in absence of a physical slot (if we plan to
> support that), we let primary know about standby(slots-synchronization
> client) through application_name?

Yes, it is what I considered.

> >
> > 03. General
> >
> > In this architecture, the syncslot worker is launched per db and they
> > independently connects to primary, right?
> 
> Not completely true. Each slotsync worker is responsible for managing
> N dbs. Here 'N' =  'Number of distinct dbs for slots in
> synchronize_slot_names'/ 'number of max_slotsync_workers configured'
> for cases where dbcount exceeds workers configured.
> And if dbcount < max_slotsync_workers, then we launch only that many
> workers equal to dbcount and each worker manages a single db. Each
> worker independently connects to primary. Currently it makes a
> connection multiple times, I am optimizing it to make connection only
> once and then after each SIGHUP assuming 'primary_conninfo' may
> change. This change will be in the next version.
> 
> 
> >I'm not sure it is efficient, but I
> > come up with another architecture - only a worker (syncslot 
> > receiver)connects
> > to the primary and other workers (syncslot worker) receives infos from it 
> > and
> > updates. This can reduce the number of connections so that it may slightly
> > improve the latency of network. How do you think?
> >
> 
> I feel it may help in reducing network latency, but not sure if it
> could be more efficient in keeping the lsns in sync. I feel it may
> introduce lag due to the fact that only one worker is getting all the
> info from primary and the actual synchronizing workers are waiting on
> that worker. This lag may be more when the number of slots are huge.
> We have run some performance tests on the design implemented
> currently, please have a look at emails around [1] and [2].

Thank you for teaching! Yeah, I agreed that another point might be a bottleneck.
It could be recalled in future, but currently we do not have to consider...

> > 04. ReplSlotSyncMain
> >
> > Does the worker have to connect to the specific database?
> >
> >
> > ```
> > /* Connect to our database. */
> >
> BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid,
> >
> MySlotSyncWorker->userid,
> >
> 0);
> > ```
> 
> Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec'
> to connect to primary, this needs database connection. It errors out
> in the absence of 'MyDatabaseId'. Do you think db-connection can have
> some downsides?
> 

I considered that we should not grant privileges to access data more than 
necessary.
It might be better if we can avoid to connect to the specific database. But I'm
not sure that we should have to add new walreceiver API to handle it. FYI, I
checked the physical walreceiver to refer it, but it was not background worker
so that it was no meaning.


And followings are further comments.

1. 
I considered the combination with the feature and

Re: Detoasting optionally to make Explain-Analyze less misleading

2023-09-13 Thread Jehan-Guillaume de Rorthais

Hi Stepan & all,

On Tue, 12 Sep 2023 17:16:00 +0200
stepan rutz  wrote:

...
> Attached a new patch. Hoping for feedback,

Nice addition to EXPLAIN!

On the feature front, what about adding the actual detoasting/serializing time
in the explain output?

That could be:

  =>  explain (analyze,serialize,costs off,timing off) 
  select * from test_detoast;
QUERY PLAN
  ─
   Seq Scan on public.test_detoast (actual rows=Nv loops=N)
   Planning Time: N ms
   Execution Time: N ms
   Serialize Time: N ms

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Bharath Rupireddy

On Wed, Sep 13, 2023 at 3:32 PM Yogesh Sharma
 wrote:
>
> On 9/13/23 02:10, Bharath Rupireddy wrote:
> > Hi,
> >
> > When a snapshot file reading fails in ImportSnapshot(), it errors out
> > with "invalid snapshot identifier". This message better suits for
> > snapshot identifier parsing errors which is being done just before the
> > file reading. The attached patch adds a generic file reading error
> > message with path to help distinguish if the issue is with snapshot
> > identifier parsing or file reading.
> >
> I suggest error message to include "snapshot" keyword in message, like this:
>
> errmsg("could not open snapshot file \"%s\" for reading: %m",
>
> and also tweak other messages accordingly.

-1. The path includes the pg_snapshots there which is enough to give
the clue, so no need to say "could not open snapshot file". AFAICS,
this is the typical messaging followed across postgres code for
AllocateFile failures.

[1]
/* Define pathname of exported-snapshot files */
#define SNAPSHOT_EXPORT_DIR "pg_snapshots"

/* OK, read the file */
snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", idstr);

f = AllocateFile(path, PG_BINARY_R);
if (!f)
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not open file \"%s\" for reading: %m",
path)));

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: persist logical slots to disk during shutdown checkpoint

2023-09-13 Thread Amit Kapila

On Wed, Sep 13, 2023 at 12:45 PM Michael Paquier  wrote:
>
> On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:
> > Consider if we move this call to bgwriter (aka flushing slots is no
> > longer part of a checkpoint), Would that be okay? Previously, I think
> > it was okay but not now. I see an argument to keep that as it is as
> > well because we have already mentioned the special shutdown checkpoint
> > case. By the way, I have changed this because Ashutosh felt it is no
> > longer correct to keep the first sentence as it is. See his email[1]
> > (Relying on the first sentence, ...).
>
> Hmmm..  Okay..
>

The patch is updated as per recent discussion.

-- 
With Regards,
Amit Kapila.


v12-0001-Flush-logical-slots-to-disk-during-a-shutdown-ch.patch
Description: Binary data

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Yogesh Sharma


On 9/13/23 02:10, Bharath Rupireddy wrote:

Hi,

When a snapshot file reading fails in ImportSnapshot(), it errors out
with "invalid snapshot identifier". This message better suits for
snapshot identifier parsing errors which is being done just before the
file reading. The attached patch adds a generic file reading error
message with path to help distinguish if the issue is with snapshot
identifier parsing or file reading.


I suggest error message to include "snapshot" keyword in message, like this:

errmsg("could not open snapshot file \"%s\" for reading: %m",

and also tweak other messages accordingly.


--
Kind Regards,
Yogesh Sharma
PostgreSQL, Linux, and Networking Expert
Open Source Enthusiast and Advocate

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-13 Thread Amit Kapila

On Tue, Sep 12, 2023 at 5:20 PM Hayato Kuroda (Fujitsu)
 wrote:
>

Few comments:
=
1. One thing to note is that if user checks whether the old cluster is
upgradable with --check option and then try to upgrade, that will also
fail. Because during the --check run there would at least one
additional shutdown checkpoint WAL and then in the next run the slots
position won't match. Note, I am saying this in context of using
--check option with not-running old cluster. Won't that be surprising
to users? One possibility is that we document such a behaviour and
other is that we go back to WAL reading design where we can ignore
known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.

2.
+ /*
+ * Store the names of output plugins as well. There is a possibility
+ * that duplicated plugins are set, but the consumer function
+ * check_loadable_libraries() will avoid checking the same library, so
+ * we do not have to consider their uniqueness here.
+ */
+ for (slotno = 0; slotno < slot_arr->nslots; slotno++)
+ {
+ os_info.libraries[totaltups].name = pg_strdup(slot_arr->slots[slotno].plugin);

Here, we should ignore invalid slots.

3.
+ if (!live_check && !slot->caught_up)
+ {
+ if (script == NULL &&
+ (script = fopen_priv(output_path, "w")) == NULL)
+ pg_fatal("could not open file \"%s\": %s",
+ output_path, strerror(errno));
+
+ fprintf(script,
+ "The slot \"%s\" has not consumed the WAL yet\n",
+ slot->slotname);

Is it possible to print the LSN locations of slot and last checkpoint?
I think that will aid in debugging the problems if any and could be
helpful to users as well.

-- 
With Regards,
Amit Kapila.

Re: [dynahash] do not refill the hashkey after hash_search

2023-09-13 Thread John Naylor

On Wed, Sep 13, 2023 at 3:46 PM Junwang Zhao  wrote:
>
> On Wed, Sep 13, 2023 at 4:22 PM John Naylor
>  wrote:

> > - memset(part_entry, 0, sizeof(LogicalRepPartMapEntry));
> > - part_entry->partoid = partOid;
> > + Assert(part_entry->partoid == partOid);
> > + memset(entry, 0, sizeof(LogicalRepRelMapEntry));
> >
> > This is making an assumption that the non-key part of
LogicalRepPartMapEntry will never get new members. Without knowing much
about this code, it seems like a risk in the abstract.
>
> What do you mean by 'the non-key part of LogicalRepPartMapEntry will
> never get new members'?

I mean, if this struct:

> typedef struct LogicalRepPartMapEntry
> {
> Oid partoid; /* LogicalRepPartMap's key */
> LogicalRepRelMapEntry relmapentry;
> } LogicalRepPartMapEntry;

...gets a new member, it will not get memset when memsetting "relmapentry".

> > Taking a quick look, I didn't happen to see any existing asserts of
this sort, so the patch doesn't seem to be making things more "normal". I
did see a few instances of /* hash_search already filled in the key */, so
if we do anything at all here, we might prefer that.
>
> There are some code using assert for this sort, for example in
> *ReorderBufferToastAppendChunk*:

> and in *rebuild_database_list*, tom commented that the key has already
> been filled, which I think
> he was trying to tell people no need to assign the key again.

Okay, we have examples of each.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: POC: GUC option for skipping shared buffers in core dumps

2023-09-13 Thread Lepikhov Andrei




On Wed, Feb 12, 2020, at 7:55 AM, tsunakawa.ta...@fujitsu.com wrote:
> From: Craig Ringer 
>> Currently my options are "dump all shmem including shared_buffers" or
>> "dump no shmem". But I usually want "dump all shmem except
>> shared_buffers". It's tolerable to just dump s_b on a test system with
>> a small s_b, but if enabling coredumps to track down some
>> hard-to-repro crash on a production system I really don't want 20GB
>> coredumps...
>
> We have a simple implementation that allows to exclude shared memory.  
> That's been working for years.
>
> [postgresql.conf]
> core_directory = 'location of core dumps'
> core_contents = '{none | minimum | full}'
> # none = doesn't dump core, minimum excludes shared memory, and full dumps all
>
> I can provide it.  But it simply changes the current directory and 
> detaches shared memory when postgres receives signals that dump core.
>
> I made this GUC because Windows also had to be dealt with.

If it's still possible, share your patch here. I don't know what about the 
core, but during development, especially the bug-fixing process, it is really 
dull to wait for the core generation process every time, even if you debug a 
planner issue and are not interested in shared memory blocks ...

-- 
Regards,
Andrei Lepikhov

Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-09-13 Thread Maxim Orlov

Hi!

I'm pretty much like the idea of the patch. Looks like an overlook in SQL
standard for me.
Anyway, patch apply with no conflicts and implements described
functionality.

On Fri, 25 Aug 2023 at 03:06, Vik Fearing  wrote:

>
> I don't like this part of the patch at all.  Not only is the
> documentation only half baked, but the entire concept of the two
> commands is different.  Especially since I believe the command should
> also create a generated column from a non-generated one.

But I have to agree with Vik Fearing, we can make this patch better, should
we?
I totally understand your intentions to keep the code flow simple and reuse
existing code as much
as possible. But in terms of semantics of these commands, they are quite
different from each other.
And in terms of reading of the code, this makes it even harder to
understand what is going on here.
So, in my view, consider split these commands.

Hope, that helps. Again, I'm +1 for this patch.

-- 
Best regards,
Maxim Orlov.

Re: Cleaning up array_in()

2023-09-13 Thread jian he

On Wed, Sep 13, 2023 at 2:00 PM Alexander Lakhin  wrote:
>
>
> Now I see only a few wrinkles.
> 1) A minor asymmetry in providing details appeared:
> select E'{"a"a}'::text[];
> ERROR:  malformed array literal: "{"a"a}"
> LINE 1: select E'{"a"a}'::text[];
> ^
> DETAIL:  Unexpected array element.
>
> select E'{a"a"}'::text[];
> ERROR:  malformed array literal: "{a"a"}"
> LINE 1: select E'{a"a"}'::text[];
> ^
> (no DETAIL)
>
> Old behavior:
> select E'{a"a"}'::text[];
> ERROR:  malformed array literal: "{a"a"}"
> LINE 1: select E'{a"a"}'::text[];
> ^
> DETAIL:  Unexpected array element.
>
> select E'{"a"a}'::text[];
> ERROR:  malformed array literal: "{"a"a}"
> LINE 1: select E'{"a"a}'::text[];
> ^
> DETAIL:  Unexpected array element.

fixed and added these two query to the test.

> 2) CPPFLAGS="-DARRAYDEBUG" ./configure ... breaks "make check", maybe change 
> elog(NOTICE) to elog(DEBUG1)?
> 2a) a message logged there lacks some delimiter before "lBound info":
> NOTICE:  array_in- ndim 1 ( 3 -1 -1 -1 -1 -1lBound info 1 1 1 1 1 1) for 
> {red,green,blue}
> what about changing the format to "ndim 1 ( 3 -1 -1 -1 -1 -1; lBound info: 1 
> 1 1 1 1 1)"?

fixed. Use elog(DEBUG1) now.

> 3) It seems that new comments need polishing, in particular:
>   /* initialize dim, lBound. useful for ReadArrayDimensions ReadArrayStr */
> ->?
>   /* Initialize dim, lBound for ReadArrayDimensions, ReadArrayStr */
>
> Otherwise, we determine the dimensions from the in curly-braces
> ->?
> Otherwise, we determine the dimensions from the curly braces.
>
> Best regards,
> Alexander

comments updates. please check the attached.
From 60b8c657d66139d727a729e62b8b95895e7929be Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Tue, 4 Jul 2023 14:07:31 -0400
Subject: [PATCH v8 3/7] Re-indent ArrayCount().

This cleans up after the removal of the "while (!itemdone)" loop.
---
 src/backend/utils/adt/arrayfuncs.c | 214 ++---
 1 file changed, 106 insertions(+), 108 deletions(-)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 59bb0614..d1b5d48b 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -477,21 +477,21 @@ ArrayCount(const char *str, int *dim, char typdelim, Node *escontext)
 	{
 		bool		new_element = false;
 
-			switch (*ptr)
-			{
-case '\0':
-	/* Signal a premature end of the string */
-	ereturn(escontext, -1,
-			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-			 errmsg("malformed array literal: \"%s\"", str),
-			 errdetail("Unexpected end of input.")));
-case '\\':
+		switch (*ptr)
+		{
+			case '\0':
+/* Signal a premature end of the string */
+ereturn(escontext, -1,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("malformed array literal: \"%s\"", str),
+		 errdetail("Unexpected end of input.")));
+			case '\\':
 
-	/*
-	 * An escape must be after a level start, after an element
-	 * start, or after an element delimiter. In any case we
-	 * now must be past an element start.
-	 */
+/*
+ * An escape must be after a level start, after an element
+ * start, or after an element delimiter. In any case we now
+ * must be past an element start.
+ */
 switch (parse_state)
 {
 	case ARRAY_LEVEL_STARTED:
@@ -511,22 +511,22 @@ ArrayCount(const char *str, int *dim, char typdelim, Node *escontext)
  errdetail("Unexpected \"%c\" character.",
 		   '\\')));
 }
-	/* skip the escaped character */
-	if (*(ptr + 1))
-		ptr++;
-	else
-		ereturn(escontext, -1,
-(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
- errmsg("malformed array literal: \"%s\"", str),
- errdetail("Unexpected end of input.")));
-	break;
-case '"':
+/* skip the escaped character */
+if (*(ptr + 1))
+	ptr++;
+else
+	ereturn(escontext, -1,
+			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+			 errmsg("malformed array literal: \"%s\"", str),
+			 errdetail("Unexpected end of input.")));
+break;
+			case '"':
 
-	/*
-	 * A quote must be after a level start, after a quoted
-	 * element start, or after an element delimiter. In any
-	 * case we now must be past an element start.
-	 */
+/*
+ * A quote must be after a level start, after a quoted element
+ * start, or after an element delimiter. In any case we now
+ * must be past an element start.
+ */
 switch (parse_state)
 {
 	case ARRAY_LEVEL_STARTED:
@@ -549,17 +549,16 @@ ArrayCount(const char *str, int *dim, char typdelim, Node *escontext)
  errmsg("malformed array literal: \"%s\"", str),
  errdetail("Unexpected array element.")));
 }
-	break;
-case '{':
-	if (!in_quotes)
-	{
-		/*
-		 * A left brace can occur if no nesting has occurred
-		 * yet,

Re: [dynahash] do not refill the hashkey after hash_search

2023-09-13 Thread Junwang Zhao

On Wed, Sep 13, 2023 at 4:22 PM John Naylor
 wrote:
>
>
> On Wed, Sep 13, 2023 at 1:47 PM Junwang Zhao  wrote:
> >
> > Hi hackers,
> >
> > It's not necessary to fill the key field for most cases, since
> > hash_search has already done that for you. For developer that
> > using memset to zero the entry structure after enter it, fill the
> > key field is a must, but IMHO that is not good coding style, we
> > really should not touch the key field after insert it into the
> > dynahash.
>
> - memset(part_entry, 0, sizeof(LogicalRepPartMapEntry));
> - part_entry->partoid = partOid;
> + Assert(part_entry->partoid == partOid);
> + memset(entry, 0, sizeof(LogicalRepRelMapEntry));
>
> This is making an assumption that the non-key part of LogicalRepPartMapEntry 
> will never get new members. Without knowing much about this code, it seems 
> like a risk in the abstract.

What do you mean by 'the non-key part of LogicalRepPartMapEntry will
never get new members'?

typedef struct LogicalRepPartMapEntry
{
Oid partoid; /* LogicalRepPartMap's key */
LogicalRepRelMapEntry relmapentry;
} LogicalRepPartMapEntry;

partoid has already been filled by hash_search with HASH_ENTER action,
so I think the
above code should have the same effects.

>
> > This patch fixed some most abnormal ones, instead of refilling the
> > key field of primitive types, adding some assert might be a better
> > choice.
>
> Taking a quick look, I didn't happen to see any existing asserts of this 
> sort, so the patch doesn't seem to be making things more "normal". I did see 
> a few instances of /* hash_search already filled in the key */, so if we do 
> anything at all here, we might prefer that.

There are some code using assert for this sort, for example in
*ReorderBufferToastAppendChunk*:

```
ent = (ReorderBufferToastEnt *)
hash_search(txn->toast_hash, _id, HASH_ENTER, );

if (!found)
{
Assert(ent->chunk_id == chunk_id);   <--- this
line, by Robert Haas
ent->num_chunks = 0;
ent->last_chunk_seq = 0;
```

and in *rebuild_database_list*, tom commented that the key has already
been filled, which I think
he was trying to tell people no need to assign the key again.

```
/* we assume it isn't found because the hash was just created */
db = hash_search(dbhash, , HASH_ENTER, NULL);

/* hash_search already filled in the key */ <--- this
line, by Tom Lane
db->adl_score = score++;
/* next_worker is filled in later */
```

>
> - hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);
> + (void) hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);
>
> I prefer explicit (void) for new code, but others may disagree. I don't think 
> we have a preferred style for this, so changing current usage will just cause 
> unnecessary code churn.
>

What I am concerned about is that if we change the key after
hash_search with HASH_ENTER action, there
are chances that if we assign a wrong value, it will be impossible to
match that entry again.

> --
> John Naylor
> EDB: http://www.enterprisedb.com



-- 
Regards
Junwang Zhao

Re: [dynahash] do not refill the hashkey after hash_search

2023-09-13 Thread John Naylor

On Wed, Sep 13, 2023 at 1:47 PM Junwang Zhao  wrote:
>
> Hi hackers,
>
> It's not necessary to fill the key field for most cases, since
> hash_search has already done that for you. For developer that
> using memset to zero the entry structure after enter it, fill the
> key field is a must, but IMHO that is not good coding style, we
> really should not touch the key field after insert it into the
> dynahash.

- memset(part_entry, 0, sizeof(LogicalRepPartMapEntry));
- part_entry->partoid = partOid;
+ Assert(part_entry->partoid == partOid);
+ memset(entry, 0, sizeof(LogicalRepRelMapEntry));

This is making an assumption that the non-key part of
LogicalRepPartMapEntry will never get new members. Without knowing much
about this code, it seems like a risk in the abstract.

> This patch fixed some most abnormal ones, instead of refilling the
> key field of primitive types, adding some assert might be a better
> choice.

Taking a quick look, I didn't happen to see any existing asserts of this
sort, so the patch doesn't seem to be making things more "normal". I did
see a few instances of /* hash_search already filled in the key */, so if
we do anything at all here, we might prefer that.

- hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);
+ (void) hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);

I prefer explicit (void) for new code, but others may disagree. I don't
think we have a preferred style for this, so changing current usage will
just cause unnecessary code churn.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: PSQL error: total cell count of XXX exceeded

2023-09-13 Thread Hongxu Ma

After double check, looks `int64` of src/include/c.h is the proper type for it.
Uploaded the v4 patch to fix it.
Thanks.

From: Hongxu Ma 
Sent: Wednesday, September 13, 2023 10:22
To: Tom Lane ; Michael Paquier 
Cc: PostgreSQL Hackers 
Subject: Re: PSQL error: total cell count of XXX exceeded

Thanks for pointing that, I did miss some other "ncols * nrows" places. 
Uploaded v3 patch to fix them.

As for the Windows, I didn't test it before but I think it should also have the 
issue (and happens more possible since `cellsadded` is also a long type).
My fix idea is simple: define a common long64 type for it.
I referred MSDN: only `LONGLONG` and `LONG64` are 64 bytes. And I assume 
Postgres should already have a similar type, but only found `typedef long int 
int64` in src/include/c.h, looks it's not a proper choose.
@Michael Paquier, could you help to give some 
advices here (which type should be used? or should define a new one?). Thank 
you very much.

From: Tom Lane 
Sent: Tuesday, September 12, 2023 12:19
To: Michael Paquier 
Cc: Hongxu Ma ; Jelte Fennema ; David 
G. Johnston ; PostgreSQL Hackers 

Subject: Re: PSQL error: total cell count of XXX exceeded

Michael Paquier  writes:
> On Tue, Sep 12, 2023 at 02:39:55AM +, Hongxu Ma wrote:
> +   long total_cells;

> long is 4 bytes on Windows, and 8 bytes basically elsewhere.  So you
> would still have the same problem on Windows, no?

More to the point: what about the multiplication in printTableInit?
The cat's been out of the bag for quite some time before we get to
printTableAddCell.

I'm more than a bit skeptical about trying to do something about this,
simply because this range of query result sizes is far past what is
practical.  The OP clearly hasn't tested his patch on actually
overflowing query results, and I don't care to either.

regards, tom lane

v4-0001-Using-int64-type-for-the-number-of-total-cells.patch
Description:  v4-0001-Using-int64-type-for-the-number-of-total-cells.patch

Re: pg_upgrade and logical replication

2023-09-13 Thread Michael Paquier

On Tue, Sep 12, 2023 at 01:22:50PM +, Zhijie Hou (Fujitsu) wrote:
> +/*
> + * binary_upgrade_sub_replication_origin_advance
> + *
> + * Update the remote_lsn for the subscriber's replication origin.
> + */
> +Datum
> +binary_upgrade_sub_replication_origin_advance(PG_FUNCTION_ARGS)
> +{
> 
> Is there any usage apart from pg_upgrade for this function, if not, I think
> we'd better move this function to pg_upgrade_support.c. If yes, I think maybe
> better to rename it to a general one.

I was equally surprised by the choice of the patch regarding the
location of these functions, so I agree with your point that these
functions should be in pg_upgrade_support.c.  All the sub-routines
these two functions rely on are defined in some headers already, so
there seem to be nothing new required for pg_upgrade_support.c.
--
Michael


signature.asc
Description: PGP signature

Re: persist logical slots to disk during shutdown checkpoint

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:
> Consider if we move this call to bgwriter (aka flushing slots is no
> longer part of a checkpoint), Would that be okay? Previously, I think
> it was okay but not now. I see an argument to keep that as it is as
> well because we have already mentioned the special shutdown checkpoint
> case. By the way, I have changed this because Ashutosh felt it is no
> longer correct to keep the first sentence as it is. See his email[1]
> (Relying on the first sentence, ...).

Hmmm..  Okay..

> As other places don't have an assert, I didn't add one here but we can
> add one here.

I'd be OK with an assertion here at the end, though I'd still choose a
stricter run-time check if I were to apply that myself.
--
Michael


signature.asc
Description: PGP signature

Re: pg_rewind WAL segments deletion pitfall

2023-09-13 Thread Alexander Kukushkin

Hi,

Please find attached v6.
Changes compared to v5:
1. use "perl -e 'exit(1)'" instead of "false" as archive_command, so it
also works on Windows
2. fixed the test name

Regards,
--
Alexander Kukushkin
From 3e1e6c9d968e9b829357b6eb0a7dfa366b550668 Mon Sep 17 00:00:00 2001
From: Alexander Kukushkin 
Date: Wed, 13 Sep 2023 09:17:15 +0200
Subject: [PATCH v6] Be more picky with WAL segment deletion in pg_rewind

Make pg_rewind to be a bit wiser in terms of creating filemap:
preserve on the target all WAL segments that contain records between the
last common checkpoint and the point of divergence.

Co-authored-by: Polina Bungina 
---
 src/bin/pg_rewind/filemap.c   | 62 +-
 src/bin/pg_rewind/filemap.h   |  4 ++
 src/bin/pg_rewind/parsexlog.c | 22 +++
 src/bin/pg_rewind/pg_rewind.c |  3 +
 src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 63 +++
 5 files changed, 152 insertions(+), 2 deletions(-)
 create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 58280d9abc..d44d716d82 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -64,6 +64,27 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 static int	final_filemap_cmp(const void *a, const void *b);
 static bool check_file_excluded(const char *path, bool is_source);
 
+typedef struct skipwal_t {
+	const char *path;
+	uint32		status;
+} skipwal_t;
+
+#define SH_PREFIX		keepwalhash
+#define SH_ELEMENT_TYPE	skipwal_t
+#define SH_KEY_TYPE		const char *
+#define	SH_KEY			path
+#define SH_HASH_KEY(tb, key)	hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define	SH_SCOPE		static inline
+#define SH_RAW_ALLOCATOR	pg_malloc0
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static keepwalhash_hash *keepwalhash = NULL;
+
+static bool keepwalhash_entry_exists(const char *path);
+
 /*
  * Definition of one element part of an exclusion list, used to exclude
  * contents when rewinding.  "name" is the name of the file or path to
@@ -207,6 +228,35 @@ lookup_filehash_entry(const char *path)
 	return filehash_lookup(filehash, path);
 }
 
+/* Initialize a hash table to store WAL file names that must be kept */
+void
+keepwalhash_init(void)
+{
+	keepwalhash = keepwalhash_create(FILEHASH_INITIAL_SIZE, NULL);
+}
+
+/* Prevent a given file deletion during rewind */
+void
+insert_keepwalhash_entry(const char *path)
+{
+	skipwal_t   *entry;
+	bool		found;
+
+	/* Should only be called with keepwalhash initialized */
+	Assert(keepwalhash);
+
+	entry = keepwalhash_insert(keepwalhash, path, );
+
+	if (!found)
+		entry->path = pg_strdup(path);
+}
+
+static bool
+keepwalhash_entry_exists(const char *path)
+{
+	return keepwalhash_lookup(keepwalhash, path) != NULL;
+}
+
 /*
  * Callback for processing source file list.
  *
@@ -682,7 +732,16 @@ decide_file_action(file_entry_t *entry)
 	}
 	else if (entry->target_exists && !entry->source_exists)
 	{
-		/* File exists in target, but not source. Remove it. */
+		/* File exists in target, but not source. */
+
+		if (keepwalhash_entry_exists(path))
+		{
+			/* This is a WAL file that should be kept. */
+			pg_log_debug("Not removing %s because it is required for recovery", path);
+			return FILE_ACTION_NONE;
+		}
+
+		/* Otherwise remove an unexpected file. */
 		return FILE_ACTION_REMOVE;
 	}
 	else if (!entry->target_exists && !entry->source_exists)
@@ -818,7 +877,6 @@ decide_file_actions(void)
 	return filemap;
 }
 
-
 /*
  * Helper function for filemap hash table.
  */
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 48f240dff1..9a0d4bbc54 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -109,5 +109,9 @@ extern void process_target_wal_block_change(ForkNumber forknum,
 extern filemap_t *decide_file_actions(void);
 extern void calculate_totals(filemap_t *filemap);
 extern void print_filemap(filemap_t *filemap);
+extern void preserve_file(char *filepath);
+
+extern void keepwalhash_init(void);
+extern void insert_keepwalhash_entry(const char *path);
 
 #endif			/* FILEMAP_H */
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..4c7f9bef14 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -176,6 +176,10 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	char	   *errormsg;
 	XLogPageReadPrivate private;
 
+	/* Track WAL segments opened while searching a checkpoint */
+	XLogSegNo	segno = 0;
+	TimeLineID	tli = 0;
+
 	/*
 	 * The given fork pointer points to the end of the last common record,
 	 * which is not necessarily the beginning of the next record, if the
@@ -217,6 +221,24 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		 LSN_FORMAT_ARGS(searchptr));
 		}
 
+		/* We are trying to

Re: Tab completion for ATTACH PARTITION

2023-09-13 Thread Alvaro Herrera

On 2023-Sep-13, bt23nguyent wrote:

> Hi,
> 
> Currently, the psql's tab completion feature does not support properly for
> ATTACH PARTITION. When  key is typed after "ALTER TABLE 
> ATTACH PARTITION ", all possible table names should be displayed, however,
> foreign table names are not displayed. So I created a patch that addresses
> this issue by ensuring that psql displays not only normal table names but
> also foreign table names in this case.

Sounds reasonable, but I think if we're going to have a specific query
for this case, we should make it a lot more precise.  For example, any
relation that's already a partition cannot be attached; as can't any
relation that is involved in legacy inheritance as either parent or
child.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"This is a foot just waiting to be shot"(Andrew Dunstan)

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-09-13 Thread Kyotaro Horiguchi

At Fri, 8 Sep 2023 08:02:57 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> > > Ditching cmd.exe seems like a big hassle. So, on the flip side, I
> > > tried to identify the postmaster PID using the shell's PID, and it
> > > seem to work. The APIs used are avaiable from XP/2003 onwards.
> 
> According to 495ed0ef2, Windows 10 seems the minimal requirement for using
> the postgres. So the approach seems OK.
> 
> Followings are my comment, but I can say only cosmetic ones because I do not 
> have
> windows machine which can run postgres.

Thank you for the comment!

> 1.
> Forward declaration seems missing. In the pg_ctl.c, the static function seems 
> to
> be declared even if there is only one caller (c.f., GetPrivilegesToDelete).

Agreed. 

> 2.
> I think the argument should be pid_t.

Yeah, I didn't linger on that detail earlier. But revisiting it, I
coucur it is best suited since it is a local function in
pg_ctl.c. I've now positioned it at the end of a WIN32 section
defining other win32-specific functions. Hence, a forward declaration
became necessary:p

> 3.
> I'm not sure the return type of the function should be pid_t or not. According
> to the document, DWORD corrresponds to the pid_t. In win32_port.h, the pid_t 
> is
> defiend as int (_MSC_VER seems to be defined when the VisualStduio is used). 
> It
> is harmless, but I perfer to match the interface between caller/callee. IIUC 
> we
> can add just a cast.

For the reason previously stated, I've adjusted the type for both the
parameter and the return value to pid_t. start_postmaster() already
assumed that pid_t is wider than DWORD.

I noticed that PID 0 is valid on Windows. However, it is consistently
the PID for the system idle process, so it can't be associated with
cmd.exe or postgres. I've added a comment noting that premise. Also I
did away with an unused variable.  For the CreateToolhelp32Snapshot
function, I changed the second parameter to 0 from shell_pid, since it
is not used when using TH32CS_SNAPPROCESS.  I changed the comparison
operator for pid_t from > to !=, ensuring correct behavior even with
negative values.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 807d7023a9..68ab83779c 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -132,6 +132,7 @@ static void adjust_data_dir(void);
 
 #ifdef WIN32
 #include 
+#include 
 static bool pgwin32_IsInstalled(SC_HANDLE);
 static char *pgwin32_CommandLine(bool);
 static void pgwin32_doRegister(void);
@@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *);
 static void pgwin32_doRunAsService(void);
 static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service);
 static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken);
+static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid);
 #endif
 
 static pid_t get_pgpid(bool is_status_request);
@@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			/* File is complete enough for us, parse it */
 			pid_t		pmpid;
 			time_t		pmstart;
-
+#ifndef WIN32
+			pid_t		wait_pid = pm_pid;
+#else
+			pid_t		wait_pid = pgwin32_find_postmaster_pid(pm_pid);
+#endif
 			/*
 			 * Make sanity checks.  If it's for the wrong PID, or the recorded
 			 * start time is before pg_ctl started, then either we are looking
@@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			 */
 			pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
 			pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
-			if (pmstart >= start_time - 2 &&
-#ifndef WIN32
-pmpid == pm_pid
-#else
-			/* Windows can only reject standalone-backend PIDs */
-pmpid > 0
-#endif
-)
+
+			if (pmstart >= start_time - 2 && pmpid == wait_pid)
 			{
 /*
  * OK, seems to be a valid pidfile from our child.  Check the
@@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken)
 
 	return tokenPrivs;
 }
+
+/*
+ * Find the PID of the launched postmaster.
+ *
+ * On Windows, the cmd.exe doesn't support the exec command. As a result, we
+ * don't directly get the postmaster's PID. This function identifies the PID of
+ * the postmaster started by the child cmd.exe.
+ *
+ * Returns the postmaster's PID. If the shell is alive but the postmaster is
+ * missing, returns 0. Otherwise terminates this command with an error.
+ *
+ * This function uses PID 0 as an invalid value, assuming the system idle
+ * process occupies it and it won't be a PID for a shell or postmaster.
+ */
+pid_t
+pgwin32_find_postmaster_pid(pid_t shell_pid)
+{
+	HANDLE hSnapshot;
+	DWORD flags;
+	PROCESSENTRY32 ppe;
+	pid_t pm_pid = 0;			/* abusing 0 as an invalid value */
+	bool shell_exists = false;
+	bool multiple_children = false;
+	DWORD last_error;
+
+	/* create a process snapshot */
+	hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
+	if (hSnapshot == INVALID_HANDLE_VALUE)
+	{
+		write_stderr(_("%s:

[dynahash] do not refill the hashkey after hash_search

2023-09-13 Thread Junwang Zhao

Hi hackers,

It's not necessary to fill the key field for most cases, since
hash_search has already done that for you. For developer that
using memset to zero the entry structure after enter it, fill the
key field is a must, but IMHO that is not good coding style, we
really should not touch the key field after insert it into the
dynahash.

This patch fixed some most abnormal ones, instead of refilling the
key field of primitive types, adding some assert might be a better
choice.


-- 
Regards
Junwang Zhao


0001-do-not-refill-the-hashkey-after-hash_search.patch
Description: Binary data

Re: pg_upgrade and logical replication

2023-09-13 Thread Michael Paquier

On Mon, Sep 11, 2023 at 05:19:27PM +0530, vignesh C wrote:
> Added a function binary_upgrade_sub_replication_origin_advance which
> will: a) check if the subscription exists, b) get the replication name
> for subscription and c) advance the replication origin.
> 
> These are handled as part of v7 posted at [1].
> [1] - 
> https://www.postgresql.org/message-id/CALDaNm1ZrbHaWpJwwNhDTJocRKWd3rEkgJazuDdZ9Z-WdvonFg%40mail.gmail.com

Thanks.  I can see that some of the others have already provided
comments about this version.  I have some comments on top of that.
--
Michael


signature.asc
Description: PGP signature

Re: persist logical slots to disk during shutdown checkpoint

2023-09-13 Thread Amit Kapila

On Wed, Sep 13, 2023 at 10:57 AM Michael Paquier  wrote:
>
> On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:
> >>
> >> - Not sure that there is any point to mention the other code paths in
> >> the tree where ReplicationSlotSave() can be called, and a slot can be
> >> saved in other processes than just WAL senders (like slot
> >> manipulations in normal backends, for one).  This was the last
> >> sentence in v10.
> >
> > The point was the earlier sentence is no longer true and keeping it as
> > it is could be wrong or at least misleading. For example, earlier it
> > is okay to say, "This needn't actually be part of a checkpoint, ..."
> > but now that is no longer true as we want to invoke this at the time
> > of shutdown checkpoint for correctness. If we want to be precise, we
>
> How so?
>

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...). It happened previously as well
that different reviewers working in this area have different views on
this sort of thing. I am trying my best to address the review
comments, especially from experienced hackers but personally, I feel
this is a minor nitpick and isn't worth too much argument, either way,
should be okay.

>
> >> +   if (s->data.invalidated == RS_INVAL_NONE &&
> >> +   s->data.confirmed_flush != s->last_saved_confirmed_flush)
> >>
> >> Actually this is incorrect, no?  Shouldn't we make sure that the
> >> confirmed_flush is strictly higher than the last saved LSN?
> >
> > I can't see why it is incorrect. Do you see how (in what scenario) it
> > could go wrong? As per my understanding, confirmed_flush LSN will
> > always be greater than equal to last_saved_confirmed_flush but we
> > don't want to ensure that point here because we just want if the
> > latest value is not the same then we should mark the slot dirty and
> > flush it as that will be location we have ensured to update before
> > walsender shutdown. I think it is better to add an assert if you are
> > worried about any such case and we had thought of adding it as well
> > but then didn't do it because we don't have matching asserts to ensure
> > that we never assign prior LSN value to consfirmed_flush LSN.
>
> Because that's just safer in the long run, and I don't see why we
> cannot just do that?  Imagine, for instance, that a bug doing an
> incorrect manipulation of a logical slot's data does an incorrect
> computation of this field, and that we finish with in-memory data
> that's older than what was previously saved.  The code may cause a
> flush at an incorrect, past, position.  That's just an assumption from
> my side, of course.
>

If you are worried about such bugs, it would be better to have an
Assert as suggested previously rather than greater than check because
we will at least catch such bugs otherwise it can go unnoticed or in
the worst case will lead to unknown consequences. I am saying this
because if there are such bugs (or got introduced later) then the slot
can be flushed with a prior confirmed_flush location even from other
code paths. Just for reference, we don't have any check ensuring that
confirmed_flush LSN can move backward in function
LogicalConfirmReceivedLocation(), see and also another place where we
update it:
else
{
SpinLockAcquire(>mutex);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(>mutex);
}

As other places don't have an assert, I didn't add one here but we can
add one here.

> > I don't want to argue on such a point because it is a little bit of a
> > matter of personal choice but I find this comment unclear. It seems to
> > read that confirmed_flush LSN is some LSN position which is where we
> > flushed the slot's data and that is not true. I found the last comment
> > in the patch sent by me: "This value tracks the last confirmed_flush
> > LSN flushed which is used during a shutdown checkpoint to decide if
> > logical's slot data should be forcibly flushed or not." which I feel
> > we agreed upon is better.
>
> Okay, fine by me here.
>

Thanks, will change once we agree on the remaining points.

-- 
With Regards,
Amit Kapila.

Re: Have better wording for snapshot file reading failure

2023-09-13 Thread Michael Paquier

On Wed, Sep 13, 2023 at 11:40:25AM +0530, Bharath Rupireddy wrote:
> When a snapshot file reading fails in ImportSnapshot(), it errors out
> with "invalid snapshot identifier". This message better suits for
> snapshot identifier parsing errors which is being done just before the
> file reading. The attached patch adds a generic file reading error
> message with path to help distinguish if the issue is with snapshot
> identifier parsing or file reading.

 f = AllocateFile(path, PG_BINARY_R);
 if (!f)
 ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("invalid snapshot identifier: \"%s\"", idstr)));
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\" for reading: %m",
+path)));

Agreed that this just looks like a copy-pasto.  The path provides
enough context about what's being read, so using this generic error
message is fine.  Will apply if there are no objections. 
--
Michael


signature.asc
Description: PGP signature

Have better wording for snapshot file reading failure

2023-09-13 Thread Bharath Rupireddy

Hi,

When a snapshot file reading fails in ImportSnapshot(), it errors out
with "invalid snapshot identifier". This message better suits for
snapshot identifier parsing errors which is being done just before the
file reading. The attached patch adds a generic file reading error
message with path to help distinguish if the issue is with snapshot
identifier parsing or file reading.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Have-better-wording-for-snapshot-file-reading-fai.patch
Description: Binary data

Re: Is the member name of hashctl inappropriate？

2023-09-13 Thread ywgrit

You are right, I came to this erroneous conclusion based on the following
mishandled experiment: My test program is shown below.

typedef struct ColumnIdentifier
{
   Oidrelid;
   AttrNumber resno;
} ColumnIdentifier;

typedef struct ColumnType
{
   ColumnIdentifier colId;
   Oid  vartype;
} ColumnType;


int
ColumnIdentifier_compare(const void *key1, const void *key2, Size keysize)
{
const ColumnIdentifier *colId_key1 = (const ColumnIdentifier *) key1;
const ColumnIdentifier *colId_key2 = (const ColumnIdentifier *) key2;

return colId_key1->relid == colId_key2->relid && colId_key1->resno ==
colId_key2->resno ? 0 : 1;
}


void *
ColumnIdentifier_copy(void *dest, const void *src, Size keysize)
{
ColumnIdentifier *colId_dest = (ColumnIdentifier *) dest;
ColumnIdentifier *colId_src = (ColumnIdentifier *) src;

colId_dest->relid = colId_src->relid;
colId_dest->resno = colId_src->resno;

   return NULL;/* not used */
}


   HASHCTL   hashctl;
   hashctl.hash = tag_hash;
   hashctl.match = ColumnIdentifier_compare;
   hashctl.keycopy = ColumnIdentifier_copy;
   hashctl.keysize = sizeof(ColumnIdentifier);
   hashctl.entrysize = sizeof(ColumnType);
   HTAB *htab = hash_create("type of column",
 512 /* nelem */,
 ,
 HASH_ELEM | HASH_FUNCTION |
 HASH_COMPARE | HASH_KEYCOPY);
ColumnType *entry = NULL;

   ColumnIdentifier *colId = (ColumnIdentifier *)
MemoryContextAllocZero(CurrentMemoryContext,
sizeof(ColumnIdentifier));
   ColumnType *coltype = (ColumnType *)
MemoryContextAllocZero(CurrentMemoryContext, sizeof(ColumnType));

   coltype->colId.relid = colId->relid = 16384;
   coltype->colId.resno = colId->resno = 1;
   coltype->vartype = INT4OID;

   hash_search(htab, coltype, HASH_ENTER, NULL);
   entry = hash_search(htab, colId, HASH_FIND, NULL);

   Assert(entry->colId.relid == colId->relid);
   Assert(entry->colId.resno == colId->resno);
   Assert(entry->vartype == INT4OID); // entry->vartype == 0

As shown above, entry->vartype is not assigned when keycopy copies only the
key. I modified ColumnIdentifier_copy as shown below, the keycopy copies
the entire entry.

void *
ColumnIdentifier_copy(void *dest, const void *src, Size keysize)
{
const ColumnType *coltype_src = (const ColumnType *) src;
const ColumnType *coltype_dest = (const ColumnType *) dest;

  coltype_dest->colId->relid = coltype_src->colId->relid;
  coltype_dest->colId->resno = coltype_src->colId->resno;
  coltype_dest->vartype = coltype_src->vartype;

   return NULL;/* not used */
}

The result is that entry->vartype is now the same as var->vartype, which
leads me to believe that keycopy "should" copy the entire entry. Before
sending the initial email, I looked at the implementation of
"hash_search_with_hash_value" and found the line
"hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize)", which made me
wonder how data field is copied into the HTAB?

But at the time I ignored a note above: "Caller is expected to fill the
data field on return". Now I know that the data field needs to be filled
manually, so it was my misuse. Thanks for the correction!

Thanks

Tom Lane  于2023年9月13日周三 09:36写道：

> ywgrit  writes:
> > According to the description, the keycopy function only copies the key,
> but
> > in reality it copies the entire entry, i.e., the key and the value,
>
> On what grounds do you claim that?  dynahash.c only ever passes "keysize"
> as the size parameter.
>
> regards, tom lane
>

Re: Cleaning up array_in()

2023-09-13 Thread Alexander Lakhin


12.09.2023 11:45, jian he wrote:

On Mon, Sep 11, 2023 at 8:00 PM Alexander Lakhin  wrote:

I can confirm that all those anomalies are fixed now.
But new version brings a warning when compiled with gcc:
arrayfuncs.c:659:9: warning: variable 'prev_tok' is uninitialized when used 
here [-Wuninitialized]
  if (prev_tok == ATOK_DELIM || nest_level == 0)
  ^~~~
arrayfuncs.c:628:3: note: variable 'prev_tok' is declared here
  ArrayToken  prev_tok;
  ^
1 warning generated.

Also it looks like an updated comment needs fixing/improving:
   /* No array dimensions, so first literal character should be oepn 
curl-braces */
(should be an opening brace?)


fixed these 2 issues.
--query
SELECT ('{ ' || string_agg(chr((ascii('B') + round(random() * 25)) ::
integer),', ') || ' }')::text[]
FROM generate_series(1,1e6) \watch i=0.1 c=1

After applying the patch, the above query runs slightly faster.


Thank you, Jian He!

Now I see only a few wrinkles.
1) A minor asymmetry in providing details appeared:
select E'{"a"a}'::text[];
ERROR:  malformed array literal: "{"a"a}"
LINE 1: select E'{"a"a}'::text[];
   ^
DETAIL:  Unexpected array element.

select E'{a"a"}'::text[];
ERROR:  malformed array literal: "{a"a"}"
LINE 1: select E'{a"a"}'::text[];
   ^
(no DETAIL)

Old behavior:
select E'{a"a"}'::text[];
ERROR:  malformed array literal: "{a"a"}"
LINE 1: select E'{a"a"}'::text[];
   ^
DETAIL:  Unexpected array element.

select E'{"a"a}'::text[];
ERROR:  malformed array literal: "{"a"a}"
LINE 1: select E'{"a"a}'::text[];
   ^
DETAIL:  Unexpected array element.

2) CPPFLAGS="-DARRAYDEBUG" ./configure ... breaks "make check", maybe change 
elog(NOTICE) to elog(DEBUG1)?
2a) a message logged there lacks some delimiter before "lBound info":
NOTICE:  array_in- ndim 1 ( 3 -1 -1 -1 -1 -1lBound info 1 1 1 1 1 1) for 
{red,green,blue}
what about changing the format to "ndim 1 ( 3 -1 -1 -1 -1 -1; lBound info: 1 1 1 1 1 
1)"?

3) It seems that new comments need polishing, in particular:
 /* initialize dim, lBound. useful for ReadArrayDimensions ReadArrayStr */
->?
 /* Initialize dim, lBound for ReadArrayDimensions, ReadArrayStr */

Otherwise, we determine the dimensions from the in curly-braces
->?
Otherwise, we determine the dimensions from the curly braces.

Best regards,
Alexander

85 matches

Mail list logo