date:20231108

Re: Remove unnecessary 'always:' from CompilerWarnings task

2023-11-08 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review.

On Wed, 8 Nov 2023 at 10:31, Peter Eisentraut  wrote:
>
> On 05.09.23 12:25, Nazir Bilal Yavuz wrote:
> > There are multiple 'always:' keywords under the CompilerWarnings task.
> > Instead of that, we can use one 'always:' and move the instructions
> > under this. So, I removed unnecessary ones and rearranged indents
> > according to that change.
>
> I'm not sure this change is beneficial.  The way the code is currently
> arranged, it's a bit easier to move or change individual blocks, and
> it's also easier to read the file, because the "always:" is next to each
> "script" and doesn't scroll off the screen.

That makes sense. I am planning to withdraw this soon if there are no
other objections.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: A recent message added to pg_upgade

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 01:12:54PM +0530, Amit Kapila wrote:
> I think it is in an email[1].

Noted.

> I can take care of this unless we see some opposition to this idea.

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: A recent message added to pg_upgade

2023-11-08 Thread Amit Kapila

On Thu, Nov 9, 2023 at 12:38 PM Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 05:04:28PM +1100, Peter Smith wrote:
> > Having a GUC hook for the "max_slot_wal_keep_size" seemed OK to me. If
> > the user overrides a GUC value (admittedly, maybe there is no reason
> > why they would want to) then at least the hook will give an error,
> > rather than us silently overwriting the user's value with -1.
> >
> > So, patch v4 LGTM, except it is better to include a test case.
>
> Where's this v4?
>

I think it is in an email[1]. I can take care of this unless we see
some opposition to this idea.

[1] - 
https://www.postgresql.org/message-id/20231102.115834.1012152975995247837.horikyota.ntt%40gmail.com

-- 
With Regards,
Amit Kapila.

Re: Fix output of zero privileges in psql

2023-11-08 Thread Laurenz Albe

On Thu, 2023-11-09 at 03:40 +0100, Erik Wienhold wrote:
> On 2023-11-08 13:23 +0100, Laurenz Albe wrote:
> > I wonder how to proceed with this patch.  The main disagreement is
> > whether default privileges should be displayed as NULL (less invasive,
> > but more confusing for beginners) or "(default)" (more invasive,
> > but nicer for beginners).
> 
> Are there any reports from beginners being confused about default
> privileges being NULL or being displayed as a blank string in psql?
> This is usually resolved with a pointer to the docs if it comes up in
> discussions or the user makes the mental leap and checks the docs
> himself.  Both patches add some details to the docs to explain psql's
> output.

Right.

> > David is for "(default)", Tom and me are for NULL, and I guess Erik
> > would also prefer "(default)", since that was how his original
> > patch did it, IIRC.  I think I could live with both solutions.
> > 
> > Kind of a stalemate.  Who wants to tip the scales?
> 
> Yes I had a slight preference for my patch but I'd go with yours (\pset
> null) now.  I followed the discussion after my last mail but had nothing
> more to add that wasn't already said.  Tom then wrote that NULL is the
> catalog's representation for the default privileges and obscuring that
> fact in psql is not doing any service to the users.  This convinced me
> because users may have to deal with aclitem[] being NULL anyway at some
> point if they need to check privileges in more detail.  So it makes
> absolutely sense that psql is transparent about that.

Thanks for the feedback.  I'll set the patch to "ready for committer" then.

Yours,
Laurenz Albe

Re: make pg_ctl more friendly

2023-11-08 Thread Crisp Lee

Hi,

I thought the PITR shutdown was DB_SHUTDOWN. I made a mistake. The v2
attach looks good.

On Thu, Nov 9, 2023 at 3:19 PM Junwang Zhao  wrote:

> On Thu, Nov 9, 2023 at 3:08 PM Junwang Zhao  wrote:
> >
> > On Thu, Nov 9, 2023 at 9:57 AM Crisp Lee 
> wrote:
> > >
> > > Hi,
> > >
> > > I know it. But my question is not that. I did a PITR operation with
> recovery_target_name and recovery_target_action('shutdown'). The PITR
> process was very short and the PITR was done before pg_ctl check. The
> postmaster shutdown due to recovery_target_action, and there was no crash.
> But pg_ctl told me about startup failure.  I think the startup had
> succeeded and the result was not a exception. pg_ctl should tell users
> about detailed messages.
> > >
> > > On Thu, Nov 9, 2023 at 9:32 AM Andres Freund 
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On 2023-11-09 09:29:32 +0800, Crisp Lee wrote:
> > >> > How to judge from 'DB_SHUTDOWNED' that PITR ends normally?
> 'DB_SHUTDOWNED'
> > >> > is just a state, it could not give more meaning, so I reuse the
> > >> > recovery.done.
> > >>
> > >> DB_SHUTDOWNED cannot be encountered while recovery is ongoing. If
> there was a
> > >> hard crash, you'd see DB_IN_ARCHIVE_RECOVERY or such, if the command
> was shut
> > >> down orderly before PITR has finished, you'd see
> DB_SHUTDOWNED_IN_RECOVERY.
> > >>
> > >> - Andres
> >
> > After a PITR shutdown, the db state should be *shut down in recovery*,
> try the
> > patch attached.
> >
>
> previous patch has some format issues, V2 attached.
>
> >
> > --
> > Regards
> > Junwang Zhao
>
>
>
> --
> Regards
> Junwang Zhao
>

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 01:50:34PM +0900, torikoshia wrote:
> PGSTAT_KIND_SLRU cannot be reset by pg_stat_reset_shared(), so I feel
> uncomfortable to delete it all together.
> It might be better after pg_stat_reset_shared() has been modified to take
> 'slru' as an argument, though.

Not sure how to feel about that, TBH, but I would not include SLRUs
here if we have already a separate function.

> I thought it would be better to reset statistics even when null is specified
> so that users are not confused with the behavior of pg_stat_reset_slru().
> Attached patch added pg_stat_reset_shared() in system_functions.sql mainly
> for this reason.

I'm OK with doing what your patch does, aka do the work if the value
is NULL or if there's no argument given.

-Resets some cluster-wide statistics counters to zero, depending on the
+Resets cluster-wide statistics counters to zero, depending on the 

This does not need to change, aka SLRUs are for example still global
and not included here.

+If the argument is NULL or not specified, all counters shown in all
+of these views are reset.

Missing a  markup around NULL.  I know, we're not consistent
about that, either, but if we are tweaking the area let's be right at
least.  Perhaps "all the counters from the views listed above are
reset"?
--
Michael


signature.asc
Description: PGP signature

Re: make pg_ctl more friendly

2023-11-08 Thread Junwang Zhao

On Thu, Nov 9, 2023 at 3:08 PM Junwang Zhao  wrote:
>
> On Thu, Nov 9, 2023 at 9:57 AM Crisp Lee  wrote:
> >
> > Hi,
> >
> > I know it. But my question is not that. I did a PITR operation with 
> > recovery_target_name and recovery_target_action('shutdown'). The PITR 
> > process was very short and the PITR was done before pg_ctl check. The 
> > postmaster shutdown due to recovery_target_action, and there was no crash. 
> > But pg_ctl told me about startup failure.  I think the startup had 
> > succeeded and the result was not a exception. pg_ctl should tell users 
> > about detailed messages.
> >
> > On Thu, Nov 9, 2023 at 9:32 AM Andres Freund  wrote:
> >>
> >> Hi,
> >>
> >> On 2023-11-09 09:29:32 +0800, Crisp Lee wrote:
> >> > How to judge from 'DB_SHUTDOWNED' that PITR ends normally? 
> >> > 'DB_SHUTDOWNED'
> >> > is just a state, it could not give more meaning, so I reuse the
> >> > recovery.done.
> >>
> >> DB_SHUTDOWNED cannot be encountered while recovery is ongoing. If there 
> >> was a
> >> hard crash, you'd see DB_IN_ARCHIVE_RECOVERY or such, if the command was 
> >> shut
> >> down orderly before PITR has finished, you'd see DB_SHUTDOWNED_IN_RECOVERY.
> >>
> >> - Andres
>
> After a PITR shutdown, the db state should be *shut down in recovery*, try the
> patch attached.
>

previous patch has some format issues, V2 attached.

>
> --
> Regards
> Junwang Zhao



-- 
Regards
Junwang Zhao


v2-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch
Description: Binary data

Re: RFC: Logging plan of the running query

2023-11-08 Thread Ashutosh Bapat

On Thu, Nov 9, 2023 at 12:03 PM torikoshia  wrote:
> >>
> >> 1. When a backend is running nested queries, we will see the plan of
> >> the innermost query. That query may not be the actual culprit if the
> >> user query is running slowly. E.g a query being run as part of inner
> >> side nested loop when the nested loop itself is the bottleneck. I
> >> think it will be useful to print plans of all the whole query stack.
>
> This was discussed in previous threads[1] and we thought it'd be useful
> but since it needed some extra works, we stopped widening the scope.
>
> >
> > I think we can start with what auto_explain is doing. Always print the
> > plan of the outermost query; the query found in pg_stat_activity. In a
> > later version we might find a way to print plans of all the queries in
> > the stack and do so in a readable manner.
>
> Agreed there are cases printing plan of the outermost query is more
> useful.
>

I am fine printing the plan of the outermost query. This will help
many cases. Printing plans of the whole query stack can be added as an
add on later.

> >
> > This makes tracking activeQueryDesc a bit tricky. My guess is that the
> > outermost query's descriptor will be available through ActivePortal
> > most of the time. But there are cases when ExecutorRun is called by
> > passing a queryDesc directly. So may be there are some cases where
> > that's not true.
>
> Yeah, actually the original version of the patch got the plan from
> ActivePortal, but it failed logging plan when the query was something
> like this[2]:
>
>   DO $$
>   BEGIN
>   PERFORM pg_sleep(100);
>   END$$;

References [1] and [2] are not listed in your email.

Is that because there was no ActivePortal created or the ActivePortal
pointed to DO block instead of PERFORM pg_sleep?

>
> > 2. When a query is running in parallel worker do we want to print that
> > query? It may or may not be interesting given the situation. If the
> > overall plan itself is faulty, output of the parallel worker query is
> > not useful. If the plan is fine but a given worker's leg is running
> > slowly it may be interesting.
>
> I think it can be useful.
> I'm wondering if we can add this after the first patch for this feature
> is committed.

With the current patches, it will print the query from a parallel
backend. If that's not desirable we should prohibit that case at
least.

-- 
Best Wishes,
Ashutosh Bapat

Re: A recent message added to pg_upgade

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 05:04:28PM +1100, Peter Smith wrote:
> Having a GUC hook for the "max_slot_wal_keep_size" seemed OK to me. If
> the user overrides a GUC value (admittedly, maybe there is no reason
> why they would want to) then at least the hook will give an error,
> rather than us silently overwriting the user's value with -1.
> 
> So, patch v4 LGTM, except it is better to include a test case.

Where's this v4?  I may be missing, but it does not seem to be
attached to this thread..
--
Michael


signature.asc
Description: PGP signature

Re: make pg_ctl more friendly

2023-11-08 Thread Junwang Zhao

On Thu, Nov 9, 2023 at 9:57 AM Crisp Lee  wrote:
>
> Hi,
>
> I know it. But my question is not that. I did a PITR operation with 
> recovery_target_name and recovery_target_action('shutdown'). The PITR process 
> was very short and the PITR was done before pg_ctl check. The postmaster 
> shutdown due to recovery_target_action, and there was no crash. But pg_ctl 
> told me about startup failure.  I think the startup had succeeded and the 
> result was not a exception. pg_ctl should tell users about detailed messages.
>
> On Thu, Nov 9, 2023 at 9:32 AM Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2023-11-09 09:29:32 +0800, Crisp Lee wrote:
>> > How to judge from 'DB_SHUTDOWNED' that PITR ends normally? 'DB_SHUTDOWNED'
>> > is just a state, it could not give more meaning, so I reuse the
>> > recovery.done.
>>
>> DB_SHUTDOWNED cannot be encountered while recovery is ongoing. If there was a
>> hard crash, you'd see DB_IN_ARCHIVE_RECOVERY or such, if the command was shut
>> down orderly before PITR has finished, you'd see DB_SHUTDOWNED_IN_RECOVERY.
>>
>> - Andres

After a PITR shutdown, the db state should be *shut down in recovery*, try the
patch attached.


-- 
Regards
Junwang Zhao


0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch
Description: Binary data

Re: pg_upgrade and logical replication

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 01:14:05PM +1100, Peter Smith wrote:
> Looks like v12 accidentally forgot to update this to the modified
> function name 'binary_upgrade_add_sub_rel_state'

This v12 is overall cleaner than its predecessors.  Nice to see.

+my $result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t1");
+is($result, qq(1), "check initial t1 table data on publisher");
+$result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t2");
+is($result, qq(1), "check initial t1 table data on publisher");
+$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t1");
+is($result, qq(1), "check initial t1 table data on the old subscriber");
+$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t2");

I'd argue that t1 and t2 should have less generic names.  t1 is used
to check that the upgrade process works, while t2 is added to the
publication after upgrading the subscriber.  Say something like
tab_upgraded or tab_not_upgraded?

+my $synced_query =
+  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN 
('r');";

Perhaps it would be safer to use a query that checks the number of
relations in 'r' state?  This query would return true if
pg_subscription_rel has no tuples.

+# Table will be in 'd' (data is being copied) state as table sync will fail
+# because of primary key constraint error.
+my $started_query =
+  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd';";

Relying on a pkey error to enforce an incorrect state is a good trick.
Nice.

+command_fails(
+[
+'pg_upgrade', '--no-sync','-d', $old_sub->data_dir,
+'-D', $new_sub->data_dir, '-b', $bindir,
+'-B', $bindir,'-s', $new_sub->host,
+'-p', $old_sub->port, '-P', $new_sub->port,
+$mode,'--check',
+],
+'run of pg_upgrade --check for old instance with relation in \'d\' 
datasync(invalid) state');
+rmtree($new_sub->data_dir . "/pg_upgrade_output.d");

Okay by me to not stop the cluster for the --check to shave a few
cycles.  It's a bit sad that we don't cross-check the contents of
subscription_state.txt before removing pg_upgrade_output.d.  Finding
the file is easy even if the subdir where it is included is not a
constant name.  Then it is possible to apply a regexp with the
contents consumed by a slurp_file().

+my $remote_lsn = $old_sub->safe_psql('postgres',
+"SELECT remote_lsn FROM pg_replication_origin_status");
Perhaps you've not noticed, but this would be 0/0 most of the time.
However the intention is to check after a valid LSN to make sure that
the origin is set, no?

I am wondering whether this should use a bit more data than just one
tuple, say at least two transaction, one of them with a multi-value
INSERT?

+# --
+# Check that pg_upgrade is successful when all tables are in ready state.
+# --
This comment is a bit inconsistent with the state that are accepted,
but why not, at least that's predictible.

+* relation to pg_subscripion_rel table. This will be used only in 

Typo: s/pg_subscripion_rel/pg_subscription_rel/.

This needs some word-smithing to explain the reasons why a state is
not needed:

+/*
+ * The subscription relation should be in either i (initialize),
+ * r (ready) or s (synchronized) state as either the replication slot
+ * is not created or the replication slot is already dropped and the
+ * required WAL files will be present in the publisher. The other
+ * states are not ok as the worker has dependency on the replication
+ * slot/origin in these case:

A slot not created yet refers to the 'i' state, while 'r' and 's'
refer to a slot created previously but already dropped, right?
Shouldn't this comment tell that rather than mixing the assumptions?

+ * a) SUBREL_STATE_DATASYNC: In this case, the table sync worker will
+ * try to drop the replication slot but as the replication slots will
+ * be created with old subscription id in the publisher and the
+ * upgraded subscriber will not be able to clean the slots in this
+ * case.

Proposal: A relation upgraded while in this state would retain a
replication slot, which could not be dropped by the sync worker
spawned after the upgrade because the subscription ID tracked by the
publisher does not match anymore.

Note: actually, this would be OK if we are able to keep the OIDs of
the subscribers consistent across upgrades?  I'm OK to not do nothing
about that in this patch, to keep it simpler.  Just asking in passing.

+ * b) SUBREL_STATE_FINISHEDCOPY: In this case, the tablesync worker 
will
+ * expect the origin to be already existing as the origin is created
+ * with an old subscription id, tablesync worker will not be able to
+ * find the origin in this case.

P

Re: A recent message added to pg_upgade

2023-11-08 Thread Kyotaro Horiguchi

At Thu, 9 Nov 2023 12:00:59 +0530, Amit Kapila  wrote 
in 
> I have also proposed that as one of the alternatives but didn't get
> many votes. And, I think if the user is passing a special value of
> max_slot_wal_keep_size during the upgrade, it has to be a special
> case, and rejecting it upfront doesn't seem unreasonable to me.

Oops. Sorry, and understood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: RFC: Logging plan of the running query

2023-11-08 Thread torikoshia


On 2023-11-06 15:32, Ashutosh Bapat wrote:

Thanks for the suggestion and example.


On Fri, Nov 3, 2023 at 7:31 PM Ashutosh Bapat
 wrote:


I have following questions related to the functionality. (Please point
me to the relevant discussion if this has been already discussed.)

1. When a backend is running nested queries, we will see the plan of
the innermost query. That query may not be the actual culprit if the
user query is running slowly. E.g a query being run as part of inner
side nested loop when the nested loop itself is the bottleneck. I
think it will be useful to print plans of all the whole query stack.


This was discussed in previous threads[1] and we thought it'd be useful 
but since it needed some extra works, we stopped widening the scope.



To further explain this point, consider following scenario

-- A completely useless function which executes two SQL statements
which take small amount individually
#create function multstmt() returns int
language sql as $$
select count(*) from pg_class where reltype = 12345;
select count(*) from pg_type limit 10;
$$;

-- force a suboptimal plan
#set enable_hashjoin to false;
#set enable_mergejoin to false;

-- A completely useless but long running query
#select c.oid, t.oid from pg_class c, pg_type t where multstmt(c.oid)
= multstmt(t.oid);

This take a few minutes on my laptop.

In a separate session query pg_stat_activity. We will see the original 
query
#select pid, query, backend_type from pg_stat_activity where pid = 
349330;

  pid   |  query
   |  backend_type
+-+
 349330 | select c.oid, t.oid from pg_class c, pg_type t where
multstmt(c.oid) = multstmt(t.oid); | client backend
(1 row)

Run the plan output function a few times
#select pg_log_query_plan(349330);

You will observe different plans based on which of the innermost query
is runnning
LOG:  query plan running on backend with PID 349330 is:
Query Text:
select count(*) from pg_class where reltype = typeid;
select count(*) from pg_type where oid = typeid;

Query Parameters: $1 = '600'
Aggregate  (cost=18.18..18.19 rows=1 width=8)
  Output: count(*)
  ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.18 rows=2 
width=0)

Output: oid, relname, relnamespace, reltype,
reloftype, relowner, relam, relfilenode, reltablespace, relpages,
reltuples, relallvisible, reltoastrelid, relhasindex, relisshared,
relpersistence, relkind, relnatts, relchecks, relhasrules,
relhastriggers, relhassubclass, relrowsecurity, relforcerowsecurity,
relispopulated, relreplident, relispartition, relrewrite,
relfrozenxid, relminmxid, relacl, reloptions, relpartbound
Filter: (pg_class.reltype = $1)
Settings: enable_hashjoin = 'off', enable_mergejoin = 'off'
2023-11-06 11:52:25.610 IST [349330] LOG:  query plan running on
backend with PID 349330 is:
Query Text:
select count(*) from pg_class where reltype = typeid;
select count(*) from pg_type where oid = typeid;

Query Parameters: $1 = '2203'
Aggregate  (cost=4.30..4.31 rows=1 width=4)
  Output: count(*)
  ->  Index Only Scan using pg_type_oid_index on
pg_catalog.pg_type  (cost=0.28..4.29 rows=1 width=0)
Output: oid
Index Cond: (pg_type.oid = $1)
Settings: enable_hashjoin = 'off', enable_mergejoin = 'off'

Individual pieces are confusing here since the query run by the
backend is not the one being EXPLAINed. A user may not know that the
queries being EXPLAINed arise from the function call multstmt(). So
they won't be able to stitch the pieces together unless they see plan
of the outermost query with loops and costs. What might help if we
explain each query in the hierarchy.

I think we can start with what auto_explain is doing. Always print the
plan of the outermost query; the query found in pg_stat_activity. In a
later version we might find a way to print plans of all the queries in
the stack and do so in a readable manner.


Agreed there are cases printing plan of the outermost query is more 
useful.




This makes tracking activeQueryDesc a bit tricky. My guess is that the
outermost query's descriptor will be available through ActivePortal
most of the time. But there are cases when ExecutorRun is called by
passing a queryDesc directly. So may be there are some cases where
that's not true.


Yeah, actually the original version of the patch got the plan from 
ActivePortal, but it failed logging plan when the query was something 
like this[2]:


 DO $$
 BEGIN
 PERFORM pg_sleep(100);
 END$$;


2. When a query is running in parallel worker do we want to print that
query? It may or may not be interesting given the situation. If the
overall plan itself is faulty, output of the parallel worker query

Re: A recent message added to pg_upgade

2023-11-08 Thread Amit Kapila

On Thu, Nov 9, 2023 at 11:40 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 9 Nov 2023 09:53:07 +0530, Amit Kapila  
> wrote in
> > Michael, Horiguchi-San, and others, do you have any thoughts on what
> > is the best way to proceed?
>
> As I previously mentioned, I believe that if rejection is to be the
> course of action, it would be best to proceed with it sooner rather
> than later. On the other hand, I am concerned about the need for users
> to perform extra steps depending on the source cluster
> conrfiguration. Therefore, another possible approach could be to
> simply ignore the given settings in the assignment hook rather than
> rejecting by the check hook, and forcibuly apply -1.
>
> What do you think about this third approach?
>

I have also proposed that as one of the alternatives but didn't get
many votes. And, I think if the user is passing a special value of
max_slot_wal_keep_size during the upgrade, it has to be a special
case, and rejecting it upfront doesn't seem unreasonable to me.

-- 
With Regards,
Amit Kapila.

Re: pg_rewind WAL segments deletion pitfall

2023-11-08 Thread torikoshia


On 2023-11-06 23:58, Alexander Kukushkin wrote:

Hi Torikoshia,

On Thu, 2 Nov 2023 at 04:24, torikoshia 
wrote:


+extern void preserve_file(char *filepath);


Is this necessary?
This function was defined in older version patch, but no longer
seems to
exist.

+# We use "perl -e 'exit(1)'" as a alternative to "false", because
the
last one
'a' should be 'an'?


Thanks for the feedback

Please find the new version attached.

Thanks for the update!

I've set the CF entry to "Ready for Committer".

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: pg_upgrade and logical replication

2023-11-08 Thread Amit Kapila

On Wed, Nov 8, 2023 at 10:52 PM vignesh C  wrote:
>
> Upgrading 2 node circular logical replication cluster:
> 1) Let's say we have a circular logical replication setup Node1->Node2
> & Node2->Node1. Here Node2 is subscribing to Node1 and Node1 is
> subscribing to Node2.
> 2) Stop the server in Node1.
> 3) Disable the subscriptions in Node2.
> 4) Upgrade the node Node1 to Node1_new.
> 5) Start the node Node1_new.
> 6) Enable the subscriptions in Node1_new.
> 7) Wait till all the incremental changes are synchronized.
> 8) Alter the subscription connections in Node2 to point from Node1 to 
> Node1_new.
> 9) Create any tables that were created in Node2 between step-2 and now
> and Refresh the publications.
>

I haven't reviewed all the steps yet but here steps 7 and 9 seem to
require some validation. How can incremental changes be synchronized
till all the new tables are created and synced before step 7?

-- 
With Regards,
Amit Kapila.

Re: A recent message added to pg_upgade

2023-11-08 Thread Kyotaro Horiguchi

At Thu, 9 Nov 2023 09:53:07 +0530, Amit Kapila  wrote 
in 
> Michael, Horiguchi-San, and others, do you have any thoughts on what
> is the best way to proceed?

As I previously mentioned, I believe that if rejection is to be the
course of action, it would be best to proceed with it sooner rather
than later. On the other hand, I am concerned about the need for users
to perform extra steps depending on the source cluster
conrfiguration. Therefore, another possible approach could be to
simply ignore the given settings in the assignment hook rather than
rejecting by the check hook, and forcibuly apply -1.

What do you think about this third approach?

I haven't checked this with pg_upgrade, but a standalone postmaster
would emit the following messages.

> postgres -b -c max_slot_wal_keep_size=-1
> LOG:  "max_slot_wal_keep_size" is foced to set to -1 during binary upgrade 
> mode.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index b541be8eec..319d4f5a81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2063,6 +2063,31 @@ check_wal_segment_size(int *newval, void **extra, 
GucSource source)
return true;
 }
 
+/*
+ * GUC check_hook for max_slot_wal_keep_size
+ *
+ * If WALs needed by logical replication slots are deleted, these slots become
+ * inoperable. During a binary upgrade, pg_upgrade sets this variable to -1 via
+ * the command line in an attempt to prevent such deletions, but users have
+ * ways to override it. To ensure the successful completion of the upgrade,
+ * it's essential to keep this variable unaltered.  See
+ * InvalidatePossiblyObsoleteSlot() and start_postmaster() in pg_upgrade for
+ * more details.
+ */
+bool
+check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
+{
+   if (IsBinaryUpgrade && *newval != -1)
+   {
+   ereport(LOG,
+   errmsg("\"%s\" is foced to set to -1 during 
binary upgrade mode.",
+  "max_slot_wal_keep_size"));
+   *newval = -1;
+   }
+
+   return true;
+}
+
 /*
  * At a checkpoint, how many WAL segments to recycle as preallocated future
  * XLOG segments? Returns the highest segment that should be preallocated.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 99823df3c7..5c3d2b1082 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1424,18 +1424,12 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
SpinLockRelease(&s->mutex);
 
/*
-* The logical replication slots shouldn't be invalidated as
-* max_slot_wal_keep_size GUC is set to -1 during the upgrade.
-*
-* The following is just a sanity check.
+* check_max_slot_wal_keep_size() ensures 
max_slot_wal_keep_size is set
+* to -1, so, slot invalidation for logical slots shouldn't 
happen
+* during an upgrade. At present, only logical slots really 
require
+* this.
 */
-   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
-   {
-   ereport(ERROR,
-   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-   errmsg("replication slots must not be 
invalidated during the upgrade"),
-   errhint("\"max_slot_wal_keep_size\" 
must be set to -1 during the upgrade"));
-   }
+   Assert (!(*invalidated && SlotIsLogical(s) && IsBinaryUpgrade));
 
if (active_pid != 0)
{
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..818ffdb2ae 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2845,7 +2845,7 @@ struct config_int ConfigureNamesInt[] =
},
&max_slot_wal_keep_size_mb,
-1, -1, MAX_KILOBYTES,
-   NULL, NULL, NULL
+   check_max_slot_wal_keep_size, NULL, NULL
},
 
{
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2a191830a8..3d74483f44 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -84,6 +84,8 @@ extern bool check_maintenance_io_concurrency(int *newval, 
void **extra,
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern bool check_max_connections(int *newval, void **extra, GucSource source);
 extern bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
+   
 GucSour

Re: Relids instead of Bitmapset * in plannode.h

2023-11-08 Thread Tom Lane

Ashutosh Bapat  writes:
> On Tue, Nov 7, 2023 at 8:54 PM Tom Lane  wrote:
>> Yes, I'm pretty sure that's exactly the reason, and I'm strongly
>> against the initially-proposed patch.  The include footprint of
>> pathnodes.h would be greatly expanded, for no real benefit.

> As I mentioned in [1] the Bitmapset implementation is not space
> efficient to be used as Relids when there are thousands of partitions.

TBH, I'd be very strongly against "optimizing" that case by adopting a
data structure that is less efficient for typical rangetable sizes.
I continue to think that anybody who is using that many partitions
is Doing It Wrong and has no excuse for thinking it'll be free.
Moreover, the size of their relid sets is pretty unlikely to be
their worst pain point.

In any case, that is a poor argument for weakening the separation
between planner and executor.  When and if somebody comes up with
a credible replacement for bitmapsets here, we can consider what
we want to do in terms of header-file organization --- but I do
not think including pathnodes.h into executor files will be it.

regards, tom lane

Re: A recent message added to pg_upgade

2023-11-08 Thread Peter Smith

On Thu, Nov 9, 2023 at 3:55 PM Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 09:53:07AM +0530, Amit Kapila wrote:
> > On Tue, Nov 7, 2023 at 4:16 PM Amit Kapila  wrote:
> >> But then we don't need the hardcoded value of
> >> max_logical_replication_workers as zero by pg_upgrade. I think doing
> >> IsBinaryUpgrade for slots won't be neat, so we anyway need to keep
> >> using the special value of max_slot_wal_keep_size GUC. Though the
> >> handling for both won't be the same but I guess given the situation,
> >> that seems like a reasonable thing to do. If we follow that then we
> >> can have this special GUC hook only for max_slot_wal_keep_size GUC.
> >
> > Michael, Horiguchi-San, and others, do you have any thoughts on what
> > is the best way to proceed?
>
> No problem for me to use a GUC hook for the WAL retention GUCs if you
> feel strongly about it at the end, but I'd rather use an approach
> based on IsBinaryUpgrade for the logical worker launcher to be
> consistent with autovacuum (where there's also an argument to refactor
> it to use a bgworker registration, centralizing the checks on
> IsBinaryUpgrade for all bgworkers, but that would be material for a
> different thread, if there's interest in doing that).
>
> The two situations we are trying to prevent (slot invalidation and
> bgworker launch) can be triggered under different contexts, so they
> don't have to use the same mechanisms to prevent what should not
> happen during an upgrade.
> --

Having a GUC hook for the "max_slot_wal_keep_size" seemed OK to me. If
the user overrides a GUC value (admittedly, maybe there is no reason
why they would want to) then at least the hook will give an error,
rather than us silently overwriting the user's value with -1.

So, patch v4 LGTM, except it is better to include a test case.

~

Meanwhile, if preventing the apply worker launch is considered better
to be implemented differently in ApplyLauncherRegister, then so be it.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Relids instead of Bitmapset * in plannode.h

2023-11-08 Thread Ashutosh Bapat

On Tue, Nov 7, 2023 at 8:54 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2023-Oct-31, Ashutosh Bapat wrote:
> >> For some reason plannode.h has declared variable to hold RTIs as
> >> Bitmapset * instead of Relids like other places. Here's patch to fix
> >> it. This is superficial change as Relids is typedefed to Bitmapset *.
> >> Build succeeds for me and also make check passes.
>
> > I think the reason for having done it this way, was mostly to avoid
> > including pathnodes.h in plannodes.h.
>
> Yes, I'm pretty sure that's exactly the reason, and I'm strongly
> against the initially-proposed patch.  The include footprint of
> pathnodes.h would be greatly expanded, for no real benefit.
> Among other things, that fuzzes the distinction between planner
> modules and non-planner modules.

As I mentioned in [1] the Bitmapset implementation is not space
efficient to be used as Relids when there are thousands of partitions.
I was assessing all usages of Bitmapset to find if there are other
places where this is an issue. That's when I noticed this. At some
point in future (possibly quite near) when queries will involved
thousands of relations (partitions or otherwise) we will need to
implement Relids in more space efficient way. Having all Relids usages
of Bitmapset labelled as Relids will help us then. If we don't want to
add pathnodes.h to plannodes.h there will be more work to identify
Relids usage. That shouldn't be a couple of days work, so it's ok.

Other possibilities are
1. Define Relids in bitmapset.h itself and use Relids everywhere
Bitmapset * is really Relids. Wherever Relids is used bitmapset.h must
have been included one or other other way. That's a bigger churn.

2. Replace RTIs with Relids in the comments and add the following
comment somewhere near the #include section. "The Relids members in
various structures in this file have been declared as Bitmapset * to
avoid including pathnodes.h in this file. This include has greatly
expanded footprint for no real benefit.".

3. Do nothing right now. If and when we implement Relids as a separate
datastructure, it will get its own module. We will be able to place it
somewhere properly.

I have no additional comments on other patches.

[1] 
https://www.postgresql.org/message-id/CAExHW5s4EqY43oB%3Dne6B2%3D-xLgrs9ZGeTr1NXwkGFt2j-OmaQQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: GUC names in messages

2023-11-08 Thread Kyotaro Horiguchi

At Thu, 9 Nov 2023 12:55:44 +1100, Peter Smith  wrote in 
> The most common pattern there is "You might need to increase %s.".
..
> Here is a patch to modify those other similar variations so they share
> that common wording.
> 
> PSA.

I'm uncertain whether the phrases "Consider doing something" and "You
might need to do something" are precisely interchangeable. However,
(for me..)  it appears that either phrase could be applied for all
messages that the patch touches.

In short, I'm fine with the patch.


By the way, I was left scratching my head after seeing the following
message.

>ereport(PANIC,
>(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
> -   errmsg("could not find free replication state, increase 
> max_replication_slots")));

Being told to increase max_replication_slots in a PANIC
message feels somewhat off to me. Just looking at the message, it
seems unconvincing to increase "slots" because there is a lack of
"state". So, I poked around in the code and found the following
comment:

> ReplicationOriginShmemSize(void)
> {
> ...
> /*
>  * XXX: max_replication_slots is arguably the wrong thing to use, as here
>  * we keep the replay state of *remote* transactions. But for now it seems
>  * sufficient to reuse it, rather than introduce a separate GUC.
>  */

I haven't read the related code, but if my understanding based on this
comment is correct, wouldn't it mean that a lack of storage space for
the state at the location outputting the message indicates a bug in
the program, not a user configuration error? In other words, isn't
this message something that at least shouldn't be a user-facing
message, and might it be more appropriate to use an Assert instead?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: A recent message added to pg_upgade

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 09:53:07AM +0530, Amit Kapila wrote:
> On Tue, Nov 7, 2023 at 4:16 PM Amit Kapila  wrote:
>> But then we don't need the hardcoded value of
>> max_logical_replication_workers as zero by pg_upgrade. I think doing
>> IsBinaryUpgrade for slots won't be neat, so we anyway need to keep
>> using the special value of max_slot_wal_keep_size GUC. Though the
>> handling for both won't be the same but I guess given the situation,
>> that seems like a reasonable thing to do. If we follow that then we
>> can have this special GUC hook only for max_slot_wal_keep_size GUC.
> 
> Michael, Horiguchi-San, and others, do you have any thoughts on what
> is the best way to proceed?

No problem for me to use a GUC hook for the WAL retention GUCs if you
feel strongly about it at the end, but I'd rather use an approach
based on IsBinaryUpgrade for the logical worker launcher to be
consistent with autovacuum (where there's also an argument to refactor
it to use a bgworker registration, centralizing the checks on
IsBinaryUpgrade for all bgworkers, but that would be material for a
different thread, if there's interest in doing that).

The two situations we are trying to prevent (slot invalidation and
bgworker launch) can be triggered under different contexts, so they
don't have to use the same mechanisms to prevent what should not
happen during an upgrade.
--
Michael

signature.asc
Description: PGP signature

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread torikoshia


On Thu, Nov 09, 2023 at 10:10:39AM +0900, torikoshia wrote:

I'll attach the patch.

Attached.

On Mon, Nov 6, 2023 at 5:30 PM Bharath Rupireddy

3. I think the new reset all stats function must also consider
resetting all SLRU stats, no?
/* stats for fixed-numbered objects */
PGSTAT_KIND_ARCHIVER,
PGSTAT_KIND_BGWRITER,
PGSTAT_KIND_CHECKPOINTER,
PGSTAT_KIND_IO,
PGSTAT_KIND_SLRU,
PGSTAT_KIND_WAL,


PGSTAT_KIND_SLRU cannot be reset by pg_stat_reset_shared(), so I feel 
uncomfortable to delete it all together.
It might be better after pg_stat_reset_shared() has been modified to 
take 'slru' as an argument, though.



On Wed, Nov 8, 2023 at 1:13 PM Andres Freund  wrote:
It's not like oids are a precious resource. It's a more confusing API 
to have
to have to specify a NULL as an argument than not having to do so. If 
we
really want to avoid a separate oid, a more sensible path would be to 
add a
default argument to pg_stat_reset_slru() (by doing a CREATE OR REPLACE 
in

system_functions.sql).


Currently proisstrict is true and pg_stat_reset_shared() returns null 
without doing any work.
I thought it would be better to reset statistics even when null is 
specified so that users are not confused with the behavior of 
pg_stat_reset_slru().
Attached patch added pg_stat_reset_shared() in system_functions.sql 
mainly for this reason.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 931bdac6e70b681e3416bbf464cbc6f42f971c6c Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 9 Nov 2023 13:41:51 +0900
Subject: [PATCH v1] Add ability to reset all statistics to
 pg_stat_reset_shared()

Currently pg_stat_reset_shared() requires its argument to specify
statistics to be reset, and there is no way to reset all statistics
at the same time.
This patch makes it possible when no argument or NULL is specified.
---
 doc/src/sgml/monitoring.sgml |  4 +++-
 src/backend/catalog/system_functions.sql |  7 +++
 src/backend/utils/adt/pgstatfuncs.c  | 19 +--
 src/include/catalog/pg_proc.dat  |  5 +++--
 src/test/regress/expected/stats.out  | 14 +++---
 src/test/regress/sql/stats.sql   | 14 +++---
 6 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e247..1cddc7d238 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4716,7 +4716,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 void


-Resets some cluster-wide statistics counters to zero, depending on the
+Resets cluster-wide statistics counters to zero, depending on the
 argument.  The argument can be bgwriter to reset
 all the counters shown in
 the pg_stat_bgwriter view,
@@ -4730,6 +4730,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 pg_stat_wal view or
 recovery_prefetch to reset all the counters shown
 in the pg_stat_recovery_prefetch view.
+If the argument is NULL or not specified, all counters shown in all
+of these views are reset.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 35d738d576..8079f1cd7f 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -621,6 +621,13 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION
+  pg_stat_reset_shared(target text DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+CALLED ON NULL INPUT VOLATILE PARALLEL SAFE
+AS 'pg_stat_reset_shared';
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1fb8b31863..e3d78b9665 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1677,7 +1677,7 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reset some shared cluster-wide counters
+ * Reset shared cluster-wide counters
  *
  * When adding a new reset target, ideally the name should match that in
  * pgstat_kind_infos, if relevant.
@@ -1685,7 +1685,22 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
-	char	   *target = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	char	   *target = NULL;
+
+	if (PG_ARGISNULL(0))
+	{
+		/* Reset all the statistics which can be specified by the argument */
+		pgstat_reset_of_kind(PGSTAT_KIND_ARCHIVER);
+		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
+		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
+		pgstat_reset_of_kind(PGSTAT_KIND_WAL);
+		XLogPrefe

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread zhihuifan1213



Tom Lane  writes:

> Daniel Gustafsson  writes:
>>> On 8 Nov 2023, at 19:18, Tom Lane  wrote:
>>> I think an actually usable feature of this sort would involve
>>> copying all the failed lines to some alternate output medium,
>>> perhaps a second table with a TEXT column to receive the original
>>> data line.  (Or maybe an array of text that could receive the
>>> broken-down field values?)  Maybe we could dump the message info,
>>> line number, field name etc into additional columns.
>
>> I agree that the errors should be easily visible to the user in some way.  
>> The
>> feature is for sure interesting, especially in data warehouse type jobs where
>> dirty data is often ingested.
>
> I agree it's interesting, but we need to get it right the first time.
>
> Here is a very straw-man-level sketch of what I think might work.
> The option to COPY FROM looks something like
>
>   ERRORS TO other_table_name (item [, item [, ...]])
>
> where the "items" are keywords identifying the information item
> we will insert into each successive column of the target table.
> This design allows the user to decide which items are of use
> to them.  I envision items like

While I'm pretty happy with the overall design, which is 'ERRORS to
other_table_name' specially. I'm a bit confused why do we need to
write the codes for (item [, item [, ...]]), not only because it
requires more coding but also requires user to make more decisions.
will it be anything wrong to make all of them as default? 

> LINENObigint  COPY line number, counting from 1
> LINE  textraw text of line (after encoding conversion)
> FIELDStext[]  separated, de-escaped string fields (the data
>   that was or would be fed to input functions)
> FIELD textname of troublesome field, if field-specific
> MESSAGE   texterror message text
> DETAILtexterror message detail, if any
> SQLSTATE text error SQLSTATE code
>


-- 
Best Regards
Andy Fan

Re: A recent message added to pg_upgade

2023-11-08 Thread Amit Kapila

On Tue, Nov 7, 2023 at 4:16 PM Amit Kapila  wrote:
>
> On Tue, Nov 7, 2023 at 8:12 AM Michael Paquier  wrote:
> >
> > On Tue, Nov 07, 2023 at 07:59:46AM +0530, Amit Kapila wrote:
> > > Do you mean to say that if 'IsBinaryUpgrade' is true then let's not
> > > allow to launch launcher or apply worker? If so, I guess this won't be
> > > any better than prohibiting at an early stage or explicitly overriding
> > > those with internal values and documenting it, at least that way we
> > > can be consistent for both variables (max_logical_replication_workers
> > > and max_slot_wal_keep_size).
> >
> > Yes, I mean to paint an extra IsBinaryUpgrade before registering the
> > apply worker launcher.  That would be consistent with what we do for
> > autovacuum in the postmaster.
> >
>
> But then we don't need the hardcoded value of
> max_logical_replication_workers as zero by pg_upgrade. I think doing
> IsBinaryUpgrade for slots won't be neat, so we anyway need to keep
> using the special value of max_slot_wal_keep_size GUC. Though the
> handling for both won't be the same but I guess given the situation,
> that seems like a reasonable thing to do. If we follow that then we
> can have this special GUC hook only for max_slot_wal_keep_size GUC.
>

Michael, Horiguchi-San, and others, do you have any thoughts on what
is the best way to proceed?

-- 
With Regards,
Amit Kapila.

Re: Parallel query behaving different with custom GUCs

2023-11-08 Thread Michael Paquier

On Mon, Oct 30, 2023 at 09:21:56AM -0400, Robert Haas wrote:
> I'm also alert to my own possible bias. Perhaps since I designed this
> mechanism, I'm more prone to viewing its deficiencies as minor than a
> neutral observer would be. So if anyone is sitting there reading this
> and thinking "wow, I can't believe Robert doesn't think it's important
> to fix this," feel free to write back and say so.

Fun.  Agreed that this is a bug, and that the consequences are of
null for most users.  And it took 7 years to find that.

If I may ask, is there an impact with functions that include SET
clauses with custom parameters that are parallel safe?  Saying that,
if the fix is simple, I see no reason not to do something about it,
even if that's HEAD-only.
--
Michael


signature.asc
Description: PGP signature

Re: Doubled test for SET statements in pg_stat_statements tests

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 10:33:23AM +0300, Sergei Kornilov wrote:
> is checked twice in contrib/pg_stat_statements/sql/utility.sql on
> lines 278-286 and 333-341. Is this on any purpose? I think the
> second set of tests is not needed and can be removed, as in the
> attached patch.

Thanks, applied.  This looks like a copy-paste mistake coming from
de2aca288569, even if it has added more scenarios for patterns around
SET.
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2023-11-08 Thread Amit Kapila

On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila  wrote:
>
> On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
>  wrote:
> >
> > > Unrelated to above, if there is a user slot on standby with the same
> > > name which the slot-sync worker is trying to create, then shall it
> > > emit a warning and skip the sync of that slot or shall it throw an
> > > error?
> > >
> >
> > I'd vote for emit a warning and move on to the next slot if any.
> >
>
> But then it could take time for users to know the actual problem and
> they probably notice it after failover. OTOH, if we throw an error
> then probably they will come to know earlier because the slot sync
> mechanism would be stopped. Do you have reasons to prefer giving a
> WARNING and skipping creating such slots? I expect this WARNING to
> keep getting repeated in LOGs because the consecutive sync tries will
> again generate a WARNING.
>

Apart from the above, I would like to discuss the slot sync work
distribution strategy of this patch. The current implementation as
explained in the commit message [1] works well if the slots belong to
multiple databases. It is clear from the data in emails [2][3][4] that
having more workers really helps if the slots belong to multiple
databases. But I think if all the slots belong to one or very few
databases then such a strategy won't be as good. Now, on one hand, we
get very good numbers for a particular workload with the strategy used
in the patch but OTOH it may not be adaptable to various different
kinds of workloads. So, I have a question whether we should try to
optimize this strategy for various kinds of workloads or for the first
version let's use a single-slot sync-worker and then we can enhance
the functionality in later patches either in PG17 itself or in PG18 or
later versions. One thing to note is that a lot of the complexity of
the patch is attributed to the multi-worker strategy which may still
not be efficient, so there is an argument to go with a simpler
single-slot sync-worker strategy and then enhance it in future
versions as we learn more about various workloads. It will also help
to develop this feature incrementally instead of doing all the things
in one go and taking a much longer time than it should.

Thoughts?

[1] - "The replication launcher on the physical standby queries
primary to get the list of dbids for failover logical slots. Once it
gets the dbids, if dbids < max_slotsync_workers, it starts only that
many workers, and if dbids > max_slotsync_workers, it starts
max_slotsync_workers and divides the work equally among them. Each
worker is then responsible to keep on syncing the logical slots
belonging to the DBs assigned to it.

Each slot-sync worker will have its own dbids list. Since the upper
limit of this dbid-count is not known, it needs to be handled using
dsa. We initially allocated memory to hold 100 dbids for each worker.
If this limit is exhausted, we reallocate this memory with size
incremented again by 100."

[2] - 
https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com
[4] - 
https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 12:04:19PM +0900, Michael Paquier wrote:
> Sure, sorry for the confusion.  By "we'd do nothing", I mean precirely
> "to take no specific action related to archive recovery and recovery
> parameters at the end of recovery", meaning that a combination of
> backup_label with no signal file would be the same as crash recovery,
> replaying WAL up to the end of what can be found in pg_wal/, and only
> that.

By being slightly more precise.  I also mean to fail recovery if it is
not possible to replay up to the end-of-backup LSN marked in the label
file because we are missing some stuff in pg_wal/, which is something
that the code is currently able to handle.
--
Michael


signature.asc
Description: PGP signature

Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 01:16:58PM -0500, Robert Haas wrote:
> On Tue, Oct 31, 2023 at 7:39 PM Michael Paquier  wrote:
 As you're telling me, and I've considered that as an option as well,
 perhaps we should just consider the presence of a backup_label file
 with no .signal files as a synonym of crash recovery?  In the recovery
 path, currently the essence of the problem is when we do
 InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning
 that it should do archive recovery but we don't want it, and that does
 not really make sense.  The rest of the code sort of implies that this
 is not a suported combination.  So basically, my suggestion here, is
 to just replay WAL up to the end of what's in your local pg_wal/ and
 hope for the best, without TLI jumps, except that we'd do nothing.
>>>
>>> This sentence seems to be incomplete.
>>
>> I've re-read it, and it looks OK to me.
> 
> Well, the sentence ends with "except that we'd do nothing" and I don't
> know what that means. It would make sense to me if it said "except
> that we'd do nothing about " or "except that we'd do nothing
> instead of " but as you've written it basically seems to
> boil down to "my suggestion is to replay WAL except do nothing" which
> makes no sense. If you replay WAL, you're not doing nothing.

Sure, sorry for the confusion.  By "we'd do nothing", I mean precirely
"to take no specific action related to archive recovery and recovery
parameters at the end of recovery", meaning that a combination of
backup_label with no signal file would be the same as crash recovery,
replaying WAL up to the end of what can be found in pg_wal/, and only
that.

>>> But I was not saying we should treat the case where we have a
>>> backup_label file like crash recovery. The real question here is why
>>> we don't treat it fully like archive recovery.
>>
>> Timeline jump at the end of recovery?  Archive recovery forces a TLI
>> jump by default at the end of redo if there's a signal file, and some
>> users may not want a TLI jump by default?
> 
> Uggh. I don't know what to think about that. I bet some people do want
> that, but that makes it pretty easy to end up with multiple copies of
> the same cluster running on the same TLI, too, which is not a thing
> that you really want to have happen.

Andres has mentioned upthread that this is something he's been using
to quickly be able to clone a cluster.  I would not recommend doing
that, personally, but if that's useful in some cases, well, why not.

> At the end of the day, I'm coming around to the view that the biggest
> problem here is the documentation. Nobody can really know what's
> supposed to work right now because the documentation doesn't say which
> things you are and are not allowed to do and what results you should
> expect in each case. If it did, it would be easier to discuss possible
> behavior changes. Right now, it's hard to change any code at all,
> because there's no list of supported scenarios, so you can't tell
> whether a potential change affects a scenario that somebody thinks
> should work, or only cases that nobody can possibly care about. It's
> sort of possible to reason your way through that, to an extent, but
> it's pretty hard. The fact that I didn't know that starting from a
> backup with neither recovery.signal nor standby.signal was a thing
> that anybody did or cared about is good evidence of that.

That's one problem, not all of it, because the code takes extra
assumptions around that.

> I also feel like the terminology here sometimes obscures more than it
> illuminates. For instance, it seems like ArchiveRecoveryRequested
> really means "are any signal files present?" while InArchiveRecovery
> means "are we fetching WAL from outside pg_wal rather than using
> what's in pg_wal?". But these are not obvious from the names, and
> sometimes we have additional variables with overlapping meanings, like
> readSource, which indicates whether we're reading from pg_wal, the
> archive, or the walreceiver, and yet is probably not redundant with
> InArchiveRecovery. In any event, I think that we need to start with
> the question of what behavior(s) we want to expose to users, and then
> back into the question of what internal variables and states need to
> exist in order to support that behavior. We cannot start by deciding
> what variables we'd like to get rid of and then trying to justify the
> resulting behavior changes on the grounds that they simplify the code.
> Users aren't going to like that, hackers aren't going to like that,
> and the resulting behavior probably won't be anything great.

Note as well that InArchiveRecovery is set when there's a
backup_label, but that the code would check for the existence of a
restore_command only if a signal file exists.  That's strange, but if
people have been relying on this behavior, so be it.

At this stage, it looks pretty clear to me that there's no consensus
on what to do, and nobod

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Peter Geoghegan

On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.

The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.

Any design in this area should try to make things as easy to debug as
possible, for the obvious reason: encrypted data that somehow becomes
corrupt is bound to be a nightmare to debug. (Besides, we already
support tools like pg_filedump, so this isn't a new principle.)

-- 
Peter Geoghegan

Re: Synchronizing slots from primary to standby

2023-11-08 Thread Amit Kapila

On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
 wrote:
>
> > Unrelated to above, if there is a user slot on standby with the same
> > name which the slot-sync worker is trying to create, then shall it
> > emit a warning and skip the sync of that slot or shall it throw an
> > error?
> >
>
> I'd vote for emit a warning and move on to the next slot if any.
>

But then it could take time for users to know the actual problem and
they probably notice it after failover. OTOH, if we throw an error
then probably they will come to know earlier because the slot sync
mechanism would be stopped. Do you have reasons to prefer giving a
WARNING and skipping creating such slots? I expect this WARNING to
keep getting repeated in LOGs because the consecutive sync tries will
again generate a WARNING.

-- 
With Regards,
Amit Kapila.

Re: Fix output of zero privileges in psql

2023-11-08 Thread Erik Wienhold

On 2023-11-08 13:23 +0100, Laurenz Albe wrote:
> I wonder how to proceed with this patch.  The main disagreement is
> whether default privileges should be displayed as NULL (less invasive,
> but more confusing for beginners) or "(default)" (more invasive,
> but nicer for beginners).

Are there any reports from beginners being confused about default
privileges being NULL or being displayed as a blank string in psql?
This is usually resolved with a pointer to the docs if it comes up in
discussions or the user makes the mental leap and checks the docs
himself.  Both patches add some details to the docs to explain psql's
output.

> David is for "(default)", Tom and me are for NULL, and I guess Erik
> would also prefer "(default)", since that was how his original
> patch did it, IIRC.  I think I could live with both solutions.
>
> Kind of a stalemate.  Who wants to tip the scales?

Yes I had a slight preference for my patch but I'd go with yours (\pset
null) now.  I followed the discussion after my last mail but had nothing
more to add that wasn't already said.  Tom then wrote that NULL is the
catalog's representation for the default privileges and obscuring that
fact in psql is not doing any service to the users.  This convinced me
because users may have to deal with aclitem[] being NULL anyway at some
point if they need to check privileges in more detail.  So it makes
absolutely sense that psql is transparent about that.

-- 
Erik

Re: Eager page freeze criteria clarification

2023-11-08 Thread Melanie Plageman

On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
>
> A rough sketch of a freezing heuristic:
>
> - We concluded that to intelligently control opportunistic freezing we need
>   statistics about the number of freezes and unfreezes
>
>   - We should track page freezes / unfreezes in shared memory stats on a
> per-relation basis
>
>   - To use such statistics to control heuristics, we need to turn them into
> rates. For that we need to keep snapshots of absolute values at certain
> times (when vacuuming), allowing us to compute a rate.
>
>   - If we snapshot some stats, we need to limit the amount of data that 
> occupies
>
> - evict based on wall clock time (we don't care about unfreezing pages
>   frozen a month ago)
>
> - "thin out" data when exceeding limited amount of stats per relation
>   using random sampling or such
>
> - need a smarter approach than just keeping N last vacuums, as there are
>   situations where a table is (auto-) vacuumed at a high frequency
>
>
>   - only looking at recent-ish table stats is fine, because we
>  - a) don't want to look at too old data, as we need to deal with changing
>workloads
>
>  - b) if there aren't recent vacuums, falsely freezing is of bounded cost
>
>   - shared memory stats being lost on crash-restart/failover might be a 
> problem
>
> - we certainly don't want to immediate store these stats in a table, due
>   to the xid consumption that'd imply
>
>
> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"
>
>   - We can perform this attribution by comparing the page LSN with recorded
> start/end LSNs of recent vacuums
>
>   - If the freezing error rate of recent vacuums is low, freeze more
> aggressively. This is important to deal with insert mostly workloads.
>
>   - If old data is "unfrozen", that's fine, we can ignore such unfreezes when
> controlling "freezing aggressiveness"
>
> - Ignoring unfreezing of old pages is important to e.g. deal with
>   workloads that delete old data
>
>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

I have taken a stab at implementing the freeze stats tracking
infrastructure we would need to drive a heuristic based on error rate.

Attached is a series of patches which adds a ring buffer of vacuum
freeze statistics to each table's stats (PgStat_StatTabEntry).

It also introduces a guc: target_page_freeze_duration. This is the time
in seconds which we would like pages to stay frozen for. The aim is to
adjust opportunistic freeze behavior such that pages stay frozen for at
least target_page_freeze_duration seconds.

When a table is vacuumed the next entry is initialized in the ring
buffer (PgStat_StatTabEntry->frz_buckets[frz_current]). If all of the
buckets are in use, the two oldest buckets both ending either before or
after now - target_page_freeze_duration are combined.

When vacuum freezes a page, we increment the freeze counter in the
current PgStat_Frz entry in the ring buffer and update the counters used
to calculate the max, min, and average page age.

When a frozen page is modified, we increment "unfreezes" in the bucket
spanning the page freeze LSN. We also update the counters used to
calculate the min, max, and average frozen duration. Because we have
both the LSNs and time elapsed during the vacuum which froze the page,
we can derive the approximate time at which the page was frozen (using
linear interpolation) and use that to speculate how long it remained
frozen. If we are unfreezing it sooner than target_page_freeze_duration,
this is counted as an "early unfreeze". Using early unfreezes and page
freezes, we can calculate an error rate.

The guc, opp_freeze_algo, remains to allow us to test different freeze
heuristics during development. I included a dummy algorithm (algo 2) to
demonstrate what we could do with the data. If the error rate is above a
hard-coded threshold and the page is older than the average page age, it
is frozen. This is missing an "aggressiveness" knob.

There are two workloads I think we can focus on. The first is a variant
of our former workload I. The second workload, J, is the pgbench
built-in simple-update workload.

I2. Work queue
   WL 1: 32 clients inserting a single row, updating an indexed
   column in another row twice, then deleting the last updated row
pgbench scale 100
WL 2: 2 clients, running the TPC-B like built-in workload
rate-limited to 10,000 TPS

J. simple-update
pgbench scale 450
WL 1: 16 clients running built-in simple-update workload

The simple-update workload works well because it only updates
pgbench_accounts and inserts into pgbench_

Re: pg_upgrade and logical replication

2023-11-08 Thread Peter Smith

Thanks for addressing my previous review comments.

I re-checked the latest patch v12-0001 and found the following:

==
Commit message

1.
The new SQL binary_upgrade_create_sub_rel_state function has the following
syntax:
SELECT binary_upgrade_create_sub_rel_state(subname text, relid oid,
state char [,sublsn pg_lsn])

~

Looks like v12 accidentally forgot to update this to the modified
function name 'binary_upgrade_add_sub_rel_state'

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost

Greetings,

On Wed, Nov 8, 2023 at 20:55 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:
>
>> * Andres Freund (and...@anarazel.de) wrote:
>> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
>> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
>> > > From: David Christensen 
>> > > Date: Tue, 9 May 2023 16:56:15 -0500
>> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>> > >
>> > > This space is reserved for extended data on the Page structure which
>> will be ultimately used for
>> > > encrypted data, extended checksums, and potentially other things.
>> This data appears at the end of
>> > > the Page, after any `pd_special` area, and will be calculated at
>> runtime based on specific
>> > > ControlFile features.
>> > >
>> > > No effort is made to ensure this is backwards-compatible with
>> existing clusters for `pg_upgrade`, as
>> > > we will require logical replication to move data into a cluster with
>> > > different settings here.
>> >
>> > The first part of the last paragraph makes it sound like pg_upgrade
>> won't be
>> > supported across this commit, rather than just between different
>> settings...
>>
>
> Yeah, that's vague, but you picked up on what I meant.
>
>
>> > I think as a whole this is not an insane idea. A few comments:
>>
>> Thanks for all the feedback!
>>
>> > - Why is it worth sacrificing space on every page to indicate which
>> features
>> >   were enabled?  I think there'd need to be some convincing reasons for
>> >   introducing such overhead.
>>
>> In conversations with folks (my memory specifically is a discussion with
>> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
>> there was a pretty strong push that a page should be able to 'stand
>> alone' and not depend on something else (eg: pg_control, or whatever) to
>> provide info needed be able to interpret the page.  For my part, I don't
>> have a particularly strong feeling on that, but that's what lead to this
>> design.
>>
>
> Unsurprisingly, I agree that it's useful to keep these features on the
> page itself; from a forensic standpoint this seems much easier to interpret
> what is happening here, as well it would allow you to have different
> features on a given page or type of page depending on need.  The initial
> patch utilizes pg_control to store the cluster page features, but there's
> no reason it couldn't be dependent on fork/page type or stored in
> pg_tablespace to utilize different features.
>

When it comes to authenticated encryption, it’s also the case that it’s
unclear what value the checksum field has, if any…  it’s certainly not
directly needed as a checksum, as the auth tag is much better for the
purpose of seeing if the page has been changed in some way. It’s also not
big enough to serve as an auth tag per NIST guidelines regarding the size
of the authenticated data vs. the size of the tag.  Using it to indicate
what features are enabled on the page seems pretty useful, as David notes.

Thanks,

Stephen

>

Re: make pg_ctl more friendly

2023-11-08 Thread Crisp Lee

Hi,

I know it. But my question is not that. I did a PITR operation with
recovery_target_name and recovery_target_action('shutdown'). The PITR
process was very short and the PITR was done before pg_ctl check. The
postmaster shutdown due to recovery_target_action, and there was no crash.
But pg_ctl told me about startup failure.  I think the startup had
succeeded and the result was not a exception. pg_ctl should tell users
about detailed messages.

On Thu, Nov 9, 2023 at 9:32 AM Andres Freund  wrote:

> Hi,
>
> On 2023-11-09 09:29:32 +0800, Crisp Lee wrote:
> > How to judge from 'DB_SHUTDOWNED' that PITR ends normally?
> 'DB_SHUTDOWNED'
> > is just a state, it could not give more meaning, so I reuse the
> > recovery.done.
>
> DB_SHUTDOWNED cannot be encountered while recovery is ongoing. If there
> was a
> hard crash, you'd see DB_IN_ARCHIVE_RECOVERY or such, if the command was
> shut
> down orderly before PITR has finished, you'd see DB_SHUTDOWNED_IN_RECOVERY.
>
> - Andres
>

Re: GUC names in messages

2023-11-08 Thread Peter Smith

On Thu, Nov 2, 2023 at 1:25 AM Tom Lane  wrote:
>
...
> Our error message style guidelines say not to assemble messages out
> of separate parts, because it makes translation difficult.  Originally
> we applied that rule to GUC names mentioned in messages as well.
> Awhile ago the translation team decided that that made for too many
> duplicative translations, so they'd be willing to compromise on
> substituting GUC names.  That's only been changed in a haphazard
> fashion though, mostly in cases where there actually were duplicative
> messages that could be merged this way.  And there's never been any
> real clarity about whether to quote GUC names, though certainly we're
> more likely to quote anything injected with %s.  So that's why we have
> a mishmash right now.

Right. While looking at all the messages I observed a number of them
having almost the same (but not quite the same) wording:

For example,

errhint("Consider increasing the configuration parameter \"max_wal_size\".")));
errhint("You might need to increase %s.", "max_locks_per_transaction")));
errhint("You might need to increase %s.", "max_pred_locks_per_transaction")));
errmsg("could not find free replication state, increase
max_replication_slots")));
hint ? errhint("You might need to increase %s.", "max_slot_wal_keep_size") : 0);
errhint("You may need to increase max_worker_processes.")));
errhint("Consider increasing configuration parameter
\"max_worker_processes\".")));
errhint("Consider increasing the configuration parameter
\"max_worker_processes\".")));
errhint("You might need to increase %s.", "max_worker_processes")));
errhint("You may need to increase max_worker_processes.")));
errhint("You might need to increase %s.", "max_logical_replication_workers")));

~

The most common pattern there is "You might need to increase %s.".

Here is a patch to modify those other similar variations so they share
that common wording.

PSA.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

v1-0001-Use-a-common-message-for-increasing-a-GUC.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread David Christensen

On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:

> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Tue, 9 May 2023 16:56:15 -0500
> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> > >
> > > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > > encrypted data, extended checksums, and potentially other things.
> This data appears at the end of
> > > the Page, after any `pd_special` area, and will be calculated at
> runtime based on specific
> > > ControlFile features.
> > >
> > > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > > we will require logical replication to move data into a cluster with
> > > different settings here.
> >
> > The first part of the last paragraph makes it sound like pg_upgrade
> won't be
> > supported across this commit, rather than just between different
> settings...
>

Yeah, that's vague, but you picked up on what I meant.


> > I think as a whole this is not an insane idea. A few comments:
>
> Thanks for all the feedback!
>
> > - Why is it worth sacrificing space on every page to indicate which
> features
> >   were enabled?  I think there'd need to be some convincing reasons for
> >   introducing such overhead.
>
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.
>

Unsurprisingly, I agree that it's useful to keep these features on the page
itself; from a forensic standpoint this seems much easier to interpret what
is happening here, as well it would allow you to have different features on
a given page or type of page depending on need.  The initial patch utilizes
pg_control to store the cluster page features, but there's no reason it
couldn't be dependent on fork/page type or stored in pg_tablespace to
utilize different features.

Thanks,

David

Re: make pg_ctl more friendly

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-09 09:29:32 +0800, Crisp Lee wrote:
> How to judge from 'DB_SHUTDOWNED' that PITR ends normally? 'DB_SHUTDOWNED'
> is just a state, it could not give more meaning, so I reuse the
> recovery.done.

DB_SHUTDOWNED cannot be encountered while recovery is ongoing. If there was a
hard crash, you'd see DB_IN_ARCHIVE_RECOVERY or such, if the command was shut
down orderly before PITR has finished, you'd see DB_SHUTDOWNED_IN_RECOVERY.

- Andres

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-09 10:25:18 +0900, Michael Paquier wrote:
> On Thu, Nov 09, 2023 at 10:10:39AM +0900, torikoshia wrote:
> > I am a little concerned about that the reset time is not the same and that
> > GetCurrentTimestamp() is called multiple times, but I think it would be
> > acceptable because the function is probably not used that often and the
> > reset time is not atomic in practice.
> 
> Arf, right.  I misremembered that this is just a clock_timestamp() so
> that's not transaction-resilient.  Anyway, my take is that this is not
> a big deal in practice compared to the usability of the wrapper.

It seems inconsequential cost-wise. Resetting stats is way more expensive that
a few timestamp determinations. Correctness wise it actually seems *better* to
record the timestamps more granularly, after all, that moves them closer to
the time the individual kind of stats is reset.

Greetings,

Andres Freund

Re: make pg_ctl more friendly

2023-11-08 Thread Crisp Lee

How to judge from 'DB_SHUTDOWNED' that PITR ends normally? 'DB_SHUTDOWNED'
is just a state, it could not give more meaning, so I reuse the
recovery.done.

On Sat, Nov 4, 2023 at 9:56 AM Andres Freund  wrote:

> Hi,
>
> On 2023-11-02 14:50:14 +0800, Crisp Lee wrote:
> > I got a basebackup using pg_basebackup -R. After that, I created a
> restore
> > point named test on primary, and set recovery_target_name to test,
> > recovery_target_action to shutdown in standby datadir. I got a failure
> > startup message  after 'pg_ctl start -D $standby_datadir'. I think it is
> > not a failure, and makes users nervous, especially for newbies.
> >
> > My thought is to generate a recovery.done file if the postmaster receives
> > exit code 3 from the startup process. When postmaster  exits, pg_ctl will
> > give a more friendly message to users.
>
> I think we can detect this without any additional state - pg_ctl already
> accesses pg_control (via get_control_dbstate()). We should be able to
> detect
> your case by issuing a different warning if
>
> a) get_control_dbstate() at the start was *not* DB_SHUTDOWNED
> b) get_control_dbstate() at the end is DB_SHUTDOWNED
>
> Greetings,
>
> Andres Freund
>

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread Michael Paquier

On Thu, Nov 09, 2023 at 10:10:39AM +0900, torikoshia wrote:
> I am a little concerned about that the reset time is not the same and that
> GetCurrentTimestamp() is called multiple times, but I think it would be
> acceptable because the function is probably not used that often and the
> reset time is not atomic in practice.

Arf, right.  I misremembered that this is just a clock_timestamp() so
that's not transaction-resilient.  Anyway, my take is that this is not
a big deal in practice compared to the usability of the wrapper.
--
Michael


signature.asc
Description: PGP signature

Re: ensure, not insure

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 08:31:28PM +1300, David Rowley wrote:
> On Wed, 8 Nov 2023 at 19:56, Peter Smith  wrote:
>> In passing I found/fixed a bunch of similar misuses in comments.
> 
> Those all look fine to me too.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread torikoshia


On 2023-11-09 08:58, Michael Paquier wrote:

On Wed, Nov 08, 2023 at 02:15:24PM +0530, Bharath Rupireddy wrote:
On Wed, Nov 8, 2023 at 9:43 AM Andres Freund  
wrote:
It's not like oids are a precious resource. It's a more confusing API 
to have
to have to specify a NULL as an argument than not having to do so. If 
we
really want to avoid a separate oid, a more sensible path would be to 
add a
default argument to pg_stat_reset_slru() (by doing a CREATE OR 
REPLACE in

system_functions.sql).


+1. Attached the patch.

 -- Test that multiple SLRUs are reset when no specific SLRU provided 
to reset function

-SELECT pg_stat_reset_slru(NULL);
+SELECT pg_stat_reset_slru();


For the SLRU part, why not.

Hmm.  What's the final plan for pg_stat_reset_shared(), then?  An
equivalent that calls a series of pgstat_reset_of_kind()?


Sorry for late reply and thanks for the feedbacks everyone.

As your 1st suggestion, I think "calls a series of 
pgstat_reset_of_kind()" would be enough.


I am a little concerned about that the reset time is not the same and 
that GetCurrentTimestamp() is called multiple times, but I think it 
would be acceptable because the function is probably not used that often 
and the reset time is not atomic in practice.


I'll attach the patch.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: meson documentation build open issues

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 16:19:51 +0100, Peter Eisentraut wrote:
> On 08.11.23 13:55, Christoph Berg wrote:
> > Re: Peter Eisentraut
> > > > If the problem is broken doc patches, then maybe a solution is to
> > > > include the `xmllint --noout --valid` target in whatever the check-world
> > > > equivalent is for meson.  Looking at doc/src/sgml/meson.build, we don't
> > > > seem to do that anywhere.  Doing the no-output lint run is very fast
> > > > (375ms real time in my machine, whereas "make html" takes 27s).
> > >
> > > This would be a start, but it wouldn't cover everything.  Lately, we 
> > > require
> > > id attributes on certain elements, which is checked on the XSLT level.
> >
> > I'd think there should be a catchy "make check-world"-equivalent that
> > does run all reasonable check that we can tell people to run by
> > default. Then if that takes too long, we could still offer
> > alternatives that exclude some areas. If it's the other way round,
> > some areas will never be checked widely.
>
> I think we could build doc/src/sgml/postgres-full.xml by default.  That
> takes less than 0.5 seconds here and it's an intermediate target for html
> and man.

That does require the docbook dtd to be installed, afaict. I think we would
need a configure test for that to be present if we want to build it by
default, otherwise we'll cause errors on plenty systems that don't get them
today.  The docbook dts aren't a huge dependency, but still. Some OSs might
not have a particularly install source for them, e.g. windows.

I don't think that'd detect the missing ids?

Greetings,

Andres Freund

Re: meson documentation build open issues

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 13:55:02 +0100, Christoph Berg wrote:
> Re: Peter Eisentraut
> > > If the problem is broken doc patches, then maybe a solution is to
> > > include the `xmllint --noout --valid` target in whatever the check-world
> > > equivalent is for meson.  Looking at doc/src/sgml/meson.build, we don't
> > > seem to do that anywhere.  Doing the no-output lint run is very fast
> > > (375ms real time in my machine, whereas "make html" takes 27s).
> > 
> > This would be a start, but it wouldn't cover everything.  Lately, we require
> > id attributes on certain elements, which is checked on the XSLT level.
> 
> I'd think there should be a catchy "make check-world"-equivalent that
> does run all reasonable check that we can tell people to run by
> default. Then if that takes too long, we could still offer
> alternatives that exclude some areas. If it's the other way round,
> some areas will never be checked widely.

The 'test' target (generated by meson, otherwise I'd have named it check),
runs all enabled tests.  You obviously can run a subset if you so desire.

Greetings,

Andres Freund

Re: Force the old transactions logs cleanup even if checkpoint is skipped

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 12:44:09PM +, Zakhlystov, Daniil (Nebius) wrote:
>> I am not sure to understand your last sentence here.  Once the
>> archiver is back up, you mean that the WAL segments that were not
>> previously archived still are still not archived?  Or do you mean that
>> because of a succession of checkpoint skipped we are just enable to
>> remove them from pg_wal.
> 
> Yes, the latter is correct - we are unable to clean up the already archived 
> WALs
> due to the checkpoint being skipped. 

Yes, theoretically you could face this situation if you have an
irregular WAL activity with cycles where nothing happens and an
archive command that keeps failing while there is WAL generated, but
works while WAL is not generated.

>> I am not convinced that this is worth the addition in the skipped
>> path.  If your system is idle and a set of checkpoints is skipped, the
>> data folder is not going to be under extra space pressure because of
>> database activity (okay, unlogged tables even if these would generate
>> some WAL for init pages), because there is nothing happening in it
>> with no "important" WAL generated.  Note that the backend is very
>> unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.
> 
>> More to the point: what's the origin of the disk space issues?  System
>> logs, unlogged tables or something else?  It is usually a good
>> practice to move logs to a different partition.  At the end, it sounds
>> to me that removing segments more aggressively is just kicking the can
>> elsewhere, without taking care of the origin of the disk issues.
> 
> This problem arises when disk space issues are caused by temporary failed 
> archiving.
> As a result, the pg_wal becomes filled with WALs. This situation
> leads to Postgres being unable to perform any write operations since there is 
> no more
> free disk space left. Usually, cloud providers switch the cluster to a 
> Read-Only mode
> if there is less than 3-4% of the available disk space left, but this also 
> does not resolve
> this problem.

> The actual problem is that after archiving starts working normally again, 
> Postgres is
> unable to free the accumulated WAL and switch to Read-Write mode due to the
> checkpoint being skipped, leading to a vicious circle. However, nothing 
> prevents
> Postgres from exiting such situations on its own. This patch addresses this 
> specific
> behavior, enabling Postgres to resolve such situations autonomously.

Yep, but it does not really solve your disk space issues in a reliable
way.

I am not really convinced that this is worth complicating the skipped
path for this goal.  In my experience, I've seen complaints where WAL
archiving bloat was coming from the archive command not able to keep
up with the amount generated by the backend, particularly because the
command invocation was taking longer than it takes to generate a new
segment.  Even if there is a hole of activity in the server, if too
much WAL has been generated it may not be enough to catch up depending
on the number of segments that need to be processed.  Others are free
to chime in with extra opinions, of course.

While on it, I think that your patch would cause incorrect and early
removal of segments.  It computes the name of the last segment to
remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
that it ignores any WAL retention required by replication slots or
wal_keep_size.  And this causes the calculation of an incorrect segno
horizon.
--
Michael


signature.asc
Description: PGP signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Damir Belyalov

Hello everyone!
Thanks for turning back to this patch.


I had already thought about storing errors in the table / separate file /
logfile and it seems to me that the best way is to output errors in
logfile. As for user it is more convenient to look for errors in the place
where they are usually generated - in logfile and if he wants to intercept
them he could easily do that by few commands.


The analogues of this feature in other DBSM usually had additional files
for storing errors, but their features had too many options (see attached
files).
I also think that the best way is to simplify this feature for the first
version and don't use redundant adjustments such as additional files and
other options.
IMHO for more complicated operations with loading tables files pgloader
exists: https://github.com/dimitri/pgloader


Links of analogues of COPY IGNORE_DATATYPE_ERRORS
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

Regards,
Damir Belyalov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 19:00:01 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
> >> I think an actually usable feature of this sort would involve
> >> copying all the failed lines to some alternate output medium,
> >> perhaps a second table with a TEXT column to receive the original
> >> data line.
> 
> > If we go in that direction, we should make it possible to *not* use such a
> > table as well, for some uses it'd be pointless.
> 
> Why?  You can always just drop the errors table if you don't want it.

I think it'll often just end up littering the database, particularly if the
callers don't care about a few errors.

> But I fail to see the use-case for ignoring errors altogether.

My experience is that there's often a few errors due to bad encoding, missing
escaping etc that you don't care sufficiently about when importing large
quantities of data.

Greetings,

Andres Freund

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Tom Lane

Andres Freund  writes:
> On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line.

> If we go in that direction, we should make it possible to *not* use such a
> table as well, for some uses it'd be pointless.

Why?  You can always just drop the errors table if you don't want it.
But I fail to see the use-case for ignoring errors altogether.

> Another way of reporting errors could be for copy to return invalid input back
> to the client, via the copy protocol.

Color me skeptical.  There are approximately zero clients in the
world today that could handle simultaneous return of data during
a COPY.  Certainly neither libpq nor psql are within hailing
distance of being able to support that.  Maybe in some far
future it could be made to work --- but if you want it in the v1
patch, you just moved the goalposts into the next county.

regards, tom lane

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 02:15:24PM +0530, Bharath Rupireddy wrote:
> On Wed, Nov 8, 2023 at 9:43 AM Andres Freund  wrote:
>> It's not like oids are a precious resource. It's a more confusing API to have
>> to have to specify a NULL as an argument than not having to do so. If we
>> really want to avoid a separate oid, a more sensible path would be to add a
>> default argument to pg_stat_reset_slru() (by doing a CREATE OR REPLACE in
>> system_functions.sql).
> 
> +1. Attached the patch.
>
>  -- Test that multiple SLRUs are reset when no specific SLRU provided to 
> reset function
> -SELECT pg_stat_reset_slru(NULL);
> +SELECT pg_stat_reset_slru();

For the SLRU part, why not.

Hmm.  What's the final plan for pg_stat_reset_shared(), then?  An
equivalent that calls a series of pgstat_reset_of_kind()?
--
Michael


signature.asc
Description: PGP signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 13:18:39 -0500, Tom Lane wrote:
> Damir  writes:
> > [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ]
> 
> Sorry for being so late to the party, but ... I don't think this
> is a well-designed feature as it stands.  Simply dropping failed rows
> seems like an unusable definition for any application that has
> pretensions of robustness.

Not everything needs to be a robust application though. I've definitely cursed
at postgres for lacking this.

> I think an actually usable feature of this sort would involve
> copying all the failed lines to some alternate output medium,
> perhaps a second table with a TEXT column to receive the original
> data line.  (Or maybe an array of text that could receive the
> broken-down field values?)  Maybe we could dump the message info,
> line number, field name etc into additional columns.

If we go in that direction, we should make it possible to *not* use such a
table as well, for some uses it'd be pointless.

Another way of reporting errors could be for copy to return invalid input back
to the client, via the copy protocol. That would allow the client to handle
failing rows and also to abort if the number of errors or the type of errors
gets to be too big.

Greetings,

Andres Freund

Some deleted GUCs are still referred to

2023-11-08 Thread Peter Smith

Hi,

I happened to notice that some GUC names "max_fsm_pages" and
"max_fsm_relations" are still mentioned in these translation files
(from the REL_16_1 source zip)

src\backend\po\fr.po
src\backend\po\tr.po

~~

Should those be removed?

There was a commit [1] that said these all traces of those GUCs had
been eliminated.

==

[1] 
https://github.com/postgres/postgres/commit/15c121b3ed7eb2f290e19533e41ccca734d23574#diff-65c699b5d467081e780d255ea0ed7d720b5bca2427e300f9fd0776bffe51560a

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: ResourceOwner refactoring

2023-11-08 Thread Heikki Linnakangas


On 08/11/2023 20:00, Alexander Lakhin wrote:

Please look at a new assertion failure, I've managed to trigger with this 
script:
CREATE TABLE t(
i01 int, i02 int, i03 int, i04 int, i05 int, i06 int, i07 int, i08 int, i09 
int, i10 int,
i11 int, i12 int, i13 int, i14 int, i15 int, i16 int, i17 int, i18 int, i19 
int, i20 int,
i21 int, i22 int, i23 int, i24 int, i25 int, i26 int
);
CREATE TABLE tp PARTITION OF t FOR VALUES IN (1);

(gdb) bt
...
#5  0x560dd4e42f77 in ExceptionalCondition (conditionName=0x560dd5059fbc 
"owner->narr == 0", fileName=0x560dd5059e31
"resowner.c", lineNumber=362) at assert.c:66
#6  0x560dd4e93cbd in ResourceOwnerReleaseAll (owner=0x560dd69cebd0, 
phase=RESOURCE_RELEASE_BEFORE_LOCKS,
printLeakWarnings=false) at resowner.c:362
#7  0x560dd4e92e22 in ResourceOwnerReleaseInternal (owner=0x560dd69cebd0, 
phase=RESOURCE_RELEASE_BEFORE_LOCKS,
isCommit=false, isTopLevel=true) at resowner.c:725
#8  0x560dd4e92d42 in ResourceOwnerReleaseInternal (owner=0x560dd69ce3f8, 
phase=RESOURCE_RELEASE_BEFORE_LOCKS,
isCommit=false, isTopLevel=true) at resowner.c:678
#9  0x560dd4e92cdb in ResourceOwnerRelease (owner=0x560dd69ce3f8, 
phase=RESOURCE_RELEASE_BEFORE_LOCKS,
isCommit=false, isTopLevel=true) at resowner.c:652
#10 0x560dd47316ef in AbortTransaction () at xact.c:2848
#11 0x560dd47329ac in AbortCurrentTransaction () at xact.c:3339
#12 0x560dd4c37284 in PostgresMain (dbname=0x560dd69779f0 "regression", 
username=0x560dd69779d8 "law") at
postgres.c:4370
...


Thanks for the testing! Fixed. I introduced that bug in one of the last 
revisions of the patch: I changed ResourceOwnerSort() to not bother 
moving all the elements from the array to the hash table if the hash 
table is allocated but empty. However, ResourceOwnerReleas() didn't get 
the memo, and still checked "owner->hash != NULL" rather than 
"owner->nhash == 0".


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: meson documentation build open issues

2023-11-08 Thread Andres Freund

Hi,

I really like the idea of an 'help' target that prints the targets. It seemed
annoying to document such targets in both the sgml docs and the input for a
the help target. Particularly due to the redundancies between id attributes,
the target name etc.

First I generated the list of targets from within meson.build, only to later
realize that that would not work when building the docs via make. So I instead
added doc/src/sgml/meson-targets.txt which is lightly postprocessed for the
'help' target, and slightly more processed when building the docs.

That does have some downsides, e.g. it'd be more complicated to only print
targets if a relevant option is enabled. But I think it's acceptable that way.


Example output:

$ ninja help
[0/1 1   0%] Running external command help (wrapped by meson to set env)
Code Targets:
  all  Build everything other than documentation
  backend  Build backend and related modules
  bin  Build frontend binaries
  contrib  Build contrib modules
  pl   Build procedual languages

Documentation Targets:
  docs Build documentation in multi-page HTML format
  doc-html Build documentation in multi-page HTML format
  doc-man  Build documentation in man page format
  doc/src/sgml/postgres-A4.pdf Build documentation in PDF format, with A4 pages
  doc/src/sgml/postgres-US.pdf Build documentation in PDF format, with US 
letter pages
  doc/src/sgml/postgres.html   Build documentation in single-page HTML format
  alldocs  Build documentation in all supported formats

Installation Targets:
  install  Install postgres, excluding documentation
  install-doc-html Install documentation in multi-page HTML format
  install-doc-man  Install documentation in man page format
  install-docs Install documentation in multi-page HTML and man 
page formats
  install-quietLike "install", but installed files are not 
displayed
  install-worldInstall postgres, including multi-page HTML and 
man page documentation
  uninstallRemove installed files

Other Targets:
  cleanRemove all build products
  test Run all enabled tests (including contrib)
  worldBuild everything, including documentation
  help List important targets


Because of the common source, some of the descriptions in the state of this
patch are a bit shorter than in the preceding commit. But I don't think that
hurts much.

Greetings,

Andres Freund
>From 9b0b5cd952880ecebbd157c05698125755bc53ed Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 8 Nov 2023 09:29:38 -0800
Subject: [PATCH v2 1/7] meson: Change default of 'selinux' feature option to
 auto

There is really no reason for selinux to behave differently than other
options.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231103211601.bgqx3cfq6pz2l...@awork3.anarazel.de
Backpatch:
---
 meson_options.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson_options.txt b/meson_options.txt
index d2f95cfec36..be1b327f544 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -136,7 +136,7 @@ option('tcl_version', type: 'string', value: 'tcl',
 option('readline', type: 'feature', value: 'auto',
   description: 'Use GNU Readline or BSD Libedit for editing')
 
-option('selinux', type: 'feature', value: 'disabled',
+option('selinux', type: 'feature', value: 'auto',
   description: 'SELinux support')
 
 option('ssl', type: 'combo', choices: ['auto', 'none', 'openssl'],
-- 
2.38.0

>From a2aa43811296c92a8d4611c996d4aebcf1b88f31 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 3 Nov 2023 10:21:13 -0700
Subject: [PATCH v2 2/7] docs: Document --with-selinux/-Dselinux options
 centrally

Previously --with-selinux was documented for autoconf in the sepgsql
documentation and not at all for meson. There are further improvements related
to this that could be made, but this seems like a clear improvement.

Author:
Reviewed-by:
Reported-by: Christoph Berg 
Discussion: https://postgr.es/m/20231103163848.26egkh5qdgw3v...@awork3.anarazel.de
Backpatch:
---
 doc/src/sgml/installation.sgml | 21 +
 doc/src/sgml/sepgsql.sgml  | 11 ---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index a3dc6eb855f..1bfb27fd38d 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1198,6 +1198,16 @@ build-postgresql:

   
 
+  
+   --with-selinux
+   
+
+ Build with selinux support, enabling the 
+ extension.
+
+   
+  
+
  
 

@@ -2629,6 +2639,17 @@ ninja install

Re: speed up a logical replica setup

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 09:50:47AM -0300, Euler Taveira wrote:
> On Tue, Nov 7, 2023, at 8:12 PM, Michael Paquier wrote:
> I used the CreateReplicationSlot() from streamutil.h but decided to use the
> CREATE_REPLICATION_SLOT command directly because it needs the LSN as output. 
> As
> you noticed at that time I wouldn't like a dependency in the pg_basebackup
> header files; if we move this binary to base backup directory, it seems 
> natural
> to refactor the referred function and use it.

Right.  That should be OK to store that in an optional XLogRecPtr
pointer, aka by letting the option to pass NULL as argument of the
function if the caller needs nothing. 

>> I've read the patch, and the additions to streamutil.h and
>> streamutil.c make it kind of natural to have it sit in pg_basebackup/.
>> There's pg_recvlogical already there.  I am wondering about two
>> things, though:
>> - Should the subdirectory pg_basebackup be renamed into something more
>> generic at this point?  All these things are frontend tools that deal
>> in some way with the replication protocol to do their work.  Say
>> a replication_tools?
> 
> It is a good fit for this tool since it is another replication tool. I also
> agree with the directory renaming; it seems confusing that the directory has
> the same name as one binary but also contains other related binaries in it.

Or cluster_tools?  Or stream_tools?  replication_tools may be OK, but
I have a bad sense in naming new things around here.  So if anybody
has a better idea, feel free..

>> - And if it would be better to refactor some of the code generic to
>> all these streaming tools to fe_utils.  What makes streamutil.h a bit
>> less pluggable are all its extern variables to control the connection,
>> but perhaps that can be an advantage, as well, in some cases.
> 
> I like it. There are common functions such as GetConnection(),
> CreateReplicationSlot(), DropReplicationSlot() and RunIdentifySystem() that is
> used by all of these replication tools. We can move the extern variables into
> parameters to have a pluggable streamutil.h.

And perhaps RetrieveWalSegSize() as well as GetSlotInformation().
These kick replication commands.
--
Michael


signature.asc
Description: PGP signature

Re: Remove MSVC scripts from the tree

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 09:41:19AM +0100, Peter Eisentraut wrote:
> I don't think we should rely on sed being there on Windows.  Maybe it's true
> now on the handful of buildfarm/CI machines and early adopters, but do we
> have any indication that that is systematic or just an accident?

Or both?  When doing builds based on MinGW in the past I vaguely
recall getting annoyed that I needed to look for sed as one thing, so
your suggestion could simplify the experience a bit.

> Since we definitely require Perl now, we could just as well use the Perl
> script and avoid this issue.
>
> Attached is a Perl version of the sed script, converted by hand (so not the
> super-verbose s2p thing).  It's basically just the sed script with
> semicolons added and the backslashes in the regular expressions moved
> around.  I think we could use something like that for all platforms now.

Sounds like a good idea to me now that perl is a hard requirement.
+1.
--
Michael


signature.asc
Description: PGP signature

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 5:49 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 09:56:37 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > I still am quite quite unconvinced that using the LSN as a nonce is a
> good
> > > design decision.
> >
> > This is a really important part of the overall path to moving this
> > forward, so I wanted to jump to it and have a specific discussion
> > around this.  I agree that there are downsides to using the LSN, some of
> > which we could possibly address (eg: include the timeline ID in the IV),
> > but others that would be harder to deal with.
>
> > The question then is- what's the alternative?
> >
> > One approach would be to change the page format to include space for an
> > explicit nonce.  I don't see the community accepting such a new page
> > format as the only format we support though as that would mean no
> > pg_upgrade support along with wasted space if TDE isn't being used.
>
> Right.
>

Hmm, if we /were/ to introduce some sort of page format change, Couldn't
that be a use case for modifying the pd_version field?  Could v4 pages be
read in and written out as v5 pages with different interpretations?


> > Ideally, we'd instead be able to support multiple page formats where
> > users could decide when they create their cluster what features they
> > want- and luckily we even have such an effort underway with patches
> > posted for review [1].
>
> I think there are some details wrong with that patch - IMO the existing
> macros
> should just continue to work as-is and instead the places that want the
> more
> narrow definition should be moved to the new macros and it changes places
> that
> should continue to use compile time constants - but it doesn't seem like a
> fundamentally bad idea to me.  I certainly like it much better than making
> the
> page size runtime configurable.
>

There had been some discussion about this WRT renaming macros and the like
(constants, etc)—I think a new pass eliminating the variable blocksize
pieces and seeing if we can minimize churn here is worthwhile, will take a
look and see what the minimally-viable set of changes is here.


> (I'll try to reply with the above points to [1])
>
>
> > Certainly, with the base page-special-feature patch, we could have an
> option
> > for users to choose that they want a better nonce than the LSN, or we
> could
> > bundle that assumption in with, say, the authenticated-encryption feature
> > (if you want authenticated encryption, then you want more from the
> > encryption system than the basics, and therefore we presume you also
> want a
> > better nonce than the LSN).
>
> I don't think we should support using the LSN as a nonce if we have an
> alternative. The cost and complexity overhead is just not worth it.  Yes,
> it'll be harder for users to migrate to encryption, but adding complexity
> elsewhere in the system to get an inferior result isn't worth it.
>

>From my read, XTS (which I'd see as inferior to authenticated encryption,
but better than some other options) could use LSN as an IV without leakage
concerns, perhaps mixing in the BlockNumber as well.  If we are going to
allow multiple encryption types, I think we may need to consider that needs
for IVs may be different, so this may need to be something that is
selectable per encryption type.

I am unclear how much of a requirement this is, but seems like having a
design supporting this to be pluggable—even if a static lookup table
internally for encryption type, block length, IV source, etc—seems the most
future proof if we had to retire an encryption method or prevent creation
of specific methods, say.


> > Another approach would be a separate fork, but that then has a number of
> > downsides too- every write has to touch that too, and a single page of
> > nonces would cover a pretty large number of pages also.
>
> Yea, the costs of doing so is nontrivial. If you were trying to implement
> encryption on the smgr level - which I doubt you should but am not certain
> about - my suggestion would be to interleave pages with metadata like
> nonces
> and AEAD with the data pages. I.e. one metadata page would be followed by
>   (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD))
> pages containing actual relation data.  That way you still get decent
> locality
> during scans and writes.
>

Hmm, this is actually an interesting idea, I will think about this a bit.


> Relation forks were a mistake, we shouldn't use them in more places.
>
>
> I think it'd be much better if we also encrypted forks, rather than just
> the
> main fork...
>

I believe the existing code should just work by modifying
the PageNeedsToBeEncrypted macro; I will test that and see if anything
blows up.

David

Re: pg_upgrade and logical replication

2023-11-08 Thread Michael Paquier

On Wed, Nov 08, 2023 at 10:52:29PM +0530, vignesh C wrote:
> Upgrading logical replication nodes requires multiple steps to be
> performed. Because not all operations are transactional, the user is
> advised to take backups.
> Backups can be taken as described in
> https://www.postgresql.org/docs/current/backup.html

There's a similar risk with --link if the upgrade fails after the new
cluster was started and the files linked began getting modified, so
that's something users would be OK with, I guess.

> Upgrading 2 node logical replication cluster:
> 1) Let's say publisher is in Node1 and subscriber is in Node2.
> 2) Stop the publisher server in Node1.
> 3) Disable the subscriptions in Node2.
> 4) Upgrade the publisher node Node1 to Node1_new.
> 5) Start the publisher node Node1_new.
> 6) Stop the subscriber server in Node2.
> 7) Upgrade the subscriber node Node2 to Node2_new.
> 8) Start the subscriber node Node2_new.
> 9) Alter the subscription connections in Node2_new to point from Node1
> to Node1_new.

Do they really need to do so in an pg_upgrade flow?  The connection
endpoint would be likely the same for transparency, no?

> 10) Enable the subscriptions in Node2_new.
> 11) Create any tables that were created in Node1_new between step-5
> and now and Refresh the publications.

How about the opposite stance, where an upgrade flow does first the
subscriber and then the publisher?  Would this be worth mentioning?
Case 3 touches that as nodes hold both publishers and subscribers.

> Steps to upgrade cascaded logical replication clusters:
> 1) Let's say we have a cascaded logical replication setup
> Node1->Node2->Node3. Here Node2 is subscribing to Node1 and Node3 is
> subscribing to Node2.
> 2) Stop the server in Node1.
> 3) Disable the subscriptions in Node2 and Node3.
> 4) Upgrade the publisher node Node1 to Node1_new.
> 5) Start the publisher node Node1_new.
> 6) Stop the server in Node1.
> 7) Upgrade the subscriber node Node2 to Node2_new.
> 8) Start the subscriber node Node2_new.
> 9) Alter the subscription connections in Node2_new to point from Node1
> to Node1_new.

Same here.

> 10) Enable the subscriptions in Node2_new.
> 11) Create any tables that were created in Node1_new between step-5
> and now and Refresh the publications.
> 12) Stop the server in Node3.
> 13) Upgrade the subscriber node Node3 to Node3_new.
> 14) Start the subscriber node Node3_new.
> 15) Alter the subscription connections in Node3_new to point from
> Node2 to Node2_new.
> 16) Enable the subscriptions in Node2_new.
> 17) Create any tables that were created in Node2_new between step-8
> and now and Refresh the publications.
> 
> Upgrading 2 node circular logical replication cluster:
> 1) Let's say we have a circular logical replication setup Node1->Node2
> & Node2->Node1. Here Node2 is subscribing to Node1 and Node1 is
> subscribing to Node2.
> 2) Stop the server in Node1.
> 3) Disable the subscriptions in Node2.
> 4) Upgrade the node Node1 to Node1_new.
> 5) Start the node Node1_new.
> 6) Enable the subscriptions in Node1_new.
> 7) Wait till all the incremental changes are synchronized.
> 8) Alter the subscription connections in Node2 to point from Node1 to 
> Node1_new.
> 9) Create any tables that were created in Node2 between step-2 and now
> and Refresh the publications.
> 10) Stop the server in Node2.
> 11) Disable the subscriptions in Node1.
> 12) Upgrade the node Node2 to Node2_new.
> 13) Start the subscriber node Node2_new.
> 14) Enable the subscriptions in Node2_new.
> 15) Alter the subscription connections in Node1 to point from Node2 to
> Node2_new.
> 16) Create any tables that were created in Node1_new between step-10
> and now and Refresh the publications.
> 
> I have done basic testing with this, I will do further testing and
> update it if I find any issues.
> Let me know if this idea is ok or we need something different.

I have not tested, but having documentation among these lines is good
because it becomes clear what the steps one needs to do are.

Another thing that I doubt is worth mentioning is the schema changes
that may happen.  We could just say that the schema should be fixed
while running an upgrade, which is kind of fair to expect in logical
setups for tables replicated anyway?

Do you think that there would be an issue in automating such tests
once support for the upgrade of subscribers is done (hopefully)?  The
first scenario may not need extra coverage if we have already
003_logical_slots.pl and a second file to test for the subscriber
part, though.
--
Michael


signature.asc
Description: PGP signature

Re: GUC names in messages

2023-11-08 Thread Peter Smith

On Wed, Nov 8, 2023 at 7:40 AM Peter Smith  wrote:
>
> FWIW, I am halfway through doing regex checking of the PG16 source for
> all GUC names in messages to see what current styles are in use today.
>
> Not sure if those numbers will influence the decision.
>
> I hope I can post my findings today or tomorrow.
>

Here are my findings from the current PG16 source messages.

I used a regex search:
".*GUCNAME

to find how each GUCNAME is used in the messages in *.c files.

The GUC names are taken from the guc_tables.c code, so they are
grouped accordingly below.

~TOTALS:

messages where GUC names are QUOTED:
- bool = 11
- int = 11
- real = 0
- string = 10
- enum = 7
TOTAL = 39

messages where GUC names are NOT QUOTED:
- bool = 14
- int = 60
- real = 0
- string = 59
- enum = 31
TOTAL = 164

~~~

Details are in the attached file. PSA.

I've categorised them as being currently QUOTED, NOT QUOTED, and NONE
(most are not used in any messages).

Notice that NOT QUOTED is the far more common pattern, so my vote
would be just to standardise on making everything this way. I know
there was some concern raised about ambiguous words like "timezone"
and "datestyle" etc but in practice, those are rare. Also, those GUCs
are different in that they are written as camel-case (e.g.
"DateStyle") in the guc_tables.c, so if they were also written
camel-case in the messages that could remove ambiguities with normal
words. YMMV.

Anyway, I will await a verdict about what to do.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
Hi 

I used regex search:
".*GUCNAME

to find how each GUCNAME is used in messages in *.c files.

GUC names are taken from the guc_tables.c code, so are grouped accordingly 
below.

I've categorised them as being currently QUOTED, NOT QUOTED, and NONE (many are 
not used in any messages).

~TOTALS:

messages where GUC names are quoted:
- bool = 11
- int = 11
- real = 0
- string = 10
- enum = 7
- TOTAL = 39

messages where GUC names are NOT quoted:
- bool = 14
- int = 60
- real = 0
- string = 59
- enum = 31
- TOTAL = 164


Details below...


///

--
ConfigureNamesBool
--

{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_indexscan", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_tidscan", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_sort", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_incremental_sort", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_memoize", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_nestloop", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_mergejoin", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_hashjoin", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_gathermerge", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_partitionwise_join", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_partitionwise_aggregate", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_parallel_append", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_partition_pruning", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_presorted_aggregate", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
- NONE

{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
- NONE

{"is_superuser", PGC_INTERNAL, UNGROUPED,
- NONE

{"bonjour", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
- NONE

{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION_SENDING,
- QUOTED
errhint("Make sure the configuration parameter \"%s\" is set on the 
primary server.", "track_commit_timestamp") :
errhint("Make sure the configuration parameter \"%s\" is set.", 
"track_commit_timestamp")));

{"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
- NOT QUOTED
errhint("Set ssl = on in postgresql.conf."),

{"ssl_passphrase_command_supports_reload", PGC_SIGHUP, CONN_AUTH_SSL,
- NONE

{"ssl_prefer_server_ciphers", PGC_SIGHUP, CONN_AUTH_SSL,
- NONE

{"fsync", PGC_SIGHUP, WAL_SETTINGS,
- NONE

{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
- NOT QUOTED
gettext_noop(... Setting "   "zero_damaged_pages to 
true causes the system to instead report a "...

{"ignore_invalid_pages", PGC_POSTMASTER, DEVELOPER_OPTIONS,
- NOT QUOTED
gettext_noop(... "Setting ignore_invalid_pages to true causes "

{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
- NOT QUOTED
errmsg("WAL generated with full_page_writes=off was replayed ""since 
last restartpoint"),
errhint(...  "Enable full_page_writes and run CHECKPOINT on the 
primary, ""and then try an online backup again.")));
errmsg("WAL generated with full_page_writes=off

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 6:47 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote:
> > On Sat, 4 Nov 2023 at 03:38, Andres Freund  wrote:
> > > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > > > I'm quite surprised at the significant number of changes being made
> > > > outside the core storage manager files. I thought that changing out
> > > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > > > would be the most obvious change to implement cluster-wide encryption
> > > > with the least code touched, as relations don't need to know whether
> > > > the files they're writing are encrypted, right? Is there a reason to
> > > > not implement this at the smgr level that I overlooked in the
> > > > documentation of these patches?
> > >
> > > You can't really implement encryption transparently inside an smgr
> without
> > > significant downsides. You need a way to store an initialization vector
> > > associated with the page (or you can store that elsewhere, but then
> you've
> > > doubled the worst cse amount of random reads/writes). The patch uses
> the LSN
> > > as the IV (which I doubt is a good idea). For authenticated encryption
> further
> > > additional storage space is required.
> >
> > I am unaware of any user of the smgr API that doesn't also use the
> > buffer cache, and thus implicitly the Page layout with PageHeader
> > [^1]
>
> Everything indeed uses a PageHeader - but there are a number of places
> that do
> *not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes
> that
> those fields are zero - and changing that wouldn't be trivial / free,
> because
> we do a lot of bitmasking/shifting with constants derived from
>
> #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
>
> which obviously wouldn't be constant anymore if you could reserve space on
> the
> page.
>

While not constants, I was able to get this working with variable values
here in a way that did not have the overhead of the original patch for the
vismap specifically using Montgomery Multiplication for division/mod. This
was actually the heaviest of the changes from moving to runtime-calculated,
so we might be able to use this approach in this specific case even if only
this change is required for this specific fork.


> > The API of smgr is also tailored to page-sized quanta of data
> > with mostly relation-level information. I don't see why there would be
> > a veil covering the layout of Page for smgr when all other information
> > already points to the use of PageHeader and Page layouts. In my view,
> > it would even make sense to allow the smgr to get exclusive access to
> > some part of the page in the current Page layout.
> >
> > Yes, I agree that there will be an impact on usable page size if you
> > want authenticated encryption, and that AMs will indeed need to
> > account for storage space now being used by the smgr - inconvenient,
> > but it serves a purpose. That would happen regardless of whether smgr
> > or some higher system decides where to store the data for encryption -
> > as long as it is on the page, the AM effectively can't use those
> > bytes.
> > But I'd say that's best solved by making the Page documentation and
> > PageInit API explicit about the potential use of that space by the
> > chosen storage method (encrypted, plain, ...) instead of requiring the
> > various AMs to manually consider encryption when using Postgres' APIs
> > for writing data to disk without hitting shared buffers; page space
> > management is already a task of AMs, but handling the actual
> > encryption is not.
>
> I don't particularly disagree with any detail here - but to me reserving
> space
> for nonces etc at PageInit() time pretty much is the opposite of handling
> encryption inside smgr.
>

Originally, I was anticipating that we might want different space amounts
reserved on different classes of pages (apart from encryption), so while
we'd be storing the default page reserved size in pg_control we'd not be
limited to this in the structure of the page calls.  We could presumably
just move the logic into PageInit() itself if every reserved allocation is
the same and individual call sites wouldn't need to know about it.  The
call sites do have more context as to the requirements of the page or the
"type" of page in play, which if we made it dependent on page type would
need to get passed in somehow, which was where the reserved_page_size
parameter came in to the current patch.

>
> > Should the AM really care whether the data on disk is encrypted or
> > not? I don't think so. When the disk contains encrypted bytes, but
> > smgrread() and smgrwrite() both produce and accept plaintext data,
> > who's going to complain? Requiring AMs to be mindful about encryption
> > on all common paths only adds pitfalls where encryption would be
> > forgotten by the developer of AMs in one path or another.
>
> I agree with that - I think the way the

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-08 Thread Tom Lane

Jian Guo  writes:
> I found a new approach to fix this issue, which seems better, so I would like 
> to post another version of the patch here. The origin patch made the 
> assumption of the values of Vars from CTE must be unique, which could be very 
> wrong. This patch examines variables for Vars inside CTE, which avoided the 
> bad assumption, so the results could be much more accurate.

You have the right general idea, but there is nothing about this patch
that's right in detail.  The outer Var doesn't refer to any particular
RTE within the subquery; it refers to a targetlist entry.  You have to
drill down to that, see if it's a Var, and if so you can recurse into
the subroot with that Var.  As this stands, it might accidentally get
the right answer for "SELECT * FROM foo" subqueries, but it will get
the wrong answer or even crash for anything that's not that.

The existing RTE_SUBQUERY stanza has most of what we need for this,
so I experimented with extending that to also handle RTE_CTE.  It
seems to work, though I soon found out that it needed tweaking for
the case where the CTE is INSERT/UPDATE/DELETE RETURNING.

Interestingly, this does not change any existing regression test
results.  I'd supposed there might be at least one place with a
visible plan change, but nope.  Running a coverage test does show
that the new code paths are exercised, but I wonder if we ought
to try to devise a regression test that proves it more directly.

regards, tom lane

PS: please, please, please do not quote the entire damn thread
when replying.  Trim it to just a minimum amount of relevant
text.  You think people want to read all that again?

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076e..196f50b241 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5493,13 +5493,17 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 			vardata->acl_ok = true;
 		}
 	}
-	else if (rte->rtekind == RTE_SUBQUERY && !rte->inh)
+	else if ((rte->rtekind == RTE_SUBQUERY && !rte->inh) ||
+			 (rte->rtekind == RTE_CTE && !rte->self_reference))
 	{
 		/*
-		 * Plain subquery (not one that was converted to an appendrel).
+		 * Plain subquery (not one that was converted to an appendrel) or
+		 * non-recursive CTE.  In either case, we can try to find out what the
+		 * Var refers to within the subquery.
 		 */
-		Query	   *subquery = rte->subquery;
-		RelOptInfo *rel;
+		Query	   *subquery;
+		PlannerInfo *subroot;
+		List	   *subtlist;
 		TargetEntry *ste;
 
 		/*
@@ -5508,6 +5512,80 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		if (var->varattno == InvalidAttrNumber)
 			return;
 
+		/*
+		 * Otherwise, find the subquery's query tree and planner subroot.
+		 */
+		if (rte->rtekind == RTE_SUBQUERY)
+		{
+			RelOptInfo *rel;
+
+			/*
+			 * Fetch RelOptInfo for subquery.  Note that we don't change the
+			 * rel returned in vardata, since caller expects it to be a rel of
+			 * the caller's query level.  Because we might already be
+			 * recursing, we can't use that rel pointer either, but have to
+			 * look up the Var's rel afresh.
+			 */
+			rel = find_base_rel(root, var->varno);
+
+			subquery = rte->subquery;
+			subroot = rel->subroot;
+		}
+		else
+		{
+			/* CTE case is more difficult */
+			PlannerInfo *cteroot;
+			Index		levelsup;
+			int			ndx;
+			int			plan_id;
+			ListCell   *lc;
+
+			/*
+			 * Find the referenced CTE, and locate the subroot previously made
+			 * for it.
+			 */
+			levelsup = rte->ctelevelsup;
+			cteroot = root;
+			while (levelsup-- > 0)
+			{
+cteroot = cteroot->parent_root;
+if (!cteroot)	/* shouldn't happen */
+	elog(ERROR, "bad levelsup for CTE \"%s\"", rte->ctename);
+			}
+
+			/*
+			 * Note: cte_plan_ids can be shorter than cteList, if we are still
+			 * working on planning the CTEs (ie, this is a side-reference from
+			 * another CTE).  So we mustn't use forboth here.
+			 */
+			ndx = 0;
+			subquery = NULL;
+			foreach(lc, cteroot->parse->cteList)
+			{
+CommonTableExpr *cte = (CommonTableExpr *) lfirst(lc);
+
+if (strcmp(cte->ctename, rte->ctename) == 0)
+{
+	subquery = castNode(Query, cte->ctequery);
+	break;
+}
+ndx++;
+			}
+			if (subquery == NULL)	/* shouldn't happen */
+elog(ERROR, "could not find CTE \"%s\"", rte->ctename);
+			if (ndx >= list_length(cteroot->cte_plan_ids))
+elog(ERROR, "could not find plan for CTE \"%s\"", rte->ctename);
+			plan_id = list_nth_int(cteroot->cte_plan_ids, ndx);
+			if (plan_id <= 0)
+elog(ERROR, "no plan was made for CTE \"%s\"", rte->ctename);
+			subroot = list_nth(root->glob->subroots, plan_id - 1);
+		}
+
+		/* If the subquery hasn't been planned yet, we have to punt */
+		if (subroot == NULL)
+			return;
+		Assert(IsA(subroot, PlannerInfo));
+
 		/*
 		 * Punt if subquery uses set operations or GROUP BY, as these will
 		 * mash underlying columns' stats beyond recogniti

Re: Wrong sentence in the README?

2023-11-08 Thread Bruce Momjian

On Sun, Sep 22, 2019 at 12:28:02PM -0400, Tom Lane wrote:
> "Daniel Westermann (DWE)"  writes:
> > in the README, top level, there is this:
> 
> > PostgreSQL has many language interfaces, many of which are listed here:
> > https://www.postgresql.org/download
> 
> > I don't think the download page lists any language interfaces or do I miss 
> > something?
> 
> Not directly on that page, though if you drill down into the "software
> catalogue" you can find them.
> 
> Since there's already a link to that same page a bit further down in
> the file, I'm inclined to just remove the quoted sentence.  Maybe
> add something about "and related software" to the later link.  Certainly
> "language interfaces" is just one small part of that.

Four years later, attached patch applied to master to implement these
changes.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/README b/README
index 6416a8cf3b..56d0c951a9 100644
--- a/README
+++ b/README
@@ -9,10 +9,6 @@ that supports an extended subset of the SQL standard, including
 transactions, foreign keys, subqueries, triggers, user-defined types
 and functions.  This distribution also contains C language bindings.
 
-PostgreSQL has many language interfaces, many of which are listed here:
-
-	https://www.postgresql.org/download/
-
 See the file INSTALL for instructions on how to build and install
 PostgreSQL.  That file also lists supported operating systems and
 hardware platforms and contains information regarding any other
@@ -22,6 +18,6 @@ file COPYRIGHT.  A comprehensive documentation set is included in this
 distribution; it can be read as described in the installation
 instructions.
 
-The latest version of this software may be obtained at
-https://www.postgresql.org/download/.  For more information look at our
-web site located at https://www.postgresql.org/.
+The latest version of this software, and related software, may be
+obtained at https://www.postgresql.org/download/.  For more information
+look at our web site located at https://www.postgresql.org/.

Re: XX000: tuple concurrently deleted during DROP STATISTICS

2023-11-08 Thread Tomas Vondra

On 11/8/23 20:58, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 11/8/23 16:52, Tom Lane wrote:
>>> Shouldn't DROP STATISTICS be taking a lock on the associated table
>>> that is strong enough to lock out ANALYZE?
> 
>> Yes, I think that's the correct thing to do. I recall having a
>> discussion about this with someone while working on the patch, leading
>> to the current code. But I haven't managed to find that particular bit
>> in the archives :-(
>> Anyway, the attached patch should fix this by getting the lock, I think.
> 
> This looks generally correct, but surely we don't need it to be as
> strong as AccessExclusiveLock?  There seems no reason to conflict with
> ordinary readers/writers of the table.
> 
> ANALYZE takes ShareUpdateExclusiveLock, and offhand I think this
> command should do the same.
> 

Right. I did copy that from DROP TRIGGER code somewhat mindlessly, but
you're right this does not need block readers/writers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 19:29:38 +0100, Tomas Vondra wrote:
> >>> I haven't checked, but I'd assume that 100bytes back and forth should 
> >>> easily
> >>> fit a new message to update LSNs and the existing feedback response. Even 
> >>> just
> >>> the difference between sending 100 bytes and sending 10k (a bit more than 
> >>> a
> >>> single WAL page) is pretty significant on a 1gbit network.
> >>>
> >>
> >> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> >> conclusion to draw from these numbers. What is the takeaway?
> >>
> >> My understanding is that in both cases the latency is initially fairly
> >> stable, independent of the request size. This applies to request up to
> >> ~1000B. And then the latency starts increasing fairly quickly, even
> >> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> >
> > Except for the smallest end, these are bandwidth related, I think. 
> > Converting
> > 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
> > the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
> > 1gbit. If you subtract the minmum latency of about 130us, that's nearly all 
> > of
> > the latency.
> >
>
> Maybe I don't understand what you mean "bandwidth related" but surely
> the smaller requests are not limited by bandwidth. I mean, 100B and 1kB
> (and even 10kB) requests have almost the same transaction rate, yet
> there's an order of magnitude difference in bandwidth (sure, there's
> overhead, but this much magnitude?).

What I mean is that bandwidth is the biggest factor determining latency in the
numbers I showed (due to decent sized packet and it being a local network). At
line rate it takes ~80us to send 10kB over 1gbit ethernet. So a roundtrip
cannot be faster than 80us, even if everything else added zero latency.

That's why my numbers show such a lower latency for the 10gbit network - it's
simply faster to put even small-ish amounts of data onto the wire.

That does not mean that the link is fully utilized over time - because we wait
for the other side to receive the data, wake up a user space process, send
back 100 bytes, wait for the data be transmitted, and then wake up a process,
there are periods where the link in one direction is largely idle.  But in
case of a 10kB packet on the 1gbit network, yes, we are bandwidth limited for
~80us (or perhaps more interestingly, we are bandwidth limited for 0.8ms when
sending 100kB).

Greetings,

Andres Freund

Re: max_parallel_workers question

2023-11-08 Thread Bruce Momjian

On Sat, Sep 28, 2019 at 12:10:53AM -0400, Robert Haas wrote:
> On Fri, Sep 27, 2019 at 8:07 PM Jeff Davis  wrote:
> > The current docs for max_parallel_workers start out:
> >
> > "Sets the maximum number of workers that the system can support for
> > parallel operations..."
> >
> > In my interpretation, "the system" means the entire cluster, but the
> > max_parallel_workers setting is PGC_USERSET. That's a bit confusing,
> > because two different backends can have different settings for "the
> > maximum number ... the system can support".
> 
> Oops.
> 
> I intended it to mean "the entire cluster." Basically, how many
> workers out of max_worker_processes are you willing to use for
> parallel query, as opposed to other things. I agree that PGC_USERSET
> doesn't make any sense.

I found two places there "custer" was better than "system", so I applied
the attached patch to master.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd70ff2e4b..fc35a46e5e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2572,7 +2572,7 @@ include_dir 'conf.d'


 
- Sets the maximum number of background processes that the system
+ Sets the maximum number of background processes that the cluster
  can support.  This parameter can only be set at server start.  The
  default is 8.
 
@@ -2680,7 +2680,7 @@ include_dir 'conf.d'


 
- Sets the maximum number of workers that the system can support for
+ Sets the maximum number of workers that the cluster can support for
  parallel operations.  The default value is 8.  When increasing or
  decreasing this value, consider also adjusting
   and

Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Bruce Momjian

On Wed, Nov  8, 2023 at 02:52:16PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> >>> It suddenly strikes me to worry that we have an XID wraparound hazard
> >>> for entries in the notify queue.
> 
> > Is this still an open issue?  Should it be a TODO item?
> 
> I don't think anyone's done anything about it, so yeah.
> 
> Realistically, if you've got NOTIFY messages that are going unread
> for long enough to risk XID wraparound, your app is broken.  So
> maybe it'd be sufficient to discard messages that are old enough
> to approach the wrap horizon.  But still that's code that doesn't
> exist.

Thanks, TODO added.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 8 Nov 2023, at 19:18, Tom Lane  wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line.  (Or maybe an array of text that could receive the
>> broken-down field values?)  Maybe we could dump the message info,
>> line number, field name etc into additional columns.

> I agree that the errors should be easily visible to the user in some way.  The
> feature is for sure interesting, especially in data warehouse type jobs where
> dirty data is often ingested.

I agree it's interesting, but we need to get it right the first time.

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

ERRORS TO other_table_name (item [, item [, ...]])

where the "items" are keywords identifying the information item
we will insert into each successive column of the target table.
This design allows the user to decide which items are of use
to them.  I envision items like

LINENO  bigint  COPY line number, counting from 1
LINEtextraw text of line (after encoding conversion)
FIELDS  text[]  separated, de-escaped string fields (the data
that was or would be fed to input functions)
FIELD   textname of troublesome field, if field-specific
MESSAGE texterror message text
DETAIL  texterror message detail, if any
SQLSTATE text   error SQLSTATE code

Some of these would have to be populated as NULL if we didn't get
that far in processing the line.  In the worst case, which is
encoding conversion failure, I think we couldn't populate any of
the data items except LINENO.

Not sure if we need to insist that the target table columns be
exactly the data types I show above.  It'd be nice to allow
the LINENO target to be plain int, perhaps.  OTOH, do we really
want to have to deal with issues like conversion failures while
trying to report an error?

> As a data point, Greenplum has this feature with additional SQL syntax to
> control it:
>   COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;
> LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
> LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
> operation errors out.  I'm not at all advocating that we should mimic this,
> just wanted to add a reference to postgres derivative where this has been
> implemented.

Hm.  A "reject limit" might be a useful add-on, but I wouldn't advocate
including it in the initial patch.

regards, tom lane

Re: XX000: tuple concurrently deleted during DROP STATISTICS

2023-11-08 Thread Tom Lane

Tomas Vondra  writes:
> On 11/8/23 16:52, Tom Lane wrote:
>> Shouldn't DROP STATISTICS be taking a lock on the associated table
>> that is strong enough to lock out ANALYZE?

> Yes, I think that's the correct thing to do. I recall having a
> discussion about this with someone while working on the patch, leading
> to the current code. But I haven't managed to find that particular bit
> in the archives :-(
> Anyway, the attached patch should fix this by getting the lock, I think.

This looks generally correct, but surely we don't need it to be as
strong as AccessExclusiveLock?  There seems no reason to conflict with
ordinary readers/writers of the table.

ANALYZE takes ShareUpdateExclusiveLock, and offhand I think this
command should do the same.

regards, tom lane

Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Tom Lane

Bruce Momjian  writes:
> On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
>>> It suddenly strikes me to worry that we have an XID wraparound hazard
>>> for entries in the notify queue.

> Is this still an open issue?  Should it be a TODO item?

I don't think anyone's done anything about it, so yeah.

Realistically, if you've got NOTIFY messages that are going unread
for long enough to risk XID wraparound, your app is broken.  So
maybe it'd be sufficient to discard messages that are old enough
to approach the wrap horizon.  But still that's code that doesn't
exist.

regards, tom lane

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Daniel Gustafsson

> On 8 Nov 2023, at 19:18, Tom Lane  wrote:

> I think an actually usable feature of this sort would involve
> copying all the failed lines to some alternate output medium,
> perhaps a second table with a TEXT column to receive the original
> data line.  (Or maybe an array of text that could receive the
> broken-down field values?)  Maybe we could dump the message info,
> line number, field name etc into additional columns.

I agree that the errors should be easily visible to the user in some way.  The
feature is for sure interesting, especially in data warehouse type jobs where
dirty data is often ingested.

As a data point, Greenplum has this feature with additional SQL syntax to
control it:

COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;

LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
operation errors out.  I'm not at all advocating that we should mimic this,
just wanted to add a reference to postgres derivative where this has been
implemented.

--
Daniel Gustafsson

Re: XX000: tuple concurrently deleted during DROP STATISTICS

2023-11-08 Thread Tomas Vondra

On 11/8/23 16:52, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 11/8/23 16:10, Justin Pryzby wrote:
>>> I found this in our logs, and reproduced it under v11-v16.
>>>
>>> CREATE TABLE t(a int, b int);
>>> INSERT INTO t SELECT generate_series(1,999);
>>> CREATE STATISTICS t_stats ON a,b FROM t;
>>>
>>> while :; do psql postgres -qtxc "ANALYZE t"; done &
>>> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
>>>
>>> It's known that concurrent DDL can hit elog().  But in this case,
>>> there's only one DDL operation.
> 
>> AFAICS this happens because store_statext (after ANALYZE builds the new
>> statistics) does this:
> 
> Shouldn't DROP STATISTICS be taking a lock on the associated table
> that is strong enough to lock out ANALYZE?
> 

Yes, I think that's the correct thing to do. I recall having a
discussion about this with someone while working on the patch, leading
to the current code. But I haven't managed to find that particular bit
in the archives :-(

Anyway, the attached patch should fix this by getting the lock, I think.

- RemoveStatisticsById is what gets called drop DROP STATISTICS (or for
dependencies), so that's where we get the AE lock

- RemoveStatisticsDataById gets called from ANALYZE, so that already
should have a lock (so no need to acquire another one)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 36bc8c33ba..dae0fd6b4f 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -734,18 +734,11 @@ void
 RemoveStatisticsById(Oid statsOid)
 {
 	Relation	relation;
+	Relation	rel;
 	HeapTuple	tup;
 	Form_pg_statistic_ext statext;
 	Oid			relid;
 
-	/*
-	 * First delete the pg_statistic_ext_data tuples holding the actual
-	 * statistical data. There might be data with/without inheritance, so
-	 * attempt deleting both.
-	 */
-	RemoveStatisticsDataById(statsOid, true);
-	RemoveStatisticsDataById(statsOid, false);
-
 	/*
 	 * Delete the pg_statistic_ext tuple.  Also send out a cache inval on the
 	 * associated table, so that dependent plans will be rebuilt.
@@ -760,12 +753,25 @@ RemoveStatisticsById(Oid statsOid)
 	statext = (Form_pg_statistic_ext) GETSTRUCT(tup);
 	relid = statext->stxrelid;
 
+	rel = table_open(relid, AccessExclusiveLock);
+
+	/*
+	 * First delete the pg_statistic_ext_data tuples holding the actual
+	 * statistical data. There might be data with/without inheritance, so
+	 * attempt deleting both.
+	 */
+	RemoveStatisticsDataById(statsOid, true);
+	RemoveStatisticsDataById(statsOid, false);
+
 	CacheInvalidateRelcacheByRelid(relid);
 
 	CatalogTupleDelete(relation, &tup->t_self);
 
 	ReleaseSysCache(tup);
 
+	/* Keep lock on the rel until end of xact */
+	table_close(rel, NoLock);
+
 	table_close(relation, RowExclusiveLock);
 }

Re: Fix some memory leaks in ecpg.addons

2023-11-08 Thread Alexander Lakhin


Hello Tristan,

08.11.2023 20:37, Tristan Partin wrote:

Are people using some suppression file or setting ASAN_OPTIONS to something?



I use the following:
ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:print_stacktrace=1:\
disable_coredump=0:strict_string_checks=1:check_initialization_order=1:\
strict_init_order=1:detect_stack_use_after_return=0

(You'll need to add detect_stack_use_after_return=0 with a newer clang
(I use clang-18) to workaround an incompatibility of check_stack_depth()
with that sanitizer feature enabled by default.)

There is also another story with hwasan ([1]).
and yet another incompatibility of check_stack_depth() related to the
aarch64-specific address tagging (TBI).

So I would say that fixing ecpg won't make postgres sanitizer-friendly in
a whole.

[1] 
https://www.postgresql.org/message-id/dbf77bf7-6e54-ed8a-c4ae-d196eeb664ce%40gmail.com

Best regards,
Alexander

Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Bruce Momjian

On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> Mark Dilger  writes:
> > On 11/23/19 8:34 AM, Tom Lane wrote:
> >> It suddenly strikes me to worry that we have an XID wraparound hazard
> >> for entries in the notify queue.
> 
> > Is it worth checking for this condition in autovacuum?
> 
> Dunno, maybe.  It's a different avenue to consider, at least.
> 
> > There shouldn't be too much reason to back-patch any of this, since
> > the change in 51004c717 only applies to v13 and onward.  Or do you
> > see the risk you described as "pretty minimal" as still being large
> > enough to outweigh the risk of anything we might back-patch?
> 
> There may not be a risk large enough to worry about before 51004c717,
> assuming that we discount cases like a single session staying
> idle-in-transaction for long enough for the XID counter to wrap
> (which'd cause problems for more than just LISTEN/NOTIFY).  I haven't
> analyzed this carefully enough to be sure.  We'd have to consider
> that, as well as the complexity of whatever fix we choose for HEAD,
> while deciding if we need a back-patch.

Is this still an open issue?  Should it be a TODO item?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Tomas Vondra




On 11/8/23 18:11, Andres Freund wrote:
> Hi,
> 
> On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
>>> I used netperf's tcp_rr between my workstation and my laptop on a local 
>>> 10Gbit
>>> network (albeit with a crappy external card for my laptop), to put some
>>> numbers to this. I used -r $s,100 to test sending a variable sized data to 
>>> the
>>> other size, with the other side always responding with 100 bytes (assuming
>>> that'd more than fit a feedback response).
>>>
>>> Command:
>>> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate";
>>>  echo $fields; for s in 10 100 1000 1 10 100;do netperf -P0 -t 
>>> TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
>>>
>>> 10gbe:
>>>
>>> request_sizeresponse_size   min_latency mean_latencymax_latency 
>>> p99_latency transaction_rate
>>> 10  100 43  64.30   390 
>>> 96  15526.084
>>> 100 100 57  75.12   428 
>>> 122 13286.602
>>> 1000100 47  74.41   270 
>>> 108 13412.125
>>> 1   100 89  114.63  712 
>>> 152 8700.643
>>> 10  100 167 255.90  584 
>>> 312 3903.516
>>> 100 100 891 1015.99 2470
>>> 1143983.708
>>>
>>>
>>> Same hosts, but with my workstation forced to use a 1gbit connection:
>>>
>>> request_sizeresponse_size   min_latency mean_latencymax_latency 
>>> p99_latency transaction_rate
>>> 10  100 78  131.18  2425
>>> 257 7613.416
>>> 100 100 81  129.25  425 
>>> 255 7727.473
>>> 1000100 100 162.12  1444
>>> 266 6161.388
>>> 1   100 310 686.19  1797
>>> 927 1456.204
>>> 10  100 10061114.20 1472
>>> 1199896.770
>>> 100 100 83388420.96 8827
>>> 8498118.410
> 
> Looks like the 1gbit numbers were somewhat bogus-ified due having configured
> jumbo frames and some network component doing something odd with that
> (handling them in software maybe?).
> 
> 10gbe:
> request_sizeresponse_size   min_latency mean_latencymax_latency   
>   p99_latency transaction_rate
> 10100 56  68.56   483 
> 87  14562.476
> 100   100 57  75.68   353 
> 123 13185.485
> 1000  100 60  71.97   391 
> 94  13870.659
> 1 100 58  92.42   489 
> 140 10798.444
> 10100 184 260.48  1141
> 338 3834.504
> 100   100 926 1071.46 2012
> 1466933.009
> 
> 1gbe
> request_sizeresponse_size   min_latency mean_latencymax_latency   
>   p99_latency transaction_rate
> 10100 77  132.19  1097
> 257 7555.420
> 100   100 79  127.85  534 
> 249 7810.862
> 1000  100 98  155.91  966 
> 265 6406.818
> 1 100 176 235.37  1451
> 314 4245.304
> 10100 944 1022.00 1380
> 1148977.930
> 100   100 86498768.42 9018
> 8895113.703
> 
> 
>>> I haven't checked, but I'd assume that 100bytes back and forth should easily
>>> fit a new message to update LSNs and the existing feedback response. Even 
>>> just
>>> the difference between sending 100 bytes and sending 10k (a bit more than a
>>> single WAL page) is pretty significant on a 1gbit network.
>>>
>>
>> I'm on decaf so I may be a bit slow, but it's not very clear to me what
>> conclusion to draw from these numbers. What is the takeaway?
>>
>> My understanding is that in both cases the latency is initially fairly
>> stable, independent of the request size. This applies to request up to
>> ~1000B. And then the latency starts increasing fairly quickly, even
>> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
>

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-08 Thread Tom Lane

Damir  writes:
> [ v7-0002-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch ]

Sorry for being so late to the party, but ... I don't think this
is a well-designed feature as it stands.  Simply dropping failed rows
seems like an unusable definition for any application that has
pretensions of robustness.  "But", you say, "we're emitting WARNING
messages about it".  That's *useless*.  For most applications WARNING
messages just go into the bit bucket, or worse they cause memory leaks
(because the app never reads them).  An app that tried to read them
would have to cope with all sorts of fun such as translated messages.
Furthermore, as best I can tell from the provided test cases, the
messages completely lack basic context such as which field or line
the problem occurred in.  An app trying to use this to understand
which input lines had failed would not get far.

I think an actually usable feature of this sort would involve
copying all the failed lines to some alternate output medium,
perhaps a second table with a TEXT column to receive the original
data line.  (Or maybe an array of text that could receive the
broken-down field values?)  Maybe we could dump the message info,
line number, field name etc into additional columns.

Also it'd be a good idea to have a vision of how the feature
could be extended to cope with lower-level errors, such as
lines that have the wrong number of columns or other problems
with line-level syntax.  I don't say we need to cope with that
immediately, but it's going to be something people will want
to add, I think.

regards, tom lane

Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-11-08 Thread Robert Haas

On Tue, Oct 31, 2023 at 7:39 PM Michael Paquier  wrote:
> Point 7. of what you quote says to use one?  True that this needs a
> refresh, and perhaps a bit fat warning about the fact that these are
> required if you want to fetch WAL from other sources than the local
> pg_wal/.  Perhaps there may be a point of revisiting the default
> behavior of recovery_target_timeline in this case, I don't know.

I don't really know what to say to this -- sure, point 7 of
"Recovering Using a Continuous Archive Backup" says to use
recovery.signal. But as I said in the preceding paragraph, it doesn't
say either "use recovery.signal or standby.signal". Nor does it or
anything else in the documentation explain under what circumstances
you're allowed to have neither. So the whole thing is very unclear.

> >> As you're telling me, and I've considered that as an option as well,
> >> perhaps we should just consider the presence of a backup_label file
> >> with no .signal files as a synonym of crash recovery?  In the recovery
> >> path, currently the essence of the problem is when we do
> >> InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning
> >> that it should do archive recovery but we don't want it, and that does
> >> not really make sense.  The rest of the code sort of implies that this
> >> is not a suported combination.  So basically, my suggestion here, is
> >> to just replay WAL up to the end of what's in your local pg_wal/ and
> >> hope for the best, without TLI jumps, except that we'd do nothing.
> >
> > This sentence seems to be incomplete.
>
> I've re-read it, and it looks OK to me.

Well, the sentence ends with "except that we'd do nothing" and I don't
know what that means. It would make sense to me if it said "except
that we'd do nothing about " or "except that we'd do nothing
instead of " but as you've written it basically seems to
boil down to "my suggestion is to replay WAL except do nothing" which
makes no sense. If you replay WAL, you're not doing nothing.

> > But I was not saying we should treat the case where we have a
> > backup_label file like crash recovery. The real question here is why
> > we don't treat it fully like archive recovery.
>
> Timeline jump at the end of recovery?  Archive recovery forces a TLI
> jump by default at the end of redo if there's a signal file, and some
> users may not want a TLI jump by default?

Uggh. I don't know what to think about that. I bet some people do want
that, but that makes it pretty easy to end up with multiple copies of
the same cluster running on the same TLI, too, which is not a thing
that you really want to have happen.

At the end of the day, I'm coming around to the view that the biggest
problem here is the documentation. Nobody can really know what's
supposed to work right now because the documentation doesn't say which
things you are and are not allowed to do and what results you should
expect in each case. If it did, it would be easier to discuss possible
behavior changes. Right now, it's hard to change any code at all,
because there's no list of supported scenarios, so you can't tell
whether a potential change affects a scenario that somebody thinks
should work, or only cases that nobody can possibly care about. It's
sort of possible to reason your way through that, to an extent, but
it's pretty hard. The fact that I didn't know that starting from a
backup with neither recovery.signal nor standby.signal was a thing
that anybody did or cared about is good evidence of that.

I'm coming to the understanding that we have four supported scenarios.
One, no backup_label, no recovery.signal, and no standby.signal.
Hence, replay WAL until the end, then start up. Two, backup_label
exists but neither recovery.signal nor standby.signal does. As before,
but if I understand correctly, now we can check that we reached the
backup end location. Three, recovery.signal exists, with or without
backup_label. Now we create a new TLI at the end of recovery, and
also, now can fetch WAL that is not present in pg_wal using
primary_conninfo or restore_command. In fact, I think we may prefer to
do that over using WAL we have locally, but I'm not quite sure about
that. Fourth, standby.signal exists, with or without backup_label. As
the previous scenario, but now when we reach the end of WAL we wait
for more to appear instead of ending recovery. I have a feeling this
is not quite an exhaustive list of differences between the various
modes, and I'm not even sure that it lists all of the things someone
might try to do. Thoughts?

I also feel like the terminology here sometimes obscures more than it
illuminates. For instance, it seems like ArchiveRecoveryRequested
really means "are any signal files present?" while InArchiveRecovery
means "are we fetching WAL from outside pg_wal rather than using
what's in pg_wal?". But these are not obvious from the names, and
sometimes we have additional variables with overlapping meanings, like
readSource, which indicates whether we'r

Re: ResourceOwner refactoring

2023-11-08 Thread Alexander Lakhin


Hello Heikki,

08.11.2023 14:37, Heikki Linnakangas wrote:


Fixed these, and pushed. Thanks everyone for reviewing!



Please look at a new assertion failure, I've managed to trigger with this 
script:
CREATE TABLE t(
i01 int, i02 int, i03 int, i04 int, i05 int, i06 int, i07 int, i08 int, i09 
int, i10 int,
i11 int, i12 int, i13 int, i14 int, i15 int, i16 int, i17 int, i18 int, i19 
int, i20 int,
i21 int, i22 int, i23 int, i24 int, i25 int, i26 int
);
CREATE TABLE tp PARTITION OF t FOR VALUES IN (1);

(gdb) bt
...
#5  0x560dd4e42f77 in ExceptionalCondition (conditionName=0x560dd5059fbc "owner->narr == 0", fileName=0x560dd5059e31 
"resowner.c", lineNumber=362) at assert.c:66
#6  0x560dd4e93cbd in ResourceOwnerReleaseAll (owner=0x560dd69cebd0, phase=RESOURCE_RELEASE_BEFORE_LOCKS, 
printLeakWarnings=false) at resowner.c:362
#7  0x560dd4e92e22 in ResourceOwnerReleaseInternal (owner=0x560dd69cebd0, phase=RESOURCE_RELEASE_BEFORE_LOCKS, 
isCommit=false, isTopLevel=true) at resowner.c:725
#8  0x560dd4e92d42 in ResourceOwnerReleaseInternal (owner=0x560dd69ce3f8, phase=RESOURCE_RELEASE_BEFORE_LOCKS, 
isCommit=false, isTopLevel=true) at resowner.c:678
#9  0x560dd4e92cdb in ResourceOwnerRelease (owner=0x560dd69ce3f8, phase=RESOURCE_RELEASE_BEFORE_LOCKS, 
isCommit=false, isTopLevel=true) at resowner.c:652

#10 0x560dd47316ef in AbortTransaction () at xact.c:2848
#11 0x560dd47329ac in AbortCurrentTransaction () at xact.c:3339
#12 0x560dd4c37284 in PostgresMain (dbname=0x560dd69779f0 "regression", username=0x560dd69779d8 "law") at 
postgres.c:4370

...

Best regards,
Alexander

Re: cataloguing NOT NULL constraints

2023-11-08 Thread Alvaro Herrera

On 2023-Oct-12, Alexander Lakhin wrote:

Hello,

> I've discovered that that commit added several recursive functions, and
> some of them are not protected from stack overflow.

True.  I reproduced the first two, but didn't attempt to reproduce the
third one -- patching all these to check for stack depth is cheap
protection.  I also patched ATAddCheckNNConstraint:

> (ATAddCheckNNConstraint() is protected because it calls
> AddRelationNewConstraints(), which in turn calls StoreRelCheck() ->
> CreateConstraintEntry() ->  recordDependencyOnSingleRelExpr() ->
> find_expr_references_walker() ->  expression_tree_walker() ->
> expression_tree_walker() -> check_stack_depth().)

because it seems uselessly risky to rely on depth checks that exist on
completely unrelated pieces of code, when the function visibly recurses
on itself.  Especially so since the test cases that demonstrate crashes
are so expensive to run, which means we're not going to detect it if at
some point that other stack depth check stops being called for whatever
reason.

BTW probably the tests could be made much cheaper by running the server
with a lower "ulimit -s" setting.  I didn't try.

I noticed one more crash while trying to "drop table" one of the
hierarchies your scripts create.  But it's a preexisting issue which
needs a backpatched fix, and I think Egor already reported it in the
other thread.

Thank you

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."  (E. Dijkstra)

Re: Fix some memory leaks in ecpg.addons

2023-11-08 Thread Tom Lane

"Tristan Partin"  writes:
> On Wed Nov 8, 2023 at 11:18 AM CST, Michael Meskes wrote:
>> Agreed, it's not exactly uncommon for tools like ecpg to not worry
>> about memory. After all it gets freed when the program ends.

> In the default configuration of AddressSanitizer, I can't even complete 
> a full build of Postgres.

Why is the meson stuff building ecpg test cases as part of the core build?
That seems wrong for a number of reasons, not only that we don't hold
that code to the same standards as the core server.

regards, tom lane

Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-11-08 Thread Bruce Momjian

On Wed, Nov  8, 2023 at 01:12:24PM +0100, Laurenz Albe wrote:
> On Tue, 2023-11-07 at 17:30 -0500, Bruce Momjian wrote:
> > You didn't seem to like my SET ROLE suggestion so I removed it.
> 
> I thought that the information that you can use SET ROLE to assume
> the identity of another role is correct, but leads a bit too far
> in the manual page of ALTER DEFAULT PRIVILEGES.

Agreed, it was a stretch.

> > > +  
> > > +   There is no way to change the default privileges for objects created 
> > > by
> > > +   arbitrary roles.  You have run ALTER DEFAULT 
> > > PRIVILEGES
> > 
> > I find the above sentence odd.  What is its purpose?
> 
> I cannot count how many times I have seen the complaint "I have run ALTER 
> DEFAULT
> PRIVILEGES, and now when some other user creates a table, the permissions are
> unchanged". People tend to think that if you omit FOR ROLE, the change 
> applies to
> PUBLIC.

I agree we have to be clear, and this is complex, which is why we are
struggling.  I feel we have to be clear about who is allowed to modify
which default privileges, and what default privileges are active during
object creation.  I ended up basically saying you can modify the default
privileges of roles you are member of, but they don't apply at creation
time for your own role.  I am open to better wording.

> Your improved documentation of "target_role" already covers that somewhat, so 
> if
> you don't like the repetition, I'm alright with that.  I just thought it might
> be worth stating it explicitly.
> 
> I think your patch is fine and ready to go.

Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Fix some memory leaks in ecpg.addons

2023-11-08 Thread Tristan Partin


On Wed Nov 8, 2023 at 11:18 AM CST, Michael Meskes wrote:

Am Mittwoch, dem 08.11.2023 um 12:07 -0500 schrieb Tom Lane:
> "Tristan Partin"  writes:
> > clang and gcc both now support -fsanitize=address,undefined. These
> > are 
> > really useful to me personally when trying to debug issues. 
> > Unfortunately ecpg code has a ton of memory leaks, which makes
> > builds 
> > really painful. It would be great to fix all of them, but I don't
> > have 
> > the patience to try to read flex/bison code. Here are two memory
> > leak 
> > fixes in any case.
> 
> I'm kind of failing to see the point.  As you say, the ecpg

> preprocessor leaks memory like there's no tomorrow.  But given its
> usage (process one source file and exit) I'm not sure that is worth
> much effort to fix.  And what does it buy to fix just two spots?

Agreed, it's not exactly uncommon for tools like ecpg to not worry
about memory. After all it gets freed when the program ends.


In the default configuration of AddressSanitizer, I can't even complete 
a full build of Postgres.


meson setup build -Db_sanitize=address
ninja -C build
[1677/1855] Generating 
src/interfaces/ecpg/test/compat_informix/rfmtlong.c with a custom command
	FAILED: src/interfaces/ecpg/test/compat_informix/rfmtlong.c 
	/home/tristan957/Projects/work/postgresql/build/src/interfaces/ecpg/preproc/ecpg --regression -I../src/interfaces/ecpg/test/compat_informix -I../src/interfaces/ecpg/include/ -C INFORMIX -o src/interfaces/ecpg/test/compat_informix/rfmtlong.c ../src/interfaces/ecpg/test/compat_informix/rfmtlong.pgc


=
==114881==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7f88c34814a8 in strdup (/lib64/libasan.so.8+0x814a8) (BuildId: 
6f17f87dc4c1aa9f9dde7c4856604c3a25ba4872)
#1 0x4cfd93 in get_progname ../src/port/path.c:589
#2 0x4b6dae in main ../src/interfaces/ecpg/preproc/ecpg.c:137
#3 0x7f88c3246149 in __libc_start_call_main 
(/lib64/libc.so.6+0x28149) (BuildId: 651b2bed7ecaf18098a63b8f10299821749766e6)
#4 0x7f88c324620a in __libc_start_main_impl 
(/lib64/libc.so.6+0x2820a) (BuildId: 651b2bed7ecaf18098a63b8f10299821749766e6)
#5 0x402664 in _start 
(/home/tristan957/Projects/work/postgresql/build/src/interfaces/ecpg/preproc/ecpg+0x402664)
 (BuildId: fab06f774e305cbe628e03cdc22d935f7bb70a76)

SUMMARY: AddressSanitizer: 5 byte(s) leaked in 1 allocation(s).
ninja: build stopped: subcommand failed.

Are people using some suppression file or setting ASAN_OPTIONS to 
something?


Here is a patch with a better solution.

--
Tristan Partin
Neon (https://neon.tech)
From da1bce1d68a55d8f00093a1aa1b2dfe913bd4549 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Wed, 8 Nov 2023 11:35:48 -0600
Subject: [PATCH v1] Silence AddressSanitizer when creating custom targets with
 ecpg

ecpg leaks like no tomorrow and will cause builds to fail.
---
 src/interfaces/ecpg/test/compat_oracle/meson.build | 12 
 src/interfaces/ecpg/test/meson.build   |  9 +
 2 files changed, 21 insertions(+)

diff --git a/src/interfaces/ecpg/test/compat_oracle/meson.build b/src/interfaces/ecpg/test/compat_oracle/meson.build
index f07d9859d1..3aff426fec 100644
--- a/src/interfaces/ecpg/test/compat_oracle/meson.build
+++ b/src/interfaces/ecpg/test/compat_oracle/meson.build
@@ -4,6 +4,17 @@ pgc_files = [
   'char_array',
 ]
 
+exe_preproc_kw = {}
+
+# ecpg leaks memory like no tomorrow; silence AddressSanitizer
+if get_option('b_sanitize').contains('address')
+  exe_preproc_kw += {
+'env': {
+  'ASAN_OPTIONS': 'detect_leaks=0'
+}
+  }
+endif
+
 foreach pgc_file : pgc_files
   exe_input = custom_target('@0@.c'.format(pgc_file),
 input: '@0@.pgc'.format(pgc_file),
@@ -13,6 +24,7 @@ foreach pgc_file : pgc_files
   ecpg_preproc_test_command_end,
 install: false,
 build_by_default: false,
+kwargs: exe_preproc_kw,
   )
 
   ecpg_test_dependencies += executable(pgc_file,
diff --git a/src/interfaces/ecpg/test/meson.build b/src/interfaces/ecpg/test/meson.build
index b7a3fb4e0e..621e76bd88 100644
--- a/src/interfaces/ecpg/test/meson.build
+++ b/src/interfaces/ecpg/test/meson.build
@@ -40,6 +40,15 @@ ecpg_preproc_kw = {
   'build_by_default': false,
 }
 
+# ecpg leaks memory like no tomorrow; silence AddressSanitizer
+if get_option('b_sanitize').contains('address')
+  ecpg_preproc_kw += {
+'env': {
+  'ASAN_OPTIONS': 'detect_leaks=0'
+}
+  }
+endif
+
 ecpg_preproc_test_command_start = [
   ecpg_exe,
   '--regression',
-- 
Tristan Partin
Neon (https://neon.tech)

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-08 Thread vignesh C

On Wed, 8 Nov 2023 at 08:43, vignesh C  wrote:
>
> On Tue, 7 Nov 2023 at 13:25, Amit Kapila  wrote:
> >
> > On Tue, Nov 7, 2023 at 10:01 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Tuesday, November 7, 2023 12:14 PM Kuroda, Hayato/黒田 隼人 
> > >  wrote:
> > > >
> > > > Dear hackers,
> > > >
> > > > PSA the patch to solve the issue [1].
> > > >
> > > > Kindly Peter E. and Andrew raised an issue that delete_old_cluster.sh is
> > > > generated in the source directory, even when the VPATH/meson build.
> > > > This can avoid by changing the directory explicitly.
> > > >
> > > > [1]:
> > > > https://www.postgresql.org/message-id/flat/7b8a9460-5668-b372-04e6-7b
> > > > 52e9308493%40dunslane.net#554090099bbbd12c94bf570665a6badf
> > >
> > > Thanks for the patch, I have confirmed that the files won't be generated
> > > in source directory after applying the patch.
> > >
> > > After running: "meson test -C build/ --suite pg_upgrade",
> > > The files are in the test directory:
> > > ./build/testrun/pg_upgrade/003_logical_slots/data/delete_old_cluster.sh
> > >
> >
> > Thanks for the patch and verification. Pushed the fix.
>
> While verifying upgrade of subscriber patch, I found one issue with
> upgrade in verbose mode.
> I was able to reproduce this issue by performing a upgrade with a
> verbose option.
>
> The trace for the same is given below:
> Program received signal SIGSEGV, Segmentation fault.
> __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
> 126../sysdeps/x86_64/multiarch/strlen-vec.S: No such file or 
> directory.
> (gdb) bt
> #0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
> #1  0x5556f572 in dopr (target=0x7fffbb90,
> format=0x5557859e "\", plugin: \"%s\", two_phase: %s",
> args=0x7fffdc40) at snprintf.c:444
> #2  0x5556ed95 in pg_vsnprintf (str=0x7fffbc10 "slot_name:
> \"ication slots within the database:", count=8192, fmt=0x55578590
> "slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
> args=0x7fffdc40) at snprintf.c:195
> #3  0x555667e3 in pg_log_v (type=PG_VERBOSE,
> fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
> ap=0x7fffdc40) at util.c:184
> #4  0x55566b38 in pg_log (type=PG_VERBOSE, fmt=0x55578590
> "slot_name: \"%s\", plugin: \"%s\", two_phase: %s") at util.c:264
> #5  0x55561a06 in print_slot_infos (slot_arr=0x55595ed0)
> at info.c:813
> #6  0x5556186e in print_db_infos (db_arr=0x55587518
> ) at info.c:782
> #7  0x555606da in get_db_rel_and_slot_infos
> (cluster=0x555874a0 , live_check=false) at info.c:308
> #8  0x839a in check_new_cluster () at check.c:215
> #9  0x55563010 in main (argc=13, argv=0x7fffdf08) at
> pg_upgrade.c:136
>
> This issue occurs because we are accessing uninitialized slot array 
> information.
>
> We could fix it by a couple of ways: a) Initialize the whole of
> dbinfos by using pg_malloc0 instead of pg_malloc which will ensure
> that the slot information is set to 0. b) Setting only slot
> information. Attached patch has the changes for both the approaches.
> Thoughts?

Here is a small improvisation where num_slots need not be initialized
as it will be used only after assigning the result now. The attached
patch has the changes for the same.

Regards,
Vignesh
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 7f21d26fd2..4878aa22bf 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -408,7 +408,7 @@ get_db_infos(ClusterInfo *cluster)
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
-	dbinfos = (DbInfo *) pg_malloc(sizeof(DbInfo) * ntups);
+	dbinfos = (DbInfo *) pg_malloc0(sizeof(DbInfo) * ntups);
 
 	for (tupnum = 0; tupnum < ntups; tupnum++)
 	{
@@ -636,15 +636,11 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check)
 	PGconn	   *conn;
 	PGresult   *res;
 	LogicalSlotInfo *slotinfos = NULL;
-	int			num_slots = 0;
+	int			num_slots;
 
 	/* Logical slots can be migrated since PG17. */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
-	{
-		dbinfo->slot_arr.slots = slotinfos;
-		dbinfo->slot_arr.nslots = num_slots;
 		return;
-	}
 
 	conn = connectToServer(&old_cluster, dbinfo->db_name);

Re: meson documentation build open issues

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 12:04:30 +0100, Peter Eisentraut wrote:
> Ok, I didn't know about ninja install-world.  That works for me.  Maybe a
> "world" target would also be good.

Yea, I thought so as well. I'll send out a patch shortly. Kinda wondering if
its worth backpatching to 16. Uniformity seems useful and it's low risk.

> I played around with this a bit and noticed some files missing or in the
> wrong place.  See two attached patches (plus e9f075f9a1 already committed).

Make sense.

Greetings,

Andres Freund

Re: pg_upgrade and logical replication

2023-11-08 Thread vignesh C

On Thu, 2 Nov 2023 at 17:01, Amit Kapila  wrote:
>
> On Thu, Nov 2, 2023 at 3:41 PM vignesh C  wrote:
> >
> > I have slightly modified it now and also made it consistent with the
> > replication slot upgrade, but I was not sure if we need to add
> > anything more. Let me know if anything else needs to be added. I will
> > add it.
> >
>
> I think it is important for users to know how they upgrade their
> multi-node setup. Say a two-node setup where replication is working
> both ways (aka each node has both publications and subscriptions),
> similarly, how to upgrade, if there are multiple nodes involved?

I was thinking of documenting something like this:
Steps to upgrade logical replication clusters:
Warning:
Upgrading logical replication nodes requires multiple steps to be
performed. Because not all operations are transactional, the user is
advised to take backups.
Backups can be taken as described in
https://www.postgresql.org/docs/current/backup.html

Upgrading 2 node logical replication cluster:
1) Let's say publisher is in Node1 and subscriber is in Node2.
2) Stop the publisher server in Node1.
3) Disable the subscriptions in Node2.
4) Upgrade the publisher node Node1 to Node1_new.
5) Start the publisher node Node1_new.
6) Stop the subscriber server in Node2.
7) Upgrade the subscriber node Node2 to Node2_new.
8) Start the subscriber node Node2_new.
9) Alter the subscription connections in Node2_new to point from Node1
to Node1_new.
10) Enable the subscriptions in Node2_new.
11) Create any tables that were created in Node1_new between step-5
and now and Refresh the publications.

Steps to upgrade cascaded logical replication clusters:
1) Let's say we have a cascaded logical replication setup
Node1->Node2->Node3. Here Node2 is subscribing to Node1 and Node3 is
subscribing to Node2.
2) Stop the server in Node1.
3) Disable the subscriptions in Node2 and Node3.
4) Upgrade the publisher node Node1 to Node1_new.
5) Start the publisher node Node1_new.
6) Stop the server in Node1.
7) Upgrade the subscriber node Node2 to Node2_new.
8) Start the subscriber node Node2_new.
9) Alter the subscription connections in Node2_new to point from Node1
to Node1_new.
10) Enable the subscriptions in Node2_new.
11) Create any tables that were created in Node1_new between step-5
and now and Refresh the publications.
12) Stop the server in Node3.
13) Upgrade the subscriber node Node3 to Node3_new.
14) Start the subscriber node Node3_new.
15) Alter the subscription connections in Node3_new to point from
Node2 to Node2_new.
16) Enable the subscriptions in Node2_new.
17) Create any tables that were created in Node2_new between step-8
and now and Refresh the publications.

Upgrading 2 node circular logical replication cluster:
1) Let's say we have a circular logical replication setup Node1->Node2
& Node2->Node1. Here Node2 is subscribing to Node1 and Node1 is
subscribing to Node2.
2) Stop the server in Node1.
3) Disable the subscriptions in Node2.
4) Upgrade the node Node1 to Node1_new.
5) Start the node Node1_new.
6) Enable the subscriptions in Node1_new.
7) Wait till all the incremental changes are synchronized.
8) Alter the subscription connections in Node2 to point from Node1 to Node1_new.
9) Create any tables that were created in Node2 between step-2 and now
and Refresh the publications.
10) Stop the server in Node2.
11) Disable the subscriptions in Node1.
12) Upgrade the node Node2 to Node2_new.
13) Start the subscriber node Node2_new.
14) Enable the subscriptions in Node2_new.
15) Alter the subscription connections in Node1 to point from Node2 to
Node2_new.
16) Create any tables that were created in Node1_new between step-10
and now and Refresh the publications.

I have done basic testing with this, I will do further testing and
update it if I find any issues.
Let me know if this idea is ok or we need something different.

Regards,
Vignesh

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-07 15:55:48 -0500, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-11-06 13:02:50 -0500, Stephen Frost wrote:
> > > > > The max_total_memory limit is checked whenever the global counters are
> > > > > updated. There is no special error handling if a memory allocation 
> > > > > exceeds
> > > > > the global limit. That allocation returns a NULL for malloc style
> > > > > allocations or an ENOMEM for shared memory allocations. Postgres has
> > > > > existing mechanisms for dealing with out of memory conditions.
> > > > 
> > > > I still think it's extremely unwise to do tracking of memory and 
> > > > limiting of
> > > > memory in one patch.  You should work towards and acceptable patch that 
> > > > just
> > > > tracks memory usage in an as simple and low overhead way as possible. 
> > > > Then we
> > > > later can build on that.
> > > 
> > > Frankly, while tracking is interesting, the limiting is the feature
> > > that's needed more urgently imv.
> > 
> > I agree that we need limiting, but that the tracking needs to be very robust
> > for that to be usable.
> 
> Is there an issue with the tracking in the patch that you saw?  That's
> certainly an area that we've tried hard to get right and to match up to
> numbers from the rest of the system, such as the memory context system.

There's some details I am pretty sure aren't right - the DSM tracking piece
seems bogus to me. But beyond that: I don't know. There's enough other stuff
in the patch that it's hard to focus on that aspect. That's why I'd like to
merge a patch doing just that, so we actually can collect numbers. If any of
the developers of the patch had focused on polishing that part instead of
focusing on the limiting, it'd have been ready to be merged a while ago, maybe
even in 16.  I think the limiting piece is unlikely to be ready for 17.

Greetings,

Andres Freund

Re: Fix some memory leaks in ecpg.addons

2023-11-08 Thread Michael Meskes

Am Mittwoch, dem 08.11.2023 um 12:07 -0500 schrieb Tom Lane:
> "Tristan Partin"  writes:
> > clang and gcc both now support -fsanitize=address,undefined. These
> > are 
> > really useful to me personally when trying to debug issues. 
> > Unfortunately ecpg code has a ton of memory leaks, which makes
> > builds 
> > really painful. It would be great to fix all of them, but I don't
> > have 
> > the patience to try to read flex/bison code. Here are two memory
> > leak 
> > fixes in any case.
> 
> I'm kind of failing to see the point.  As you say, the ecpg
> preprocessor leaks memory like there's no tomorrow.  But given its
> usage (process one source file and exit) I'm not sure that is worth
> much effort to fix.  And what does it buy to fix just two spots?

Agreed, it's not exactly uncommon for tools like ecpg to not worry
about memory. After all it gets freed when the program ends.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De
Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org

Re: Syncrep and improving latency due to WAL throttling

2023-11-08 Thread Andres Freund

Hi,

On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
> > I used netperf's tcp_rr between my workstation and my laptop on a local 
> > 10Gbit
> > network (albeit with a crappy external card for my laptop), to put some
> > numbers to this. I used -r $s,100 to test sending a variable sized data to 
> > the
> > other size, with the other side always responding with 100 bytes (assuming
> > that'd more than fit a feedback response).
> >
> > Command:
> > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate";
> >  echo $fields; for s in 10 100 1000 1 10 100;do netperf -P0 -t 
> > TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
> >
> > 10gbe:
> >
> > request_sizeresponse_size   min_latency mean_latencymax_latency 
> > p99_latency transaction_rate
> > 10  100 43  64.30   390 
> > 96  15526.084
> > 100 100 57  75.12   428 
> > 122 13286.602
> > 1000100 47  74.41   270 
> > 108 13412.125
> > 1   100 89  114.63  712 
> > 152 8700.643
> > 10  100 167 255.90  584 
> > 312 3903.516
> > 100 100 891 1015.99 2470
> > 1143983.708
> >
> >
> > Same hosts, but with my workstation forced to use a 1gbit connection:
> >
> > request_sizeresponse_size   min_latency mean_latencymax_latency 
> > p99_latency transaction_rate
> > 10  100 78  131.18  2425
> > 257 7613.416
> > 100 100 81  129.25  425 
> > 255 7727.473
> > 1000100 100 162.12  1444
> > 266 6161.388
> > 1   100 310 686.19  1797
> > 927 1456.204
> > 10  100 10061114.20 1472
> > 1199896.770
> > 100 100 83388420.96 8827
> > 8498118.410

Looks like the 1gbit numbers were somewhat bogus-ified due having configured
jumbo frames and some network component doing something odd with that
(handling them in software maybe?).

10gbe:
request_sizeresponse_size   min_latency mean_latencymax_latency 
p99_latency transaction_rate
10  100 56  68.56   483 
87  14562.476
100 100 57  75.68   353 
123 13185.485
1000100 60  71.97   391 
94  13870.659
1   100 58  92.42   489 
140 10798.444
10  100 184 260.48  1141
338 3834.504
100 100 926 1071.46 2012
1466933.009

1gbe
request_sizeresponse_size   min_latency mean_latencymax_latency 
p99_latency transaction_rate
10  100 77  132.19  1097
257 7555.420
100 100 79  127.85  534 
249 7810.862
1000100 98  155.91  966 
265 6406.818
1   100 176 235.37  1451
314 4245.304
10  100 944 1022.00 1380
1148977.930
100 100 86498768.42 9018
8895113.703


> > I haven't checked, but I'd assume that 100bytes back and forth should easily
> > fit a new message to update LSNs and the existing feedback response. Even 
> > just
> > the difference between sending 100 bytes and sending 10k (a bit more than a
> > single WAL page) is pretty significant on a 1gbit network.
> >
>
> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> conclusion to draw from these numbers. What is the takeaway?
>
> My understanding is that in both cases the latency is initially fairly
> stable, independent of the request size. This applies to request up to
> ~1000B. And then the latency starts increasing fairly quickly, even
> though it shouldn't hit the bandwidth (except maybe the 1MB requests).

Except for the smallest end, these are bandwidth related, I think. Converting
1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead.

Re: Fix some memory leaks in ecpg.addons

2023-11-08 Thread Tom Lane

"Tristan Partin"  writes:
> clang and gcc both now support -fsanitize=address,undefined. These are 
> really useful to me personally when trying to debug issues. 
> Unfortunately ecpg code has a ton of memory leaks, which makes builds 
> really painful. It would be great to fix all of them, but I don't have 
> the patience to try to read flex/bison code. Here are two memory leak 
> fixes in any case.

I'm kind of failing to see the point.  As you say, the ecpg
preprocessor leaks memory like there's no tomorrow.  But given its
usage (process one source file and exit) I'm not sure that is worth
much effort to fix.  And what does it buy to fix just two spots?

regards, tom lane

Fix some memory leaks in ecpg.addons

2023-11-08 Thread Tristan Partin

clang and gcc both now support -fsanitize=address,undefined. These are 
really useful to me personally when trying to debug issues. 
Unfortunately ecpg code has a ton of memory leaks, which makes builds 
really painful. It would be great to fix all of them, but I don't have 
the patience to try to read flex/bison code. Here are two memory leak 
fixes in any case.


--
Tristan Partin
Neon (https://neon.tech)
From af69e1711e87ae96d75814fc83a8588a4bdf7cac Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Wed, 8 Nov 2023 10:53:02 -0600
Subject: [PATCH v1] Patch two memory leaks in ecpg.addons

make_str() allocates memory, but we were never releasing it. There are
more issues in this code, but these were just two easy ones to fix.
---
 src/interfaces/ecpg/preproc/ecpg.addons | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons
index e94da2a3f8..dd1ea39c6e 100644
--- a/src/interfaces/ecpg/preproc/ecpg.addons
+++ b/src/interfaces/ecpg/preproc/ecpg.addons
@@ -51,6 +51,8 @@ ECPG: stmtExecuteStmt block
 str[strlen(str) - 1] = '\0';
 sprintf(length, "%zu", strlen(str));
 add_variable_to_tail(&argsinsert, new_variable(str, ECPGmake_simple_type(ECPGt_const, length, 0), 0), &no_indicator);
+
+free(str);
 			}
 			output_statement(cat_str(3, mm_strdup("execute"), mm_strdup("$0"), $1.type), 0, ECPGst_exec_with_exprlist);
 		}
@@ -79,6 +81,8 @@ ECPG: stmtPrepareStmt block
 str[strlen(str) - 1] = '\0';
 sprintf(length, "%zu", strlen(str));
 add_variable_to_tail(&argsinsert, new_variable(str, ECPGmake_simple_type(ECPGt_const, length, 0), 0), &no_indicator);
+
+free(str);
 			}
 			output_statement(cat_str(5, mm_strdup("prepare"), mm_strdup("$0"), $1.type, mm_strdup("as"), $1.stmt), 0, ECPGst_prepare);
 		}
-- 
Tristan Partin
Neon (https://neon.tech)

Re: Add PQsendSyncMessage() to libpq

2023-11-08 Thread Alvaro Herrera

On 2023-Nov-07, Jelte Fennema-Nio wrote:

> I think this function should be named something with the "PQsend"
> prefix since that's the way we name all our public async message
> sending functions in libpq. The "Put" word we only use in internal
> libpq functions, so I feel it has no place in the external API
> surface. My proposal would be to call the function PQsendPipelineSync
> (i.e. having the PQsend prefix while still looking similar to the
> existing PQpipelineSync).

Argued that way, it makes sense to me.

> Also I think the flag argument is completely unnecessary. [...]
> Instead of looking at it as adding another version of our current
> PQpipelineSync API, we should look at it as an addition to our current
> list of PQsend functions for a new packet type. And none of those
> PQsend functions ever needed a flag.

True.

> Finally, I have one suggestion for a behavioural change: I think the
> function should still call pqPipelineFlush, just like all of our other
> PQsend functions (except PQsendFlushRequest, but that seems like an
> oversight there too).

I agree.

So, yeah, it looks like this will be pretty similar to Anton's original
patch, with PQpipelineSync() being just PQsendPipelineSync() + PQflush().

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Re: Call pqPipelineFlush from PQsendFlushRequest

2023-11-08 Thread Alvaro Herrera

On 2023-Nov-07, Michael Paquier wrote:

> On Tue, Nov 07, 2023 at 10:38:04AM +0100, Jelte Fennema-Nio wrote:
> > In pipeline mode after queuing a message to be sent we would flush the
> > buffer if the size of the buffer passed some threshold. The only message
> > type that we didn't do that for was the Flush message. This addresses
> > that oversight.
> > 
> > I noticed this discrepancy while reviewing the
> > PQsendSyncMessage/PQpipelinePutSync patchset.
> 
> Indeed, it looks a bit strange that there is no flush if the buffer
> threshold is reached once the message is sent, so your suggestion
> sounds right.  Alvaro?

Pushed, thanks.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Porque Kim no hacía nada, pero, eso sí,
con extraordinario éxito" ("Kim", Kipling)

Re: 64-bit integer subtraction bug on some platforms

2023-11-08 Thread Tom Lane

Laurenz Albe  writes:
> On Wed, 2023-11-08 at 11:58 +, Dean Rasheed wrote:
>> This should overflow, since the correct result (+9223372036854775808)
>> is out of range. However, on platforms without integer overflow
>> builtins or 128-bit integers, pg_sub_s64_overflow() does the
>> following:
>> ...
>> which fails to spot the fact that overflow is also possible when a ==
>> 0. So on such platforms, it returns the wrong result.
>> 
>> Patch attached.

> The patch looks good to me.

+1: good catch, fix looks correct.

regards, tom lane

Re: Fix use of openssl.path() if openssl isn't found

2023-11-08 Thread Tristan Partin

On Wed Nov 8, 2023 at 2:31 AM CST, Michael Paquier wrote:

On Wed, Nov 08, 2023 at 12:07:49AM -0600, Tristan Partin wrote:
>'with_ssl': ssl_library,
> -  'OPENSSL': openssl.path(),
> +  'OPENSSL': openssl.found() ? openssl.path : '',

Except that this was incorrect.  I've fixed the grammar and applied
that down to 16.

Coding at 12 in the morning is never conducive to coherent thought :). 
Thanks. Sorry for the trouble.

--
Tristan Partin
Neon (https://neon.tech)

Re: Cleaning up array_in()

2023-11-08 Thread Tom Lane

Alexander Lakhin  writes:
> Thank you for the update! I haven't looked into the code, just did manual
> testing and rechecked commands given in the arrays documentation ([1]).
> Everything works correctly, except for one minor difference:
> INSERT INTO sal_emp
>      VALUES ('Bill',
>      '{1, 1, 1, 1}',
>      '{{"meeting", "lunch"}, {"meeting"}}');

> currently gives:
> ERROR:  malformed array literal: "{{"meeting", "lunch"}, {"meeting"}}"
> LINE 4: '{{"meeting", "lunch"}, {"meeting"}}');
>      ^
> DETAIL:  Multidimensional arrays must have sub-arrays with matching 
> dimensions.

> not
> ERROR:  multidimensional arrays must have array expressions with matching 
> dimensions

Oh!  I had not realized we had actual documentation examples covering
this area.  Yeah, that doc needs to be updated to show the current
wording of the error.  Thanks for catching that.

regards, tom lane

Re: meson documentation build open issues

2023-11-08 Thread Tristan Partin


Looks good to me. Thanks for finding this.

--
Tristan Partin
Neon (https://neon.tech)

Re: meson documentation build open issues

2023-11-08 Thread Alvaro Herrera

On 2023-Nov-08, Peter Eisentraut wrote:

> I think we could build doc/src/sgml/postgres-full.xml by default.  That
> takes less than 0.5 seconds here and it's an intermediate target for html
> and man.

If that detects problems like the id attributes you mentioned, apart
from the other checks in the `xmllint --noout`, then that WFM.

At least with the makefile the command to produce postgres-full.xml
includes --valid, so I think we're covered.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: XX000: tuple concurrently deleted during DROP STATISTICS

2023-11-08 Thread Tom Lane

Tomas Vondra  writes:
> On 11/8/23 16:10, Justin Pryzby wrote:
>> I found this in our logs, and reproduced it under v11-v16.
>> 
>> CREATE TABLE t(a int, b int);
>> INSERT INTO t SELECT generate_series(1,999);
>> CREATE STATISTICS t_stats ON a,b FROM t;
>> 
>> while :; do psql postgres -qtxc "ANALYZE t"; done &
>> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
>> 
>> It's known that concurrent DDL can hit elog().  But in this case,
>> there's only one DDL operation.

> AFAICS this happens because store_statext (after ANALYZE builds the new
> statistics) does this:

Shouldn't DROP STATISTICS be taking a lock on the associated table
that is strong enough to lock out ANALYZE?

regards, tom lane

Re: XX000: tuple concurrently deleted during DROP STATISTICS

2023-11-08 Thread Tomas Vondra

On 11/8/23 16:10, Justin Pryzby wrote:
> I found this in our logs, and reproduced it under v11-v16.
> 
> CREATE TABLE t(a int, b int);
> INSERT INTO t SELECT generate_series(1,999);
> CREATE STATISTICS t_stats ON a,b FROM t;
> 
> while :; do psql postgres -qtxc "ANALYZE t"; done &
> while :; do psql postgres -qtxc "begin; DROP STATISTICS t_stats"; done &
> 
> It's known that concurrent DDL can hit elog().  But in this case,
> there's only one DDL operation.
> 
AFAICS this happens because store_statext (after ANALYZE builds the new
statistics) does this:

/*
 * Delete the old tuple if it exists, and insert a new one. It's easier
 * than trying to update or insert, based on various conditions.
 */
RemoveStatisticsDataById(statOid, inh);

/* form and insert a new tuple */
stup = heap_form_tuple(RelationGetDescr(pg_stextdata), values, nulls);
CatalogTupleInsert(pg_stextdata, stup);

So it deletes the tuple first (if there's one), and then inserts the new
statistics tuple.

We could update the tuple instead, but that would be more complex (as
the comment explains), and it doesn't actually fix anything because then
simple_heap_delete just fails with TM_Updated instead.

I think the only solution would be to lock the statistics tuple before
running ANALYZE, or something like that. Or maybe we should even lock
the statistics object itself, so that ANALYZE and DROP can't run
concurrently on it?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

1 2 >

1 - 100 of 138 matches

Mail list logo