date:20241002

On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:
> This is greatly simplified implementation of the patch proposed in [1] and
> hopefully it addresses the concerns expressed there. Since the
> implementation is quite different it seemed like a new thread was
> appropriate, especially since the old thread title would be very misleading
> regarding the new functionality.

-/* No backup_label file has been found if we are here. */
+/*
+ * No backup_label file has been found if we are here. Error if the
+ * control file requires backup_label.
+ */
+if (ControlFile->backupLabelRequired)
+ereport(FATAL,
+(errmsg("could not find backup_label required for 
recovery"),
+ errhint("backup_label must be present for recovery to 
succeed")));

I thought that this had some similarities with my last fight in this
area, where xlogrecovery.c would fail hard if there was a backup_label
file but no signal files, but nope that's not the case:
https://www.postgresql.org/message-id/flat/ZArVOMifjzE7f8W7%40paquier.xyz

> The basic idea is to harden recovery by returning a copy of pg_control from
> pg_backup_stop() that has a flag set to prevent recovery if the backup_label
> file is missing. Instead of backup software copying pg_control from PGDATA,
> it stores an updated version that is returned from pg_backup_stop().

Hmm, okay.  There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces.  As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces.  That seems slightly
cleaner to me, still I agree that both are the same things.

Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.

> * We don't need to worry about backup software seeing a torn copy of
> pg_control, since Postgres can safely read it out of memory and provide a
> valid copy via pg_backup_stop(). This solves torn reads without needing to
> write pg_control via a temp file, which may affect performance on a standby.

We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues.  So I'm not sure to
see your point here?

> * For backup from standby, we no longer need to instruct the backup software
> to copy pg_control last. In fact the backup software should not copy
> pg_control from PGDATA at all.

Yep.  Not from PGDATA, but from the function.

> These changes have no impact on current backup software and they are free to
> use the pg_control available from pg_stop_backup() or continue to use
> pg_control from PGDATA. Of course they will miss the benefits of getting a
> consistent copy of pg_control and the backup_label checking, but will be no
> worse off than before.

There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery.  Perhaps it should be improved based on
this patch.

The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function?  Perhaps existing
backup solutions are good enough risk vs reward is not worth it?  The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side.  This adds a new step where backups would need to copy
the control file to the data folder.
--
Michael

signature.asc
Description: PGP signature

Re: Enhance create subscription reference manual

2024-10-02 Thread Tatsuo Ishii

>> > I agreed with adding  tag to "failover" since it is done in the
>> > description on "slot_name" parameter. 
>> > 
>> > How about also rewrite it to "enable the failover option" rather than 
>> > simply
>> > "enable the failover" to clarify that the parameter is refereed to.
>> > 
>> > We could also use "enable the failover parameter". I think both make 
>> > sense, but
>> > it seems that "failover option" is preferred in the slot_name description.
>> 
>> But a few lines above we have:
>> 
>>  
>>   This clause specifies optional parameters for a subscription.
>>  
>> 
>>  
>>   The following parameters control what happens during subscription 
>> creation:
>> 
>> So it seems "paramter" is more consistent than "option" here.
> 
> For consistency, using "parameter" seems better. 
> 
> If we consider this, should we rewrite other places using "option" to use 
> "parameter"?
> For example, I can find uses of "option" in the "connect", "slot_name", and 
> "binary"
> descriptions in the CREATE SUBSCRIPTION document.

Not sure. In some places I think "option" is an abbreviation of
"optional parameter". So using "option" there does not seem to be
inconsistent or incorrect.  See following example in create
subscription manual:

This clause specifies optional parameters for a subscription.
:
:
connect (boolean)
:
:
Since no connection is made when this option is false, no tables are 
subscribed. 

> Also, the "public" parameter in CREATE PUBLICATION doc,

You mean "publish"?

> "vacuum_index_cleanup" and
> "vacuum_truncate" storage parameters in CREATE TABLE doc might be also 
> targets to be
> rewritten.  I am not sure if this covers all, though.

I would like to hear opinions from native English speakers.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Enhance create subscription reference manual

2024-10-02 Thread Yugo NAGATA

On Wed, 02 Oct 2024 16:25:35 +0900 (JST)
Tatsuo Ishii  wrote:


> > Also, the "public" parameter in CREATE PUBLICATION doc,
> 
> You mean "publish"?

Yes. Sorry for the typo.

> 
> > "vacuum_index_cleanup" and
> > "vacuum_truncate" storage parameters in CREATE TABLE doc might be also 
> > targets to be
> > rewritten.  I am not sure if this covers all, though.
> 
> I would like to hear opinions from native English speakers.

+1

Regards,
Yugo Nagata


-- 
Yugo NAGATA

Re: not null constraints, again

2024-10-02 Thread Alvaro Herrera

On 2024-Oct-02, jian he wrote:

> On Tue, Oct 1, 2024 at 11:20 PM Alvaro Herrera  
> wrote:
> >
> > On 2024-Oct-01, jian he wrote:
> >
> > > create table t2 (a int primary key  constraint foo not null no inherit);
> > > primary key cannot coexist with not-null no inherit?
> > > here t2, pg_dump/restore will fail.
> >
> > Yeah, this needs to throw an error.  If you use a table constraint, it
> > does fail as expected:
> >
> > create table notnull_tbl_fail (a int primary key, not null a no inherit);
> > ERROR:  conflicting NO INHERIT declaration for not-null constraint on 
> > column "a"
> >
> > I missed adding the check in the column constraint case.
> >
> after v7, still not bullet-proof. as before, pg_dump/restore will fail
> for the following:
> 
> drop table if exists t2, t2_0
> create table t2 (a int, b int, c int, constraint foo primary key(a),
> constraint foo1 not null a no inherit);
> create table t2_0 (a int constraint foo1 not null no inherit, b int, c
> int, constraint foo12 primary key(a));

Rats.  Fixing :-)

> drop table if exists idxpart,idxpart0,idxpart1 cascade;
> create table idxpart (a int not null) partition by list (a);
> create table idxpart0 (a int constraint foo not null no inherit);
> alter table idxpart attach partition idxpart0 for values in (0,1);
> 
> With V7, we basically cannot change the status of "NO INHERIT".
> now, we need to drop the not-null constraint foo,
> recreate a not-null constraint on idxpart0,
> then attach it to the partitioned table idxpart.

Yeah, that sucks.  We'll need a new command
  ALTER TABLE .. ALTER CONSTRAINT .. INHERIT
(exact syntax TBD) which allows you to turn a NO INHERIT constraint into
a normal one, to avoid this problem.  I suggest we don't hold up this
patch for that.

> -
> drop table if exists inh_parent,inh_child1,inh_child2;
> create table inh_parent(f1 int);
> create table inh_child1(f1 int not null);
> alter table inh_child1 inherit inh_parent;
> alter table only inh_parent add constraint nn not null f1;
> alter table only inh_parent  alter column f1 set not null;
> 
> minor inconsistency, i guess.
> "alter table only inh_parent add constraint nn not null f1;"
> will fail.
> But
> "alter table only inh_parent  alter column f1 set not null;"
> will not fail, but add a "NOT NULL f1 NO INHERIT" constraint.
> I thought they should behave the same.
> 
> 
> for partitioned table
> now both ALTER TABLE ONLY ADD CONSTRAINT NOT NULL,
> ALTER TABLE ONLY ALTER COLUMN  SET NOT NULL
> will error out.
> I am fine with partitioned table behavior.

Yeah, this naughty relationship between ONLY and NO INHERIT is
bothersome and maybe we need to re-examine it.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
#error "Operator lives in the wrong universe"
  ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-02 Thread Jacob Champion

On Wed, Sep 25, 2024 at 6:39 AM Daniel Gustafsson  wrote:
> I can't recall specific bounds for supporting LibreSSL even being discussed,
> the support is also not documented as an official thing.  Requiring TLS 1.3
> APIs for supporting a library in 2025 (when 18 ships) doesn't seem entirely
> unreasonable so maybe 3.4 is a good cutoff.  The fact that LibreSSL trailed
> behind OpenSSL in adding these APIs shouldn't limit our functionality.

Okay. At minimum I think we'll lose conchuela, plover, and morepork
from the master builds until they are updated. schnauzer is new enough
to keep going.

> Thinking on it a bit I went (to some
> degree inspired by what we did in curl) with ssl_tls13_ciphers which makes the
> name very similar to the tls12 GUC but with the clear distinction of being
> protocol specific.  It also makes the GUC name more readable to place the
> protocol before "ciphers" I think.

Looks fine to me.

> I ended
> up adding a version of SSLerrmessage which takes a replacement string for 
> ecode
> 0 (which admittedly is hardcoded version knowledge as well..).  This can be
> used for scenarios when it's known that OpenSSL sometimes reports and error 
> and
> sometimes not (I'm sure there are quite a few more).

I like this new API! And yeah, I think it'll get more use elsewhere.

My only nitpick for this particular error message is that there's no
longer any breadcrumb back to the setting that's broken:

FATAL:  ECDH: failed to set curve names: No valid groups found
HINT:  Ensure that each group name is spelled correctly and
supported by the installed version of OpenSSL

If I migrate a server to a different machine that doesn't support my
groups, I don't know that this would give me enough information to fix
the configuration.

--

One nice side effect of the new ssl_groups implementation is that we
now support common group aliases. For example, "P-256", "prime256v1",
and "secp256r1" can all be specified now, whereas before ony
"prime256v1" worked because of how we looked up curves. Is that worth
a note in the docs? Even if not, it might be good to keep in mind for
support threads.

Thanks,
--Jacob

Re: pg_parse_json() should not leak token copies on failure

2024-10-02 Thread Andrew Dunstan




On 2024-10-01 Tu 3:04 PM, Jacob Champion wrote:

(Tom, thank you for the fixup in 75240f65!)

Off-list, Andrew suggested an alternative approach to the one I'd been
taking: let the client choose whether it wants to take token ownership
to begin with. v3 adds a new flag (and associated API) which will
allow libpq to refuse ownership of those tokens. The lexer is then
free to clean everything up during parse failures.

Usually I'm not a fan of "do the right thing" flags, but in this case,
libpq really is the outlier. And it's nice that existing clients
(potentially including extensions) don't have to worry about an API
change.



Yeah.

Generally looks good. Should we have a check in 
setJsonLexContextOwnsTokens() that we haven't started parsing yet, for 
the incremental case?






At the moment, we have a test matrix consisting of "standard frontend"
and "shlib frontend" tests for the incremental parser. I'm planning
for the v4 patch to extend that with a "owned/not owned" dimension;
any objections?



Sounds reasonable.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_verifybackup: TAR format backup verification

2024-10-02 Thread Tom Lane

Robert Haas  writes:
> On Tue, Oct 1, 2024 at 1:32 PM Tom Lane  wrote:
>> Yes, mamba thinks this is OK.

> Committed.

Sadly, it seems adder[1] is even pickier than mamba:

../pgsql/src/backend/backup/basebackup_incremental.c: In function 
\342\200\230CreateIncrementalBackupInfo\342\200\231:
../pgsql/src/backend/backup/basebackup_incremental.c:179:30: error: assignment 
to \342\200\230json_manifest_per_file_callback\342\200\231 {aka 
\342\200\230void (*)(JsonManifestParseContext *, const char *, long long 
unsigned int,  pg_checksum_type,  int,  unsigned char *)\342\200\231} from 
incompatible pointer type \342\200\230void (*)(JsonManifestParseContext *, 
const char *, size_t,  pg_checksum_type,  int,  uint8 *)\342\200\231 {aka 
\342\200\230void (*)(JsonManifestParseContext *, const char *, unsigned int,  
pg_checksum_type,  int,  unsigned char *)\342\200\231} 
[-Wincompatible-pointer-types]
  179 | context->per_file_cb = manifest_process_file;
  |  ^


regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-10-02%2014%3A09%3A58

Re: On disable_cost

On Wed, 2024-10-02 at 21:13 +0300, Alena Rybakina wrote:
> >   CREATE TABLE tab_a (id integer);
> > 
> >   CREATE TABLE tab_b (id integer);
> > 
> >   SET enable_nestloop = off;
> >   SET enable_hashjoin = off;
> > 
> >   EXPLAIN SELECT * FROM tab_a JOIN tab_b USING (id);
> > 
> >    QUERY PLAN  
> >   ═
> >    Merge Join  (cost=359.57..860.00 rows=32512 width=4)
> >  Merge Cond: (tab_a.id = tab_b.id)
> >  ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
> >    Sort Key: tab_a.id
> >    ->  Seq Scan on tab_a  (cost=0.00..35.50 rows=2550 width=4)
> >  ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
> >    Sort Key: tab_b.id
> >    ->  Seq Scan on tab_b  (cost=0.00..35.50 rows=2550 width=4)
> > 
> > I would have expected to see "Disabled nodes: 2" with the merge join,
> > because both the nested loop join and the hash join have been disabled.
> > 
> > Why is there no disabled node shown?
> > 
> > 
> > 
> > 
> 
> Disabled nodes show the number of disabled paths, you simply don’t
> have them here in mergejoin, because hashjoin and nestedloop were
> not selected. The reason is the compare_path_costs_fuzzily function,
> because the function decides which path is better based on fewer
> disabled nodes. hashjoin and nestedloop have 1 more nodes compared
> to mergejoin. you can disable mergejoin, I think the output about
> this will appear.

I see; the merge join happened to be the preferred join path, so nothing
had to be excluded.

  /* reset all parameters */

  EXPLAIN (COSTS OFF) SELECT * FROM tab_a JOIN tab_b USING (id);

   QUERY PLAN  
  ═
   Merge Join
 Merge Cond: (tab_a.id = tab_b.id)
 ->  Sort
   Sort Key: tab_a.id
   ->  Seq Scan on tab_a
 ->  Sort
   Sort Key: tab_b.id
   ->  Seq Scan on tab_b

So now if I disable merge joins, I should get a different strategy and see
a disabled node, right?

  SET enable_mergejoin = off;

  EXPLAIN (COSTS OFF) SELECT * FROM tab_a JOIN tab_b USING (id);

   QUERY PLAN 
  
   Hash Join
 Hash Cond: (tab_a.id = tab_b.id)
 ->  Seq Scan on tab_a
 ->  Hash
   ->  Seq Scan on tab_b

No disabled node shown... Ok, I still don't get it.

Yours,
Laurenz Albe

Re: Conflict Detection and Resolution

Hi, here are my code/test review comments for patch v14-0001.

(the v14-0001 docs review was already previously posted)

==
src/backend/commands/subscriptioncmds.c

parse_subscription_conflict_resolvers:

1.
+ /* Check if the conflict type already exists in the list */
+ if (list_member(SeenTypes, makeString(defel->defname)))

This 'SeenTypes' logic of string comparison of list elements all seems
overkill to me.

There is a known number of conflict types, so why not use a more
efficient bool array:
e.g. bool seen_conflict_type[CONFLICT_NUM_TYPES] = {0};

NOTE - I've already made this change in the nits attachment to
demonstrate that it works fine.

~

2.
+ foreach(lc, stmtresolvers)
+ {
+ DefElem*defel = (DefElem *) lfirst(lc);

There are macros to do this, so you don't need to use lfirst(). e.g.
foreach_ptr?

~

3.
nit (a) - remove excessive blank lines
nit (b) - minor comment reword + periods.

~~~

CreateSubscription:

4.
  bits32 supported_opts;
  SubOpts opts = {0};
  AclResult aclresult;
+ ConflictTypeResolver conflictResolvers[CONFLICT_NUM_TYPES];

nit - Suggest rename to 'conflict_resolvers' to keep similar style to
other variables.

~

5.
nit - add a missing period to some comments.

~~~

AlterSubscription:

6.
nit (a) - remove excessing blank lines.
nit (b) - add missing period in comments.
nit (c) - rename 'conflictResolvers' to 'conflict_resolvers' for
consistent variable style

==
src/backend/parser/gram.y

7.
+ | ALTER SUBSCRIPTION name RESET CONFLICT RESOLVER FOR conflict_type
+ {
+ AlterSubscriptionStmt *n =
+ makeNode(AlterSubscriptionStmt);
+
+ n->kind = ALTER_SUBSCRIPTION_RESET_CONFLICT_RESOLVER;
+ n->subname = $3;
+ n->conflict_type = $8;
+ $$ = (Node *) n;
+ }

Why does this only support resetting resolvers for one conflict type at a time?

~

8.
nit - add whitespace blank line

==
src/backend/replication/logical/conflict.c

9.
Add some static assertions for all of the arrays to ensure everything
is declared as expected.

(I've made this change in the nit attachment)

~~~

10.
+#define CONFLICT_TYPE_MAX_RESOLVERS 4

It's a bit fiddly introducing this constant just for the map
dimensions. You might as well use the already defined
CONFLICT_NUM_RESOLVERS.

Yes, I know they are not exactly the same. But, the extra few ints of
space wasted reusing this is insignificant, and there is no
performance loss in the lookup logic (after my other review comment
later). IMO it is easier to use what you already have instead of
introducing yet another hard-to-explain constant.

(I've made this change in the nit attachment)

~~~

11.
+static const int ConflictTypeResolverMap[][CONFLICT_TYPE_MAX_RESOLVERS] = {
+ [CT_INSERT_EXISTS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR},
+ [CT_UPDATE_EXISTS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR},
+ [CT_UPDATE_ORIGIN_DIFFERS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR},
+ [CT_UPDATE_MISSING] = {CR_APPLY_OR_SKIP, CR_APPLY_OR_ERROR, CR_SKIP,
CR_ERROR},
+ [CT_DELETE_MISSING] = {CR_SKIP, CR_ERROR},
+ [CT_DELETE_ORIGIN_DIFFERS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR}
+};

FYI - There is a subtle quirk (and potential bug) lurking in the way
this map is declared. The array elements for resolvers that are not
defined will default to value 0. So "{CR_SKIP, CR_ERROR}" is
essentially saying "{CR_SKIP, CR_ERROR, 0, 0}", and that means that
the enums for resolvers cannot have a value 0 because then this
ConflictTypeResolverMap would be broken. I see that this was already
handled (enums started at 1) but there was no explanatory comment so
it would be easy to unknowingly break it.

A better way (what I am suggesting here) would be to have a -1 marker
to indicate the end of the list of valid resolvers. This removes any
doubt, and it means the enums can start from 0 as normal, and it means
you can quick-exit from the lookup code (suggested later below) for a
performance gain. Basically, win-win-win.

For example:
 * NOTE: -1 is an end marker for the list of valid resolvers for each conflict
 * type.
 */
static const int ConflictTypeResolverMap[][CONFLICT_NUM_RESOLVERS+1] = {
  [CT_INSERT_EXISTS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR, -1},
  [CT_UPDATE_EXISTS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR, -1},
  [CT_UPDATE_ORIGIN_DIFFERS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR, -1},
  [CT_UPDATE_MISSING] = {CR_APPLY_OR_SKIP, CR_APPLY_OR_ERROR, CR_SKIP,
CR_ERROR, -1},
  [CT_DELETE_MISSING] = {CR_SKIP, CR_ERROR, -1},
  [CT_DELETE_ORIGIN_DIFFERS] = {CR_APPLY_REMOTE, CR_KEEP_LOCAL, CR_ERROR, -1}
};

(I have made this change already in the nit attachment)

~~~

12.
+/*
+ * Default conflict resolver for each conflict type.
+ */
+static const int ConflictTypeDefaultResolvers[] = {
+ [CT_INSERT_EXISTS] = CR_ERROR,
+ [CT_UPDATE_EXISTS] = CR_ERROR,
+ [CT_UPDATE_ORIGIN_DIFFERS] = CR_APPLY_REMOTE,
+ [CT_UPDATE_MISSING] = CR_SKIP,
+ [CT_DELETE_MISSING] = CR_SKIP,
+ [CT_DELETE_ORIGIN_DIFFERS] = CR_APPLY_REMOTE
+
+};
+

IMO you can throw all this away. Instead

Re: On disable_cost

On Tue, 1 Oct 2024 at 06:17, Laurenz Albe  wrote:
> Why did you change "Disabled" from an integer to a boolean?

I just don't think "Disabled Nodes" is all that self-documenting and
I'm also unsure why the full integer value of disabled_nodes is
required over just displaying the boolean value of if the node is
disabled or not. Won't readers look at the remainder of the plan to
determine information about which other nodes are disabled? Do we need
to give them a running total?

> If you see a join where two plans were disabled, that's useful information.

I'm not sure if I follow what you mean here.  The patch will show
"Disabled: true" for both the inner and outer side of the join if both
of those are disabled.  The difference is that my patch does not show
the join itself is disabled like master does. I thought that's what
you were complaining about. Can you show an example of what you mean?

David

Re: Return pg_control from pg_backup_stop().

2024-10-02 Thread David Steele


On 10/2/24 10:11, Michael Paquier wrote:

On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:


The basic idea is to harden recovery by returning a copy of pg_control from
pg_backup_stop() that has a flag set to prevent recovery if the backup_label
file is missing. Instead of backup software copying pg_control from PGDATA,
it stores an updated version that is returned from pg_backup_stop().


Hmm, okay.  There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces.  As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces.  That seems slightly
cleaner to me, still I agree that both are the same things.


Sending pg_control later results in even more code churn because of how 
tar files are terminated. I've updated it that way in v2 so you can see 
what I mean. I don't have a strong preference, though, so if you prefer 
the implementation in v2 then that's fine with me.



Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.


I can definitely see us making other updates to pg_control so I would 
rather keep this logic centralized, even though it is not too 
complicated at this point. Still, even 8 lines of code (as it is now) 
seems better not to duplicate.



* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide a
valid copy via pg_backup_stop(). This solves torn reads without needing to
write pg_control via a temp file, which may affect performance on a standby.


We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues.  So I'm not sure to
see your point here?


Even at 512B it is possible to see tears in pg_control and they happen 
in the build farm right now. In fact, this thread [1] trying to fix the 
problem was what got me thinking about alternate solutions to preventing 
tears in pg_control. Thomas' proposed fixes have not been committed to 
my knowledge so the problem remains, but would be fixed by this commit.



There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery.  Perhaps it should be improved based on
this patch.


I added a sentence to this comment block in v2.


The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function?


Primarily existing backup software, I would imagine. The idea is that it 
would give them feature parity with pg_basebackup, rather than the new 
protections being exclusive to pg_basebackup.



Perhaps existing
backup solutions are good enough risk vs reward is not worth it?  


I'm not sure I see the risk here. Saving out pg_control is optional so 
no changes to current software is required. Of course they miss the 
benefit of the protection against tears and missing backup_label, but 
that is a choice.


Also, no matter what current backup solutions do, they cannot prevent a 
user from removing backup_label after restore. This patch prevents 
invalid recovery when that happens.



The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side.  This adds a new step where backups would need to copy
the control file to the data folder.


Again, optional, but if I was able to manage these saves using the psql 
interface in the TAP tests then I'd say it would be pretty easy for 
anyone with a normal connection to Postgres. Also, we require users to 
treat tabelspace_map and backup_label as binary so not too big a change 
here.


[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2Bjig%2BQdBETj_ab%2B%2BVWSoJjbwT3sLNCnk0wFsY_6tRqoA%40mail.gmail.comFrom f9ebd108a9c5dc55fa485dc8c60eb11d2c16715c Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 2 Oct 2024 08:59:12 +
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the follo

Re: On disable_cost



On 02.10.2024 22:08, Laurenz Albe wrote:

On Wed, 2024-10-02 at 21:31 +0300, Alena Rybakina wrote:

Honestly, I like this patch. Before this patch, when disabling any algorithm
in the optimizer, the cost increased significantly and I’m not sure that this
was a reliable solution due to the fact that the cost even without disabling
can be greatly increased because of the high cardinality, for example.

Right there, the mechanism is simple and more honest in my opinion - we simply
count the number of disabled nodes and discard the paths with the largest
number of them.

I have no issue with this way of handling disabled plan nodes, I only
complained about the verbosity of the EXPLAIN output.
I'm willing to agree with you. I think we should display it not all the 
time.

I don't want to see disabled nodes propagated all the way up the tree,
and I would like the output suppressed by default.

I may have misunderstood your message, but disabled nodes number must 
propagate up the tree, otherwise it will be incorrect.


Let consider an example. We disabled seqscan, so the hashjoin containing 
it cannot be equal to 0, since such a path in principle should not be 
generated because a path must be generated that does not contain 
seqscan. It should use indexscan, for example. Therefore the hash join 
path containing indexscan will have fewer disabled nodes and will 
finally be used by the optimizer.


--
Regards,
Alena Rybakina
Postgres Professional

Re: Parallel workers stats in pg_stat_database

2024-10-02 Thread Benoit Lobréau

Hi,

Thanks for your imput ! I will fix the doc as proposed and do the split
as soon as I have time.

On 10/1/24 09:27, Michael Paquier wrote:

I'm less
a fan of the addition for utilities because these are less common
operations.

My thought process was that in order to size max_parallel_workers we
need to
have information on the maintenance parallel worker and "query" parallel
workers.

Actually, could we do better than what's proposed here? How about
presenting an aggregate of this data in pg_stat_statements for each
query instead?

I think both features are useful.

My collegues and I had a discussion about what could be done to improve
parallelism observability in PostgreSQL [0]. We thought about several
places to do it for several use cases.

Guillaume Lelarge worked on pg_stat_statements [1] and
pg_stat_user_[tables|indexes] [2]. I proposed a patch for the logs [3].

As a consultant, I frequently work on installation without
pg_stat_statements and I cannot install it on the client's production
in the timeframe of my intervention.

pg_stat_database is available everywhere and can easily be sampled by
collectors/supervision services (like check_pgactivity).

Lastly the number would be more precise/easier to make sense of, since
pg_stat_statement has a limited size.

[0]
https://www.postgresql.org/message-id/flat/d657df20-c4bf-63f6-e74c-cb85a81d0...@dalibo.com
[1]
https://www.postgresql.org/message-id/CAECtzeWtTGOK0UgKXdDGpfTVSa5bd_VbUt6K6xn8P7X%2B_dZqKw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAECtzeXXuMkw-RVGTWvHGOJsmFdsRY%2BjK0ndQa80sw46y2uvVQ%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/8123423a-f041-4f4c-a771-bfd96ab235b0%40dalibo.com

--
Benoit Lobréau
Consultant
http://dalibo.com

Re: On disable_cost

On Thu, 3 Oct 2024 at 08:41, Alena Rybakina  wrote:
> I may have misunderstood your message, but disabled nodes number must 
> propagate up the tree, otherwise it will be incorrect.

I think there are two misunderstandings on this thread:

1) You're misunderstanding what Laurenz is complaining about. He's
only concerned with the EXPLAIN output, not how disasbled_nodes works
internally.
2) Laurenz is misunderstanding what "Disabled Nodes" means. It has
nothing to do with other Paths which were considered and rejected. It
might be better named as "Disabled Degree". It tracks how many plan
nodes below and including this node are disabled.

Because of #2, I think I now understand why Laurenz was interested in
only showing this with VERBOSE. If it worked the way Laurenz thought,
I'd probably agree with him.

Overall, I think we need to do something here. There's no
documentation about what Disabled Nodes means so we either need to
make it easier to understand without documenting it or add something
to the documents about it. If Laurenz, who has a huge amount of
PostgreSQL experience didn't catch it, then what hope is there for the
average user?

David

Re: [BUG FIX]Connection fails with whitespace after keepalives parameter value

2024-10-02 Thread Tom Lane

Fujii Masao  writes:
> On 2024/10/02 11:35, Michael Paquier wrote:
>> On Tue, Oct 01, 2024 at 12:29:15PM -0400, Tom Lane wrote:
>>> I agree with Sasaki-san that useKeepalives seems rather bogus: almost
>>> every other place in fe-connect.c uses pqParseIntParam rather than
>>> calling strtol directly, so why not this one?

> I have no objection to improving the handling of the keepalives parameter.
> OTOH, I think ecpg might have an issue when converting the connection URI.

I went ahead and pushed Sasaki-san's patch.  I think anything we might
do in ecpg is probably just cosmetic and wouldn't get back-patched.

> In the converted URI, whitespace is added before and after the ? character.
> In my quick test, ECPGconnect() seems to handle this without error,
> but when I tried the same URI in psql, it returned an error:
> $ psql "postgresql://localhost:5432/postgres?keepalives=1 & keepalives_idle=1 
> & keepalives_interval=1 & keepalives_count=2"
> psql: error: invalid URI query parameter: " keepalives_idle"

Interesting.  This is unhappy about the space before a parameter name,
not the space after a parameter value, so it's a different issue.
But it's weird that ecpg takes it while libpq doesn't.  Could libecpg
be modifying/reassembling the URI string?  I didn't look.

regards, tom lane

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-02 Thread Jacob Champion

On Wed, Oct 2, 2024 at 11:33 AM Daniel Gustafsson  wrote:
> > If I migrate a server to a different machine that doesn't support my
> > groups, I don't know that this would give me enough information to fix
> > the configuration.
>
> Fair point, how about something along the lines of:
>
> +   errmsg("ECDH: failed to set curve names specified in ssl_groups: %s",
> +   SSLerrmessageExt(ERR_get_error(),
> +_("No valid groups found"))),

Yeah, I think that's enough of a pointer. And then maybe "Failed to
set group names specified in ssl_groups: %s" to get rid of the
lingering ECC references?

> > One nice side effect of the new ssl_groups implementation is that we
> > now support common group aliases. For example, "P-256", "prime256v1",
> > and "secp256r1" can all be specified now, whereas before ony
> > "prime256v1" worked because of how we looked up curves. Is that worth
> > a note in the docs?
>
> Maybe. We have this currently in the manual:
>
> "The full list of available curves can be shown with the command
> openssl ecparam -list_curves.  Not all of them are
> usable with TLS though."
>
> Perhaps we can extend that with a short not on aliases?  Got any suggested
> wordings for that if so?

Hm, well, I went down a rabbit hole this afternoon -- OpenSSL has an
open feature request [1] that might eventually document this the right
way. In the meantime, maybe something like...

An incomplete list of available groups can be shown with the
command openssl ecparam -list_curves. Not all of them are usable with
TLS though, and many supported group names and aliases are omitted.

In PostgreSQL versions before 18.0 this setting was named
ssl_ecdh_curve. It only accepted a single value and did not recognize
group aliases at all.

Thanks,
--Jacob

[1] https://github.com/openssl/openssl/issues/17953

Re: Enhance create subscription reference manual

On Wed, Oct 2, 2024 at 6:17 PM Yugo NAGATA wrote:
>
> >
> > I would like to hear opinions from native English speakers.
>
> +1
>

Hi,

You can look up the definitions of the words "option" versus "parameter".

They do have different meanings, like:
- An "option" is a modifier that affects how a command behaves. It is
usually optional (hence the name?) so it typically has a default
value.
- A "parameter" is input required for the command to work. A parameter
possible values can be freely user-specified, whereas option values
are fixed (boolean, enum etc).

Chat-GPT can explain the difference better than me:
--
is postgres subscription failover a parameter or an option?

ChatGPT said:
In PostgreSQL, subscription failover would typically be considered an
option rather than a parameter. Here’s why:

Option: It modifies the behavior of the subscription. Enabling or
configuring failover adjusts how the subscription behaves when a
failover occurs in a high-availability setup. This feature is optional
and alters the default behavior of a subscription.

Parameter: In PostgreSQL, parameters usually refer to specific inputs
or configurations required for a function, command, or feature to
work. For instance, specifying a connection string, user, or host
would be considered parameters.
--

~~~

Unfortunately, it gets a bit muddled in the Postgres docs, because the
CREATE SUBSCRIPTION docs page refers to all these as "subscription
parameters" -- "WITH ( subscription_parameter [= value] [, ... ] ) "
regardless of whether they are options or parameters.

e.g. IMO the "slot_name" really is a parameter, because it can take a
user-specified value.
e.g. IMO "failover" really is an option, even though this page refers
to it in some places as a parameter.

You can see how confused the current docs are because "failover" is
called by both terms even within the same paragraph! [1]
- "failover parameter specified in the subscription"
- "subscription's failover option"

~~~

What to do? Ideally, the docs should have consistent and correct usage
of the words "option" and "parameter" everywhere. But in practice, I
guess most people probably are reading those words as synonyms anyway
so using them wrongly isn't impacting the understanding badly.

Anyway, since you are already fixing something for "failover", then it
would be good to fix the correct term everywhere for that one (e.g.
call it an "option"), or at least call it an option everywhere on the
CREATE SUBSCRIPTION page.

==
[1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-SLOT-NAME

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Enhance create subscription reference manual

2024-10-02 Thread David G. Johnston

On Wednesday, October 2, 2024, Peter Smith  wrote:

>
>
> You can see how confused the current docs are because "failover" is
> called by both terms even within the same paragraph! [1]
> - "failover parameter specified in the subscription"
> - "subscription's failover option"
>
> ~~~
>
> What to do? Ideally, the docs should have consistent and correct usage
> of the words "option" and "parameter" everywhere. But in practice, I
> guess most people probably are reading those words as synonyms anyway
> so using them wrongly isn't impacting the understanding badly.
>
> Anyway, since you are already fixing something for "failover", then it
> would be good to fix the correct term everywhere for that one (e.g.
> call it an "option"), or at least call it an option everywhere on the
> CREATE SUBSCRIPTION page.
>

The distinction between required and optional is not relevant for our
documentation.  The descriptions tell you that.

If you wish to know whether to call something an option or a parameter look
at the syntax placeholder.  In this case, it is named:
 “subscription_parameter” so parameter is the correct term to choose on
this page, for these things.  For explain you call them options because the
placeholder is “option”.

David J.

Re: Pgoutput not capturing the generated columns

Hi Shubham,

The different meanings of the terms "parameter" versus "option" were
discussed in a recent thread [1], and that has made me reconsider this
generated columns feature.

Despite being in the PUBLICATION section "WITH ( publication_parameter
[= value] [, ... ] )", I think that 'publish_generated_columns' is an
"option" (not a parameter).

We should update all those places that are currently calling it a parameter:
- commit messages
- docs
- comments
- etc.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuiRydyrYfMzR1OxOnVJf-_G8OBCLdyqu8jJ8si51d%2BEQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Enhance create subscription reference manual

2024-10-02 Thread David G. Johnston

On Wednesday, October 2, 2024, Peter Smith  wrote:

> `
> - "subscription's failover option"
>

Feels like this one refers to the stored catalog value, which is neither a
parameter nor an option but an attribute.  Though in some cases we are
using “property” as well.  An in this usage option sorta works too but
parameter really doesn’t - a parameter refers to the command syntax parts,
not the resultant table attributes.

David J.

Re: On disable_cost

On Thu, 3 Oct 2024 at 03:04, Robert Haas  wrote:
> I don't think this will produce the right answer in all cases because
> disabled node counts don't propagate across subquery levels.

I see my patch didn't behave correctly when faced with a SubqueryScan
as SubqueryScan does not use the "lefttree" field and has a "subplan"
field instead. The patch will need special handling for that (fixed in
the attached patch).

I can't quite find the area you're talking about where the
disabled_nodes don't propagate through subquery levels. Looking at
cost_subqueryscan(), I see propagation of disabled_nodes. If the
SubqueryScan node isn't present then the propagation just occurs
normally as it does with other path types. e.g. master does:

# set enable_Seqscan=0;
# explain (costs off) select * from (select * from pg_class offset 0)
order by oid;
 QUERY PLAN

 Sort
   Disabled Nodes: 1
   Sort Key: pg_class.oid
   ->  Seq Scan on pg_class
 Disabled Nodes: 1
(5 rows)

Can you provide an example of what you mean?

I've attached an updated PoC patch which I think gets the SubqueryScan
stuff correct. I've not spent time testing everything as if nobody
likes the patch's EXPLAIN output, I don't want to waste time on the
patch for that.

I understand you're keen on keeping the output as it is in master. It
would be good to hear if other people agree with you on this.  I
imagine you'd rather work on other things, but it's easier to discuss
this now than after PG18 is out.

For me, I find master's output overly verbose and not all that easy to
identify the disabled nodes as it requires scanning all the
disabled_node values and finding the nodes where the value is one
higher than the sum of the sum node's disabled_nodes. For example, if
a Nested Loop has "Disabled Nodes: 17" and the inner side of the join
has "Disabled Nodes: 9" and the outer side has "Disabled Nodes: 8",
it's not that easy to determine if the nested loop is disabled or not.
Of course, you only need to do 8+9=17 and see it isn't, but when faced
with run-time pruning done at executor startup, some
Append/MergeAppend nodes might be missing from EXPLAIN and when that
happens, you can't just manually add the Disabled Nodes up. Here's
what I mean:

setup:

create table lp (a int) partition by list(a);
create table lp1 partition of lp for values in(1);
create table lp2 partition of lp for values in(2);
set enable_seqscan=0;
prepare q1(int) as select * from lp where a = $1 order by a;
set plan_cache_mode=force_generic_plan;
explain (analyze, costs off, timing off, summary off) execute q1(1);

master:

 Append (actual rows=0 loops=1)
   Disabled Nodes: 2
   Subplans Removed: 1
   ->  Seq Scan on lp1 lp_1 (actual rows=0 loops=1)
 Disabled Nodes: 1
 Filter: (a = $1)

patched:

 Append (actual rows=0 loops=1)
   Subplans Removed: 1
   ->  Seq Scan on lp1 lp_1 (actual rows=0 loops=1)
 Disabled: true
 Filter: (a = $1)

With master, it looks like Seq Scan and Append are disabled. With the
patched version, you can see it isn't.

David

poc_improve_disabled_nodes_explain_output_v2.patch
Description: Binary data

Re: On disable_cost

2024-10-02 Thread Robert Haas

On Fri, Sep 27, 2024 at 4:42 AM Laurenz Albe  wrote:
> 1. The "disabled nodes" are always displayed.
>I'd be happier if it were only shown for COSTS ON, but I think it
>would be best if they were only shown with VERBOSE ON.
>
>After all, the messages are pretty verbose...

I agree that the messages are more verbose than what we did before
(add a large value to the cost). But I would have thought it wouldn't
matter much because most of the time nothing will be disabled. And I
would think if you get a plan that has some nodes disabled, you would
want to know about that.

I actually thought it was rather nice that this system lets you show
the disabled-nodes information even when COSTS OFF. Regression tests
need to suppress costs because it can vary by platform, but the count
of disabled nodes is stable enough to display.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_verifybackup: TAR format backup verification

2024-10-02 Thread Robert Haas

On Tue, Oct 1, 2024 at 1:32 PM Tom Lane  wrote:
> Yes, mamba thinks this is OK.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: On disable_cost

On Wed, 2024-10-02 at 21:55 +1300, David Rowley wrote:
> On Tue, 1 Oct 2024 at 06:17, Laurenz Albe  wrote:
> > Why did you change "Disabled" from an integer to a boolean?
> 
> I just don't think "Disabled Nodes" is all that self-documenting and
> I'm also unsure why the full integer value of disabled_nodes is
> required over just displaying the boolean value of if the node is
> disabled or not. Won't readers look at the remainder of the plan to
> determine information about which other nodes are disabled? Do we need
> to give them a running total?

I didn't want a running total, but maybe I misunderstood what a disabled
node is; see below.

> > If you see a join where two plans were disabled, that's useful information.
> 
> I'm not sure if I follow what you mean here.  The patch will show
> "Disabled: true" for both the inner and outer side of the join if both
> of those are disabled.  The difference is that my patch does not show
> the join itself is disabled like master does. I thought that's what
> you were complaining about. Can you show an example of what you mean?

I ran the following example, and now I am confused.

  CREATE TABLE tab_a (id integer);

  CREATE TABLE tab_b (id integer);

  SET enable_nestloop = off;
  SET enable_hashjoin = off;

  EXPLAIN SELECT * FROM tab_a JOIN tab_b USING (id);

   QUERY PLAN  
  ═
   Merge Join  (cost=359.57..860.00 rows=32512 width=4)
 Merge Cond: (tab_a.id = tab_b.id)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: tab_a.id
   ->  Seq Scan on tab_a  (cost=0.00..35.50 rows=2550 width=4)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: tab_b.id
   ->  Seq Scan on tab_b  (cost=0.00..35.50 rows=2550 width=4)

I would have expected to see "Disabled nodes: 2" with the merge join,
because both the nested loop join and the hash join have been disabled.

Why is there no disabled node shown?

Yours,
Laurenz Albe

Re: On disable_cost


Hi!

On 02.10.2024 21:04, Laurenz Albe wrote:

I didn't want a running total, but maybe I misunderstood what a disabled
node is; see below.


If you see a join where two plans were disabled, that's useful information.

I'm not sure if I follow what you mean here.  The patch will show
"Disabled: true" for both the inner and outer side of the join if both
of those are disabled.  The difference is that my patch does not show
the join itself is disabled like master does. I thought that's what
you were complaining about. Can you show an example of what you mean?

I ran the following example, and now I am confused.

   CREATE TABLE tab_a (id integer);

   CREATE TABLE tab_b (id integer);

   SET enable_nestloop = off;
   SET enable_hashjoin = off;

   EXPLAIN SELECT * FROM tab_a JOIN tab_b USING (id);

QUERY PLAN
   ═
Merge Join  (cost=359.57..860.00 rows=32512 width=4)
  Merge Cond: (tab_a.id = tab_b.id)
  ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
Sort Key: tab_a.id
->  Seq Scan on tab_a  (cost=0.00..35.50 rows=2550 width=4)
  ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
Sort Key: tab_b.id
->  Seq Scan on tab_b  (cost=0.00..35.50 rows=2550 width=4)

I would have expected to see "Disabled nodes: 2" with the merge join,
because both the nested loop join and the hash join have been disabled.

Why is there no disabled node shown?

Disabled nodes show the number of disabled paths, you simply don’t have 
them here in mergejoin, because hashjoin and nestedloop were not 
selected. The reason is the compare_path_costs_fuzzily function, because 
the function decides which path is better based on fewer disabled nodes. 
hashjoin and nestedloop have 1 more nodes compared to mergejoin. you can 
disable mergejoin, I think the output about this will appear.


--
Regards,
Alena Rybakina
Postgres Professional

Re: On disable_cost

On Wed, 2024-10-02 at 10:08 -0400, Robert Haas wrote:
> On Fri, Sep 27, 2024 at 4:42 AM Laurenz Albe  wrote:
> > 1. The "disabled nodes" are always displayed.
> >     I'd be happier if it were only shown for COSTS ON, but I think it
> >     would be best if they were only shown with VERBOSE ON.
> > 
> >     After all, the messages are pretty verbose...
> 
> I agree that the messages are more verbose than what we did before
> (add a large value to the cost). But I would have thought it wouldn't
> matter much because most of the time nothing will be disabled. And I
> would think if you get a plan that has some nodes disabled, you would
> want to know about that.

I'm alright with that, but I certainly don't want to see them propagated
through the tree.  If you have a three page execution plan, and now it
is four pages long because some sequential scan at the lower end was
disabled and I get "Disabled nodes: 1" on every third line, that is
going to make me unhappy.

> I actually thought it was rather nice that this system lets you show
> the disabled-nodes information even when COSTS OFF. Regression tests
> need to suppress costs because it can vary by platform, but the count
> of disabled nodes is stable enough to display.

VERBOSE can be used with COSTS OFF, so that would work nicely if the
disabled nodes were only shown with EXPLAIN (VERBOSE).

I don't think that the feature is bad, I just would prefer it disabled
by default.

Yours,
Laurenz Albe

Re: On disable_cost




you can disable mergejoin, I think the output about this will appear.


I did it and disabled nodes were displayed in the query explain:

alena@postgres=#  CREATE TABLE tab_a (id integer);
alena@postgres=#  CREATE TABLE tab_a (id integer);
alena@postgres=#  CREATE TABLE tab_b (id integer);
alena@postgres=#  CREATE TABLE tab_b (id integer);
alena@postgres=#  SET enable_nestloop = off;
alena@postgres=#  SET enable_nestloop = off;
alena@postgres=#  SET enable_hashjoin = off;
alena@postgres=#  SET enable_mergejoin = off;

alena@postgres=# EXPLAIN SELECT * FROM tab_a JOIN tab_b USING (id);
 QUERY PLAN
-
 Nested Loop  (cost=0.00..97614.88 rows=32512 width=4)
   Disabled Nodes: 1
   Join Filter: (tab_a.id = tab_b.id)
   ->  Seq Scan on tab_a  (cost=0.00..35.50 rows=2550 width=4)
   ->  Materialize  (cost=0.00..48.25 rows=2550 width=4)
 ->  Seq Scan on tab_b  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

The number of disabled nodes
alena@postgres=# set enable_seqscan =off;
SET
alena@postgres=# EXPLAIN SELECT * FROM tab_a JOIN tab_b USING (id);
 QUERY PLAN
-
 Nested Loop  (cost=0.00..97614.88 rows=32512 width=4)
   Disabled Nodes: 3
   Join Filter: (tab_a.id = tab_b.id)
   ->  Seq Scan on tab_a  (cost=0.00..35.50 rows=2550 width=4)
 Disabled Nodes: 1
   ->  Materialize  (cost=0.00..48.25 rows=2550 width=4)
 Disabled Nodes: 1
 ->  Seq Scan on tab_b  (cost=0.00..35.50 rows=2550 width=4)
   Disabled Nodes: 1
(9 rows)

Here is an example, if you also disable seqscan. the number of disabled 
nodes in a join connection is equal to the sum of all disabled subnodes 
and the nestedloop itself (it is also disabled).


Honestly, I like this patch. Before this patch, when disabling any 
algorithm in the optimizer, the cost increased significantly and I’m not 
sure that this was a reliable solution due to the fact that the cost 
even without disabling can be greatly increased because of the high 
cardinality, for example.


Right there, the mechanism is simple and more honest in my opinion - we 
simply count the number of disabled nodes and discard the paths with the 
largest number of them.



--
Regards,
Alena Rybakina
Postgres Professional

Re: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values

2024-10-02 Thread Takeshi Ideriha

Hi, Amit
(adding hackers for discussion)

>We forgot to set/unset the flag in functions
>systable_beginscan_ordered and systable_endscan_ordered. BTW,

Thank you for the clarification.

>shouldn't this occur even without prepare transaction? If so, we need
>to backpatch this till 14.

Yes, it occurred also at PG14.
I did some tests using 015_stream.pl with some modification like
below, which tests the subscription about stream is on but two-phase
is off.
The same issue occurred at both current head source code and PG14.

```
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -134,9 +134,11 @@ my $node_subscriber =
PostgreSQL::Test::Cluster->new('subscriber');
 $node_subscriber->init;
 $node_subscriber->start;

+my $default = join('', map {chr(65 + rand 26)} (1 .. 1));
+
 # Create some preexisting content on publisher
 $node_publisher->safe_psql('postgres',
-   "CREATE TABLE test_tab (a int primary key, b bytea)");
+   "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default')");
 $node_publisher->safe_psql('postgres',
"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");

@@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE
TABLE test_tab_2 (a int)");

 # Setup structure on subscriber
 $node_subscriber->safe_psql('postgres',
-   "CREATE TABLE test_tab (a int primary key, b bytea, c
timestamptz DEFAULT now(), d bigint DEFAULT 999)"
+   "CREATE TABLE test_tab (a int primary key, b bytea, t text
DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT
999)"
 );
```

Logs from head source:
```
2024-10-01 12:34:56.694 UTC [575202] LOG:  starting PostgreSQL 18devel
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG:  statement: BEGIN;
2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG:  statement:
INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM
generate_series(3, 5000) s(i);
2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR:  unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG:  released logical
replication slot "tap_sub"
2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG:  could not send data
to client: Broken pipe
2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL:  connection to client lost
...
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG:  starting logical
decoding for slot "tap_sub"
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL:  Streaming
transactions committing after 0/1583A68, reading WAL from 0/1583A30.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG:  logical decoding
found consistent point at 0/1583A30
2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL:  There are no
running transactions.
2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR:  unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4',
streaming 'on', origin 'any', publication_names '"tap_pub"')
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG:  released logical
replication slot "tap_sub"
2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG:  could not send data
to client: Broken pipe
2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL:  connection to client lost
```

Logs from PG14 source:

```
2024-10-01 13:20:29.409 UTC [593696] LOG:  starting PostgreSQL 14.9 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat
11.4.1-2), 64-bit
...
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG:  statement: BEGIN;
2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG:  statement:
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,
5000) s(i);
2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR:  unexpected
table_index_fetch_tuple call during logical decoding
2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
streaming 'on', publication_names '"tap_pub"')
2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG:  could not send data
to client: Broken pipe
2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL:  connection to client

Re: Patch: Show queries of processes holding a lock

2024-10-02 Thread Alexey Orlov

On Tue, Oct 1, 2024 at 11:45 AM wenhui qiu  wrote:
>
> Hi Alexey Orlov
>  Thank you for your work on this path,The lock information is recorded in 
> detail,Easy to trace the lock competition at that time there is a detailed 
> lock competition log，But I have a concern,Frequent calls to this function 
> (pgstat_get_backend_current_activity) in heavy lock contention or high 
> concurrency environments may cause performance degradation, especially when 
> processes frequently enter and exit lock waits. Can you add a guc parameter 
> to turn this feature on or off?After all communities for this parameter( 
> log_lock_waits )default values set to on many people concern 
> (https://commitfest.postgresql.org/49/4718/)
>
>
>
> Thanks
>
Yeah, agree, thank you. I just think changing the parameter type would
be nice too.

 typedef enum
{
 LOG_LOCK_WAIT_NONE = 0,
 LOG_LOCK_WAIT_TERSE,
 LOG_LOCK_WAIT_VERBOSE,
} LogLockWaitVerbosity;

LOG_LOCK_WAIT_NONE is "off", LOG_LOCK_WAIT_TERSE is "on",
LOG_LOCK_WAIT_VERBOSE enables  writing the query to to the log.

I've attached a slightly modified patch to use the new log_lock_wait values.

--
Regards,
 Alexey Orlov!
From 0e1835146df63f104f6d4460ffb357a93732eefa Mon Sep 17 00:00:00 2001
From: Alexey Orlov 
Date: Wed, 2 Oct 2024 12:08:13 +0300
Subject: [PATCH v=2] Show queries in log_lock_wait

---
 src/backend/storage/lmgr/proc.c | 22 ++-
 src/backend/utils/misc/guc_tables.c | 33 +
 src/include/storage/proc.h  |  9 +++-
 3 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0d8162a2cc..60b749b140 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -61,7 +61,7 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 int			TransactionTimeout = 0;
 int			IdleSessionTimeout = 0;
-bool		log_lock_waits = false;
+int			log_lock_waits = LOG_LOCK_WAIT_NONE;
 
 /* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
@@ -1514,7 +1514,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 		{
 			StringInfoData buf,
 		lock_waiters_sbuf,
-		lock_holders_sbuf;
+		lock_holders_sbuf,
+		lock_queries_sbuf;
 			const char *modename;
 			long		secs;
 			int			usecs;
@@ -1528,6 +1529,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 			initStringInfo(&buf);
 			initStringInfo(&lock_waiters_sbuf);
 			initStringInfo(&lock_holders_sbuf);
+			initStringInfo(&lock_queries_sbuf);
 
 			DescribeLockTag(&buf, &locallock->tag.lock);
 			modename = GetLockmodeName(locallock->tag.lock.locktag_lockmethodid,
@@ -1572,6 +1574,15 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 }
 else
 {
+	if (log_lock_waits == LOG_LOCK_WAIT_VERBOSE && myWaitStatus == PROC_WAIT_STATUS_WAITING)
+	{
+		appendStringInfoChar(&lock_queries_sbuf, '\n');
+
+		appendStringInfo(&lock_queries_sbuf,
+		 _("Process %d: %s"),
+		 curproclock->tag.myProc->pid,
+		 pgstat_get_backend_current_activity(curproclock->tag.myProc->pid, false));
+	}
 	if (first_holder)
 	{
 		appendStringInfo(&lock_holders_sbuf, "%d",
@@ -1616,9 +1627,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 ereport(LOG,
 		(errmsg("process %d still waiting for %s on %s after %ld.%03d ms",
 MyProcPid, modename, buf.data, msecs, usecs),
-		 (errdetail_log_plural("Process holding the lock: %s. Wait queue: %s.",
-			   "Processes holding the lock: %s. Wait queue: %s.",
-			   lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data;
+		 (errdetail_log_plural("Process holding the lock: %s. Wait queue: %s.%s",
+			   "Processes holding the lock: %s. Wait queue: %s.%s",
+			   lockHoldersNum, lock_holders_sbuf.data, lock_waiters_sbuf.data, lock_queries_sbuf.data;
 			else if (myWaitStatus == PROC_WAIT_STATUS_OK)
 ereport(LOG,
 		(errmsg("process %d acquired %s on %s after %ld.%03d ms",
@@ -1654,6 +1665,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 			pfree(buf.data);
 			pfree(lock_holders_sbuf.data);
 			pfree(lock_waiters_sbuf.data);
+			pfree(lock_queries_sbuf.data);
 		}
 	} while (myWaitStatus == PROC_WAIT_STATUS_WAITING);
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..1eddaaf0e5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -474,6 +474,20 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry log_lock_waits_options[] = {
+	{"verbose", LOG_LOCK_WAIT_VERBOSE, false},
+	{"terse", LOG_LOCK_WAIT_TERSE, false},
+	{"on", LOG_LOCK_WAIT_TERSE, false},
+	{"off", LOG_LOCK_W

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-02 Thread Daniel Gustafsson

> On 2 Oct 2024, at 19:16, Jacob Champion  
> wrote:
> 
> On Wed, Sep 25, 2024 at 6:39 AM Daniel Gustafsson  wrote:
>> I can't recall specific bounds for supporting LibreSSL even being discussed,
>> the support is also not documented as an official thing.  Requiring TLS 1.3
>> APIs for supporting a library in 2025 (when 18 ships) doesn't seem entirely
>> unreasonable so maybe 3.4 is a good cutoff.  The fact that LibreSSL trailed
>> behind OpenSSL in adding these APIs shouldn't limit our functionality.
> 
> Okay. At minimum I think we'll lose conchuela, plover, and morepork
> from the master builds until they are updated. schnauzer is new enough
> to keep going.

I will raise it on the thread where bumping to 1.1.1 as the lowest supported
version to make sure it doesn't land as a surprise.

> My only nitpick for this particular error message is that there's no
> longer any breadcrumb back to the setting that's broken:
> 
>FATAL:  ECDH: failed to set curve names: No valid groups found
>HINT:  Ensure that each group name is spelled correctly and
> supported by the installed version of OpenSSL
> 
> If I migrate a server to a different machine that doesn't support my
> groups, I don't know that this would give me enough information to fix
> the configuration.

Fair point, how about something along the lines of:

+   errmsg("ECDH: failed to set curve names specified in ssl_groups: %s",
+   SSLerrmessageExt(ERR_get_error(),
+_("No valid groups found"))),

> One nice side effect of the new ssl_groups implementation is that we
> now support common group aliases. For example, "P-256", "prime256v1",
> and "secp256r1" can all be specified now, whereas before ony
> "prime256v1" worked because of how we looked up curves. Is that worth
> a note in the docs? 

Maybe. We have this currently in the manual:

"The full list of available curves can be shown with the command
openssl ecparam -list_curves.  Not all of them are
usable with TLS though."

Perhaps we can extend that with a short not on aliases?  Got any suggested
wordings for that if so?

--
Daniel Gustafsson

Re: On disable_cost

On Wed, 2024-10-02 at 21:31 +0300, Alena Rybakina wrote:
> Honestly, I like this patch. Before this patch, when disabling any algorithm
> in the optimizer, the cost increased significantly and I’m not sure that this
> was a reliable solution due to the fact that the cost even without disabling
> can be greatly increased because of the high cardinality, for example.
>
> Right there, the mechanism is simple and more honest in my opinion - we simply
> count the number of disabled nodes and discard the paths with the largest
> number of them.

I have no issue with this way of handling disabled plan nodes, I only
complained about the verbosity of the EXPLAIN output.

I don't want to see disabled nodes propagated all the way up the tree,
and I would like the output suppressed by default.

Yours,
Laurenz Albe

Re: On disable_cost


I see; the merge join happened to be the preferred join path, so nothing
had to be excluded.

   /* reset all parameters */

   EXPLAIN (COSTS OFF) SELECT * FROM tab_a JOIN tab_b USING (id);

QUERY PLAN
   ═
Merge Join
  Merge Cond: (tab_a.id = tab_b.id)
  ->  Sort
Sort Key: tab_a.id
->  Seq Scan on tab_a
  ->  Sort
Sort Key: tab_b.id
->  Seq Scan on tab_b

So now if I disable merge joins, I should get a different strategy and see
a disabled node, right?

   SET enable_mergejoin = off;

   EXPLAIN (COSTS OFF) SELECT * FROM tab_a JOIN tab_b USING (id);

QUERY PLAN
   
Hash Join
  Hash Cond: (tab_a.id = tab_b.id)
  ->  Seq Scan on tab_a
  ->  Hash
->  Seq Scan on tab_b

No disabled node shown... Ok, I still don't get it.


No, you don't see it.

you can see that the compare_path_costs_fuzzily function is fundamental 
to determining which path will remain - new path or one of the old paths 
added in the pathlist of relation (see add_path function that calls 
compare_path_costs_fuzzily function).


One of the signs for it is an assessment based on the number of disabled 
paths. This lines from the compare_path_costs_fuzzily function:


/* Number of disabled nodes, if different, trumps all else. */
if (unlikely(path1->disabled_nodes != path2->disabled_nodes))
{
    if (path1->disabled_nodes < path2->disabled_nodes)
    return COSTS_BETTER1;
    else
    return COSTS_BETTER2;

}

Since mergejoin is disabled for optimizer, the number of disabled nodes 
are equal to 1. hashjoin is enabled and the number of its disabled nodes 
are equal to 0. Thus, a hash join will be chosen since the number of 
disabled nodes is less compared to a merge join.


Hashjoin is not disabled, so there are no note in the query plan that it 
is disabled.


  EXPLAIN (COSTS OFF) SELECT * FROM tab_a JOIN tab_b USING (id);

   QUERY PLAN
  
   Hash Join
 Hash Cond: (tab_a.id = tab_b.id)
 ->  Seq Scan on tab_a
 ->  Hash
   ->  Seq Scan on tab_b

--
Regards,
Alena Rybakina
Postgres Professional

Re: bgwrite process is too lazy

2024-10-02 Thread Tomas Vondra

On 10/2/24 17:02, Tony Wayne wrote:
> 
> 
> On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe  > wrote:
> 
> On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > Whenever I check the checkpoint information in a log, most dirty
> pages are written by the checkpoint process
> 
> That's exactly how it should be!
> 
> is it because if bgwriter frequently flushes, the disk io will be more?🤔 

Yes, pretty much. But it's also about where the writes happen.

Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.

Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).

But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.

The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.

regards

-- 
Tomas Vondra

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-10-02 Thread Masahiko Sawada

On Tue, Oct 1, 2024 at 8:57 PM Tom Lane  wrote:
>
> Noah Misch  writes:
> > On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
> >> Considering that the population of database cluster signedness will
> >> converge to signedness=true in the future, we can consider using
> >> -fsigned-char to prevent similar problems for the future. We need to
> >> think about possible side-effects as well, though.
>
> > It's good to think about -fsigned-char.  While I find it tempting, several
> > things would need to hold for us to benefit from it:
>
> > - Every supported compiler has to offer it or an equivalent.
> > - The non-compiler parts of every supported C implementation need to
> >   cooperate.  For example, CHAR_MIN must change in response to the flag.  
> > See
> >   the first comment in cash_in().
> > - Libraries we depend on can't do anything incompatible with it.
>
> > Given that, I would lean toward not using -fsigned-char.  It's unlikely all
> > three things will hold.  Even if they do, the benefit is not large.
>
> I am very, very strongly against deciding that Postgres will only
> support one setting of char signedness.  It's a step on the way to
> hardware monoculture, and we know where that eventually leads.
> (In other words, I categorically reject Sawada-san's assertion
> that signed chars will become universal.  I'd reject the opposite
> assertion as well.)

Thank you for pointing this out. I agree with both of you.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Rename PageData to XLogPageData

2024-10-02 Thread Heikki Linnakangas


On 02/10/2024 14:30, Peter Eisentraut wrote:
I was fiddling a bit with making some Page-related APIs const-proof, 
which might involve changing something like "Page p" to "const PageData 
*p", but I was surprised that a type PageData exists but it's an 
unrelated type local to generic_xlog.c.


Good find

This patch renames that type to a more specific name XLogPageData.  This 
makes room for possibly adding another PageData type with the earlier 
meaning, but that's not done here.  But I think even without that, this 
patch is a useful little cleanup that makes the code more consistent and 
clear.


+1 for renaming, but -1 on XLogPageData. That sounds like a WAL page, 
see XLogPageHeaderData for example. I'd suggest GenericXLogPageData or 
just GenericPerPageData or something.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Pgoutput not capturing the generated columns

On Thu, Oct 3, 2024 at 10:09 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> The different meanings of the terms "parameter" versus "option" were
> discussed in a recent thread [1], and that has made me reconsider this
> generated columns feature.
>
> Despite being in the PUBLICATION section "WITH ( publication_parameter
> [= value] [, ... ] )", I think that 'publish_generated_columns' is an
> "option" (not a parameter).
>
> We should update all those places that are currently calling it a parameter:
> - commit messages
> - docs
> - comments
> - etc.
>
> ==
> [1] 
> https://www.postgresql.org/message-id/CAHut%2BPuiRydyrYfMzR1OxOnVJf-_G8OBCLdyqu8jJ8si51d%2BEQ%40mail.gmail.com
>

It seems there are differing opinions on that other thread about what
term to use. Probably, it is best to just leave the above suggestion
alone for now.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Patch: Show queries of processes holding a lock

On Tue, 1 Oct 2024 at 21:04, Alexey Orlov  wrote:
> session 1:
> CREATE TABLE foo (val integer);
> INSERT INTO foo (val) VALUES (1);
> BEGIN;
> UPDATE foo SET val = 3;
>
> session 2:
> BEGIN;
> UPDATE TABLE foo SET val = 2;
>
> LOG: process 3133043 still waiting for ShareLock on transaction 758
> after 1000.239 ms
> DETAIL: Process holding the lock: 3132855. Wait queue: 3133043.
> Process 3132855: update foo SET val = 3;
> CONTEXT: while updating tuple (0,7) in relation "foo"
> STATEMENT: update foo SET val = 2;

> What do you think?

Can you explain why the last query executed by the blocking process is
relevant output to show?  Are you just hoping that the last statement
to execute is the one that obtained the lock? Wouldn't it be confusing
if the last query to execute wasn't the query which obtained the lock?

David

Re: pg_verifybackup: TAR format backup verification

2024-10-02 Thread Tom Lane

I wrote:
> Sadly, it seems adder[1] is even pickier than mamba:

Nope, it was my testing error: I supposed that this patch only
affected pg_combinebackup and pg_verifybackup, so I only
recompiled those modules not the whole tree.  But there's one
more place with a json_manifest_per_file_callback callback :-(.

I pushed a quick-hack fix.

regards, tom lane

Re: [BUG FIX]Connection fails with whitespace after keepalives parameter value

On Wed, Oct 02, 2024 at 05:39:31PM -0400, Tom Lane wrote:
> Interesting.  This is unhappy about the space before a parameter name,
> not the space after a parameter value, so it's a different issue.

conninfo_uri_parse_options() parses the URI as a set of option/values,
where conninfo_uri_parse_params.  If we were to be careful about
trailing and leading whitespaces for the parameter names, we need to
be careful about the special JDBC cases for "ssl" and "requiressl",
meaning that we should add more logic in conninfo_uri_decode() to
discard these.  That would apply a extra layer of sanity into the
values as well.

> But it's weird that ecpg takes it while libpq doesn't.  Could libecpg
> be modifying/reassembling the URI string?  I didn't look.

ECPGconnect() has some custom logic to discard trailing and leading
spaces: 
/* Skip spaces before keyword */
for (token1 = str; *token1 == ' '; token1++)
[...]
token1[e] = '\0'; //skips trailing spaces.

The argument for libpq where we could be consistent is appealing.  How
about lifting things in libpq like the attached?  I wouldn't backpatch
that, but we have tests for URIs and I didn't break anything.
--
Michael
From dafde0e2d5382571bfb91339e3f9e9a5530738b0 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 3 Oct 2024 09:34:01 +0900
Subject: [PATCH] libpq: Count for leading and trailing whitespaces in URIs

---
 src/interfaces/libpq/fe-connect.c | 42 ++-
 src/interfaces/libpq/t/001_uri.pl | 11 
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 64787bea51..2d09ea5afd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -6763,9 +6763,9 @@ conninfo_uri_parse_params(char *params,
 static char *
 conninfo_uri_decode(const char *str, PQExpBuffer errorMessage)
 {
-	char	   *buf;
-	char	   *p;
-	const char *q = str;
+	char	   *buf;	/* result */
+	char	   *p;		/* output location */
+	const char *q = str;	/* input location */
 
 	buf = malloc(strlen(str) + 1);
 	if (buf == NULL)
@@ -6775,13 +6775,23 @@ conninfo_uri_decode(const char *str, PQExpBuffer errorMessage)
 	}
 	p = buf;
 
+	/* skip leading whitespaces */
+	for (const char *s = q; *s == ' '; s++)
+	{
+		q++;
+		continue;
+	}
+
 	for (;;)
 	{
 		if (*q != '%')
 		{
-			/* copy and check for NUL terminator */
-			if (!(*(p++) = *(q++)))
-break;
+			/* if found a whitespace or NUL, the string ends */
+			if (*q == ' ' || *q == '\0')
+goto end;
+
+			/* copy character */
+			*(p++) = *(q++);
 		}
 		else
 		{
@@ -6817,6 +6827,26 @@ conninfo_uri_decode(const char *str, PQExpBuffer errorMessage)
 		}
 	}
 
+end:
+
+	/* skip trailing whitespaces */
+	for (const char *s = q; *s == ' '; s++)
+	{
+		q++;
+		continue;
+	}
+
+	/* Not at the end of the string yet?  Fail. */
+	if (*q != '\0')
+	{
+		libpq_append_error(errorMessage, "trailing data found: \"%s\"", str);
+		free(buf);
+		return NULL;
+	}
+
+	/* Copy NUL terminator */
+	*p = '\0';
+
 	return buf;
 }
 
diff --git a/src/interfaces/libpq/t/001_uri.pl b/src/interfaces/libpq/t/001_uri.pl
index 49ea5377e0..a872e973e7 100644
--- a/src/interfaces/libpq/t/001_uri.pl
+++ b/src/interfaces/libpq/t/001_uri.pl
@@ -86,6 +86,17 @@ my @tests = (
 		q{user='uri-user' host='host' (inet)},
 		q{},
 	],
+	[
+		# Leading and trailing spaces, works
+		q{postgresql://host? user  = uri-user & port  = 12345 },
+		q{user='uri-user' host='host' port='12345' (inet)},
+		q{},
+	],
+	[
+		q{postgresql://host?  host  =  uri-user & port = 12345 12 },
+		q{},
+		q{libpq_uri_regress: trailing data found: " 12345 12 "},
+	],
 	[ q{postgresql://host?}, q{host='host' (inet)}, q{}, ],
 	[
 		q{postgresql://[::1]:12345/db},
-- 
2.45.2



signature.asc
Description: PGP signature

Re: Enhance create subscription reference manual

On Thu, Oct 3, 2024 at 10:01 AM David G. Johnston
 wrote:
>
> On Wednesday, October 2, 2024, Peter Smith  wrote:
>>
>>
>>
>> You can see how confused the current docs are because "failover" is
>> called by both terms even within the same paragraph! [1]
>> - "failover parameter specified in the subscription"
>> - "subscription's failover option"
>>
>> ~~~
>>
>> What to do? Ideally, the docs should have consistent and correct usage
>> of the words "option" and "parameter" everywhere. But in practice, I
>> guess most people probably are reading those words as synonyms anyway
>> so using them wrongly isn't impacting the understanding badly.
>>
>> Anyway, since you are already fixing something for "failover", then it
>> would be good to fix the correct term everywhere for that one (e.g.
>> call it an "option"), or at least call it an option everywhere on the
>> CREATE SUBSCRIPTION page.
>
>
> The distinction between required and optional is not relevant for our 
> documentation.  The descriptions tell you that.
>
> If you wish to know whether to call something an option or a parameter look 
> at the syntax placeholder.  In this case, it is named:  
> “subscription_parameter” so parameter is the correct term to choose on this 
> page, for these things.  For explain you call them options because the 
> placeholder is “option”.
>

OK. If that is the "rule" that the documentation uses then it is fine.
The same term can be consistently used everywhere the 'thing' is
referred to.

~

IIUC you were referring to the EXPLAIN docs page [1] as an "option" example:
--
EXPLAIN [ ( option [, ...] ) ] statement
where option can be one of:
--

but that page also seems to have a muddle of different terms used to
describe the same "option".

==
[1] https://www.postgresql.org/docs/devel/sql-explain.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Partitioned tables and [un]loggedness

On Thu, Sep 19, 2024 at 01:08:33PM +0900, Michael Paquier wrote:
> I have applied 0001 for now to add ATT_PARTITIONED_TABLE.  Attached is
> the remaining piece.

And the second piece is now applied as of e2bab2d79204.
--
Michael


signature.asc
Description: PGP signature

Re: bgwrite process is too lazy

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> Whenever I check the checkpoint information in a log, most dirty pages are 
> written by the checkpoint process

That's exactly how it should be!

Yours,
Laurenz Albe

Re: general purpose array_sort

2024-10-02 Thread Junwang Zhao

On Wed, Oct 2, 2024 at 9:51 AM jian he  wrote:
>
> > >
> > > + typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
> > > + if (typentry == NULL || typentry->type_id != elmtyp)
> > > + {
> > > + typentry = lookup_type_cache(elmtyp, sort_asc ? TYPECACHE_LT_OPR :
> > > TYPECACHE_GT_OPR);
> > > + fcinfo->flinfo->fn_extra = (void *) typentry;
> > > + }
> > > you need to one-time check typentry->lt_opr or typentry->gt_opr exists?
> > > see CreateStatistics.
> > > /* Disallow data types without a less-than operator */
> > > type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
> > > if (type->lt_opr == InvalidOid)
> > > ereport(ERROR,
> > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > >  errmsg("column \"%s\" cannot be used in
> > > statistics because its type %s has no default btree operator class",
> > > attname, 
> > > format_type_be(attForm->atttypid;
> >
> > I added an Assert for this part, not sure if that is enough.
> >
>
> i think it really should be:
>
> if (typentry == NULL || typentry->type_id != elmtyp)
> {
>  typentry = lookup_type_cache(elmtyp, sort_asc ? TYPECACHE_LT_OPR :
> TYPECACHE_GT_OPR);
>  fcinfo->flinfo->fn_extra = (void *) typentry;
> if ((sort_asc && !OidIsValid(typentry->lt_opr) || (!sort_as &&
> OidIsValid(typentry->gt_opr));
> ereport(ERROR,)
> }
>
> Imagine a type that doesn't have TYPECACHE_LT_OPR or TYPECACHE_GT_OPR
> then we cannot do the sort, we should just error out.
>
> I just tried this colour type [1] with (CREATE TYPE colour (INPUT =
> colour_in, OUTPUT = colour_out, LIKE = pg_catalog.int4);
>
> select array_sort('{#FF, #FF}'::colour[]);
> of course it will segfault  with your new Assert.
>
>
> [1] https://github.com/hlinnaka/colour-datatype/blob/master/colour.c

Make sense, PFA v5 with Jian's suggestion.


-- 
Regards
Junwang Zhao


v5-0001-general-purpose-array_sort.patch
Description: Binary data

Re: bgwrite process is too lazy

2024-10-02 Thread Tony Wayne

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe 
wrote:

> On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > Whenever I check the checkpoint information in a log, most dirty pages
> are written by the checkpoint process
>
> That's exactly how it should be!
>
> is it because if bgwriter frequently flushes, the disk io will be more?🤔

Re: On disable_cost

2024-10-02 Thread Robert Haas

On Wed, Oct 2, 2024 at 4:55 AM David Rowley  wrote:
> On Tue, 1 Oct 2024 at 06:17, Laurenz Albe  wrote:
> > Why did you change "Disabled" from an integer to a boolean?
>
> I just don't think "Disabled Nodes" is all that self-documenting and
> I'm also unsure why the full integer value of disabled_nodes is
> required over just displaying the boolean value of if the node is
> disabled or not. Won't readers look at the remainder of the plan to
> determine information about which other nodes are disabled? Do we need
> to give them a running total?

I don't think this will produce the right answer in all cases because
disabled node counts don't propagate across subquery levels.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: not null constraints, again

2024-10-02 Thread Alvaro Herrera

On 2024-Oct-02, Alvaro Herrera wrote:

> On 2024-Oct-02, jian he wrote:
> 
> > On Tue, Oct 1, 2024 at 11:20 PM Alvaro Herrera  
> > wrote:

> > after v7, still not bullet-proof. as before, pg_dump/restore will fail
> > for the following:
> > 
> > drop table if exists t2, t2_0
> > create table t2 (a int, b int, c int, constraint foo primary key(a),
> > constraint foo1 not null a no inherit);
> > create table t2_0 (a int constraint foo1 not null no inherit, b int, c
> > int, constraint foo12 primary key(a));
> 
> Rats.  Fixing :-)

Hmm, I thought this was going to be a five-minute job: I figured I could
add a check in DefineIndex() that reads all columns and ensure they're
no-inherit.  First complication: when creating a partition, we do
DefineIndex to create the indexes that the parent table has, before we
do AddRelationNotNullConstraints(), so the not-null constraint lookup
fails.  Easy enough to fix: just move the AddRelationNotNullConstraints
call a few lines up.  However, things are still not OK because ALTER
TABLE ALTER COLUMN TYPE does want to recreate the PK before the
not-nulls (per ATPostAlterTypeParse), because AT_PASS_OLD_INDEX comes
before AT_PASS_OLD_CONSTR ...  and obviously we cannot change that.

Another possibility is to add something like AT_PASS_OLD_NOTNULL but
that sounds far too ad-hoc.

Maybe I need the restriction to appear somewhere else rather than on
DefineIndex.

Still looking ...

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

Re: bgwrite process is too lazy

2024-10-02 Thread wenhui qiu

Hi Tomas
 Thank you for explaining,If  do not change this static parameter,Can
 refer to the parameter autovacuum_vacuum_cost_delay and lower the minimum
value of the bgwrite_delay parameter to 2ms? Let bgwrite write more dirty
pages to reduce the impact on performance when checkpoint occurs?After all,
extending the checkpoint time and crash recovery time needs to find a
balance, and cannot be increased indefinitely ( checkpoint_timeout
,max_wal_size)



Thanks

Tomas Vondra  于2024年10月3日周四 00:36写道：

> On 10/2/24 17:02, Tony Wayne wrote:
> >
> >
> > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe  > > wrote:
> >
> > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > > Whenever I check the checkpoint information in a log, most dirty
> > pages are written by the checkpoint process
> >
> > That's exactly how it should be!
> >
> > is it because if bgwriter frequently flushes, the disk io will be
> more?🤔
>
> Yes, pretty much. But it's also about where the writes happen.
>
> Checkpoint flushes dirty buffers only once per checkpoint interval,
> which is the lowest amount of write I/O that needs to happen.
>
> Every other way of flushing buffers is less efficient, and is mostly a
> sign of memory pressure (shared buffers not large enough for active part
> of the data).
>
> But it's also happens about where the writes happen. Checkpoint does
> that in the background, not as part of regular query execution. What we
> don't want is for the user backends to flush buffers, because it's
> expensive and can cause result in much higher latency.
>
> The bgwriter is somewhere in between - it's happens in the background,
> but may not be as efficient as doing it in the checkpointer. Still much
> better than having to do this in regular backends.
>
>
> regards
>
> --
> Tomas Vondra
>
>
>

Re: Enhance create subscription reference manual

2024-10-02 Thread Yugo Nagata

On Wed, 2 Oct 2024 17:09:42 -0700
"David G. Johnston"  wrote:

> On Wednesday, October 2, 2024, Peter Smith  wrote:
> 
> > `
> > - "subscription's failover option"
> >
> 
> Feels like this one refers to the stored catalog value, which is neither a
> parameter nor an option but an attribute.  Though in some cases we are
> using “property” as well.  An in this usage option sorta works too but
> parameter really doesn’t - a parameter refers to the command syntax parts,
> not the resultant table attributes.

Thank you for your explanation.

If I understand correctly, whether something is an option of a parameter
should be determined by the syntax placeholder, so "failover" is a
parameter in this case (it is an "optional" parameter, though). However,
when we refer to the stored catalog value, we should call it an option or
a property and calling it parameter is not suitable.

If so, I feel that "the failover" in the following statement means
the catalog value (or the failover feature itself), so we should not
rewrite this to "the failover parameter".

> To initiate replication, you must manually create the replication slot, 
> enable the failover if required, enable the subscription, and refresh the
> subscription.

Instead, should we use "failover option"? Or, if it would mean to the failover
feature rather than the parameter, is it not proper to add  tag to this
"failover"?

Regards,
Yugo Nagata

-- 
Yugo Nagata

Re: Address the -Wuse-after-free warning in ATExecAttachPartition()

2024-10-02 Thread Amit Langote

On Tue, Oct 1, 2024 at 3:23 PM Amit Langote  wrote:
> Hi,
>
> On Mon, Jul 8, 2024 at 7:08 PM Junwang Zhao  wrote:
> > On Mon, Jul 8, 2024 at 3:22 PM Nitin Jadhav
> >  wrote:
> > >
> > > In [1], Andres reported a -Wuse-after-free bug in the
> > > ATExecAttachPartition() function.  I've created a patch to address it
> > > with pointers from Amit offlist.
> > >
> > > The issue was that the partBoundConstraint variable was utilized after
> > > the list_concat() function. This could potentially lead to accessing
> > > the partBoundConstraint variable after its memory has been freed.
> > >
> > > The issue was resolved by using the return value of the list_concat()
> > > function, instead of using the list1 argument of list_concat(). I
> > > copied the partBoundConstraint variable to a new variable named
> > > partConstraint and used it for the previous references before invoking
> > > get_proposed_default_constraint(). I confirmed that the
> > > eval_const_expressions(), make_ands_explicit(),
> > > map_partition_varattnos(), QueuePartitionConstraintValidation()
> > > functions do not modify the memory location pointed to by the
> > > partBoundConstraint variable. Therefore, it is safe to use it for the
> > > next reference in get_proposed_default_constraint()
> > >
> > > Attaching the patch. Please review and share the comments if any.
> > > Thanks to Andres for spotting the bug and some off-list advice on how
> > > to reproduce it.
> >
> > The patch LGTM.
>
> I reviewed the faulty code in ATExecAttachPartition() that this patch
> aims to address, and it seems that the fix suggested by Alvaro in the
> original report [1] might be more appropriate. It also allows us to
> resolve a minor issue related to how partBoundConstraint is handled
> later in the function. Specifically, the constraint checked against
> the default partition, if one exists on the parent table, should not
> include the negated version of the parent's constraint, which the
> current code mistakenly does. This occurs because the list_concat()
> operation modifies the partBoundConstraint when combining it with the
> parent table’s partition constraint. Although this doesn't cause a
> live bug, the inclusion of the redundant parent constraint introduces
> unnecessary complexity in the combined constraint expression. This can
> cause slight delays in evaluating the constraint, as the redundant
> check always evaluates to false for rows in the default partition.
> Ultimately, the constraint failure is still determined by the negated
> version of the partition constraint of the table being attached.
>
> I'm inclined to apply the attached patch only to the master branch as
> the case for applying this to back-branches doesn't seem all that
> strong to me.  Thoughts?

I've pushed this to the master branch now.

Thanks Nitin for the report.

-- 
Thanks, Amit Langote

Re: Parallel workers stats in pg_stat_database

On Wed, Oct 02, 2024 at 11:12:37AM +0200, Benoit Lobréau wrote:
> My collegues and I had a discussion about what could be done to improve
> parallelism observability in PostgreSQL [0]. We thought about several
> places to do it for several use cases.
> 
> Guillaume Lelarge worked on pg_stat_statements [1].

Thanks, missed that.  I will post a reply there.  There is a good
overlap with everything you are doing here, because each one of you
wishes to track more data to the executor state and push it to
different part of the system, system view or just an extension.

Tracking the number of workers launched and planned in the executor
state is the strict minimum for a lot of these things, as far as I can
see.  Once the nodes are able to push this data, then extensions can
feed on it the way they want.  So that's a good idea on its own, and
covers two of the counters posted here:
https://www.postgresql.org/message-id/CAECtzeWtTGOK0UgKXdDGpfTVSa5bd_VbUt6K6xn8P7X%2B_dZqKw%40mail.gmail.com

Could you split the patch based on that?  I'd recommend to move
es_workers_launched and es_workers_planned closer to the top, say
es_total_processed, and document what these counters are here for.

After that comes the problem of where to push this data..

> Lastly the number would be more precise/easier to make sense of, since
> pg_stat_statement has a limited size.

Upper bound that can be configured.

When looking for query-level patterns or specific SET tuning, using
query-level data speaks more than this data pushed at database level.
TBH, I am +-0 about pushing this data to pg_stat_database so as we
would be able to tune database-level GUCs.  That does not help with
SET commands tweaking the number of workers to use.  Well, perhaps few
rely on SET and most rely on the system-level GUCs in their
applications, meaning that I'm wrong, making your point about
publishing this data at database-level better, but I'm not really
sure.  If others have an opinion, feel free.

Anyway, what I am sure of is that publishing the same set of data
everywhere leads to bloat, and I'd rather avoid that.  Aggregating
that from the queries also to get an impression of the whole database
offers an equivalent of what would be stored in pg_stat_database
assuming that the load is steady.  Your point about pg_stat_statements
not being set is also true, even if some cloud vendors enable it by
default.

Table/index-level data can be really interesting because we can
cross-check what's happening for more complex queries if there are
many gather nodes with complex JOINs.

Utilities (vacuum, btree, brin) are straight-forward and best at query
level, making pg_stat_statements their best match.  And there is no
need for four counters if pushed at this level while two are able to
do the job as utility and non-utility statements are separated
depending on their PlannedStmt leading to separate entries in PGSS.
--
Michael

signature.asc
Description: PGP signature

Re: general purpose array_sort

2024-10-02 Thread Amit Langote

Hi Junwang,

On Wed, Oct 2, 2024 at 11:46 PM Junwang Zhao  wrote:
> On Wed, Oct 2, 2024 at 9:51 AM jian he  wrote:
> >
> > > >
> > > > + typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
> > > > + if (typentry == NULL || typentry->type_id != elmtyp)
> > > > + {
> > > > + typentry = lookup_type_cache(elmtyp, sort_asc ? TYPECACHE_LT_OPR :
> > > > TYPECACHE_GT_OPR);
> > > > + fcinfo->flinfo->fn_extra = (void *) typentry;
> > > > + }
> > > > you need to one-time check typentry->lt_opr or typentry->gt_opr exists?
> > > > see CreateStatistics.
> > > > /* Disallow data types without a less-than operator */
> > > > type = lookup_type_cache(attForm->atttypid, 
> > > > TYPECACHE_LT_OPR);
> > > > if (type->lt_opr == InvalidOid)
> > > > ereport(ERROR,
> > > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > >  errmsg("column \"%s\" cannot be used in
> > > > statistics because its type %s has no default btree operator class",
> > > > attname, 
> > > > format_type_be(attForm->atttypid;
> > >
> > > I added an Assert for this part, not sure if that is enough.
> > >
> >
> > i think it really should be:
> >
> > if (typentry == NULL || typentry->type_id != elmtyp)
> > {
> >  typentry = lookup_type_cache(elmtyp, sort_asc ? TYPECACHE_LT_OPR :
> > TYPECACHE_GT_OPR);
> >  fcinfo->flinfo->fn_extra = (void *) typentry;
> > if ((sort_asc && !OidIsValid(typentry->lt_opr) || (!sort_as &&
> > OidIsValid(typentry->gt_opr));
> > ereport(ERROR,)
> > }
> >
> > Imagine a type that doesn't have TYPECACHE_LT_OPR or TYPECACHE_GT_OPR
> > then we cannot do the sort, we should just error out.
> >
> > I just tried this colour type [1] with (CREATE TYPE colour (INPUT =
> > colour_in, OUTPUT = colour_out, LIKE = pg_catalog.int4);
> >
> > select array_sort('{#FF, #FF}'::colour[]);
> > of course it will segfault  with your new Assert.
> >
> >
> > [1] https://github.com/hlinnaka/colour-datatype/blob/master/colour.c
>
> Make sense, PFA v5 with Jian's suggestion.

Have you noticed that the tests have failed on Cirrus CI runs of this patch?

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5277

It might be related to the test machines having a different *default*
locale than your local environment, which could result in a different
sort order for the test data. You may need to add an explicit COLLATE
clause to the tests to ensure consistent sorting across systems.

-- 
Thanks, Amit Langote

Re: long-standing data loss bug in initial sync of logical replication

2024-10-02 Thread Shlok Kyal

> > I have addressed the comment for 0002 patch and attached the patches.
> > Also, I have moved the tests in the 0002 to 0001 patch.
>
> Thanks for updating the patch. 0002 patch seems to remove cache invalidations
> from publication_invalidation_cb(). Related with it, I found an issue and had 
> a concern.
>
> 1.
> The replication continues even after ALTER PUBLICATION RENAME is executed.
> For example - assuming that a subscriber subscribes only "pub":
>
> ```
> pub=# INSERT INTO tab values (1);
> INSERT 0 1
> pub=# ALTER PUBLICATION pub RENAME TO pub1;
> ALTER PUBLICATION
> pub=# INSERT INTO tab values (2);
> INSERT 0 1
>
> sub=# SELECT * FROM tab ; -- (2) should not be replicated however...
>  a
> ---
>  1
>  2
> (2 rows)
> ```
>
> This happens because 1) ALTER PUBLICATION RENAME statement won't be 
> invalidate the
> relation cache, and 2) publications are reloaded only when invalid 
> RelationSyncEntry
> is found. In given example, the first INSERT creates the valid cache and 
> second
> INSERT reuses it. Therefore, the pubname-check is skipped.
>
> For now, the actual renaming is done at AlterObjectRename_internal(), a 
> generic
> function. I think we must implement a dedecated function to publication and 
> must
> invalidate relcaches there.
>
> 2.
> Similarly with above, the relcache won't be invalidated when ALTER PUBLICATION
> OWNER TO is executed. This means that privilage checks may be ignored if the 
> entry
> is still valid. Not sure, but is there a possibility this causes an 
> inconsistency?
>

Hi Kuroda-san,

Thanks for testing the patch. I have fixed the comments and attached
the updated patch.
I have added a callback function rel_sync_cache_publicationrel_cb().
This callback invalidates the cache of tables in a particular
publication.
This callback is called when there is some modification in
pg_publication catalog.

I have tested the two cases  'ALTER PUBLICATION ... RENAME TO ...' and
 'ALTER PUBLICATION ... OWNER TO ...'  and debugged it. The newly
added callback is called and it invalidates the cache of tables
present in that particular publication.
I have also added a test related to 'ALTER PUBLICATION ... RENAME TO
...' to 0001 patch.

Thanks and Regards,
Shlok Kyal


v11-0001-Distribute-invalidatons-if-change-in-catalog-tab.patch
Description: Binary data


v11-0002-Selective-Invalidation-of-Cache.patch
Description: Binary data

Re: Return pg_control from pg_backup_stop().

On Wed, Oct 02, 2024 at 09:03:27AM +, David Steele wrote:
> On 10/2/24 10:11, Michael Paquier wrote:
>> Hmm, okay.  There is also a slight impact for BASE_BACKUP, requiring
>> basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
>> called earlier when sending the main data directory which is the last
>> one in the list of tablespaces.  As far as I can see, this does not
>> change the logic because do_pg_backup_stop() does not touch the
>> control file, but it seems to me that we'd rather keep these two
>> calls as they are now, and send the control file once we are out of
>> the for loop that processes all the tablespaces.  That seems slightly
>> cleaner to me, still I agree that both are the same things.
> 
> Sending pg_control later results in even more code churn because of how tar
> files are terminated. I've updated it that way in v2 so you can see what I
> mean. I don't have a strong preference, though, so if you prefer the
> implementation in v2 then that's fine with me.

It does not make much of a difference, indeed.

>> Anyway, finishing do_pg_backup_stop() and then sending the control
>> file is just a consequence of the implementation choice driven by the
>> output required for the SQL function so as this is stored in the
>> backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
>> could also take one step back and forget about the SQL function,
>> setting only the new flag when sending a BASE_BACKUP, or just not use
>> the backupState to store this data as the exercise involves forcing
>> one boolean and recalculate a CRC32.
> 
> I can definitely see us making other updates to pg_control so I would rather
> keep this logic centralized, even though it is not too complicated at this
> point. Still, even 8 lines of code (as it is now) seems better not to
> duplicate.

I was wondering if the field update should be hidden behind a macro
that uses an offsetof() on ControlFileData, with the name of the field
and a pointer to the value to update to.  If you include the CRC32
calculation in that, that makes for less chunks of code when updating
one field of the control file.

The full CRC calculation could also be hidden inside a macro, as there
are a couple of places where we do the same things, like pg_rewind.c,
etc.

>> We're talking about a 8kB file which has a size of 512B
>> (PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues.  So I'm not sure to
>> see your point here?
> 
> Even at 512B it is possible to see tears in pg_control and they happen in
> the build farm right now. In fact, this thread [1] trying to fix the problem
> was what got me thinking about alternate solutions to preventing tears in
> pg_control. Thomas' proposed fixes have not been committed to my knowledge
> so the problem remains, but would be fixed by this commit.

Ah, right.  That rings a bell.  Thomas has done some work with
c558e6fd92ff and 63a58c6b.  And we're still not taking the
ControlFileLock while copying it over..  It looks like we should do it
separately, and backpatch.  That's not something for this thread to
worry about.

>> Perhaps existing
>> backup solutions are good enough risk vs reward is not worth it?
> 
> I'm not sure I see the risk here. Saving out pg_control is optional so no
> changes to current software is required. Of course they miss the benefit of
> the protection against tears and missing backup_label, but that is a choice.
>
> Again, optional, but if I was able to manage these saves using the psql
> interface in the TAP tests then I'd say it would be pretty easy for anyone
> with a normal connection to Postgres. Also, we require users to treat
> tabelspace_map and backup_label as binary so not too big a change here.

Maintenance cost for a limited user impact overall.  With incremental
backups being a thing in v18 only available through the replication
protocol, the SQL functions have less advantages these days.  My point
would be to see this thread as a two-step process:
1) Update the flag in the control file when sending it across in
replication stream.
2) Do the SQL function thing with the bytea for the control file, if
necessary.

1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol.  Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from.  The SQL part is optional IMO.  It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file. 
--
Michael


signature.asc
Description: PGP signature

Re: Using per-transaction memory contexts for storing decoded tuples

2024-10-02 Thread Masahiko Sawada

On Wed, Oct 2, 2024 at 9:42 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san, Amit,
>
> > > So, decoding a large transaction with many smaller allocations can
> > > have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
> > > real workloads, we will have fewer such large transactions or a mix of
> > > small and large transactions. That will make the overhead much less
> > > visible. Does this mean that we should invent some strategy to defrag
> > > the memory at some point during decoding or use any other technique? I
> > > don't find this overhead above the threshold to invent something
> > > fancy. What do others think?
> >
> > I agree that the overhead will be much less visible in real workloads.
> > +1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old
> > branches (if we agree) and to revert the change in case something
> > happens.
>
> I also felt okay. Just to confirm - you do not push rb_mem_block_size patch 
> and
> just replace SLAB_LARGE_BLOCK_SIZE -> SLAB_DEFAULT_BLOCK_SIZE, right?

Right.

> It seems that
> only reorderbuffer.c uses the LARGE macro so that it can be removed.

I'm going to keep the LARGE macro since extensions might be using it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: long-standing data loss bug in initial sync of logical replication

2024-10-02 Thread Shlok Kyal

On Mon, 30 Sept 2024 at 16:56, Hayato Kuroda (Fujitsu)
 wrote:
>
> > 2.
> > Similarly with above, the relcache won't be invalidated when ALTER
> > PUBLICATION
> > OWNER TO is executed. This means that privilage checks may be ignored if the
> > entry
> > is still valid. Not sure, but is there a possibility this causes an 
> > inconsistency?
>
> Hmm, IIUC, the attribute pubowner is not used for now. The paragpargh
> "There are currently no privileges on publications" [1] may show the 
> current
> status. However, to keep the current behavior, I suggest to invalidate the 
> relcache
> of pubrelations when the owner is altered.
>
> [1]: https://www.postgresql.org/docs/devel/logical-replication-security.html

I have shared the updated patch [1].
So now, when 'ALTER .. PUBLICATION .. OWNER TO ..' is executed the
relcache is invalidated for that specific publication.

[1] : 
https://www.postgresql.org/message-id/CANhcyEWEXL3rxvKH9-Xtx-DgGX0D62EktHpW%2BnG%2BMSSaMVUVig%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Re: Add on_error and log_verbosity options to file_fdw

2024-10-02 Thread Masahiko Sawada

On Tue, Oct 1, 2024 at 11:34 PM Fujii Masao  wrote:
>
>
>
> On 2024/10/02 9:27, Masahiko Sawada wrote:
> > Sorry for being late in joining the review of this patch. Both 0001
> > and 0003 look good to me. I have two comments on the 0002 patch:
>
> Thanks for the review!
>
> > I think that while scanning a file_fdw foreign table with
> > log_verbosity='silent' the query is not interruptible.
>
> You're right. I added CHECK_FOR_INTERRUPTS() in the retry loop.
>
> > Also, we don't switch to the per-tuple memory context when retrying
> > due to a soft error. I'm not sure it's okay as in CopyFrom(), a
> > similar function for COPY command, we switch to the per-tuple memory
> > context every time before parsing an input line. Would it be
> > problematic if we switch to another memory context while parsing an
> > input line? In CopyFrom() we also call ResetPerTupleExprContext() and
> > ExecClearTuple() for every input, so we might want to consider calling
> > them for every input.
>
> Yes, I've updated the patch based on your comment.
> Could you please review the latest version?

Thank you for updating the patch! All patches look good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: On disable_cost