Re: "pgoutput" options missing on documentation

2023-12-13 Thread Peter Smith
Hi, here are some initial review comments.

//

Patch v00-0001

1.
+
+ /* Check required parameters */
+ if (!protocol_version_given)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("proto_version parameter missing")));
+ if (!publication_names_given)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("publication_names parameter missing")));

To reduce translation efforts, perhaps it is better to arrange for
these to share a common message.

For example,

ereport(ERROR,
 errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 /* translator: %s is a pgoutput option */
 errmsg("missing pgoutput option: %s", "proto_version"));

~

ereport(ERROR,
 errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 /* translator: %s is a pgoutput option */
 errmsg("missing pgoutput option: %s", "publication_names"));

Also, I am unsure whether to call these "parameters" or "options" -- I
wanted to call them parameters like in the documentation, but every
other message in this function refers to "options", so in my example,
I mimicked the nearby code YMMV.

//

Patch v00-0002

2.
-   The logical replication START_REPLICATION command
-   accepts following parameters:
+   The standard logical decoding plugin (pgoutput) accepts
+   following parameters with START_REPLICATION command:

Since the previous line already said pgoutput is the standard decoding
plugin, maybe it's not necessary to repeat that.

SUGGESTION
Using the START_REPLICATION command,
pgoutput) accepts the following parameters:

~~~

3.
I noticed in the protocol message formats docs [1] that those messages
are grouped according to the protocol version that supports them.
Perhaps this page should be arranged similarly for these parameters?

For example, document the parameters in the order they were introduced.

SUGGESTION

-proto_version
 ...
-publication_names
 ...
-binary
 ...
-messages
 ...

Since protocol version 2:

-streaming (boolean)
 ...

Since protocol version 3:

-two_phase
 ...

Since protocol version 4:

-streaming (boolean/parallel)
 ...
-origin
 ...

~~~

4.
+   Boolean parameter to use the binary transfer mode.  This is faster
+   than the text mode, but slightly less robust

SUGGESTION
Boolean parameter to use binary transfer mode. Binary mode is faster
than the text mode but slightly less robust


==
[1]  
https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
> I also thought about the same solution. I tried this solution as the
> attached patch on top of Hayato's diagnostic changes.

I think you forgot to attach the patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Ashutosh Bapat
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:

> > >
> >
> > It is correct that we can make a wrong decision about whether a change
> > is transactional or non-transactional when sequence DDL happens before
> > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > after that state. However, one thing to note here is that we won't try
> > to stream such a change because for non-transactional cases we don't
> > proceed unless the snapshot is in a consistent state. Now, if the
> > decision had been correct then we would probably have queued the
> > sequence change and discarded at commit.
> >
> > One thing that we deviate here is that for non-sequence transactional
> > cases (including logical messages), we immediately start queuing the
> > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > SnapBuildProcessChange() returns true which is quite possible) and
> > take final decision at commit/prepare/abort time. However, that won't
> > be the case for sequences because of the dependency of determining
> > transactional cases on one of the prior records. Now, I am not
> > completely sure at this stage if such a deviation can cause any
> > problem and or whether we are okay to have such a deviation for
> > sequences.
>
> Okay, so this particular scenario that I raised is somehow saved, I
> mean although we are considering transactional sequence operation as
> non-transactional we also know that if some of the changes for a
> transaction are skipped because the snapshot was not FULL that means
> that transaction can not be streamed because that transaction has to
> be committed before snapshot become CONSISTENT (based on the snapshot
> state change machinery).  Ideally based on the same logic that the
> snapshot is not consistent the non-transactional sequence changes are
> also skipped.  But the only thing that makes me a bit uncomfortable is
> that even though the result is not wrong we have made some wrong
> intermediate decisions i.e. considered transactional change as
> non-transactions.
>
> One solution to this issue is that, even if the snapshot state does
> not reach FULL just add the sequence relids to the hash, I mean that
> hash is only maintained for deciding whether the sequence is changed
> in that transaction or not.  So no adding such relids to hash seems
> like a root cause of the issue.  Honestly, I haven't analyzed this
> idea in detail about how easy it would be to add only these changes to
> the hash and what are the other dependencies, but this seems like a
> worthwhile direction IMHO.

I also thought about the same solution. I tried this solution as the
attached patch on top of Hayato's diagnostic changes. Following log
messages are seen in server error log. Those indicate that the
sequence change was correctly deemed as a transactional change (line
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is
transactional).
2023-12-14 12:12:50.550 IST [321229] ERROR: relation
"pg_replication_slot" does not exist at character 15
2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from
pg_replication_slot;
2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found
initial starting point at 0/1598D50
2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 759 to end.
2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found
initial consistent point at 0/1599170
2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 760 to end.
2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_FULL_SNAPSHOT
2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional
2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found
consistent point at 0/15992E8
2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions.
2023-12-14 12:14:55.813 IST 

Re: Reducing output size of nodeToString

2023-12-13 Thread Matthias van de Meent
On Thu, 7 Dec 2023 at 13:09, David Rowley  wrote:
>
> On Thu, 7 Dec 2023 at 10:09, Matthias van de Meent
>  wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size significantly.
>
> It would be very cool to have the technology proposed by Andres back
> in 2019 [1]. With that, we could easily write various output
> functions.  One could be compact and easily machine-readable and
> another designed to be better for humans for debugging purposes.
>
> We could also easily serialize plans to binary format for copying to
> parallel workers rather than converting them to a text-based
> serialized format. It would also allow us to do things like serialize
> PREPAREd plans into a nicely compact single allocation that we could
> just pfree in a single pfree call on DEALLOCATE.

I'm not sure what benefit you're refering to. If you mean "it's more
compact than the current format" then sure; but the other points can
already be covered by either the current nodeToString format, or by
nodeCopy-ing the prepared plan into its own MemoryContext, which would
allow us to do essentially the same thing.

> Likely we could just use the existing Perl scripts to form the
> metadata arrays rather than the clang parsing stuff Andres used in his
> patch.
>
> Anyway, just wanted to ensure you knew about this idea.

I knew about that thread thread, but didn't notice the metadata arrays
part of it, which indeed looks interesting for this patch. Thanks for
pointing it out. I'll see if I can incorporate parts of that into this
patchset.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




RE: Synchronizing slots from primary to standby

2023-12-13 Thread Zhijie Hou (Fujitsu)
On Thursday, December 14, 2023 12:45 PM Peter Smith  
wrote:

> A review comment for v47-0001

Thanks for the comment.

> 
> ==
> src/backend/replication/slot.c
> 
> 1.  GetStandbySlotList
> 
> +static void
> +WalSndRereadConfigAndReInitSlotList(List **standby_slots) {
> + char*pre_standby_slot_names;
> +
> + ProcessConfigFile(PGC_SIGHUP);
> +
> + /*
> + * If we are running on a standby, there is no need to reload
> + * standby_slot_names since we do not support syncing slots to
> + cascading
> + * standbys.
> + */
> + if (RecoveryInProgress())
> + return;
> 
> Should the RecoveryInProgress() check be first -- even before the
> ProcessConfigFile call?

ProcessConfigFile is necessary here, it is used not only for standby_slot_names
but also all other GUCs that could be used in the caller.

Best Regards,
Hou zj


Re: Synchronizing slots from primary to standby

2023-12-13 Thread Amit Kapila
On Wed, Dec 13, 2023 at 3:53 PM Peter Smith  wrote:
>
> 12.
> +static void
> +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
> + bool *slot_updated)
> +{
> + ReplicationSlot *s;
> + ReplicationSlot *slot;
> + char sync_state = '\0';
>
> In my previous review [1]#33a I thought it was strange to assign the
> sync_state (which is essentially an enum) to some meaningless value,
> so I suggested it should be set to SYNCSLOT_STATE_NONE in the
> declaration. The reply [2] was "No, that will change the flow. It
> should stay uninitialized if the slot is not found."
>
> But I am not convinced there is any flow problem. Also,
> SYNCSLOT_STATE_NONE seems the naturally correct default for something
> with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the
> same time (that is reported as an ERROR "skipping sync of slot") so I
> see no problem.
>
> The CURRENT code is like this:
>
> /* Slot created by the slot sync worker exists, sync it */
> if (sync_state)
> {
>   Assert(sync_state == SYNCSLOT_STATE_READY || sync_state ==
> SYNCSLOT_STATE_INITIATED);
>   ...
> }
> /* Otherwise create the slot first. */
> else
> {
>   ...
> }
>
> AFAICT that could easily be changed to like below, with no change to
> the logic, and it avoids setting strange values.
>
> SUGGESTION.
>
> if (sync_state == SYNCSLOT_STATE_NONE)
> {
>   /* Slot not found. Create it. */
>   ..
> }
> else
> {
>   Assert(sync_state == SYNCSLOT_STATE_READY || sync_state ==
> SYNCSLOT_STATE_INITIATED);
>   ...
> }
>

I think instead of creating syncslot based on syncstate, it would be
better to create it when we don't find it via
SearchNamedReplicationSlot(). That will avoid the need to initialize
the syncstate and I think it would make code in this area look better.

> ~~~
>
> 13. synchronize_one_slot
>
> +static void
> +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
> + bool *slot_updated)
>
> This *slot_updated parameter looks dubious. It is used in a loop from
> the caller to mean that ANY slot was updated -- e.g. maybe it is true
> or false on entry to this function.
>
> But, Instead of having some dependency between this function and the
> caller, IMO it makes more sense if we would make this just a boolean
> function in the first place (e.g. was updated? T/F)
>
> Then the caller can also be written more easily like:
>
> some_slot_updated |= synchronize_one_slot(wrconn, remote_slot);
>

+1.

>
> 23. slotsync_reread_config
>
> + old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
> + Assert(old_dbname);
>
> (This is same comment as old review [1]#61)
>
> Hmm. I still don't see why this extraction of the dbname cannot be
> deferred until later when you know the PrimaryConnInfo has changed,
> otherwise, it might be redundant to do this. Shveta replied [2] that
> "Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm
> not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a
> string search -- Why can't you defer this until you know
> conninfoChanged is true, and then to get the old_dbname, you can just
> pass the old_primary_conninfo. E.g. call like
> walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am
> mistaken.
>

I think we should just restart if any one of the information is
changed with a message like: "slotsync worker will restart because of
a parameter change". This would be similar to what we do apply worker
in maybe_reread_subscription().

>
> 28.
> + /*
> + * Is this a slot created by a sync-slot worker?
> + *
> + * Only relevant for logical slots on the physical standby.
> + */
> + char sync_state;
> +
>
> (probably I am repeating a previous thought here)
>
> The comment says the field is only relevant for standby, and that's
> how I've been visualizing it, and why I had previously suggested even
> renaming it to 'standby_sync_state'. However, replies are saying that
> after failover these sync_states also have "some meaning for the
> primary server".
>
> That's the part I have trouble understanding. IIUC the server states
> are just either all 'n' (means nothing) or 'r' because they are just
> leftover from the old standby state. So, does it *truly* have meaning
> for the server? Or should those states somehow be removed/ignored on
> the new primary? Anyway, the point is that if this field does have
> meaning also on the primary (I doubt) then those details should be in
> this comment. Otherwise "Only relevant ... on the standby" is too
> misleading.
>

I think this deserves more comments.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila  wrote:
>
> > > But can this even happen? Can we start decoding in the middle of a
> > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > > messages, where we also call the output plugin in non-transactional cases.
> >
> > It's not a problem for logical messages because whether the message is
> > transaction or non-transactional is decided while WAL logs the message
> > itself.  But here our problem starts with deciding whether the change
> > is transactional vs non-transactional, because if we insert the
> > 'relfilenode' in hash then the subsequent sequence change in the same
> > transaction would be considered transactional otherwise
> > non-transactional.
> >
>
> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state. However, one thing to note here is that we won't try
> to stream such a change because for non-transactional cases we don't
> proceed unless the snapshot is in a consistent state. Now, if the
> decision had been correct then we would probably have queued the
> sequence change and discarded at commit.
>
> One thing that we deviate here is that for non-sequence transactional
> cases (including logical messages), we immediately start queuing the
> changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> SnapBuildProcessChange() returns true which is quite possible) and
> take final decision at commit/prepare/abort time. However, that won't
> be the case for sequences because of the dependency of determining
> transactional cases on one of the prior records. Now, I am not
> completely sure at this stage if such a deviation can cause any
> problem and or whether we are okay to have such a deviation for
> sequences.

Okay, so this particular scenario that I raised is somehow saved, I
mean although we are considering transactional sequence operation as
non-transactional we also know that if some of the changes for a
transaction are skipped because the snapshot was not FULL that means
that transaction can not be streamed because that transaction has to
be committed before snapshot become CONSISTENT (based on the snapshot
state change machinery).  Ideally based on the same logic that the
snapshot is not consistent the non-transactional sequence changes are
also skipped.  But the only thing that makes me a bit uncomfortable is
that even though the result is not wrong we have made some wrong
intermediate decisions i.e. considered transactional change as
non-transactions.

One solution to this issue is that, even if the snapshot state does
not reach FULL just add the sequence relids to the hash, I mean that
hash is only maintained for deciding whether the sequence is changed
in that transaction or not.  So no adding such relids to hash seems
like a root cause of the issue.  Honestly, I haven't analyzed this
idea in detail about how easy it would be to add only these changes to
the hash and what are the other dependencies, but this seems like a
worthwhile direction IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2023-12-13 Thread Peter Smith
A review comment for v47-0001

==
src/backend/replication/slot.c

1.  GetStandbySlotList

+static void
+WalSndRereadConfigAndReInitSlotList(List **standby_slots)
+{
+ char*pre_standby_slot_names;
+
+ ProcessConfigFile(PGC_SIGHUP);
+
+ /*
+ * If we are running on a standby, there is no need to reload
+ * standby_slot_names since we do not support syncing slots to cascading
+ * standbys.
+ */
+ if (RecoveryInProgress())
+ return;

Should the RecoveryInProgress() check be first -- even before the
ProcessConfigFile call?

==
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: logical decoding and replication of sequences, take 2

2023-12-13 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state.

I found a workload which decoder distinguish wrongly.

# Prerequisite

Apply an attached patch for inspecting the sequence status. It can be applied 
atop v20231203 patch set.
Also, a table and a sequence must be defined:

```
CREATE TABLE foo (var int);
CREATE SEQUENCE s;
```

# Workload

Then, you can execute concurrent transactions from three clients like below:

Client-1

BEGIN;
INSERT INTO foo VALUES (1);

Client-2

SELECT pg_create_logical_replication_slot('slot', 
'test_decoding');

Client-3

BEGIN;
ALTER SEQUENCE s MAXVALUE 5000;
COMMIT;
SAVEPOINT s1;
SELECT setval('s', 2000);
ROLLBACK;

SELECT pg_logical_slot_get_changes('slot', 
'test_decoding');

# Result and analysis

At first, below lines would be output on the log. This meant that WAL records
for ALTER SEQUENCE were decoded but skipped because the snapshot had been 
building.

```
...
LOG:  logical decoding found initial starting point at 0/154D238
DETAIL:  Waiting for transactions (approximately 1) older than 741 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: skipped
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: skipped
...
```

Note that above `seq_decode...` line was not output via `setval()`, it was done
by ALTER SEQUENCE statement. Below is a call stack for inserting WAL.

```
XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
fill_seq_fork_with_data
fill_seq_with_data
AlterSequence
```

Then, subsequent lines would say like them. This means that the snapshot becomes
FULL and `setval()` is regarded non-transactional wrongly.

```
LOG:  logical decoding found initial consistent point at 0/154D658
DETAIL:  Waiting for transactions (approximately 1) older than 742 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: the sequence is non-transactional
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: not consistent: skipped
```

The change would be discarded because the snapshot has not been CONSISTENT yet
by the below part. If it has been transactional, we would have queued this
change though the transaction will be skipped at commit.

```
else if (!transactional &&
 (SnapBuildCurrentState(builder) != 
SNAPBUILD_CONSISTENT ||
  SnapBuildXactNeedsSkip(builder, buf->origptr)))
return;
```

But anyway, we could find a case which we can make a wrong decision. This 
example
is lucky - does not output wrongly, but I'm not sure all the case like that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index d48d88081f..73e38cafd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1397,12 +1397,17 @@ seq_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr);
 
+   elog(LOG, "XXX: seq_decode. snapshot is %s", 
SnapBuildIdentify(builder));
+
/*
 * If we don't have snapshot or we are just fast-forwarding, there is no
 * point in decoding sequences.
 */
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
+   {
+   elog(LOG, "XXX: skipped");
return;
+   }
 
/* only interested in our database */
XLogRecGetBlockTag(r, 0, _locator, NULL, NULL);
@@ -1437,14 +1442,22 @@ seq_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)

 target_locator,

 NULL);
 
+   elog(LOG, "XXX: the 

Re: "pgoutput" options missing on documentation

2023-12-13 Thread Amit Kapila
On Wed, Dec 13, 2023 at 9:25 PM Emre Hasegeli  wrote:
>
> I noticed that Logical Streaming Replication Protocol documentation
> [1] is missing the options added to "pgoutput" since version 14.  A
> patch is attached to fix it together with another small one to give a
> nice error when "proto_version" parameter is not provided.
>

I agree that we missed updating the parameters of the Logical
Streaming Replication Protocol documentation. I haven't reviewed all
the details yet but one minor thing that caught my attention while
looking at your patch is that we can update the exact additional
information we started to send for streaming mode parallel. We should
update the following sentence: "It accepts an additional value
"parallel" to enable sending extra information with the "Stream Abort"
messages to be used for parallelisation."

-- 
With Regards,
Amit Kapila.




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-13 Thread Amul Sul
On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar  wrote:

> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar  wrote:
> >
> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar 
> wrote:
>
> Here is the updated patch based on some comments by tender wang (those
> comments were sent only to me)
>

few nitpicks:

+
+   /*
+* Mask for slotno banks, considering 1GB SLRU buffer pool size and the
+* SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
+*/
+   bits16  bank_mask;
 } SlruCtlData;

...
...

+ int bankno = pageno & ctl->bank_mask;

I am a bit uncomfortable seeing it as a mask, why can't it be simply a
number
of banks (num_banks) and get the bank number through modulus op (pageno %
num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which
is a
bit difficult to read compared to modulus op which is quite simple,
straightforward and much common practice in hashing.

Are there any advantages of using &  over % ?

Also, a few places in 0002 and 0003 patch, need the bank number, it is
better
to have a macro for that.
---

 extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64
segpage,
   void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
 #endif /* SLRU_H */


Add an empty line after the declaration, in 0002 patch.
---

-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr
lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr
lsn,
+ int slotno)

Unrelated change for 0003 patch.
---

Regards,
Amul


Re: Remove MSVC scripts from the tree

2023-12-13 Thread NINGWEI CHEN
On Mon, 4 Dec 2023 17:05:24 +0900
Michael Paquier  wrote:

> On Tue, Sep 26, 2023 at 12:17:04PM -0400, Andrew Dunstan wrote:
> > On 2023-09-26 Tu 01:25, NINGWEI CHEN wrote:
> >> hamerkop is not yet prepared for Meson builds, but we plan to work on this 
> >> support soon.
> >> If we go with Meson builds exclusively right now, we will have to 
> >> temporarily remove the master/HEAD for a while.
> > 
> > I don't think we should switch to that until you're ready.
> 
> Agreed that it would just be breaking a build for the sake of breaking
> it.  Saying that, the last exchange that we had about hamerkop
> switching to meson was two months ago.  Are there any plans to do the
> switch?
> --
> Michael


Sorry for the delayed response. 
We are currently working on transitioning to meson build at hamerkop and 
anticipating that this can be accomplished by no later than January.

If the old build scripts are removed before that, hamerkop will be temporarily 
taken off the master branch, and will rejoin once the adjustment is done.


Best Regards.
-- 
SRA OSS LLC
Chen Ningwei




Re: Simplify newNode()

2023-12-13 Thread Junwang Zhao
On Thu, Dec 14, 2023 at 9:34 AM Zhang Mingli  wrote:
>
> Hi,
>
> LGTM.
>
> + Assert(size >= sizeof(Node)); /* need the tag, at least */
> + result = (Node *) palloc0fast(size);
> + result->type = tag;
>
> + return result;
> +}
>
> How about moving the comments /* need the tag, at least */ after result->type 
> = tag; by the way?

I don't think so, the comment has the meaning of the requested size
should at least the size
of Node, which contains just a NodeTag.

typedef struct Node
{
NodeTag type;
} Node;

>
>
>
> Zhang Mingli
> www.hashdata.xyz



-- 
Regards
Junwang Zhao




Re: Simplify newNode()

2023-12-13 Thread Zhang Mingli
Hi,

LGTM.

+   Assert(size >= sizeof(Node));   /* need the tag, at least */
+   result = (Node *) palloc0fast(size);
+   result->type = tag;

+   return result;
+}

How about moving the comments /* need the tag, at least */ after result->type = 
tag; by the way?



Zhang Mingli
www.hashdata.xyz


Re: Synchronizing slots from primary to standby

2023-12-13 Thread Peter Smith
Hi, here are a few more review comments for the patch v47-0002

(plus my review comments of v45-0002 [1] are yet to be addressed)

==
1. General

For consistency and readability, try to use variables of the same
names whenever they have the same purpose, even when they declared are
in different functions. A few like this were already mentioned in the
previous review but there are more I keep noticing.

For example,
'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller.


==
src/backend/replication/logical/slotsync.c

2.
+/*
+ *
+ * Validates the primary server info.
+ *
+ * Using the specified primary server connection, it verifies whether
the master
+ * is a standby itself and returns true in that case to convey the caller that
+ * we are on the cascading standby.
+ * But if master is the primary server, it goes ahead and validates
+ * primary_slot_name. It emits error if the physical slot in primary_slot_name
+ * does not exist on the primary server.
+ */
+static bool
+validate_primary_info(WalReceiverConn *wrconn)

2a.
Extra line top of that comment?

~

2b.
IMO it is too tricky to have a function called "validate_xxx", when
actually you gave that return value some special unintuitive meaning
other than just validation. IMO it is always better for the returned
value to properly match the function name so the expectations are very
obvious. So, In this case, I think a better function signature would
be like this:

SUGGESTION

static void
validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby)

or

static void
validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)

~~~

3.
+ if (res->status != WALRCV_OK_TUPLES)
+ ereport(ERROR,
+ (errmsg("could not fetch recovery and primary_slot_name \"%s\" info from the "
+ "primary: %s", PrimarySlotName, res->err)));

I'm not sure that including "recovery and" in the error message is
meaningful to the user, is it?

~~~

4. slotsync_reread_config

+/*
+ * Re-read the config file.
+ *
+ * If any of the slot sync GUCs have changed, re-validate them. The
+ * worker will exit if the check fails.
+ *
+ * Returns TRUE if primary_slot_name is changed, let the caller re-verify it.
+ */
+static bool
+slotsync_reread_config(WalReceiverConn *wrconn)

Hm. This is another function where the return value has been butchered
to have a special meaning unrelated the the function name. IMO it
makes the code unnecessarily confusing.

IMO a better function signature here would be:

static void
slotsync_reread_config(WalReceiverConn *wrconn, bool *primary_slot_name_changed)

~~~

5. ProcessSlotSyncInterrupts

+/*
+ * Interrupt handler for main loop of slot sync worker.
+ */
+static bool
+ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool
check_cascading_standby)
+{

There is no function comment describing the meaning of the return
value. But actually, IMO this is an example of how conflating the
meanings of validation VERSUS are_we_cascading_standby in the
lower-down function has propagated up to become a big muddle.

The code
+ if (primary_slot_changed || check_cascading_standby)
+ return validate_primary_info(wrconn);

seems unnecessarily hard to understand because,
false -- doesn't mean invalid
true -- doesn't mean valid

Please, consider changing this signature also so the functions return
what you would intuitively expect them to return without surprisingly
different meanings.

SUGGESTION

static void
ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool
check_cascading_standby, bool *am_cascading_standby)

~~~

6. ReplSlotSyncWorkerMain

+ int rc;
+ long naptime = WORKER_DEFAULT_NAPTIME_MS;
+ TimestampTz now;
+ bool slot_updated;
+
+ /*
+ * The transaction env is needed by walrcv_exec() in both the slot
+ * sync and primary info validation flow.
+ */
+ StartTransactionCommand();
+
+ if (!am_cascading_standby)
+ {
+ slot_updated = synchronize_slots(wrconn);
+
+ /*
+ * If any of the slots get updated in this sync-cycle, use default
+ * naptime and update 'last_update_time'. But if no activity is
+ * observed in this sync-cycle, then increase naptime provided
+ * inactivity time reaches threshold.
+ */
+ now = GetCurrentTimestamp();
+ if (slot_updated)
+ last_update_time = now;
+ else if (TimestampDifferenceExceeds(last_update_time,
+ now, WORKER_INACTIVITY_THRESHOLD_MS))
+ naptime = WORKER_INACTIVITY_NAPTIME_MS;
+ }
+ else
+ naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */
+
+ rc = WaitLatch(MyLatch,
+WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+naptime,
+WAIT_EVENT_REPL_SLOTSYNC_MAIN);
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+
+ am_cascading_standby =
+ ProcessSlotSyncInterrupts(wrconn, am_cascading_standby);
+
+ CommitTransactionCommand();

IMO it is more natural to avoid negative conditions, so just reverse
these. Also, some comment is needed to explain why the longer naptime
is needed in this special case.


SUGGESTION
if (am_cascading_standby)
{
  /* comment the reason  */
 

useless LIMIT_OPTION_DEFAULT

2023-12-13 Thread Zhang Mingli
Hi, all

By reading the codes, I found that we process limit option as 
LIMIT_OPTION_WITH_TIES when using WITH TIES
and all others are LIMIT_OPTION_COUNT by  commit 
357889eb17bb9c9336c4f324ceb1651da616fe57.
And check actual limit node in limit_needed().
There is no need to maintain a useless default limit enum.
I remove it and have an install check to verify.

Are there any considerations behind this?
Shall we remove it for clear as it’s not actually the default option.


Zhang Mingli
www.hashdata.xyz


v1-0001-Remove-useless-LIMIT_OPTION_DEFAULT.patch
Description: Binary data


Simplify newNode()

2023-12-13 Thread Heikki Linnakangas
The newNode() macro can be turned into a static inline function, which 
makes it a lot simpler. See attached. This was not possible when the 
macro was originally written, as we didn't require compiler to have 
static inline support, but nowadays we do.


This was last discussed in 2008, see discussion at 
https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. 
In those tests, Tom observed that gcc refused to inline the static 
inline function. That was weird, the function is very small and doesn't 
do anything special. Whatever the problem was, I think we can dismiss it 
with modern compilers. It does get inlined on gcc 12 and clang 14 that I 
have installed.


--
Heikki Linnakangas
Neon (https://neon.tech)From 6ad4c4cf49ef5b3f7ed22acc258a868f1a13f6f4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 14 Dec 2023 01:08:16 +0100
Subject: [PATCH 1/1] Convert newNode() into a static inline function

Because it's simpler. We didn't require static inline support from
compiler when this was originally written, but now we do.

One complication is that the static inline function doesn't work in
the frontend, because it calls palloc0fast() which isn't included
frontend programs. That's OK, the old macro also didn't work in
frontend programs if you actually tried to call it, but we now need to
explicitly put it in an #ifndef FRONTEND block to keep the compiler
happy.
---
 src/backend/nodes/Makefile|  1 -
 src/backend/nodes/meson.build |  1 -
 src/backend/nodes/nodes.c | 31 -
 src/include/nodes/nodes.h | 43 +++
 4 files changed, 13 insertions(+), 63 deletions(-)
 delete mode 100644 src/backend/nodes/nodes.c

diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile
index ebbe9052cb7..66bbad8e6e0 100644
--- a/src/backend/nodes/Makefile
+++ b/src/backend/nodes/Makefile
@@ -23,7 +23,6 @@ OBJS = \
 	makefuncs.o \
 	multibitmapset.o \
 	nodeFuncs.o \
-	nodes.o \
 	outfuncs.o \
 	params.o \
 	print.o \
diff --git a/src/backend/nodes/meson.build b/src/backend/nodes/meson.build
index 31467a12d3b..1efbf2c11ca 100644
--- a/src/backend/nodes/meson.build
+++ b/src/backend/nodes/meson.build
@@ -7,7 +7,6 @@ backend_sources += files(
   'makefuncs.c',
   'multibitmapset.c',
   'nodeFuncs.c',
-  'nodes.c',
   'params.c',
   'print.c',
   'read.c',
diff --git a/src/backend/nodes/nodes.c b/src/backend/nodes/nodes.c
deleted file mode 100644
index 1913a4bdf7d..000
--- a/src/backend/nodes/nodes.c
+++ /dev/null
@@ -1,31 +0,0 @@
-/*-
- *
- * nodes.c
- *	  support code for nodes (now that we have removed the home-brew
- *	  inheritance system, our support code for nodes is much simpler)
- *
- * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/nodes/nodes.c
- *
- * HISTORY
- *	  Andrew Yu			Oct 20, 1994	file creation
- *
- *-
- */
-#include "postgres.h"
-
-#include "nodes/nodes.h"
-
-/*
- * Support for newNode() macro
- *
- * In a GCC build there is no need for the global variable newNodeMacroHolder.
- * However, we create it anyway, to support the case of a non-GCC-built
- * loadable module being loaded into a GCC-built backend.
- */
-
-Node	   *newNodeMacroHolder;
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4c32682e4ce..dbb8eed122c 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -132,6 +132,8 @@ typedef struct Node
 
 #define nodeTag(nodeptr)		(((const Node*)(nodeptr))->type)
 
+#ifndef FRONTEND
+
 /*
  * newNode -
  *	  create a new node of the specified size and tag the node with the
@@ -139,43 +141,24 @@ typedef struct Node
  *
  * !WARNING!: Avoid using newNode directly. You should be using the
  *	  macro makeNode.  eg. to create a Query node, use makeNode(Query)
- *
- * Note: the size argument should always be a compile-time constant, so the
- * apparent risk of multiple evaluation doesn't matter in practice.
- */
-#ifdef __GNUC__
-
-/* With GCC, we can use a compound statement within an expression */
-#define newNode(size, tag) \
-({	Node   *_result; \
-	AssertMacro((size) >= sizeof(Node));		/* need the tag, at least */ \
-	_result = (Node *) palloc0fast(size); \
-	_result->type = (tag); \
-	_result; \
-})
-#else
-
-/*
- *	There is no way to dereference the palloc'ed pointer to assign the
- *	tag, and also return the pointer itself, so we need a holder variable.
- *	Fortunately, this macro isn't recursive so we just define
- *	a global variable for this purpose.
  */
-extern PGDLLIMPORT Node *newNodeMacroHolder;
+static inline Node *
+newNode(size_t size, NodeTag tag)
+{
+	Node	   *result;
 
-#define newNode(size, tag) \
-( \
-	AssertMacro((size) >= sizeof(Node)),		/* need the tag, at 

Re: Teach predtest about IS [NOT] proofs

2023-12-13 Thread James Coleman
Thanks for taking a look!

On Wed, Dec 13, 2023 at 1:36 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > Attached is a patch that solves that issue. It also teaches predtest about
> > quite a few more cases involving BooleanTest expressions (e.g., how they
> > relate to NullTest expressions). One thing I could imagine being an
> > objection is that not all of these warrant cycles in planning. If that
> > turns out to be the case there's not a particularly clear line in my mind
> > about where to draw that line.
>
> I don't have an objection in principle to adding more smarts to
> predtest.c.  However, we should be wary of slowing down cases where
> no BooleanTests are present to be optimized.  I wonder if it could
> help to use a switch on nodeTag rather than a series of if(IsA())
> tests.  (I'd be inclined to rewrite the inner if-then-else chains
> as switches too, really.  You get some benefit from the compiler
> noticing whether you've covered all the enum values.)

I think I could take this on; would you prefer it as a patch in this
series? Or as a new patch thread?

> I note you've actively broken the function's ability to cope with
> NULL input pointers.  Maybe we don't need it to, but I'm not going
> to accept a patch that just side-swipes that case without any
> justification.

I should have explained that. I don't think I've broken it:

1. predicate_implied_by_simple_clause() is only ever called by
predicate_implied_by_recurse()
2. predicate_implied_by_recurse() starts with:
pclass = predicate_classify(predicate, _info);
3. predicate_classify(Node *clause, PredIterInfo info) starts off with:
Assert(clause != NULL);

I believe this means we are currently guaranteed by the caller to
receive a non-NULL pointer, but I could be missing something.

The same argument (just substituting the equivalent "refute" function
names) applies to predicate_refuted_by_simple_clause().

> Another way in which the patch needs more effort is that you've
> not bothered to update the large comment block atop the function.
> Perhaps, rather than hoping people will notice comments that are
> potentially offscreen from what they're modifying, we should relocate
> those comment paras to be adjacent to the relevant parts of the
> function?

Splitting up that block comment makes sense to me.

> I've not gone through the patch in detail to see whether I believe
> the proposed proof rules.  It would help to have more comments
> justifying them.

Most of them are sufficiently simple -- e.g., X IS TRUE implies X --
that I don't think there's a lot to say in justification. In some
cases I've noted the cases that force only strong or weak implication.

There are a few cases, though, (e.g., "X is unknown weakly implies X
is not true") that, reading over this again, don't immediately strike
me as obvious, so I'll expand on those.

> > As noted in a TODO in the patch itself, I think it may be worth refactoring
> > the test_predtest module to run the "x, y" case as well as the "y, x" case
> > with a single call so as to eliminate a lot of repetition in
> > clause/expression test cases. If reviewers agree that's desirable, then I
> > could do that as a precursor.
>
> I think that's actively undesirable.  It is not typically the case that
> a proof rule for A => B also works in the other direction, so this would
> encourage wasting cycles in the tests.  I fear it might also cause
> confusion about which direction a proof rule is supposed to work in.

That makes sense in the general case.

Boolean expressions seem like a special case in that regard: (subject
to what it looks like) would you be OK with a wrapping function that
does both directions (with output that shows which direction is being
tested) used only for the cases where we do want to check both
directions?

Thanks,
James Coleman




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-13 Thread Masahiko Sawada
On Tue, Dec 12, 2023 at 11:53 AM John Naylor  wrote:
>
> On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada  wrote:
>
> > I've attached the updated patch set. From the previous patch set, I've
> > merged patches 0007 to 0010. The other changes such as adding RT_GET()
> > still are unmerged for now, for discussion. Probably we can make them
> > as follow-up patches as we discussed. 0011 to 0015 patches are new
> > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
> > support variable-length values.
>
> This looks like the right direction, and I'm pleased it's not much
> additional code on top of my last patch.
>
> v44-0014:
>
> +#ifdef RT_VARLEN_VALUE
> + /* XXX: need to choose block sizes? */
> + tree->leaf_ctx = AllocSetContextCreate(ctx,
> +"radix tree leaves",
> +ALLOCSET_DEFAULT_SIZES);
> +#else
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +"radix tree leaves",
> +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +sizeof(RT_VALUE_TYPE));
> +#endif /* RT_VARLEN_VALUE */
>
> Choosing block size: Similar to what we've discussed previously around
> DSA segments, we might model this on CreateWorkExprContext() in
> src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
> autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
> max block size to 1/16 of that, or less.
>
> Also, it occurred to me that compile-time embeddable values don't need
> a leaf context. I'm not sure how many places assume that there is
> always a leaf context. If not many, it may be worth not creating one
> here, just to be tidy.
>
> + size_t copysize;
>
> - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
> + copysize = sizeof(RT_VALUE_TYPE);
> +#endif
> +
> + memcpy(leaf.local, value_p, copysize);
>
> I'm not sure this indirection adds clarity. I guess the intent was to
> keep from saying "memcpy" twice, but now the code has to say "copysize
> = foo" twice.
>
> For varlen case, we need to watch out for slowness because of memcpy.
> Let's put that off for later testing, though. We may someday want to
> avoid a memcpy call for the varlen case, so let's keep it flexible
> here.
>
> v44-0015:
>
> +#define SizeOfBlocktableEntry (offsetof(
>
> Unused.
>
> + char buf[MaxBlocktableEntrySize] = {0};
>
> Zeroing this buffer is probably going to be expensive. Also see this
> pre-existing comment:
> /* WIP: slow, since it writes to memory for every bit */
> page->words[wordnum] |= ((bitmapword) 1 << bitnum);
>
> For this function (which will be vacuum-only, so we can assume
> ordering), in the loop we can:
> * declare the local bitmapword variable to be zero
> * set the bits on it
> * write it out to the right location when done.
>
> Let's fix both of these at once.
>
> + if (TidStoreIsShared(ts))
> + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
> + else
> + local_rt_set(ts->tree.local, blkno, (void *) page, page_len);
>
> Is there a reason for "void *"? The declared parameter is
> "RT_VALUE_TYPE *value_p" in 0014.
> Also, since this function is for vacuum (and other uses will need a
> new function), let's assert the returned bool is false.
>
> Does iteration still work? If so, it's not too early to re-wire this
> up with vacuum and see how it behaves.
>
> Lastly, my compiler has a warning that CI doesn't have:
>
> In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
> ../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
> ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
> uninitialized [-Wmaybe-uninitialized]
>  2142 | return (RT_VALUE_TYPE*) slot;
>   |^
> ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
>  2112 | RT_PTR_ALLOC *slot;
>   |   ^~~~

Thank you for the comments! I agreed with all of them and incorporated
them into the attached latest patch set, v45.

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum. Note that DSA segment problem is not
resolved yet in this patch. 0010 and 0011 makes DSA initial/max
segment size configurable and make parallel vacuum specify both in
proportion to maintenance_work_mem. 0012 is a development-purpose
patch to make it easy to investigate bugs in tidstore. I'd like to
keep it in the patch set at least during the development.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v45-ART.tar.gz
Description: GNU Zip compressed data


Re: [BUG] autovacuum may skip tables when session_authorization/role is set on database

2023-12-13 Thread Imseih (AWS), Sami
> What is the actual
> use case for such a setting? 

I don't have exact details on the use-case, bit this is not a common
use-case.

> Doesn't it risk security problems?

I cannot see how setting it on the database being more problematic than
setting it on a session level.


> I'm rather unimpressed by this proposal, first because there are
> probably ten other ways to break autovac with ill-considered settings,

There exists code in autovac that safeguard for such settings. For example,
statement_timeout, lock_timeout are disabled. There are a dozen or
more other settings that are overridden for autovac.

I see this being just another one to ensure that autovacuum always runs
as superuser.

> and second because if we do want to consider this a supported case,
> what about other background processes? They'd likely have issues
> as well.

I have not considered other background processes, but autovac is the only
one that I can think of which checks for relation permissions.

Regards,

Sami








Obscure lwlock assertion failure if write fails in initdb

2023-12-13 Thread Thomas Munro
Hi,

In all releases, if bootstrap mode's checkpoint gets an error (ENOSPC,
EDQUOT, EIO, ...) or a short write in md.c, ERROR is promoted to FATAL
and the shmem_exit resowner machinery reaches this:

running bootstrap script ... 2023-12-14 10:38:02.320 NZDT [1409162]
FATAL:  could not write block 42 in file "base/1/1255": wrote only
4096 of 8192 bytes
2023-12-14 10:38:02.320 NZDT [1409162] HINT:  Check free disk space.
2023-12-14 10:38:02.320 NZDT [1409162] CONTEXT:  writing block 42 of
relation base/1/1255
TRAP: failed Assert("!LWLockHeldByMe(BufferDescriptorGetContentLock(buf))"),
File: "bufmgr.c", Line: 2409, PID: 1409162

It's really hard to hit because we'd normally expect smgrextend() to
get the error first, and when it does it looks something like this:

running bootstrap script ... 2023-12-14 10:22:41.940 NZDT [1378512]
FATAL:  could not extend file "base/1/1255": wrote only 4096 of 8192
bytes at block 42
2023-12-14 10:22:41.940 NZDT [1378512] HINT:  Check free disk space.
2023-12-14 10:22:41.940 NZDT [1378512] PANIC:  cannot abort
transaction 1, it was already committed
Aborted (core dumped)

A COW system might succeed in smgrextend() and then fail in
smgrwrite(), and any system might fail here with other errno.

It's an extremely well hidden edge case and doesn't matter to users:
initdb failed for lack of space or worse, the message is clear and the
rest is meaningless detail of interest to developers with assertion
builds.  I only happened to notice because I've been testing short
write and error scenarios via artificially rigged up means for my
vectored I/O work.  No patch, I just wanted to flag this obscure
pre-existing problem spotted in passing.




Re: Remove MSVC scripts from the tree

2023-12-13 Thread Andrew Dunstan



On 2023-12-13 We 09:23, Michael Paquier wrote:

On Tue, Dec 05, 2023 at 07:29:59AM +0900, Michael Paquier wrote:

Okay.  Thanks for the update.

While in Prague, Andres and Peter E. have mentioned me that we perhaps
had better move on with this patch sooner than later, without waiting
for the two buildfarm members to do the switch because much more
cleanup is required for the documentation once the scripts are
removed.

So, any objections with the patch as presented to remove the scripts
while moving the existing doc blocks from install-windows.sgml that
still need more discussion?




TBH I'd prefer to wait. But I have had a couple more urgent things on my 
plate. I hope to get back to it before New Year. In the meantime I have 
switched bowerbird to building only STABLE branches.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [BUG] autovacuum may skip tables when session_authorization/role is set on database

2023-12-13 Thread Tom Lane
"Imseih (AWS), Sami"  writes:
> A recent case in the field in which a database session_authorization is
> altered to a non-superuser, non-owner of tables via alter database .. set 
> session_authorization ..
> caused autovacuum to skip tables.

That seems like an extremely not-bright idea.  What is the actual
use case for such a setting?  Doesn't it risk security problems?

> Attached is a repro and a patch which sets the session user to the BOOTSTRAP 
> superuser
> at the start of the autovac worker.

I'm rather unimpressed by this proposal, first because there are
probably ten other ways to break autovac with ill-considered settings,
and second because if we do want to consider this a supported case,
what about other background processes?  They'd likely have issues
as well.

regards, tom lane




Re: Clean up find_typedefs and add support for Mac

2023-12-13 Thread Tristan Partin

On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote:


On 2023-12-12 Tu 18:02, Tom Lane wrote:
> "Tristan Partin"  writes:
>> The big patch here is adding support for Mac. objdump -W doesn't work on
>> Mac. So, I used dsymutil and dwarfdump to achieve the same result.
> We should probably nuke the current version of src/tools/find_typedef
> altogether in favor of copying the current buildfarm code for that.
> We know that the buildfarm's version works, whereas I'm not real sure
> that src/tools/find_typedef is being used in anger by anyone.  Also,
> we're long past the point where developers can avoid having Perl
> installed.
>
> Ideally, the buildfarm client would then start to use find_typedef
> from the tree rather than have its own copy, both to reduce
> duplication and to ensure that the in-tree copy keeps working.
>
>


+1. I'd be more than happy to be rid of maintaining the code. We already 
performed a rather more complicated operation along these lines with the 
PostgreSQL::Test::AdjustUpgrade stuff.


I'll work with you on GitHub to help make the transition. I've already 
made a draft PR in the client-code repo, but I am sure I am missing some 
stuff.


--
Tristan Partin
Neon (https://neon.tech)




Re: Add --check option to pgindent

2023-12-13 Thread Andrew Dunstan



On 2023-12-12 Tu 10:30, Alvaro Herrera wrote:

On 2023-Dec-12, Tom Lane wrote:


"Euler Taveira"  writes:

When you add exceptions, it starts to complicate the UI.

Indeed.  It seems like --silent-diff was poorly defined and poorly
named, and we need to rethink that option along the way to adding
this behavior.  The idea that --show-diff and --silent-diff can
be used together is just inherently confusing, because they sound
like opposites

Maybe it's enough to rename --silent-diff to --check.  You can do
"--show-diff --check" and get both the error and the diff printed; or
just "--check" and it'll throw an error without further ado; or
"--show-diff" and it will both apply the diff and print it.



That seems reasonable. These features were fairly substantially debated 
when we put them in, but I'm fine with some tweaking. But note: 
--show-diff doesn't apply the diff, it's intentionally non-destructive.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





[BUG] autovacuum may skip tables when session_authorization/role is set on database

2023-12-13 Thread Imseih (AWS), Sami
Hi,

A recent case in the field in which a database session_authorization is
altered to a non-superuser, non-owner of tables via alter database .. set 
session_authorization ..
caused autovacuum to skip tables.

The issue was discovered on 13.10, and the logs show such messages:

warning:  skipping "table1" --- only table or database owner can vacuum it

In HEAD, I can repro, but the message is now a bit different due to [1].

WARNING:  permission denied to vacuum "table1”, skipping it

It seems to me we should force an autovacuum worker to set the session userid to
a superuser.

Attached is a repro and a patch which sets the session user to the BOOTSTRAP 
superuser
at the start of the autovac worker.

Thoughts?

Regards,

Sami
Amazon Web Services (AWS)


[1] 
https://postgr.es/m/20220726.104712.912995710251150228.horikyota@gmail.com



0001-v1-Force-autovacuum-to-use-bootstrap-superuser.patch
Description: 0001-v1-Force-autovacuum-to-use-bootstrap-superuser.patch
## make autovac trigger often for the test
psql<

Re: Clean up find_typedefs and add support for Mac

2023-12-13 Thread Andrew Dunstan



On 2023-12-12 Tu 18:02, Tom Lane wrote:

"Tristan Partin"  writes:

The big patch here is adding support for Mac. objdump -W doesn't work on
Mac. So, I used dsymutil and dwarfdump to achieve the same result.

We should probably nuke the current version of src/tools/find_typedef
altogether in favor of copying the current buildfarm code for that.
We know that the buildfarm's version works, whereas I'm not real sure
that src/tools/find_typedef is being used in anger by anyone.  Also,
we're long past the point where developers can avoid having Perl
installed.

Ideally, the buildfarm client would then start to use find_typedef
from the tree rather than have its own copy, both to reduce
duplication and to ensure that the in-tree copy keeps working.





+1. I'd be more than happy to be rid of maintaining the code. We already 
performed a rather more complicated operation along these lines with the 
PostgreSQL::Test::AdjustUpgrade stuff.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Windows default locale vs initdb

2023-12-13 Thread Thomas Munro
Here is a thought that occurs to me, as I follow along with Jeff
Davis's evolving proposals for built-in collations and ctypes:  What
would stop us from dropping support for the libc (sic) provider on
Windows?  That may sound radical and likely to cause extra work for
people on upgrade, but how does that compare to the pain of keeping
this barely maintained code in the tree?  Suppose the idea in this
thread goes ahead and we get people to transition to the modern locale
names: there is non-zero transitional/upgrade pain there too.  How
delicious it would be to just nuke the whole thing from orbit, and
keep only cross-platform code that is maintained with enthusiasm by
active hackers.

That's probably a little extreme, but it's the direction my thoughts
start to go in when confronting the realisation that it's up to us
[Unix hackers making drive-by changes], no one is coming to help us
[from the Windows user community].

I've even heard others talk about dropping Windows completely, due to
the maintenance imbalance.  This would be somewhat more fine grained.
(One could use a similar argument to drop non-NTFS filesystems and
turn on POSIX-mode file links, to end that other locus of struggle.)




LargeObjectRelationId vs LargeObjectMetadataRelationId, redux

2023-12-13 Thread Tom Lane
By chance I discovered that checks on large object ownership
are broken in v16+.  For example,

regression=# create user alice;
CREATE ROLE
regression=# \c - alice
You are now connected to database "regression" as user "alice".
regression=> \lo_import test
lo_import 40378
regression=> comment on large object 40378 is 'test';
ERROR:  unrecognized class ID: 2613

This has been failing since commit afbfc0298, which refactored
ownership checks, replacing pg_largeobject_ownercheck and
allied routines with object_ownercheck.  That function lacks
the little dance that's been stuck into assorted crannies:

if (classId == LargeObjectRelationId)
classId = LargeObjectMetadataRelationId;

which translates from the object-address representation with
classId LargeObjectRelationId to the catalog we actually need
to look at.

The proximate cause of the failure is in get_object_property_data,
so I first thought of making that function do this transposition.
That might be a good thing to do, but it wouldn't be enough to
fix the problem, because we'd then reach this in object_ownercheck:

rel = table_open(classid, AccessShareLock);

which is going to examine the wrong catalog.  So AFAICS what
we have to do is put this substitution into object_ownercheck,
adding to the four or five places that know about it already.

This is an absolutely horrid mess, of course.  The big problem
is that at this point I have exactly zero confidence that there
are not other places with the same bug; and it's not apparent
how to find them.

There seems little choice but to make the hacky fix in v16,
but I wonder whether we shouldn't be more ambitious and try
to fix this permanently in HEAD, by getting rid of the
discrepancy in which OID to use.  ISTM the correct fix
is to change the ObjectAddress representation of large
objects to use classid LargeObjectMetadataRelationId.
Somebody seems to have felt that that would create more
problems than it solves, but I have to disagree.  If we
stick with the current way, we are going to be hitting
problems of this ilk forevermore.

Thoughts?

regards, tom lane




Re: Teach predtest about IS [NOT] proofs

2023-12-13 Thread Tom Lane
James Coleman  writes:
> Attached is a patch that solves that issue. It also teaches predtest about
> quite a few more cases involving BooleanTest expressions (e.g., how they
> relate to NullTest expressions). One thing I could imagine being an
> objection is that not all of these warrant cycles in planning. If that
> turns out to be the case there's not a particularly clear line in my mind
> about where to draw that line.

I don't have an objection in principle to adding more smarts to
predtest.c.  However, we should be wary of slowing down cases where
no BooleanTests are present to be optimized.  I wonder if it could
help to use a switch on nodeTag rather than a series of if(IsA())
tests.  (I'd be inclined to rewrite the inner if-then-else chains
as switches too, really.  You get some benefit from the compiler
noticing whether you've covered all the enum values.)

I note you've actively broken the function's ability to cope with
NULL input pointers.  Maybe we don't need it to, but I'm not going
to accept a patch that just side-swipes that case without any
justification.

Another way in which the patch needs more effort is that you've
not bothered to update the large comment block atop the function.
Perhaps, rather than hoping people will notice comments that are
potentially offscreen from what they're modifying, we should relocate
those comment paras to be adjacent to the relevant parts of the
function?

I've not gone through the patch in detail to see whether I believe
the proposed proof rules.  It would help to have more comments
justifying them.

> As noted in a TODO in the patch itself, I think it may be worth refactoring
> the test_predtest module to run the "x, y" case as well as the "y, x" case
> with a single call so as to eliminate a lot of repetition in
> clause/expression test cases. If reviewers agree that's desirable, then I
> could do that as a precursor.

I think that's actively undesirable.  It is not typically the case that
a proof rule for A => B also works in the other direction, so this would
encourage wasting cycles in the tests.  I fear it might also cause
confusion about which direction a proof rule is supposed to work in.

regards, tom lane




Re: Clean up find_typedefs and add support for Mac

2023-12-13 Thread Tristan Partin

On Wed Dec 13, 2023 at 11:27 AM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> That makes sense to me. Where can I find the buildfarm code to propose 
> a different patch, at least pulling in the current state of the 
> buildfarm script? Or perhaps Andrew is the best person for this job.


I think this is the authoritative repo:

https://github.com/PGBuildFarm/client-code.git


Cool. I'll reach out to Andrew off list to work with him. Maybe I'll 
gain a little bit more knowledge of how the buildfarm works :).


--
Tristan Partin
Neon (https://neon.tech)




Re: Clean up find_typedefs and add support for Mac

2023-12-13 Thread Tom Lane
"Tristan Partin"  writes:
> That makes sense to me. Where can I find the buildfarm code to propose 
> a different patch, at least pulling in the current state of the 
> buildfarm script? Or perhaps Andrew is the best person for this job.

I think this is the authoritative repo:

https://github.com/PGBuildFarm/client-code.git

regards, tom lane




Re: Eager page freeze criteria clarification

2023-12-13 Thread Robert Haas
Great results.

On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman
 wrote:
> Values can be "removed" from the accumulator by simply decrementing its
> cardinality and decreasing the sum and sum squared by a value that will
> not change the mean and standard deviation of the overall distribution.
> To adapt to a table's changing access patterns, we'll need to remove
> values from this accumulator over time, but this patch doesn't yet
> decide when to do this. A simple solution may be to cap the cardinality
> of the accumulator to the greater of 1% of the table size, or some fixed
> number of values (perhaps 200?). Even without such removal of values,
> the distribution recorded in the accumulator will eventually skew toward
> more recent data, albeit at a slower rate.

I think we're going to need something here. Otherwise, after 6 months
of use, changing a table's perceived access pattern will be quite
difficult.

I think one challenge here is to find something that doesn't decay too
often and end up with cases where it basically removes all the data.

As long as you avoid that, I suspect that the algorithm might not be
terribly sensitive to other kinds of changes. If you decay after 200
values or 2000 or 20,000, it will only affect how fast we can change
our notion of the access pattern, and my guess would be that any of
those values would produce broadly acceptable results, with some
differences in the details. If you decay after 200,000,000 values or
not at all, then I think there will be problems.

> This algorithm is able to predict when pages will be modified before
> target_freeze_threshold as long as their modification pattern fits a
> normal distribution -- this accommodates access patterns where, after
> some period of time, pages become less likely to be modified the older
> they are. As an example, however, access patterns where modification
> times are bimodal aren't handled well by this model (imagine that half
> your modifications are to very new data and the other half are to data
> that is much older but still younger than target_freeze_duration). If
> this is a common access pattern, the normal distribution could easily be
> swapped out for another distribution. The current accumulator is capable
> of tracking a distribution's first two moments of central tendency (the
> mean and standard deviation). We could track more if we wanted to use a
> fancier distribution.

I think it might be fine to ignore this problem. What we're doing
right now is way worse.

> We also must consider what to do when we have few unsets, e.g. with an
> insert-only workload. When there are very few unsets (I chose 30 which
> the internet says is the approximate minimum number required for the
> central limit theorem to hold), we can choose to always freeze freezable
> pages; above this limit, the calculated page threshold is used. However,
> we may not want to make predictions based on 31 values either. To avoid
> this "cliff", we could modify the algorithm a bit to skew the mean and
> standard deviation of the distribution using a confidence interval based
> on the number of values we've collected.

I think some kind of smooth transition would might be a good idea, but
I also don't know how much it matters. I think the important thing is
that we freeze aggressively unless there's a clear signal telling us
not to do that. Absence of evidence should cause us to tend toward
aggressive freezing. Otherwise, we get insert-only tables wrong.

> The goal is to keep pages frozen for at least target_freeze_duration.
> target_freeze_duration is in seconds and pages only have a last
> modification LSN, so target_freeze_duration must be converted to LSNs.
> To accomplish this, I've added an LSNTimeline data structure, containing
> XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
> age. When we need to translate the guc value to LSNs, we linearly
> interpolate it on this timeline. For the time being, the global
> LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
> is no reason it can't be updated with some other cadence and/or by some
> other process (nothing about it is inherently tied to vacuum). The
> cached translated value of target_freeze_duration is stored in each
> table's stats. This is arbitrary as it is not a table-level stat.
> However, it needs to be located somewhere that is accessible on
> update/delete. We may want to recalculate it more often than once per
> table vacuum, especially in case of long-running vacuums.

This part sounds like it isn't quite baked yet. The idea of the data
structure seems fine, but updating it once per vacuum sounds fairly
unprincipled to me? Don't we want the updates to happen on a somewhat
regular wall clock cadence?

> To benchmark this new heuristic (I'm calling it algo 6 since it is the
> 6th I've proposed on this thread), I've used a modified subset of my
> original workloads:
>
> [ everything worked perfectly ]

Hard to beat that.

-- 

Re: Clean up find_typedefs and add support for Mac

2023-12-13 Thread Tristan Partin

On Tue Dec 12, 2023 at 5:02 PM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> The big patch here is adding support for Mac. objdump -W doesn't work on 
> Mac. So, I used dsymutil and dwarfdump to achieve the same result.


We should probably nuke the current version of src/tools/find_typedef
altogether in favor of copying the current buildfarm code for that.
We know that the buildfarm's version works, whereas I'm not real sure
that src/tools/find_typedef is being used in anger by anyone.  Also,
we're long past the point where developers can avoid having Perl
installed.

Ideally, the buildfarm client would then start to use find_typedef
from the tree rather than have its own copy, both to reduce
duplication and to ensure that the in-tree copy keeps working.


That makes sense to me. Where can I find the buildfarm code to propose 
a different patch, at least pulling in the current state of the 
buildfarm script? Or perhaps Andrew is the best person for this job.


--
Tristan Partin
Neon (https://neon.tech)




Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> But there are CLDR mappings on top of that.

I see, thank you.

Would it still be called "full" case mapping to only use the mappings
in SpecialCasing.txt? And would that be useful?

Regards,
Jeff Davis





"pgoutput" options missing on documentation

2023-12-13 Thread Emre Hasegeli
I noticed that Logical Streaming Replication Protocol documentation
[1] is missing the options added to "pgoutput" since version 14.  A
patch is attached to fix it together with another small one to give a
nice error when "proto_version" parameter is not provided.

[1] https://www.postgresql.org/docs/devel/protocol-logical-replication.html


v00-0001-pgoutput-Test-if-protocol_version-is-given.patch
Description: Binary data


v00-0002-pgoutput-Document-missing-options.patch
Description: Binary data


Re: Built-in CTYPE provider

2023-12-13 Thread Daniel Verite
Jeff Davis wrote:

> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.

But there are CLDR mappings on top of that.

According to the Unicode FAQ

   https://unicode.org/faq/casemap_charprop.html#5

   Q: Does the default case mapping work for every language? What
   about the default case folding?

   [...]

   To make case mapping language sensitive, the Unicode Standard
   specificially allows implementations to tailor the mappings for
   each language, but does not provide the necessary data. The file
   SpecialCasing.txt is included in the Standard as a guide to a few
   of the more important individual character mappings needed for
   specific languages, notably the Greek script and the Turkic
   languages. However, for most language-specific mappings and
   tailoring, users should refer to CLDR and other resources.

In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.

The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




AW: Building PosgresSQL with LLVM fails on Solaris 11.4

2023-12-13 Thread Sacha Hottinger
Hi Andres

Thanks for your reply.
The reason I was suspicious with the warnings of the gcc build was, because 
gmake check reported 138 out of 202 tests to have failed. I have attached the 
output of gmake check.

After you mentioned that gcc did not report any errors, just warnings, we 
installed the build.
First, it seeemed to work and SELECT pg_jit_available(); showed 
"pg_jit_available" as "t" but the DB showed strange behaviour. I.e. not always, 
but sometimes running "show parallel_tuple_cost" caused postmaster to restart a 
server process.
We had to back to the previous installation.

It seems there is definitievly something wrong with the result gcc created.

Best regards
Sacha

Von: Andres Freund 
Datum: Donnerstag, 7. Dezember 2023 um 17:50
An: Sacha Hottinger 
Cc: pgsql-hack...@postgresql.org 
Betreff: Re: Building PosgresSQL with LLVM fails on Solaris 11.4
Hi,

On 2023-12-07 13:43:55 +, Sacha Hottinger wrote:
> Thanks a lot.
> It now got much further but failed here with Sun Studio:
> …
> gmake[2]: Leaving directory 
> '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/test/perl'
> gmake -C backend/jit/llvm all
> gmake[2]: Entering directory 
> '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/backend/jit/llvm'
> /opt/developerstudio12.6/bin/cc -m64 -xarch=native -Xa -v -O  -KPIC 
> -D__STDC_LIMIT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_CONSTANT_MACROS 
> -I/usr/include  -I../../../../src/include-c -o llvmjit.o llvmjit.c
> "llvmjit.c", line 493: warning: argument #1 is incompatible with prototype:
> prototype: pointer to void : 
> "../../../../src/include/jit/llvmjit_emit.h", line 27
> argument : pointer to function(pointer to struct 
> FunctionCallInfoBaseData {pointer to struct FmgrInfo {..} flinfo, pointer to 
> struct Node {..} context, pointer to struct Node {..} resultinfo, unsigned 
> int fncollation, _Bool isnull, short nargs, array[-1] of struct NullableDatum 
> {..} args}) returning unsigned long
> g++ -O -std=c++14 -KPIC -D__STDC_LIMIT_MACROS -D__STDC_FORMAT_MACROS 
> -D__STDC_CONSTANT_MACROS -I/usr/include  -I../../../../src/include-c -o 
> llvmjit_error.o llvmjit_error.cpp
> g++: error: unrecognized command-line option ‘-KPIC’; did you mean ‘-fPIC’?
> gmake[2]: *** [: llvmjit_error.o] Error 1
> gmake[2]: Leaving directory 
> '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/backend/jit/llvm'
> gmake[1]: *** [Makefile:42: all-backend/jit/llvm-recurse] Error 2
> gmake[1]: Leaving directory '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src'
> gmake: *** [GNUmakefile:11: all-src-recurse] Error 2

I don't know where the -KPIC is coming from. And TBH, I don't see much point
trying to fix a scenario involving matching sun studio C with g++.


> With ggc it fails at the same step as before.
> I have attached the log files of the SunStudio and gcc runs to the email.

I don't see a failure with gcc.

The warnings are emitted for every extension and compilation succeeds.

Greetings,

Andres Freund


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager.



gcc_gmake_check.log
Description: gcc_gmake_check.log


Re: Fix bug with indexes on whole-row expressions

2023-12-13 Thread Tom Lane
ywgrit  writes:
> I forbid to create indexes on whole-row expression in the following patch.
> I'd like to hear your opinions.

As I said in the previous thread, I don't think this can possibly
be acceptable.  Surely there are people depending on the capability.
I'm not worried so much about the exact case of an index column
being a whole-row Var --- I agree that that's pretty useless ---
but an index column that is a function on a whole-row Var seems
quite useful.  (Your patch fails to detect that, BTW, which means
it does not block the case presented in bug #18244.)

I thought about extending the ALTER TABLE logic to disallow changes
in composite types that appear in index expressions.  We already have
find_composite_type_dependencies(), and it turns out that this already
blocks ALTER for the case you want to forbid, but we concluded that we
didn't need to prevent it for the bug #18244 case:

 * If objsubid identifies a specific column, refer to that in error
 * messages.  Otherwise, search to see if there's a user column of the
 * type.  (We assume system columns are never of interesting types.)
 * The search is needed because an index containing an expression
 * column of the target type will just be recorded as a whole-relation
 * dependency.  If we do not find a column of the type, the dependency
 * must indicate that the type is transiently referenced in an index
 * expression but not stored on disk, which we assume is OK, just as
 * we do for references in views.  (It could also be that the target
 * type is embedded in some container type that is stored in an index
 * column, but the previous recursion should catch such cases.)

Perhaps a reasonable answer would be to issue a WARNING (not error)
in the case where an index has this kind of dependency.  The index
might need to be reindexed --- but it might not, too, and in any case
I doubt that flat-out forbidding the ALTER is a helpful idea.

regards, tom lane




Re: Subsequent request from pg client

2023-12-13 Thread Tom Lane
Abdul Matin  writes:
> Can postgres client send subsequent requests without receiving response? If
> so, how does the postgres client correlate between a request and its
> response? Where can I get more hints about it?

https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-PIPELINING

See also the pipelining-related functions in recent libpq versions.

regards, tom lane




Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()

2023-12-13 Thread Michael Paquier
On Tue, Dec 12, 2023 at 04:54:32PM -0300, Euler Taveira wrote:
> Couldn't it give up before starting if you apply your patch? My main concern 
> is
> due to a slow system, the walrcv_connect() took to long in WalReceiverMain()
> and the code above kills the walreceiver while in the process to start it.
> Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might
> have issues during overload periods.

Sounds like a fair point to me, this area is trickier than it looks.
Another thing that I'm a bit surprised with is why it would be OK to
switch the status to STREAMING only we first_stream is set, discarding
the restart case.
--
Michael


signature.asc
Description: PGP signature


Re: Remove MSVC scripts from the tree

2023-12-13 Thread Michael Paquier
On Tue, Dec 05, 2023 at 07:29:59AM +0900, Michael Paquier wrote:
> Okay.  Thanks for the update.

While in Prague, Andres and Peter E. have mentioned me that we perhaps
had better move on with this patch sooner than later, without waiting
for the two buildfarm members to do the switch because much more
cleanup is required for the documentation once the scripts are
removed.

So, any objections with the patch as presented to remove the scripts
while moving the existing doc blocks from install-windows.sgml that
still need more discussion?
--
Michael


signature.asc
Description: PGP signature


Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
> My biggest concern is around maintenance. Every year Unicode is
> assigning new characters to existing code points, and those existing
> code points can of course already be stored in old databases before
> libs
> are updated.

Is the concern only about unassigned code points?

I already committed a function "unicode_assigned()" to test whether a
string contains only assigned code points, which can be used in a
CHECK() constraint. I also posted[5] an idea about a per-database
option that could reject the storage of any unassigned code point,
which would make it easier for users highly concerned about
compatibility.

> And we may end up with
> something like the timezone database where we need to periodically
> add a
> more current ruleset - albeit alongside as a new version in this
> case.

There's a build target "update-unicode" which is run to pull in new
Unicode data files and parse them into static C arrays (we already do
this for the Unicode normalization tables). So I agree that the tables
should be updated but I don't understand why that's a problem.

> If I'm reading the Unicode 15 update correctly, PostgreSQL regex
> expressions with [[:digit:]] will not correctly identify Kaktovik or
> Nag
> Mundari or Kawi digits without that update to character type specs.

Yeah, if we are behind in the Unicode version, then results won't be
the most up-to-date. But ICU or libc could also be behind in the
Unicode version.

> But lets remember that people like to build indexes on character
> classification functions like upper/lower, for case insensitive
> searching.

UPPER()/LOWER() are based on case mapping, not character
classification.

I intend to introduce a SQL-level CASEFOLD() function that would obey
Unicode casefolding rules, which have very strong compatibility
guarantees[6] (essentially, if you are only using assigned code points,
you are fine).

>  It's another case where the index will be corrupted if
> someone happened to store Latin Glottal vowels in their database and
> then we update libs to the latest character type rules.

I don't agree with this characterization at all.

  (a) It's not "another case". Corruption of an index on LOWER() can
happen today. My proposal makes the situation better, not worse.
  (b) These aren't libraries, I am proposing built-in Unicode tables
that only get updated in a new major PG version.
  (c) It likely only affects a small number of indexes and it's easier
for an administrator to guess which ones might be affected, making it
easier to just rebuild those indexes.
  (d) It's not a problem if you stick to assigned code points.

> So even with something as basic as character type, if we're going to
> do
> it right, we still need to either version it or definitively decide
> that
> we're not going to every support newly added Unicode characters like
> Latin Glottals.

If, by "version it", you mean "update the data tables in new Postgres
versions", then I agree. If you mean that one PG version would need to
support many versions of Unicode, I don't agree.

Regards,
Jeff Davis

[5]
https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.ca...@j-davis.com

[6] https://www.unicode.org/policies/stability_policy.html#Case_Folding





Re: trying again to get incremental backup

2023-12-13 Thread Robert Haas
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak
 wrote:
> > I can't exactly say that such a hint would be inaccurate, but I think
> > the impulse to add it here is misguided. One of my design goals for
> > this system is to make it so that you never have to take a new
> > incremental backup "just because,"
>
> Did you mean take a new full backup here?

Yes, apologies for the typo.

> > not even in case of an intervening
> > timeline switch. So, all of the errors in this function are warning
> > you that you've done something that you really should not have done.
> > In this particular case, you've either (1) manually removed the
> > timeline history file, and not just any timeline history file but the
> > one for a timeline for a backup that you still intend to use as the
> > basis for taking an incremental backup or (2) tried to use a full
> > backup taken from one server as the basis for an incremental backup on
> > a completely different server that happens to share the same system
> > identifier, e.g. because you promoted two standbys derived from the
> > same original primary and then tried to use a full backup taken on one
> > as the basis for an incremental backup taken on the other.
> >
>
> Okay, but please consider two other possibilities:
>
> (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> and some cronjob just a day later attempted to take incremental and
> failed with that error.
>
> (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> directory) the DB where I had cronjob that just failed with this error
>
> I bet that (4) is going to happen more often than (1), (2) , which
> might trigger users to complain on forums, support tickets.

Hmm. In case (4), I was thinking that you'd get a complaint about the
database system identifier not matching. I'm not actually sure that's
what would happen, though, now that you mention it.

In case (3), I think you would get an error about missing WAL summary files.

> > Huzzah, the cfbot likes the patch set now. Here's a new version with
> > the promised fix for your non-reproducible issue. Let's see whether
> > you and cfbot still like this version.
>
> LGTM, all quick tests work from my end too. BTW: I have also scheduled
> the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> test. I'll let you know how it went.

Awesome, thank you so much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Amit Kapila
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar  wrote:
>
> On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
>  wrote:
> >
> > Yes, if something like this happens, that'd be a problem:
> >
> > 1) decoding starts, with
> >
> >SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
> >
> > 2) transaction that creates a new refilenode gets decoded, but we skip
> >it because we don't have the correct snapshot
> >
> > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
> >
> > 4) we decode sequence change from nextval() for the sequence
> >
> > This would lead to us attempting to apply sequence change for a
> > relfilenode that's not visible yet (and may even get aborted).
> >
> > But can this even happen? Can we start decoding in the middle of a
> > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > messages, where we also call the output plugin in non-transactional cases.
>
> It's not a problem for logical messages because whether the message is
> transaction or non-transactional is decided while WAL logs the message
> itself.  But here our problem starts with deciding whether the change
> is transactional vs non-transactional, because if we insert the
> 'relfilenode' in hash then the subsequent sequence change in the same
> transaction would be considered transactional otherwise
> non-transactional.
>

It is correct that we can make a wrong decision about whether a change
is transactional or non-transactional when sequence DDL happens before
the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
after that state. However, one thing to note here is that we won't try
to stream such a change because for non-transactional cases we don't
proceed unless the snapshot is in a consistent state. Now, if the
decision had been correct then we would probably have queued the
sequence change and discarded at commit.

One thing that we deviate here is that for non-sequence transactional
cases (including logical messages), we immediately start queuing the
changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
SnapBuildProcessChange() returns true which is quite possible) and
take final decision at commit/prepare/abort time. However, that won't
be the case for sequences because of the dependency of determining
transactional cases on one of the prior records. Now, I am not
completely sure at this stage if such a deviation can cause any
problem and or whether we are okay to have such a deviation for
sequences.

-- 
With Regards,
Amit Kapila.




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-13 Thread Andrey M. Borodin



> On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> 
> Andrey, do you have any stress tests or anything else that you used to
> gain confidence in this code?

We are using only first two steps of the patchset, these steps do not touch 
locking stuff.

We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3





Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-13 Thread Masahiko Sawada
On Tue, Dec 12, 2023 at 11:09 AM Junwang Zhao  wrote:
>
> On Mon, Dec 11, 2023 at 10:32 PM Masahiko Sawada  
> wrote:
> >
> > On Mon, Dec 11, 2023 at 7:19 PM Michael Paquier  wrote:
> > >
> > > On Mon, Dec 11, 2023 at 10:57:15AM +0900, Masahiko Sawada wrote:
> > > > IIUC we cannot create two same name functions with the same arguments
> > > > but a different return value type in the first place. It seems to me
> > > > to be an overkill to change such a design.
> > >
> > > Agreed to not touch the logictics of LookupFuncName() for the sake of
> > > this thread.  I have not checked the SQL specification, but I recall
> > > that there are a few assumptions from the spec embedded in the lookup
> > > logic particularly when it comes to specify a procedure name without
> > > arguments.
> > >
> > > > Another idea is to encapsulate copy_to/from_handler by a super class
> > > > like copy_handler. The handler function is called with an argument,
> > > > say copyto, and returns copy_handler encapsulating either
> > > > copy_to/from_handler depending on the argument.
> > >
> > > Yep, that's possible as well and can work as a cross-check between the
> > > argument and the NodeTag assigned to the handler structure returned by
> > > the function.
> > >
> > > At the end, the final result of the patch should IMO include:
> > > - Documentation about how one can register a custom copy_handler.
> > > - Something in src/test/modules/, minimalistic still useful that can
> > > be used as a template when one wants to implement their own handler.
> > > The documentation should mention about this module.
> > > - No need for SQL functions for all the in-core handlers: let's just
> > > return pointers to them based on the options given.
> >
> > Agreed.
> >
> > > It would be probably cleaner to split the patch so as the code is
> > > refactored and evaluated with the in-core handlers first, and then
> > > extended with the pluggable facilities and the function lookups.
> >
> > Agreed.
> >
> > I've sketched the above idea including a test module in
> > src/test/module/test_copy_format, based on v2 patch. It's not splitted
> > and is dirty so just for discussion.
> >
> The test_copy_format extension doesn't use the fields of CopyToState and
> CopyFromState in this patch, I think we should move CopyFromStateData
> and CopyToStateData to commands/copy.h, what do you think?

Yes, I basically agree with that, where we move CopyFromStateData to
might depend on how we define COPY FROM APIs though. I think we can
move CopyToStateData to copy.h at least.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Some useless includes in llvmjit_inline.cpp

2023-12-13 Thread Shubham Khanna
On Mon, Dec 11, 2023 at 6:28 AM Thomas Munro  wrote:
>
> Hi,
>
> This is not exhaustive, I just noticed in passing that we don't need these.

I was able to compile the changes with "--with-llvm" successfully, and
the changes look good to me.

Thanks and Regards,
Shubham Khanna.




Re: Report planning memory in EXPLAIN ANALYZE

2023-12-13 Thread Ashutosh Bapat
On Wed, Dec 13, 2023 at 1:41 AM Alvaro Herrera  wrote:
>
> I would replace the text in explain.sgml with this:
>
> +  Include information on memory consumption by the query planning phase.
> +  This includes the precise amount of storage used by planner in-memory
> +  structures, as well as overall total consumption of planner memory,
> +  including allocation overhead.
> +  This parameter defaults to FALSE.

To me consumption in the "... total consumption ..." part is same as
allocation and that includes allocation overhead as well as any memory
that was allocated but remained unused (see discussion [1] if you
haven't already) ultimately. Mentioning "total consumption" and
"allocation overhead" seems more confusing.

How about
+  Include information on memory consumption by the query planning phase.
+  Report the precise amount of storage used by planner in-memory
+  structures, and total memory considering allocation overhead.
+  The parameter defaults to FALSE.

Takes care of your complaint about multiple include/ing as well.

>
> and remove the new example you're adding; it's not really necessary.

Done.

>
> In struct ExplainState, I'd put the new member just below summary.

Done

>
> If we're creating a new memory context for planner, we don't need the
> "mem_counts_start" thing, and we don't need the
> MemoryContextSizeDifference thing.  Just add the
> MemoryContextMemConsumed() function (which ISTM should be just below
> MemoryContextMemAllocated() instead of in the middle of the
> MemoryContextStats implementation), and
> just report the values we get there.  I think the SizeDifference stuff
> increases surface damage for no good reason.

240 bytes are used right after creating a memory context i.e.
mem_counts_start.totalspace = 8192 and mem_counts_start.freespace =
7952. To account for that I used two counters and calculated the
difference. If no planning is involved in EXECUTE 
it will show 240 bytes used instead of 0. Barring that for all
practical purposes 240 bytes negligible. E.g. 240 bytes is 4% of the
amount of memory used for planning a simple query " select 'x' ". But
your suggestion simplifies the code a lot. An error of 240 bytes seems
worth the code simplification. So did that way.

>
> ExplainOnePlan() is given a MemoryUsage object (or, if we do as above
> and no longer have a MemoryUsage struct at all in the first place, a
> MemoryContextCounters object) even when the MEMORY option is false.
> This means we waste computing memory usage when not needed.  Let's avoid
> that useless overhead.

Done.

Also avoided creating a memory context and switching to it when
es->memory = false.

>
> I'd also do away with the comment you added in explain_ExecutorEnd() and
> do just this, below setting of es->summary:
>
> +   /* No support for MEMORY option */
> +   /* es->memory = false; */

Done.

I ended up rewriting most of the code, so squashed everything into a
single patch as attached.

-- 
Best Wishes,
Ashutosh Bapat
From 8282b3347a8535cfe09b5f2f4a0c5a6bdcba11fa Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 12 Jul 2023 14:34:14 +0530
Subject: [PATCH] EXPLAIN reports memory consumed for planning a query

The memory consumed is reported as "Planner Memory" property in EXPLAIN
output when option MEMORY is specified.

A separate memory context is allocated for measuring memory consumption
to eliminate any effect of previous activity on the calculation.  The
memory context statistics is noted before and after calling
pg_plan_query() from ExplainOneQuery(). The difference in the two
statistics is used to calculate the memory consumption.

We report two values "used" and "allocated".  The difference between
memory statistics counters totalspace and freespace gives the memory
that remains palloc'ed at the end of planning. It is reported as memory
"used". This does not account for chunks of memory that were palloc'ed
but subsequently pfree'ed. But such memory remains allocated in the
memory context is given by the totalspace counter. The value of this
counter is reported as memory "allocated".

Ashutosh Bapat, reviewed by David Rowley and Alvaro Herrera
---
 contrib/auto_explain/auto_explain.c   |  2 +
 doc/src/sgml/ref/explain.sgml | 13 +
 src/backend/commands/explain.c| 67 +-
 src/backend/commands/prepare.c| 26 +-
 src/backend/utils/mmgr/mcxt.c | 13 +
 src/include/commands/explain.h|  4 +-
 src/include/utils/memutils.h  |  2 +
 src/test/regress/expected/explain.out | 68 +++
 src/test/regress/sql/explain.sql  |  6 +++
 9 files changed, 197 insertions(+), 4 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c3ac27ae99..26e80e4e16 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -396,6 +396,8 @@ explain_ExecutorEnd(QueryDesc 

Re: trying again to get incremental backup

2023-12-13 Thread Jakub Wartak
Hi Robert,

On Mon, Dec 11, 2023 at 6:08 PM Robert Haas  wrote:
>
> On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
>  wrote:
> > While we are at it, maybe around the below in PrepareForIncrementalBackup()
> >
> > if (tlep[i] == NULL)
> > ereport(ERROR,
> >
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("timeline %u found in
> > manifest, but not in this server's history",
> > range->tli)));
> >
> > we could add
> >
> > errhint("You might need to start a new full backup instead of
> > incremental one")
> >
> > ?
>
> I can't exactly say that such a hint would be inaccurate, but I think
> the impulse to add it here is misguided. One of my design goals for
> this system is to make it so that you never have to take a new
> incremental backup "just because,"

Did you mean take a new full backup here?

> not even in case of an intervening
> timeline switch. So, all of the errors in this function are warning
> you that you've done something that you really should not have done.
> In this particular case, you've either (1) manually removed the
> timeline history file, and not just any timeline history file but the
> one for a timeline for a backup that you still intend to use as the
> basis for taking an incremental backup or (2) tried to use a full
> backup taken from one server as the basis for an incremental backup on
> a completely different server that happens to share the same system
> identifier, e.g. because you promoted two standbys derived from the
> same original primary and then tried to use a full backup taken on one
> as the basis for an incremental backup taken on the other.
>

Okay, but please consider two other possibilities:

(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.

(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this error

I bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.

> > > I have a fix for this locally, but I'm going to hold off on publishing
> > > a new version until either there's a few more things I can address all
> > > at once, or until Thomas commits the ubsan fix.
> > >
> >
> > Great, I cannot get it to fail again today, it had to be some dirty
> > state of the testing env. BTW: Thomas has pushed that ubsan fix.
>
> Huzzah, the cfbot likes the patch set now. Here's a new version with
> the promised fix for your non-reproducible issue. Let's see whether
> you and cfbot still like this version.

LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.

-J.




Re: Synchronizing slots from primary to standby

2023-12-13 Thread Peter Smith
Hi Shveta, here are some review comments for v45-0002.

==
doc/src/sgml/bgworker.sgml

1.
+   
+
+ BgWorkerStart_PostmasterStart
+ 
+  
+   BgWorkerStart_PostmasterStart
+   Start as soon as postgres itself has finished its own initialization;
+   processes requesting this are not eligible for database connections.
+  
+ 
+
+
+
+ BgWorkerStart_ConsistentState
+ 
+  
+   BgWorkerStart_ConsistentState
+   Start as soon as a consistent state has been reached in a hot-standby,
+   allowing processes to connect to databases and run read-only queries.
+  
+ 
+
+
+
+ BgWorkerStart_RecoveryFinished
+ 
+  
+   BgWorkerStart_RecoveryFinished
+   Start as soon as the system has entered normal read-write state. Note
+   that the BgWorkerStart_ConsistentState and
+  BgWorkerStart_RecoveryFinished are equivalent
+   in a server that's not a hot standby.
+  
+ 
+
+
+
+ BgWorkerStart_ConsistentState_HotStandby
+ 
+  
+   
BgWorkerStart_ConsistentState_HotStandby
+   Same meaning as BgWorkerStart_ConsistentState but
+   it is more strict in terms of the server i.e. start the worker only
+   if it is hot-standby.
+  
+ 
+
+   

Maybe reorder these slightly, because I felt it is better if the
BgWorkerStart_ConsistentState_HotStandby comes next after
BgWorkerStart_ConsistentState, which it refers to

For example::
1st.BgWorkerStart_PostmasterStart
2nd.BgWorkerStart_ConsistentState
3rd.BgWorkerStart_ConsistentState_HotStandby
4th.BgWorkerStart_RecoveryFinished

==
doc/src/sgml/config.sgml

2.
enable_syncslot = true

Not sure, but I thought the "= true" part should be formatted too.

SUGGESTION
enable_syncslot = true

==
doc/src/sgml/logicaldecoding.sgml

3.
+
+ A logical replication slot on the primary can be synchronized to the hot
+ standby by enabling the failover option during slot creation and setting
+ enable_syncslot on the standby. For the synchronization
+ to work, it is mandatory to have a physical replication slot between the
+ primary and the standby. It's highly recommended that the said physical
+ replication slot is listed in standby_slot_names on
+ the primary to prevent the subscriber from consuming changes faster than
+ the hot standby. Additionally, hot_standby_feedback
+ must be enabled on the standby for the slots synchronization to work.
+

I felt those parts that describe the mandatory GUCs should be kept together.

SUGGESTION
For the synchronization to work, it is mandatory to have a physical
replication slot between the primary and the standby, and
hot_standby_feedback must be enabled on the
standby.

It's also highly recommended that the said physical replication slot
is named in standby_slot_names list on the primary,
to prevent the subscriber from consuming changes faster than the hot
standby.

~~~

4. (Chapter 49)

By enabling synchronization of slots, logical replication can be
resumed after failover depending upon the
pg_replication_slots.sync_state for the synchronized slots on the
standby at the time of failover. Only slots that were in ready
sync_state ('r') on the standby before failover can be used for
logical replication after failover. However, the slots which were in
initiated sync_state ('i') and not sync-ready ('r') at the time of
failover will be dropped and logical replication for such slots can
not be resumed after failover. This applies to the case where a
logical subscription is disabled before failover and is enabled after
failover. If the synchronized slot due to disabled subscription could
not be made sync-ready ('r') on standby, then the subscription can not
be resumed after failover even when enabled. If the primary is idle,
then the synchronized slots on the standby may take a noticeable time
to reach the ready ('r') sync_state. This can be sped up by calling
the pg_log_standby_snapshot function on the primary.

~

Somehow, I still felt all that was too wordy/repetitive. Below is my
attempt to make it more concise. Thoughts?

SUGGESTION
The ability to resume logical replication after failover depends upon
the pg_replication_slots.sync_state value for the synchronized slots
on the standby at the time of failover. Only slots that have attained
a "ready" sync_state ('r') on the standby before failover can be used
for logical replication after failover. Slots that have not yet
reached 'r' state (they are still 'i') will be dropped, therefore
logical replication for those slots cannot be resumed. For example, if
the synchronized slot could not become sync-ready on standby due to a
disabled subscription, then the subscription cannot be resumed after
failover even when it is enabled.

If the primary is idle, the synchronized slots on the standby may take
a noticeable time to reach the ready ('r') sync_state. This can be
sped up by calling the 

Re: remaining sql/json patches

2023-12-13 Thread jian he
Hi. small issues I found...

typo:
+-- Test mutabilily od query functions

+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only datetime, bool, numeric, and text types can be casted
to jsonpath types")));

transformJsonPassingArgs's function: transformJsonValueExpr will make
the above code unreached.
also based on the `switch (typid)` cases,
I guess best message would be
errmsg("only datetime, bool, numeric, text, json, jsonb types can be
casted to jsonpath types")));

+ case JSON_QUERY_OP:
+ jsexpr->wrapper = func->wrapper;
+ jsexpr->omit_quotes = (func->quotes == JS_QUOTES_OMIT);
+
+ if (!OidIsValid(jsexpr->returning->typid))
+ {
+ JsonReturning *ret = jsexpr->returning;
+
+ ret->typid = JsonFuncExprDefaultReturnType(jsexpr);
+ ret->typmod = -1;
+ }
+ jsexpr->result_coercion = coerceJsonFuncExprOutput(pstate, jsexpr);

I noticed, if (!OidIsValid(jsexpr->returning->typid)) is the true
function JsonFuncExprDefaultReturnType may be called twice, not sure
if it's good or not..




Re: brininsert optimization opportunity

2023-12-13 Thread Soumyadeep Chakraborty
On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra
 wrote:


> The attached 0001 patch fixes this by adding the call to validate_index,
which seems like the proper place as it's where the indexInfo is
allocated and destroyed.

Yes, and by reading validate_index's header comment, there is a clear
expectation that we will be adding tuples to the index in the table AM call
table_index_validate_scan (albeit "validating" doesn't necessarily convey this
side effect). So, it makes perfect sense to call it here.


> But this makes me think - are there any other places that might call
> index_insert without then also doing the cleanup? I did grep for the
> index_insert() calls, and it seems OK except for this one.
>
> There's a call in toast_internals.c, but that seems OK because that only
> deals with btree indexes (and those don't need any cleanup). The same
> logic applies to unique_key_recheck(). The rest goes through
> execIndexing.c, which should do the cleanup in ExecCloseIndices().
>
> Note: We should probably call the cleanup even in the btree cases, even
> if only to make it clear it needs to be called after index_insert().

Agreed. Doesn't feel great, but yeah all of these btree specific code does call
through index_* functions, instead of calling btree functions directly. So,
ideally we should follow through with that pattern and call the cleanup
explicitly. But we are introducing per-tuple overhead that is totally wasted.
Maybe we can add a comment instead like:

void
toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
{
int i;

/*
* Save a few cycles by choosing not to call index_insert_cleanup(). Toast
* indexes are btree, which don't have a aminsertcleanup() anyway.
*/

/* Close relations and clean up things */
...
}

And add something similar for unique_key_recheck()? That said, I am also not
opposed to just accepting these wasted cycles, if the commenting seems wonky.

> I propose we do a much simpler thing instead - allow the cache may be
> initialized / cleaned up repeatedly, and make sure it gets reset at
> convenient place (typically after index_insert calls that don't go
> through execIndexing). That'd mean the cleanup does not need to happen
> very far from the index_insert(), which makes the reasoning much easier.
> 0002 does this.

That kind of goes against the ethos of the ii_AmCache, which is meant to capture
state to be used across multiple index inserts. Also, quoting the current docs:

"If the index AM wishes to cache data across successive index insertions
within an SQL statement, it can allocate space
in indexInfo-ii_Context and store a pointer to the
data in indexInfo-ii_AmCache (which will be NULL
initially). After the index insertions complete, the memory will be freed
automatically. If additional cleanup is required (e.g. if the cache contains
buffers and tuple descriptors), the AM may define an optional function
indexinsertcleanup, called before the memory is released."

The memory will be freed automatically (as soon as ii_Context goes away). So,
why would we explicitly free it, like in the attached 0002 patch? And the whole
point of the cleanup function is to do something other than freeing memory, as
the docs note highlights so well.

Also, the 0002 patch does introduce per-tuple function call overhead in
heapam_index_validate_scan().

Also, now we have split brain...in certain situations we want to call
index_insert_cleanup() per index insert and in certain cases we would like it
called once for all inserts. Not very easy to understand IMO.

Finally, I don't think that any index AM would have anything to clean up after
every insert.

I had tried (abused) an approach with MemoryContextCallbacks upthread and that
seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like
destroyIndexInfo()) means that we lose the benefits of ii_Context. That could
be disruptive to existing AMs in-core and outside.

All said and done, v1 has my vote.

Regards,
Soumyadeep (VMware)