date:20240110

Hi,

On Wed, Jan 10, 2024 at 12:23:14PM +, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, January 10, 2024 2:26 PM Dilip Kumar  
> wrote:
> > 
> > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
> > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
> > +remote_slot->catalog_xmin);
> > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
> > +   remote_slot->restart_lsn);
> > +}
> > 
> > IIUC on the standby we just want to overwrite what we get from primary no? 
> > If
> > so why we are using those APIs that are meant for the actual decoding slots
> > where it needs to take certain logical decisions instead of mere 
> > overwriting?
> 
> I think we don't have a strong reason to use these APIs, but it was 
> convenient to
> use these APIs as they can take care of updating the slots info and will call
> functions like, ReplicationSlotsComputeRequiredXmin,
> ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly 
> overwriting
> the fields and call these manually ?

I'd vote for using the APIs as I think it will be harder to maintain if we are
not using them (means ensure the "direct" overwriting still makes sense over 
time).

FWIW, pg_failover_slots also rely on those APIs from what I can see.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Make mesage at end-of-recovery less scary.

2024-01-10 Thread Kyotaro Horiguchi

At Fri, 5 Jan 2024 16:02:24 +0530, vignesh C  wrote in 
> On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi  
> wrote:
> >
> > Anyway, this requires rebsaing, and done.
> 
> Few tests are failing at [1], kindly post an updated patch:

Thanks!

The errors occurred in a part of the tests for end-of-WAL detection
added in the master branch. These failures were primarily due to
changes in the message contents introduced by this patch. During the
revision, I discovered an issue with the handling of empty pages that
appear in the middle of reading continuation records. In the previous
version, such empty pages were mistakenly identified as indicating a
clean end-of-WAL (that is a LOG). However, they should actually be
handled as a WARNING, since the record curently being read is broken
at the empty pages. The following changes have been made in this
version:

1. Adjusting the test to align with the error message changes
  introduced by this patch.

2. Adding tests for the newly added messages.

3. Correcting the handling of empty pages encountered during the
  reading of continuation records. (XLogReaderValidatePageHeader)

4. Revising code comments.

5. Changing the term "log segment" to "WAL
  segment". (XLogReaderValidatePageHeader)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 47cc39a212d7fd6857f30c35c76bcdd0d26bbc3f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v28] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 147 ++
 src/backend/access/transam/xlogrecovery.c |  96 ++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 +-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/035_recovery.pl   | 130 +++
 src/test/recovery/t/039_end_of_wal.pl |  47 +--
 7 files changed, 380 insertions(+), 61 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..94861969eb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -553,6 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -636,16 +640,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -664,19 +664,19 @@ restart:
 	}
 	else
 	{
-		/* There may be no next page if it's too small. */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: expected at least %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot

Re: speed up a logical replica setup

On Thu, Jan 11, 2024 at 7:59 AM Euler Taveira  wrote:
>
> On Wed, Jan 10, 2024, at 1:33 AM, Shlok Kyal wrote:
>
> Here the standby node would be waiting for the 'consistent_lsn' wal
> during recovery but this wal will not be present on standby if no
> physical replication is setup. Hence the command will be waiting
> infinitely for the wal.
>
>
> Hmm. Some validations are missing.
>
> To solve this added a timeout of 60s for the recovery process and also
> added a check so that pg_subscriber would give a error when it called
> for node which is not in physical replication.
> Have attached the patch for the same. It is a top-up patch of the
> patch shared by Euler at [1].
>
>
> If the user has a node that is not a standby and it does not set the GUCs to
> start the recovery process from a backup, the initial setup is broken. (That's
> the case you described.) A good UI is to detect this scenario earlier.
> Unfortunately, there isn't a reliable and cheap way to do it. You need to 
> start
> the recovery and check if it is having some progress. (I don't have a strong
> opinion about requiring a standby to use this tool. It would reduce the
> complexity about checking if the setup has all requirements to run this tool.)
>

Right, such a check will reduce some complexity. So, +1 for the check
as proposed by Shlok. Also, what are your thoughts on a timeout during
the wait? I think it is okay to wait for 60s by default but there
should be an option for users to wait for longer.

-- 
With Regards,
Amit Kapila.

Re: A recent message added to pg_upgade

On Thu, Jan 11, 2024 at 11:25:44AM +0530, Amit Kapila wrote:
> On Thu, Jan 11, 2024 at 9:08 AM Michael Paquier  wrote:
>> Hence, how about something like that:
>> "The logical replication launcher is disabled during binary upgrades,
>> as a logical replication workers running on the cluster upgrading from
>> could cause replication origins to move forward after they are copied
>> to the cluster upgrading to, creating potentially conflicts with the
>> physical files copied over."
> 
> Looks better. One minor nitpick: /potentially/potential

Sure, WFM.  Let's wait a bit and see if others have more comments to
offer.
--
Michael


signature.asc
Description: PGP signature

Re: A recent message added to pg_upgade

On Thu, Jan 11, 2024 at 9:08 AM Michael Paquier  wrote:
>
> On Wed, Jan 10, 2024 at 06:02:12PM +0530, Amit Kapila wrote:
> > - if (max_logical_replication_workers == 0)
> > + /*
> > + * The logical replication launcher is disabled during binary upgrades,
> > + * as logical replication workers can stream data during the upgrade
> > + * which can cause replication origins to move forward after we have
> > + * copied them.  It can cause the system to request the data which is
> > + * already present in the new cluster.
> > + */
> > + if (max_logical_replication_workers == 0 || IsBinaryUpgrade)
> >
> > This comment is not very clear to me. The first part of the sentence
> > can't apply to the new cluster as after the upgrade, subscriptions
> > will be disabled and the second part talks about requesting the wrong
> > data in the new cluster. As per my understanding, the problem here is
> > that, on the old cluster, the origins may move forward after we copy
> > them and then we copy physical files. Now, in the new cluster when we
> > try to request the data, it will be already present.
>
> As far as I understand your complaint is about being more precise
> about where the workers could run when we do an upgrade.  My patch
> covers the reason why it would be a problem, and I agree that it could
> be more detailed.
>
> Hence, how about something like that:
> "The logical replication launcher is disabled during binary upgrades,
> as a logical replication workers running on the cluster upgrading from
> could cause replication origins to move forward after they are copied
> to the cluster upgrading to, creating potentially conflicts with the
> physical files copied over."
>

Looks better. One minor nitpick: /potentially/potential

-- 
With Regards,
Amit Kapila.

Re: A failure in t/038_save_logical_slots_shutdown.pl

On Wed, Jan 10, 2024 at 2:08 PM Bharath Rupireddy
 wrote:
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> #   at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state:   in production
> pg_control last modified: Wed Jan 10 07:55:44 2024
> Latest checkpoint location:   0/1508018
> Latest checkpoint's REDO location:0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
> # Is it same as the value read from log?
> ok( $latest_checkpoint eq $confirmed_flush_from_log,
> "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
> );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
>

As per my understanding, the reason we expect them to be the same is
because we ensure that during shutdown, the walsender sends all the
WAL just before shutdown_checkpoint and the confirm_flush also points
to the end of WAL record before shutodwn_checkpoint. So, the next
starting location should be of shutdown_checkpoint record which should
ideally be the same. Do you see this failure consistently?

-- 
With Regards,
Amit Kapila.

Re: introduce dynamic shared memory registry

2024-01-10 Thread Bharath Rupireddy

On Thu, Jan 11, 2024 at 10:42 AM Michael Paquier  wrote:
>
> >> 3. IIUC, this feature eventually makes both shmem_request_hook and
> >> shmem_startup_hook pointless, no? Or put another way, what's the
> >> significance of shmem request and startup hooks in lieu of this new
> >> feature? I think it's quite possible to get rid of the shmem request
> >> and startup hooks (of course, not now but at some point in future to
> >> not break the external modules), because all the external modules can
> >> allocate and initialize the same shared memory via
> >> dsm_registry_init_or_attach and its init_callback. All the external
> >> modules will then need to call dsm_registry_init_or_attach in their
> >> _PG_init callbacks and/or in their bg worker's main functions in case
> >> the modules intend to start up bg workers. Am I right?
> >
> > Well, modules might need to do a number of other things (e.g., adding
> > hooks) that can presently only be done when preloaded, in which case I
> > doubt there's much benefit from switching to the DSM registry.  I don't
> > really intend for it to replace the existing request/startup hooks, but
> > you're probably right that most, if not all, could use the registry
> > instead.  IMHO this is well beyond the scope of this thread, though.
>
> Even if that's not in the scope of this thread, just removing these
> hooks would break a lot of out-of-core things, and they still have a
> lot of value when extensions expect to always be loaded with shared.
> They don't cost in maintenance at this stage.

Adding some notes in the docs on when exactly one needs to use shmem
hooks and the named DSM segments can help greatly.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

On Wed, Jan 10, 2024 at 6:37 PM vignesh C  wrote:
>
> I got the log files from Bharath offline. Thanks Bharath for sharing
> the log files offline.
> The WAL record sequence is exactly the same in the failing test and
> tests which are passing.
> One observation in our case the confirmed flush lsn points exactly to
> shutdown checkpoint, but in the failing test the lsn pointed is
> invalid, pg_waldump says that address is invalid and skips about 24
> bytes and then sees a valid record
>
> Passing case confirm flush lsn(0/150D158) from my machine:
> pg_waldump 00010001 -s 0/150D158
> rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
> 0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> Failing case confirm flush lsn( 0/1508000) from failing tests log file:
> pg_waldump 00010001 -s 0/1508000
> pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
> over 24 bytes
> rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
> 0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> I'm still not sure why in this case, it is not exactly pointing to a
> valid WAL record, it has to skip 24 bytes to find the valid checkpoint
> shutdown record.
>

Can we see the previous record (as pointed out by prev in the WAL
record) in both cases? Also, you can see few prior records in both
cases.

-- 
With Regards,
Amit Kapila.

Introduce XID age and inactive timeout based replication slot invalidation

2024-01-10 Thread Bharath Rupireddy

Hi,

Replication slots in postgres will prevent removal of required
resources when there is no connection using them (inactive). This
consumes storage because neither required WAL nor required rows from
the user tables/system catalogs can be removed by VACUUM as long as
they are required by a replication slot. In extreme cases this could
cause the transaction ID wraparound.

Currently postgres has the ability to invalidate inactive replication
slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
that will be needed for the slots in case they become active. However,
the wraparound issue isn't effectively covered by
max_slot_wal_keep_size - one can't tell postgres to invalidate a
replication slot if it is blocking VACUUM. Also, it is often tricky to
choose a default value for max_slot_wal_keep_size, because the amount
of WAL that gets generated and allocated storage for the database can
vary.

Therefore, it is often easy for developers to do the following:
a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
billion, after which the slots get invalidated.
b) set a timeout of say 1 or 2 or 3 days, after which the inactive
slots get invalidated.

To implement (a), postgres needs a new GUC called max_slot_xid_age.
The checkpointer then invalidates all the slots whose xmin (the oldest
transaction that this slot needs the database to retain) or
catalog_xmin (the oldest transaction affecting the system catalogs
that this slot needs the database to retain) has reached the age
specified by this setting.

To implement (b), first postgres needs to track the replication slot
metrics like the time at which the slot became inactive (inactive_at
timestamptz) and the total number of times the slot became inactive in
its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
structure. And, then it needs a new timeout GUC called
inactive_replication_slot_timeout. Whenever a slot becomes inactive,
the current timestamp and inactive count are stored in
ReplicationSlotPersistentData structure and persisted to disk. The
checkpointer then invalidates all the slots that are lying inactive
for about inactive_replication_slot_timeout duration starting from
inactive_at.

In addition to implementing (b), these two new metrics enable
developers to improve their monitoring tools as the metrics are
exposed via pg_replication_slots system view. For instance, one can
build a monitoring tool that signals when replication slots are lying
inactive for a day or so using inactive_at metric, and/or when a
replication slot is becoming inactive too frequently using inactive_at
metric.

I’m attaching the v1 patch set as described below:
0001 - Tracks invalidation_reason in pg_replication_slots. This is
needed because slots now have multiple reasons for slot invalidation.
0002 - Tracks inactive replication slot information inactive_at and
inactive_timeout.
0003 - Adds inactive_timeout based replication slot invalidation.
0004 - Adds XID based replication slot invalidation.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Track-invalidation_reason-in-pg_replication_slots.patch
Description: Binary data


v1-0002-Track-inactive-replication-slot-information.patch
Description: Binary data


v1-0003-Add-inactive_timeout-based-replication-slot-inval.patch
Description: Binary data


v1-0004-Add-XID-based-replication-slot-invalidation.patch
Description: Binary data

Re: introduce dynamic shared memory registry

On Mon, Jan 08, 2024 at 11:16:27AM -0600, Nathan Bossart wrote:
> On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote:
>> 1. I think we need to add some notes about this new way of getting
>> shared memory for external modules in the Shared Memory and
>> LWLocks section in xfunc.sgml? This will at least tell there's
>> another way for external modules to get shared memory, not just with
>> the shmem_request_hook and shmem_startup_hook. What do you think?
> 
> Good call.  I definitely think this stuff ought to be documented.  After a
> quick read, I also wonder if it'd be worth spending some time refining that
> section.

+1.  It would be a second thing to point at autoprewarm.c in the docs
as an extra pointer that can be fed to users reading the docs.

>> 3. IIUC, this feature eventually makes both shmem_request_hook and
>> shmem_startup_hook pointless, no? Or put another way, what's the
>> significance of shmem request and startup hooks in lieu of this new
>> feature? I think it's quite possible to get rid of the shmem request
>> and startup hooks (of course, not now but at some point in future to
>> not break the external modules), because all the external modules can
>> allocate and initialize the same shared memory via
>> dsm_registry_init_or_attach and its init_callback. All the external
>> modules will then need to call dsm_registry_init_or_attach in their
>> _PG_init callbacks and/or in their bg worker's main functions in case
>> the modules intend to start up bg workers. Am I right?
> 
> Well, modules might need to do a number of other things (e.g., adding
> hooks) that can presently only be done when preloaded, in which case I
> doubt there's much benefit from switching to the DSM registry.  I don't
> really intend for it to replace the existing request/startup hooks, but
> you're probably right that most, if not all, could use the registry
> instead.  IMHO this is well beyond the scope of this thread, though.

Even if that's not in the scope of this thread, just removing these
hooks would break a lot of out-of-core things, and they still have a
lot of value when extensions expect to always be loaded with shared.
They don't cost in maintenance at this stage.
--
Michael


signature.asc
Description: PGP signature

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

> On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
>> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>> Oh, I understand it now. Yes, that makes sense.
>> I thought removing op_bytes completely ( as you said "This patch
>> extends it with two more operation sizes, and there are even cases
>> where it may be a variable" ) from pg_stat_io view then adding
>> something like {read | write | extend}_bytes and {read | write |
>> extend}_calls could be better, so that we don't lose any information.
> 
> Upthread, Michael says:
> 
>> I find the use of 1 in this context a bit confusing, because when
>> referring to a counter at N, then it can be understood as doing N
>> times a operation,
> 
> I didn't understand this argument, so I'm not sure if I agree or
> disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver?  For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time.  My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

> I think these are the three proposals for handling WAL reads:
> 
> 1) setting op_bytes to 1 and the number of reads is the number of bytes
> 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> number of calls to pg_pread() or similar
> 3) setting op_bytes to NULL and the number of reads is the number of
> calls to pg_pread() or similar

3) could be a number of bytes, actually.

> Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

> For an unpopular idea: we could add separate [IOOp]_bytes columns for
> all those IOOps for which it would be relevant. It kind of stinks but
> it would give us the freedom to document exactly what a single IOOp
> means for each combination of BackendType, IOContext, IOObject, and
> IOOp (as relevant) and still have an accurate number in the *bytes
> columns. Everyone will probably hate us if we do that, though.
> Especially because having bytes for the existing IOObjects is an
> existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view.  That would be up to 8k extra
tuples just for the sake of op_byte's variability.

> A separate question: suppose [1] goes in (to read WAL from WAL buffers
> directly). Now, WAL reads are not from permanent storage anymore. Are
> we only tracking permanent storage I/O in pg_stat_io? I also had this
> question for some of the WAL receiver functions. Should we track any
> I/O other than permanent storage I/O? Or did I miss this being
> addressed upthread?

That's a good point.  I guess that this should just be a different
IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

> In terms of what I/O we should track in a streaming/asynchronous
> world, the options would be:
> 
> 1) track read/write syscalls
> 2) track blocks of BLCKSZ submitted to the kernel
> 3) track bytes submitted to the kernel
> 4) track merged I/Os (after doing any merging in the application)
> 
> I think the debate was largely between 2 and 4. There was some
> disagreement, but I think we landed on 2 because there is merging that
> can happen at many levels in the storage stack (even the storage
> controller). Distinguishing between whether or not Postgres submitted
> 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> but I think it might be confusing for the Postgres user trying to
> determine why their query is slow. It probably makes the most sense to
> still track in block size.
> 
> No matter what solution we pick, you should get a correct number if
> you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> rather, there should be some way of getting an accurate number in
> bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.
--
Michael


signature.asc
Description: PGP signature

Re: Documentation to upgrade logical replication cluster

On Wed, 10 Jan 2024 at 15:50, Amit Kapila  wrote:
>
> On Thu, Jan 4, 2024 at 2:22 PM vignesh C  wrote:
> >
> > We have documentation on how to upgrade "publisher" and "subscriber"
> > at [1], but currently we do not have any documentation on how to
> > upgrade logical replication clusters.
> > Here is a patch to document how to upgrade different logical
> > replication clusters: a) Upgrade 2 node logical replication cluster b)
> > Upgrade cascaded logical replication cluster c) Upgrade 2 node
> > circular logical replication cluster.
> >
>
> Today, off-list, I had a short discussion on this documentation with
> Jonathan and Michael. I was not sure whether we should add this in the
> main documentation of the upgrade or maintain it as a separate wiki
> page. My primary worry was that this seemed to be taking too much
> space on pgupgrade page and making the information on that page a bit
> unreadable. Jonathan suggested that we can add this information to the
> logical replication page [1] and add a reference in the pgupgrade
> page. That suggestion makes sense to me considering we have
> sub-sections like Monitoring, Security, and Configuration Settings on
> the logical replication page. We can have a new sub-section Upgrade on
> the same lines. What do you think?

I feel that would be better, also others like Kuroda-san had said in
the similar lines at comment-2 at [1] and Peter also had similar
opinion at [2]. I will handle this in the next version.

[1] - 
https://www.postgresql.org/message-id/TY3PR01MB9889BD1202530E8310AC9B3DF5662%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/CAHut%2BPs4AtGB9MMK51%3D1Z1JQ1FUK%2BX0oXQuAdEad1kEEuw7%2BkA%40mail.gmail.com

Regards,
Vignesh

Re: Adding facility for injection points (or probe points?) for more advanced tests

On Wed, Jan 10, 2024 at 03:21:03PM +0530, Ashutosh Bapat wrote:
> On Tue, Jan 9, 2024 at 10:09 AM Michael Paquier  wrote:
>>> +#ifdef USE_INJECTION_POINTS
>>> +static bool
>>> +file_exists(const char *name)
>>>
>>> There's similar function in jit.c and dfmgr.c. Can we not reuse that code?
>>
>> This has been mentioned in a different comment.  Refactored as of
>> 0001, but there is something here related to EACCES for the JIT path.
>> Seems weird to me that we would not fail if the JIT library cannot be
>> accessed when stat() fails.
> 
> I agree with this change to jit. Without having search permissions on
> every directory in the path, the function can not determine if the
> file exists or not. So throwing an error is better than just returning
> false which means that
> the file does not exist.

I was looking at the original set of threads related to JIT, and this
has been mentioned nowhere.  I think that I'm going to give it a shot
and see how the buildfarm reacts.  If that finishes with red, we could
always revert this part of the patch in jit.c still keep the
refactored routine.

>> Yeah, that's an intended design choice to keep the code simpler and
>> faster as there is no need to track the library and function names in
>> the local caches or implement something similar to invalidation
>> messages for this facility because it would impact performance anyway
>> in the call paths.  In short, just don't do that, or use two distinct
>> points.
> 
> In practice the InjectionPointDetach() and InjectionPointAttach()
> calls may not be close by and the user may not be able to figure out
> why the injection points are behaviour weird. It may impact
> performance but unexpected behaviour should be avoided, IMO.
> 
> If nothing else this should be documented.

In all the infrastructures I've looked at, folks did not really care
about having an invalidation for the callbacks loaded.  Still I'm OK
to add something in the documentation about that, say among the lines
of an extra sentence like:
"The callbacks loaded by a process are cached within each process.
There is no invalidation facility for the callbacks attached to
injection points, hence updating a callback for an injection point
requires a restart of the process to release its cache and the
previous callbacks attached to it."

> I am ok with not populating the cache but checking with just
> load_external_function(). This is again another ease of use scenario
> where a silly mistake by user is caught earlier making user's life
> easier. That at least should be the goal of the first cut.

I don't really aim for complicated here, just useful.

> With v6 I could run the test when built with enable_injection_point
> false. I just ran make check in that folder. Didn't test meson build.

The CI has been failing because 041_invalid_checkpoint_after_promote
was loading Time::HiRes::nanosleep and Windows does not support it.

> Yeah, I think we have to use another shared state. If the waiting
> backend moves ahead without test_injection_point_wake() being called,
> that could lead to unexpected and very weird behaviour.
> 
> It looks like ConditionVariable just remembers the processes that need
> to be woken up during broadcast or signal. But by itself it doesn't
> guarantee desired condition when woken up.

Yeah, I'm not sure yet about how to do that in the most elegant way.
But this part could always happen after 0001~0003.

>> Attached is a v7 series.  What do you think?  0004 and 0005 for the
>> extra tests still need more discussion and much more polishing, IMO.
> 
> Generally I think the 0001 and 0002 are in good shape. However, I
> would like them to be more easy to use - like catching simple user
> errors that can be easily caught. That saves a lot of frustration
> because of unexpected behaviour. I will review 0001 and 0002 from v7
> in detail again, but it might take a few days.

Thanks again for the reviews.  I still intend to focus solely on 0001,
0002 and 0003 for the current commit fest to have something able to
enforce error states in backends, at least.  There have been quite a
few bugs that could have coverage thanks for that.
--
Michael
From 11c6e12750b2fc86c16a7ee5554e864fbc4438d3 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 9 Jan 2024 10:32:29 +0900
Subject: [PATCH v8 1/5] Refactor code to check file file existence

jit.c and dfgr.c had a copy of the same code to check if a file exists
or not.  This refactored routine will be used by an upcoming patch.

Note that this adds a check on EACCES for the JIT path.
---
 src/include/storage/fd.h   |  1 +
 src/backend/jit/jit.c  | 20 +---
 src/backend/storage/file/fd.c  | 22 ++
 src/backend/utils/fmgr/dfmgr.c | 25 -
 4 files changed, 28 insertions(+), 40 deletions(-)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c4c60bc0a8..60bba5c970 100644
--- a/src/include/storage/fd.h
+++

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-01-10 Thread Tom Lane

Michael Paquier  writes:
> I am wondering if we'd better backpatch all that, TBH.

Seems like a good idea to me.

regards, tom lane

Re: add function argument names to regex* functions.

2024-01-10 Thread Dian Fay

On Wed Jan 10, 2024 at 9:18 AM EST, jian he wrote:
> On Tue, Jan 9, 2024 at 8:52 AM Dian Fay  wrote:
> >
> > On Mon Jan 8, 2024 at 9:26 AM EST, jian he wrote:
> > > On Mon, Jan 8, 2024 at 8:44 AM Dian Fay  wrote:
> > > > The `regexp_replace` summary in table 9.10 is mismatched and still
> > > > specifies the first parameter name as `string` instead of `source`.
> > > > Since all the other functions use `string`, should `regexp_replace` do
> > > > the same or is this a case where an established "standard" diverges?
> > >
> > > got it. Thanks for pointing it out.
> > >
> > > in functions-matching.html
> > > if I change source to
> > > string then
> > > there are no markup "string" and markup "string", it's kind of
> > > slightly confusing.
> > >
> > > So does the following refactored description of regexp_replace make sense:
> > >
> > >  The string is returned unchanged if
> > >  there is no match to the pattern.  If 
> > > there is a
> > >  match, the string is returned with the
> > >  replacement string substituted for the 
> > > matching
> > >  substring.  The replacement string can 
> > > contain
> > >  \n, where
> > > n is 1
> > >  through 9, to indicate that the source substring matching the
> > >  n'th parenthesized subexpression of
> > > the pattern should be
> > >  inserted, and it can contain \ to indicate 
> > > that the
> > >  substring matching the entire pattern should be inserted.  Write
> > >  \\ if you need to put a literal backslash in
> > > the replacement
> > >  text.
> >
> > That change makes sense to me! I'll see about the section refactoring
> > after this lands.
>
> I put the changes into the new patch.

Sorry, I missed one minor issue with v2. The replacement on lines
6027-6028 of func.sgml (originally "`n` rows if there are `n` matches")
is not needed and could be more confusing since the `n` represents a
number, not an argument to `regexp_matches`. I've built v3 and gone over
everything else one more time and it looks good.

Re: heavily contended lwlocks with long wait queues scale badly

On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:
> Now that commit a4adc31 has had some time to bake and concerns about
> unintended consequences may have abated, I wanted to revive this
> back-patching discussion.  I see a few possibly-related reports [0] [1]
> [2], and I'm now seeing this in the field, too.  While it is debatable
> whether this is a bug, it's a quite nasty issue for users, and it's both
> difficult to detect and difficult to work around.

+1, I've seen this becoming a PITA for a few things.  Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.
--
Michael


signature.asc
Description: PGP signature

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

On Tue, Jan 09, 2024 at 09:40:12PM -0500, Tom Lane wrote:
> No Windows expert here, but it does seem like the same argument
> applies.

Yeah, I've applied the same restriction for pg_regress to avoid
similar problems as we spawn a postgres process in this case.  I've
tested it and it was not causing issues in my own setup or the CI.

I am wondering if we'd better backpatch all that, TBH.
--
Michael

signature.asc
Description: PGP signature

Re: A recent message added to pg_upgade

On Wed, Jan 10, 2024 at 06:02:12PM +0530, Amit Kapila wrote:
> - if (max_logical_replication_workers == 0)
> + /*
> + * The logical replication launcher is disabled during binary upgrades,
> + * as logical replication workers can stream data during the upgrade
> + * which can cause replication origins to move forward after we have
> + * copied them.  It can cause the system to request the data which is
> + * already present in the new cluster.
> + */
> + if (max_logical_replication_workers == 0 || IsBinaryUpgrade)
> 
> This comment is not very clear to me. The first part of the sentence
> can't apply to the new cluster as after the upgrade, subscriptions
> will be disabled and the second part talks about requesting the wrong
> data in the new cluster. As per my understanding, the problem here is
> that, on the old cluster, the origins may move forward after we copy
> them and then we copy physical files. Now, in the new cluster when we
> try to request the data, it will be already present.

As far as I understand your complaint is about being more precise
about where the workers could run when we do an upgrade.  My patch
covers the reason why it would be a problem, and I agree that it could
be more detailed.

Hence, how about something like that:
"The logical replication launcher is disabled during binary upgrades,
as a logical replication workers running on the cluster upgrading from
could cause replication origins to move forward after they are copied
to the cluster upgrading to, creating potentially conflicts with the
physical files copied over." 

If you have a better suggestion, feel free.
--
Michael


signature.asc
Description: PGP signature

Re: introduce dynamic shared memory registry

2024-01-10 Thread Nathan Bossart

On Thu, Jan 11, 2024 at 09:50:19AM +0700, Andrei Lepikhov wrote:
> On 9/1/2024 00:16, Nathan Bossart wrote:
>> On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote:
>> > 2. FWIW, I'd like to call this whole feature "Support for named DSM
>> > segments in Postgres". Do you see anything wrong with this?
>> 
>> Why do you feel it should be renamed?  I don't see anything wrong with it,
>> but I also don't see any particular advantage with that name compared to
>> "dynamic shared memory registry."
> It is not a big issue, I suppose. But for me personally (as not a native
> English speaker), the label "Named DSM segments" seems more straightforward
> to understand.

That is good to know, thanks.  I see that it would also align better with
RequestNamedLWLockTranche/GetNamedLWLockTranche.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Streaming I/O, vectored I/O (WIP)

2024-01-10 Thread Thomas Munro

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas  wrote:
> On 10/01/2024 06:13, Thomas Munro wrote:
> > Bikeshedding call: I am open to better suggestions for the names
> > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
> > grammatically clumsy.
>
> How will these functions work in the brave new async I/O world? I assume
> PrepareReadBuffer() will initiate the async I/O, and
> CompleteReadBuffers() will wait for it to complete. How about
> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
> WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

 * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
 * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
 * CompleteReadBuffers() -> waits, completes as necessary

In the proposed synchronous version, the middle step is missing, but
streaming_read.c directly calls smgrprefetch() instead.  I thought
about shoving that inside a prosthetic StartReadBuffers() function,
but I backed out of simulating asynchronous I/O too fancifully.  The
streaming read API is where we really want to stabilise a nice API, so
we can moves things around behind it if required.

A bit of analysis of the one block -> nblocks change and the
synchronous -> asynchronous change:

Two things are new in a world with nblocks > 1.  (1) We needed to be
able to set BM_IO_IN_PROGRESS on more than one block at a time, but
commit 12f3867f already provided that, and (2) someone else might come
along and read a block in the middle of our range, effectively
chopping our range into subranges.  That's true also in master but
when nblocks === 1 that's all-or-nothing, and now we have partial
cases.  In the proposed synchronous code, CompleteReadBuffers() claims
as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and
then loops process the rest, skipping over any blocks that are already
done.  Further down in md.c, you might also cross a segment boundary.
So that's two different reasons while a single call to
CompleteReadBuffers() might finish up generating zero or more than one
I/O system call, though very often it's one.

Hmm, while spelling that out, I noticed an obvious problem and
improvement to make to that part of v5.  If backend #1 is trying to
read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but
backend #2 comes along and starts reading block 102 first, backend
#1's StartBufferIO() call would wait for 102's I/O CV while it still
holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third
backend #3 that wants to read block 101 even though no I/O is in
progress for that block yet!  At least that's deadlock free (because
always in block order), but it seems like undesirable lock chaining.
Here is my proposed improvement: StartBufferIO() gains a nowait flag.
For the head block we wait, but while trying to build a larger range
we don't.  We'll try 102 again in the next loop, with a wait.  Here is
a small fixup for that.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs).  One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary.  I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one .

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet.  Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged.  While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

> About the signature of those functions: Does it make sense for
> CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
> of buffers? If StartReadBuffer() initiates the async I/O immediately, is
> there any benefit to batching the waiting?
>
> If StartReadBuffer() starts the async I/O, the idea that you can call
> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Andy Fan



Robert Haas  writes:

> Thanks for jumping in with a review, Matthias!

FWIW, Matthias is also the first one for this proposal at this
thread, thanks for that as well!

-- 
Best Regards
Andy Fan

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-10 Thread Nathan Bossart

On Mon, Nov 21, 2022 at 10:31:14AM -0500, Jonathan S. Katz wrote:
> On 11/20/22 2:56 PM, Andres Freund wrote:
>> I still think it might be worth to backpatch in a bit, but so far the votes 
>> on
>> that weren't clear enough on that to feel comfortable.
> 
> My general feeling is "yes" on backpatching, particularly if this is a bug
> and it's fixable without ABI breaks.

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion.  I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too.  While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.

Thoughts?

[0] 
https://postgr.es/m/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM%3DewqHV6n%3Dw1u8Q%40mail.gmail.com
[1] 
https://postgr.es/m/VI1PR05MB62031A41186ACC3FC91ACFC70%40VI1PR05MB6206.eurprd05.prod.outlook.com
[2] https://postgr.es/m/dd0e070809430a31f7ddd8483fbcce59%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Andy Fan


Hi Matthias,

Thanks for the review!

Matthias van de Meent  writes:

> On Wed, 10 Jan 2024 at 02:44, Andy Fan  wrote:
>> Hi,
>>
>> I want to know if Andres or you have plan
>> to do some code review. I don't expect this would happen very soon, just
>> want to make sure this will not happen that both of you think the other
>> one will do, but actually none of them does it in fact. a commit fest
>> [1] has been added for this.
>
>
>> +++ b/src/backend/storage/buffer/bufmgr.c
>> @@ -5419,6 +5419,7 @@ LockBufHdr(BufferDesc *desc)
>> perform_spin_delay();
>> }
>> finish_spin_delay();
>> +START_SPIN_LOCK();
>> return old_buf_state | BM_LOCKED;
>> }
>
> I think that we need to 'arm' the checks just before we lock the spin
> lock, and 'disarm' the checks just after we unlock the spin lock,
> rather than after and before, respectively. That way, we won't have a
> chance of false negatives: with your current patch it is possible that
> an interrupt fires between the acquisition of the lock and the code in
> START_SPIN_LOCK() marking the thread as holding a spin lock, which
> would cause any check in that signal handler to incorrectly read that
> we don't hold any spin locks.

That's a good idea. fixed in v2.

>
>> +++ b/src/backend/storage/lmgr/lock.c
>> @@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
>> boolfound_conflict;
>> boollog_lock = false;
>>
>> +Assert(SpinLockCount == 0);
>> +
>
> I'm not 100% sure on the policy of this, but theoretically you could
> use LockAquireExtended(dontWait=true) while holding a spin lock, as
> that would not have an unknown duration. Then again, this function
> also does elog/ereport, which would cause issues, still, so this code
> may be the better option.

I thought this statement as "keeping the current patch as it is" since
"not waiting" doesn't means the a few dozen in this case. please
correct me if anything wrong.

>
>> +elog(PANIC, "stuck spinlock detected at %s, %s:%d after waiting for %u 
>> ms",
>> + func, file, line, delay_ms);
>
> pg_usleep doesn't actually guarantee that we'll wait for exactly that
> duration; depending on signals received while spinning and/or OS
> scheduling decisions it may be off by orders of magnitude.

True, but I did this for two reasons. a). the other soluation needs call
'time' syscall twice, I didn't want to pay this run-time effort. b). the
possiblity of geting a signals during pg_usleep should be low and
even that happens, because the message is just for human, we don't need
a absolutely accurate number. what do you think?

>
>> +++ b/src/common/scram-common.c
>
> This is unrelated to the main patchset

Fixed in v2.

>
>> +++ b/src/include/storage/spin.h
>
> Minor: I think these changes could better be included in miscadmin, or
> at least the definition for SpinLockCount should be moved there: The
> spin lock system itself shouldn't be needed in places where we need to
> make sure that we don't hold any spinlocks, and miscadmin.h already
> holds things related to "System interrupt and critical section
> handling", which seems quite related.

fixed in v2. 

-- 
Best Regards
Andy Fan

>From 5548477e192abce02a95249f4c1e0c9cf465d668 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 11 Jan 2024 11:12:00 +0800
Subject: [PATCH v2] Detect more misuse of spin lock automatically

spin lock are intended for *very* short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation. In this patch, all of such cases will be
automatically detected in an ASSERT_CHECKING build.

CHECK_FOR_INTERRUPTS should be avoided when holding a spin lock because
it is nearly impossible to release the spin lock correctly if an
error/fatal happens would cause the code jump to some other places.
---
 src/backend/access/table/tableam.c  |  1 +
 src/backend/storage/buffer/bufmgr.c |  1 +
 src/backend/storage/ipc/barrier.c   |  2 ++
 src/backend/storage/lmgr/lock.c |  2 ++
 src/backend/storage/lmgr/lwlock.c   |  2 ++
 src/backend/storage/lmgr/s_lock.c   | 15 ---
 src/backend/utils/hash/dynahash.c   |  1 +
 src/backend/utils/init/globals.c|  1 +
 src/backend/utils/mmgr/mcxt.c   |  9 +
 src/include/miscadmin.h |  9 +
 src/include/storage/buf_internals.h |  1 +
 src/include/storage/s_lock.h|  2 ++
 src/include/storage/shm_toc.h   |  1 +
 src/include/storage/spin.h  | 17 ++---
 14 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a..6c6a65764c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -24,6 +24,7 @@
 #include "access/syncscan.h"
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "miscadmin.h"
 #include "optimizer/plancat.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-10 Thread jian he

On Tue, Jan 9, 2024 at 10:36 PM torikoshia  wrote:
>
> On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada 
> wrote:
> > If we want only such a feature we need to implement it together (the
> > patch could be split, though). But if some parts of the feature are
> > useful for users as well, I'd recommend implementing it incrementally.
> > That way, the patches can get small and it would be easy for reviewers
> > and committers to review/commit them.
>
> Jian, how do you think this comment?
>
> Looking back at the discussion so far, it seems that not everyone thinks
> saving table information is the best idea[1] and some people think just
> skipping error data is useful.[2]
>
> Since there are issues to be considered from the design such as
> physical/logical replication treatment, putting error information to
> table is likely to take time for consensus building and development.
>
> Wouldn't it be better to follow the following advice and develop the
> functionality incrementally?
>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada
>  wrote:
> > So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
>
>
> Attached a patch for this "first step" with reference to v7 patch, which
> logged errors and simpler than latest one.
> - This patch adds new option SAVE_ERROR_TO, but currently only supports
> 'none', which means just skips error data. It is expected to support
> 'log' and 'table'.
> - This patch Skips just soft errors and don't handle other errors such
> as missing column data.

Hi.
I made the following change based on your patch
(v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch)

* when specified SAVE_ERROR_TO, move the initialization of
ErrorSaveContext to the function BeginCopyFrom.
I think that's the right place to initialize struct CopyFromState field.
* I think your patch when there are N rows have malformed data, then it
will initialize N ErrorSaveContext.
In the struct CopyFromStateData, I changed it to ErrorSaveContext *escontext.
So if an error occurred, you can just set the escontext accordingly.
* doc: mention "If this option is omitted, COPY
stops operation at the first error."
* Since we only support 'none' for now, 'none' means we don't want
ErrorSaveContext metadata,
 so we should set cstate->escontext->details_wanted to false.

> BTW I have question and comment about v15 patch:
>
> > +   {
> > +   /*
> > +   *
> > +   * InputFunctionCall is more faster than
> > InputFunctionCallSafe.
> > +   *
> > +   */
>
> Have you measured this?
> When I tested it in an older patch, there were no big difference[3].
Thanks for pointing it out, I probably was over thinking.

>   > -   SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P SECURITY
> SELECT
>   > +   SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P
> SECURITY SELECT
>
> There was a comment that we shouldn't add new keyword for this[4].
>
Thanks for pointing it out.


v1-0001-minor-refactor.no-cfbot
Description: Binary data

Re: introduce dynamic shared memory registry

2024-01-10 Thread Andrei Lepikhov

On 9/1/2024 00:16, Nathan Bossart wrote:

On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote:

1. I think we need to add some notes about this new way of getting
shared memory for external modules in the Shared Memory and
LWLocks section in xfunc.sgml? This will at least tell there's
another way for external modules to get shared memory, not just with
the shmem_request_hook and shmem_startup_hook. What do you think?
+1. Maybe even more - in the section related to extensions, this 
approach to using shared data can be mentioned, too.

2. FWIW, I'd like to call this whole feature "Support for named DSM
segments in Postgres". Do you see anything wrong with this?

Why do you feel it should be renamed?  I don't see anything wrong with it,
but I also don't see any particular advantage with that name compared to
"dynamic shared memory registry."
It is not a big issue, I suppose. But for me personally (as not a native 
English speaker), the label "Named DSM segments" seems more 
straightforward to understand.

3. IIUC, this feature eventually makes both shmem_request_hook and
shmem_startup_hook pointless, no? Or put another way, what's the
significance of shmem request and startup hooks in lieu of this new
feature? I think it's quite possible to get rid of the shmem request
and startup hooks (of course, not now but at some point in future to
not break the external modules), because all the external modules can
allocate and initialize the same shared memory via
dsm_registry_init_or_attach and its init_callback. All the external
modules will then need to call dsm_registry_init_or_attach in their
_PG_init callbacks and/or in their bg worker's main functions in case
the modules intend to start up bg workers. Am I right?

Well, modules might need to do a number of other things (e.g., adding
hooks) that can presently only be done when preloaded, in which case I
doubt there's much benefit from switching to the DSM registry.  I don't
really intend for it to replace the existing request/startup hooks, but
you're probably right that most, if not all, could use the registry
instead.  IMHO this is well beyond the scope of this thread, though.

+1, it may be a many reasons to use these hooks.

>> 3. Use NAMEDATALEN instead of 64?
>> +charkey[64];
> I kept this the same, as I didn't see any need to tie the key size to
> NAMEDATALEN.
IMO, we should avoid magic numbers whenever possible. Current logic 
according to which first users of this feature will be extensions 
naturally bonds this size to the size of the 'name' type.

And one more point. I think the commit already deserves a more detailed 
commit message.

--
regards,
Andrei Lepikhov
Postgres Professional

RE: Random pg_upgrade test failure on drongo

2024-01-10 Thread Hayato Kuroda (Fujitsu)

Dear Alexander, Amit,

> > But tomorrow it could be for other tables and if we change this
> > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > there is always a chance that one misses changing this comment. I feel
> > keeping it generic in this case would be better as the problem is
> > generic but it is currently shown for pg_largeobject.
> 
> Yes, for sure. So let's keep it generic as you prefer.
> 
> Thank you!

Thanks for working the patch. I'm also OK to push the Amit's fix patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: speed up a logical replica setup

2024-01-10 Thread Euler Taveira

On Wed, Jan 10, 2024, at 1:33 AM, Shlok Kyal wrote:
> Here the standby node would be waiting for the 'consistent_lsn' wal
> during recovery but this wal will not be present on standby if no
> physical replication is setup. Hence the command will be waiting
> infinitely for the wal.

Hmm. Some validations are missing.

> To solve this added a timeout of 60s for the recovery process and also
> added a check so that pg_subscriber would give a error when it called
> for node which is not in physical replication.
> Have attached the patch for the same. It is a top-up patch of the
> patch shared by Euler at [1].

If the user has a node that is not a standby and it does not set the GUCs to
start the recovery process from a backup, the initial setup is broken. (That's
the case you described.) A good UI is to detect this scenario earlier.
Unfortunately, there isn't a reliable and cheap way to do it. You need to start
the recovery and check if it is having some progress. (I don't have a strong
opinion about requiring a standby to use this tool. It would reduce the
complexity about checking if the setup has all requirements to run this tool.)

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Update docs for default value of fdw_tuple_cost

2024-01-10 Thread John Naylor

On Tue, Dec 26, 2023 at 11:27 PM Umair Shahid
 wrote:
>
> Commit cac169d686eddb277880a0d8a760ac3007b4846a updated the default value of 
> fdw_tuple_cost from 0.01 to 0.2. The attached patch updates the docs to 
> reflect this change.

Pushed, thanks!

Re: Synchronizing slots from primary to standby

2024-01-10 Thread Peter Smith

Here are some review comments for patch v58-0002

(FYI - I quickly checked with the latest v59-0002 and AFAIK all these
review comments below are still relevant)

==
Commit message

1.
If a logical slot is invalidated on the primary, slot on the standby is also
invalidated.

~

/slot on the standby/then that slot on the standby/

==
doc/src/sgml/logicaldecoding.sgml

2.
In order to resume logical replication after failover from the synced
logical slots, it is required that 'conninfo' in subscriptions are
altered to point to the new primary server using ALTER SUBSCRIPTION
... CONNECTION. It is recommended that subscriptions are first
disabled before promoting the standby and are enabled back once these
are altered as above after failover.

~

Minor rewording mainly to reduce a long sentence.

SUGGESTION
To resume logical replication after failover from the synced logical
slots, the subscription's 'conninfo' must be altered to point to the
new primary server. This is done using ALTER SUBSCRIPTION ...
CONNECTION. It is recommended that subscriptions are first disabled
before promoting the standby and are enabled back after altering the
connection string.

==
doc/src/sgml/system-views.sgml

3.
+  
+   synced bool
+  
+  
+  True if this logical slot was synced from a primary server.
+  
+  

SUGGESTION
True if this is a logical slot that was synced from a primary server.

==
src/backend/access/transam/xlogrecovery.c

4.
+ /*
+ * Shutdown the slot sync workers to prevent potential conflicts between
+ * user processes and slotsync workers after a promotion.
+ *
+ * We do not update the 'synced' column from true to false here, as any
+ * failed update could leave some slot's 'synced' column as false. This
+ * could cause issues during slot sync after restarting the server as a
+ * standby. While updating after switching to the new timeline is an
+ * option, it does not simplify the handling for 'synced' column.
+ * Therefore, we retain the 'synced' column as true after promotion as they
+ * can provide useful information about their origin.
+ */

Minor comment wording changes.

BEFORE
...any failed update could leave some slot's 'synced' column as false.
SUGGESTION
...any failed update could leave 'synced' column false for some slots.

~

BEFORE
Therefore, we retain the 'synced' column as true after promotion as
they can provide useful information about their origin.
SUGGESTION
Therefore, we retain the 'synced' column as true after promotion as it
may provide useful information about the slot origin.

==
src/backend/replication/logical/slotsync.c

5.
+ * While creating the slot on physical standby, if the local restart_lsn and/or
+ * local catalog_xmin is ahead of those on the remote then the worker cannot
+ * create the local slot in sync with the primary server because that would
+ * mean moving the local slot backwards and the standby might not have WALs
+ * retained for old LSN. In this case, the worker will mark the slot as
+ * RS_TEMPORARY. Once the primary server catches up, it will move the slot to
+ * RS_PERSISTENT and will perform the sync periodically.

/will move the slot to RS_PERSISTENT/will mark the slot as RS_PERSISTENT/

~~~

6. drop_synced_slots_internal
+/*
+ * Helper function for drop_obsolete_slots()
+ *
+ * Drops synced slot identified by the passed in name.
+ */
+static void
+drop_synced_slots_internal(const char *name, bool nowait)
+{
+ Assert(MyReplicationSlot == NULL);
+
+ ReplicationSlotAcquire(name, nowait);
+
+ Assert(MyReplicationSlot->data.synced);
+
+ ReplicationSlotDropAcquired();
+}

IMO you don't need this function. AFAICT it is only called from one
place and does not result in fewer lines of code.

~~~

7. get_local_synced_slots

+ /* Check if it is logical synchronized slot */
+ if (s->in_use && SlotIsLogical(s) && s->data.synced)
+ {
+ local_slots = lappend(local_slots, s);
+ }

Do you need to check SlotIsLogical(s) here? I thought s->data.synced
can never be true for physical slots. I felt you could write this like
blelow:

if (s->in_use s->data.synced)
{
  Assert(SlotIsLogical(s));
  local_slots = lappend(local_slots, s);
}

~~~

8. check_sync_slot_on_remote

+static bool
+check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots,
+   bool *locally_invalidated)
+{
+ ListCell   *lc;
+
+ foreach(lc, remote_slots)
+ {
+ RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);

I think you can use the new style foreach_ptr list macros here.

~~~

9. drop_obsolete_slots

+drop_obsolete_slots(List *remote_slot_list)
+{
+ List*local_slots = NIL;
+ ListCell   *lc;
+
+ local_slots = get_local_synced_slots();
+
+ foreach(lc, local_slots)
+ {
+ ReplicationSlot *local_slot = (ReplicationSlot *) lfirst(lc);

I think you can use the new style foreach_ptr list macros here.

~~~

10. reserve_wal_for_slot

+ Assert(slot != NULL);
+ Assert(slot->data.restart_lsn == InvalidXLogRecPtr);

You can use the macro

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-10 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 10 Jan 2024 16:53:48 +0900,
  Masahiko Sawada  wrote:

>> Interesting. But I feel that it introduces another (a bit)
>> tricky mechanism...
> 
> Right. On the other hand, I don't think the idea 3 is good for the
> same reason Michael-san pointed out before[1][2].
>
> [1] https://www.postgresql.org/message-id/ZXEUIy6wl4jHy6Nm%40paquier.xyz
> [2] https://www.postgresql.org/message-id/ZXKm9tmnSPIVrqZz%40paquier.xyz

I think that the important part of the Michael-san's opinion
is "keep COPY TO implementation and COPY FROM implementation
separated for maintainability".

The patch focused in [1][2] uses one routine for both of
COPY TO and COPY FROM. If we use the approach, we need to
change one common routine from copyto.c and copyfrom.c (or
export callbacks from copyto.c and copyfrom.c and use them
in copyto.c to construct one common routine). It's
the problem.

The idea 3 still has separated routines for COPY TO and COPY
FROM. So I think that it still keeps COPY TO implementation
and COPY FROM implementation separated.

>> BTW, we also need to set .type:
>>
>>  .routine = COPYTO_ROUTINE (
>>  .type = T_CopyToFormatRoutine,
>>  .start_fn = testfmt_copyto_start,
>>  .onerow_fn = testfmt_copyto_onerow,
>>  .end_fn = testfmt_copyto_end
>>  )
> 
> I think it's fine as the same is true for table AM.

Ah, sorry. I should have said explicitly. I don't this that
it's not a problem. I just wanted to say that it's missing.

Defining one more static const struct instead of providing a
convenient (but a bit tricky) macro may be straightforward:

static const CopyToFormatRoutine testfmt_copyto_routine = {
.type = T_CopyToFormatRoutine,
.start_fn = testfmt_copyto_start,
.onerow_fn = testfmt_copyto_onerow,
.end_fn = testfmt_copyto_end
};

static const CopyFormatRoutine testfmt_copyto_handler = {
.type = T_CopyFormatRoutine,
.is_from = false,
.routine = (Node *) _copyto_routine
};

Thanks,
-- 
kou

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-10 Thread Masahiko Sawada

On Mon, Jan 8, 2024 at 8:35 PM John Naylor  wrote:
>
> On Wed, Jan 3, 2024 at 9:10 PM John Naylor  wrote:
> >
> > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  
> > wrote:
> >
> > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > calls part. I think that even if we expose them, we will still need to
> > > do something like "if (TidStoreIsShared(ts))
> > > shared_rt_lock_share(ts->tree.shared)", no?
> >
> > I'll come back to this topic separately.
>
> To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> part would be pushed down into a function so that only one place has
> to care about it.
>
> However, I'm starting to question whether we even need that. Meaning,
> lock the tidstore separately. To "lock the tidstore" means to take a
> lock, _separate_ from the radix tree's internal lock, to control
> access to two fields in a separate "control object":
>
> +typedef struct TidStoreControl
> +{
> + /* the number of tids in the store */
> + int64 num_tids;
> +
> + /* the maximum bytes a TidStore can use */
> + size_t max_bytes;
>
> I'm pretty sure max_bytes does not need to be in shared memory, and
> certainly not under a lock: Thinking of a hypothetical
> parallel-prune-phase scenario, one way would be for a leader process
> to pass out ranges of blocks to workers, and when the limit is
> exceeded, stop passing out blocks and wait for all the workers to
> finish.

True. I agreed that it doesn't need to be under a lock anyway, as it's
read-only.

>
> As for num_tids, vacuum previously put the similar count in
>
> @@ -176,7 +179,8 @@ struct ParallelVacuumState
>   PVIndStats *indstats;
>
>   /* Shared dead items space among parallel vacuum workers */
> - VacDeadItems *dead_items;
> + TidStore *dead_items;
>
> VacDeadItems contained "num_items". What was the reason to have new
> infrastructure for that count? And it doesn't seem like access to it
> was controlled by a lock -- can you confirm? If we did get parallel
> pruning, maybe the count would belong inside PVShared?

I thought that since the tidstore is a general-purpose data structure
the shared counter should be protected by a lock. One thing I'm
concerned about is that we might need to update both the radix tree
and the counter atomically in some cases. But that's true we don't
need it for lazy vacuum at least for now. Even given the parallel scan
phase, probably we won't need to have workers check the total number
of stored tuples during a parallel scan.

>
> The number of tids is not that tightly bound to the tidstore's job. I
> believe tidbitmap.c (a possible future client) doesn't care about the
> global number of tids -- not only that, but AND/OR operations can
> change the number in a non-obvious way, so it would not be convenient
> to keep an accurate number anyway. But the lock would still be
> mandatory with this patch.

Very good point.

>
> If we can make vacuum work a bit closer to how it does now, it'd be a
> big step up in readability, I think. Namely, getting rid of all the
> locking logic inside tidstore.c and let the radix tree's locking do
> the right thing. We'd need to make that work correctly when receiving
> pointers to values upon lookup, and I already shared ideas for that.
> But I want to see if there is any obstacle in the way of removing the
> tidstore control object and it's separate lock.

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach().  I'll try it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Melanie Plageman

I have code review feedback as well, but I've saved that for my next email.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
>
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
> >
> > On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > > On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
> > >> I am not sure while the whole point of the exercise is to have all the
> > >> I/O related data in a single view.  Something that I've also found a
> > >> bit disturbing yesterday while looking at your patch is the fact that
> > >> the operation size is guessed from the context and object type when
> > >> querying the view because now everything is tied to BLCKSZ.  This
> > >> patch extends it with two more operation sizes, and there are even
> > >> cases where it may be a variable.  Could it be a better option to
> > >> extend pgstat_count_io_op_time() so as callers can themselves give the
> > >> size of the operation?
> > >
> > > Do you mean removing the op_bytes column and tracking the number of
> > > bytes in reads, writes, and extends? If so, that makes sense to me but
> > > I don't want to remove the number of operations; I believe that has a
> > > value too. We can extend the pgstat_count_io_op_time() so it can both
> > > track the number of bytes and the number of operations.
> >
> > Apologies if my previous wording sounded confusing.  The idea I had in
> > mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> > of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> > as a number of bytes.
>
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

Forgive me as I catch up on this thread.

Upthread, Michael says:

> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

Looking at the patch, I think it is still doing 2.

It would be good to list all our options with pros and cons (if only
because they are a bit spread throughout the thread now).

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

> > > Also, it is not directly related to this patch but vectored IO [1] is
> > > coming soon; so the number of operations could be wrong since vectored
> > > IO could merge a couple of operations.
> >
> > Hmm.  I have not checked this patch series so I cannot say for sure,
> > but we'd likely just want to track the number of bytes if a single
> > operation has a non-equal size rather than registering in pg_stat_io N
> > rows with different op_bytes, no?
>
> Yes, that is correct.

I do not like the idea of having basically GROUP BY op_bytes in the
view (if that is the suggestion).

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most

Re: initdb --no-locale=C doesn't work as specified when the environment is not C

2024-01-10 Thread Tom Lane

Kyotaro Horiguchi  writes:
> It seems somewhat intentional that only lc_messages references the
> environment at boot time. On the other hand, previously, in the
> absence of a specified locale, initdb would embed the environmental
> value in the configuration file, as it seems to be documented. Given
> that initdb is always used for cluster creation, it's unlikey that
> systems depend on this boot-time default for their operation.

Yeah, after further reflection there doesn't seem to be a lot of value
in leaving these entries commented-out, even in the cases where that's
technically correct.  Let's just go back to the old behavior of always
uncommenting them; that stood for years without complaints.  So I
committed your latest patch as-is.

> By the way, the lines around lc_* in the sample file seem to have
> somewhat inconsistent indentations. Wouldnt' it be preferable to fix
> this? (The attached doesn't that.)

They look all right if you assume the tab width is 8, which seems to
be what is used elsewhere in the file.  I think there's been some
prior discussion about whether to ban use of tabs at all in these
sample files, so as to reduce confusion about how wide the tabs are.
But I'm not touching that question today.

regards, tom lane

Re: Built-in CTYPE provider

2024-01-10 Thread Daniel Verite

Jeff Davis wrote:

> Attached a more complete version that fixes a few bugs

[v15 patch]

When selecting the builtin provider with initdb, I'm getting the
following setup:

$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
 
  The database cluster will be initialized with this locale configuration:
default collation provider:  builtin
default collation locale:C.UTF-8
LC_COLLATE:  C.UTF-8
LC_CTYPE:C.UTF-8
LC_MESSAGES: C.UTF-8
LC_MONETARY: C.UTF-8
LC_NUMERIC:  C.UTF-8
LC_TIME: C.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "english".

This is from an environment where LANG=fr_FR.UTF-8

I would expect all LC_* variables to be fr_FR.UTF-8, and the default
text search configuration to be "french".  It is what happens
when selecting ICU as the provider in the same environment:

$ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata

  Using language tag "en" for ICU locale "en".
  The database cluster will be initialized with this locale configuration:
default collation provider:  icu
default collation locale:en
LC_COLLATE:  fr_FR.UTF-8
LC_CTYPE:fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC:  fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "french".

The collation setup does not influence the rest of the localization.
The problem AFAIU is that --locale has two distinct
meanings in the v15 patch:
--locale-provider=X --locale=Y means use "X" as the provider
with "Y" as datlocale, and it means use "Y" as the locale for all
localized libc functionalities.

I wonder what would happen if invoking
 bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
on a system where C.UTF-8 does not exist as a libc locale.
Would it fail? (I don't have an OS like this to test ATM, will try later).

A related comment is about naming the builtin locale C.UTF-8, the same
name as in libc. On one hand this is semantically sound, but on the
other hand, it's likely to confuse people. What about using completely
different names, like "pg_unicode" or something else prefixed by "pg_"
both for the locale name and the collation name (currently
C.UTF-8/c_utf8)?



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: GUCifying MAX_WAL_SEND

On Thu, Jan 11, 2024 at 01:15:48AM +0330, Majid Garoosi wrote:
> Why do I think it can be beneficial to GUCify it?
> We use Postgres in K8s along with RBD disks. Today, I found out that
> because we use remote disks, it's better to read bigger chunks of data from
> the disk in one operation. In our setup (which I assume is not a silly
> setup), we had a query that took 4.5 seconds to execute on the primary
> server and ~21 seconds to send the WAL records to our synchronous standby
> server. After recompiling Postgres and setting MAX_WAL_SEND to 16MiB, that
> 21 seconds decreased to just 1.5 seconds, which is a 92.9% improvement.

I did not follow the older thread, but MAX_SEND_SIZE does not stress
me as a performance bottleneck in walsender.c when we use it for
calculations with the end LSN, so I can get behind the argument of
GUC-ifying it when reading WAL data in batches, even if it can be easy
to get it wrong.   The hardcoded value comes back to 40f908bdcdc7 back
in 2010, and hardware has evolved a lot since that point.

So I'd +1 your suggestion.  A RBD setup sounds exotic to me for a
database, though :)
--
Michael

signature.asc
Description: PGP signature

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-10 Thread Jelte Fennema-Nio

On Wed, 10 Jan 2024 at 22:12, Robert Haas  wrote:
> Could you explain why you think that the protocol version bump is necessary?

Patch 0006 adds a new protocol message to the protocol. Somehow the
client needs to be told that the server understands that message.
Using the protocol version seems like the best/simplest/cleanest way
to do that to me.

In theory we could add a dedicated protocol extension parameter (e.g.
_pq_.enable_protocol_level_set) that the client would need to set to
true before it would be able to use ParameterSet. But that just sounds
like introducing unnecessary complexity to me.

Bumping the protocol version carries exactly the same level of risk as
adding new protocol extension parameters. Both will always allow old
clients to connect to the newer server. And both also allow a new
client to connect to an old server just fine as well, as long as that
server has ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed (which was
introduced in PG11.0 and was backpatched to all then supported PG
versions).

> > It now limits the possible GUCs that can be changed to PGC_USERSET and
> > PGC_SUSET. If desired, we could limit it even further by using an
> > allowlist of reasonable GUCs or set a flag on GUCs that can be
> > "upgraded" . Things that seem at least reasonable to me are "role",
> > "session_authorization", "client_encoding".
>
> I don't know whether that limit helps anything or not, and you haven't
> really said why you imposed it.

The main reason I did this is to make sure that the required context
can only be hardened, not softened. e.g. it would be very bad if
PGC_POSTMASTER GUCs could suddenly be changed with a protocol message.
So it was more meant as fixing a bug, than really reducing the number
of GUCs this has an impact on significantly.

> I'm still afraid that
> trying to allow this kind of nail-down for a broad range of GUCs (even
> if not all) is going to be messy. But I'm also plenty willing to
> listen to contrary opinions. I hear yours, but I wonder what others
> think? Tom particularly.

I wouldn't mind heavily reducing the GUCs that can be nailed-down like
this. For my usecase (connection pooling) I only really care about it
being possible to nail-down session_authorization.

Honestly, I care more about patch 0010 than patch 0008. Patch 0008
simply seemed like the easiest way to demonstrate the ParameterSet
message.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-10 Thread Melanie Plageman

On Wed, Jan 10, 2024 at 3:54 PM Robert Haas  wrote:
>
> On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman
>  wrote:
> > Done in attached v6.
>
> Don't kill me, but:
>
> +   /*
> +* For now, pass no_indexes == false
> regardless of whether or not
> +* the relation has indexes. In the future we
> may enable immediate
> +* reaping for on access pruning.
> +*/
> +   heap_page_prune(relation, buffer, vistest, false,
> +   , NULL);
>
> My previous comments about lying seem equally applicable here.

Yes, the options I can think of are:

1) rename the parameter to "immed_reap" or similar and make very clear
in heap_page_prune_opt() that you are not to pass true.
2) make immediate reaping work for on-access pruning. I would need a
low cost way to find out if there are any indexes on the table. Do you
think this is possible? Should we just rename the parameter for now
and think about that later?

>
> -   if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> +   if (!ItemIdIsUsed(itemid))
> continue;
>
> There is a bit of overhead here for the !no_indexes case. I assume it
> doesn't matter.

Right. Should be okay. Alternatively, and I'm not saying this is a
good idea, but we could throw this into the loop in heap_page_prune()
which calls heap_prune_chain():

+   if (ItemIdIsDead(itemid) && prstate.no_indexes)
+   {
+   heap_prune_record_unused(, offnum);
+   continue;
+   }

I think that is correct?
But, from a consistency perspective, we don't call the
heap_prune_record* functions directly from heap_page_prune() in any
other cases. And it seems like it would be nice to have all of those
in one place?

>  static void
>  heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum)
>  {
> +   /*
> +* If the relation has no indexes, we can remove dead tuples during
> +* pruning instead of marking their line pointers dead. Set this 
> tuple's
> +* line pointer LP_UNUSED. We hint that tables with indexes are more
> +* likely.
> +*/
> +   if (unlikely(prstate->no_indexes))
> +   {
> +   heap_prune_record_unused(prstate, offnum);
> +   return;
> +   }
>
> I think this should be pushed to the callers. Else you make the
> existing function name into another lie.

Yes, so I preferred it in the body of heap_prune_chain() (the caller).
Andres didn't like the extra level of indentation. I could wrap
heap_record_dead() and heap_record_unused(), but I couldn't really
think of a good wrapper name. heap_record_dead_or_unused() seems long
and literal. But, it might be the best I can do. I can't think of a
general word which encompasses both the transition to death and to
disposal.

> +   boolrecordfreespace;
>
> Not sure if it's necessary to move this to an outer scope like this?
> The whole handling of this looks kind of confusing. If we're going to
> do it this way, then I think lazy_scan_prune() definitely needs to
> document how it handles this function (i.e. might set true to false,
> won't set false to true, also meaning). But are we sure we want to let
> a local variable with a weird name "leak out" like this?

Which function do you mean when you say "document how
lazy_scan_prune() handles this function". And no we definitely don't
want a variable like this to be hanging out in lazy_scan_heap(), IMHO.
The other patches in the stack move the FSM updates into
lazy_scan_[no]prune() and eliminate this parameter. I moved up the
scope because lazy_scan_noprune() already had recordfreespace as an
output parameter and initialized it unconditionally inside. I
initialize it unconditionally in lazy_scan_prune() as well. I mention
in the commit message that this is temporary because we plan to
eliminate recordfreespace as an output parameter by updating the FSM
in lazy_scan_[no]prune(). I could have stuck recordfreespace into the
LVPagePruneState with the other output parameters. But, leaving it as
a separate output parameter made the diffs lovely for the rest of the
patches in the stack.

> +   Assert(vacrel->do_index_vacuuming);
>
> Is this related?

Yes, previously the assert was:
Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
And we eliminated the caller of lazy_vacuum_heap_page() with
vacrel->nindexes == 0. Now it should only be called after doing index
vacuuming (thus index vacuuming should definitely be enabled).

- Melanie

GUCifying MAX_WAL_SEND

2024-01-10 Thread Majid Garoosi

Hello all,

*TL; DR*
There has been a discussion about GUCifying the MAX_WAL_SEND constant in
walsender.c in here  nearly 7
years ago, but resulting in nothing in the end. Today, I found out the
configurability of this parameter can be very helpful. So, I want to submit
a patch for it, but I also want to know your comments before then.

What is MAX_WAL_SEND?
It's the maximum size of WAL records that walsender reads from the disk and
then sends to the standby servers. Its current value is hardcoded as
XLOG_BLCKSZ * 16, which is 128KiB (8KiB * 16) by default.

Why do I think it can be beneficial to GUCify it?
We use Postgres in K8s along with RBD disks. Today, I found out that
because we use remote disks, it's better to read bigger chunks of data from
the disk in one operation. In our setup (which I assume is not a silly
setup), we had a query that took 4.5 seconds to execute on the primary
server and ~21 seconds to send the WAL records to our synchronous standby
server. After recompiling Postgres and setting MAX_WAL_SEND to 16MiB, that
21 seconds decreased to just 1.5 seconds, which is a 92.9% improvement.

Thank you for your comments in advance
Majid

Re: Stack overflow issue


On 05/01/2024 19:23, Robert Haas wrote:

On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas  wrote:

What do you think?


At least for 0001 and 0002, I think we should just add the stack depth checks.

With regard to 0001, CommitTransactionCommand() and friends are hard
enough to understand as it is; they need "goto" like I need an extra
hole in my head.

With regard to 0002, this function isn't sufficiently important to
justify adding special-case code for an extremely rare event. We
should just handle it the way we do in general.

I agree that in the memory-context case it might be worth expending
some more code to be more clever. But I probably wouldn't do that for
MemoryContextStats(); check_stack_depth() seems fine for that one.

In general, I think we should try to keep the number of places that
handle stack overflow in "special" ways as small as possible.


The problem with CommitTransactionCommand (or rather 
AbortCurrentTransaction() which has the same problem)
and ShowTransactionStateRec is that they get called in a state where 
aborting can lead to a panic. If you add a "check_stack_depth()" to them 
and try to reproducer scripts that Egor posted, you still get a panic.


I'm not sure if MemoryContextStats() could safely elog(ERROR). But at 
least it would mask the "out of memory" that caused the stats to be 
printed in the first place.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

On Mon, Jan 8, 2024 at 5:52 PM Jelte Fennema-Nio  wrote:
> It's still bumping the protocol version. I think this is a necessity
> with the current changeset, since ParameterSet now applies to plain
> GUCs too and. As I clarified in a previous email, this does **not**
> break old clients, since the server silently downgrades when an older
> protocol version is requested.

Could you explain why you think that the protocol version bump is necessary?

> > Second, I don't really like the idea of selectively turning GUCs into
> > protocol-managed parameters. I think there are a relatively small
> > number of things that we'd ever want to manage on a protocol level,
> > but this design would force us to make it work for any GUC whatsoever.
>
> It now limits the possible GUCs that can be changed to PGC_USERSET and
> PGC_SUSET. If desired, we could limit it even further by using an
> allowlist of reasonable GUCs or set a flag on GUCs that can be
> "upgraded" . Things that seem at least reasonable to me are "role",
> "session_authorization", "client_encoding".

I don't know whether that limit helps anything or not, and you haven't
really said why you imposed it. Personally, I think that the login
role should be changeable via a protocol message and make it just as
if we'd logged in using the selected role originally, except that a
further protocol message can change it again (when not in
transaction). SET ROLE and SET SESSION AUTHORIZATION would behave in
accordance with the idea that it was the originally selected role.
Then, a connected client can't distinguish between being directly
connected to the database in a session created for that role and being
connected to a pooler which has used this protocol message to create a
session that is effectively for that role. With your design, the
client can see behavioral differences between those cases.

I agree that client_encoding feels like a protocol parameter rather
than a GUC as we know them today. How to get there with reasonable
impact on backward compatibility, I'm not sure. I'm still afraid that
trying to allow this kind of nail-down for a broad range of GUCs (even
if not all) is going to be messy. But I'm also plenty willing to
listen to contrary opinions. I hear yours, but I wonder what others
think? Tom particularly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman
 wrote:
> Done in attached v6.

Don't kill me, but:

+   /*
+* For now, pass no_indexes == false
regardless of whether or not
+* the relation has indexes. In the future we
may enable immediate
+* reaping for on access pruning.
+*/
+   heap_page_prune(relation, buffer, vistest, false,
+   , NULL);

My previous comments about lying seem equally applicable here.

-   if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+   if (!ItemIdIsUsed(itemid))
continue;

There is a bit of overhead here for the !no_indexes case. I assume it
doesn't matter.

 static void
 heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum)
 {
+   /*
+* If the relation has no indexes, we can remove dead tuples during
+* pruning instead of marking their line pointers dead. Set this tuple's
+* line pointer LP_UNUSED. We hint that tables with indexes are more
+* likely.
+*/
+   if (unlikely(prstate->no_indexes))
+   {
+   heap_prune_record_unused(prstate, offnum);
+   return;
+   }

I think this should be pushed to the callers. Else you make the
existing function name into another lie.

+   boolrecordfreespace;

Not sure if it's necessary to move this to an outer scope like this?
The whole handling of this looks kind of confusing. If we're going to
do it this way, then I think lazy_scan_prune() definitely needs to
document how it handles this function (i.e. might set true to false,
won't set false to true, also meaning). But are we sure we want to let
a local variable with a weird name "leak out" like this?

+   Assert(vacrel->do_index_vacuuming);

Is this related?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Create shorthand for including all extra tests

2024-01-10 Thread Peter Eisentraut


On 05.09.23 19:26, Nazir Bilal Yavuz wrote:

Thanks for the feedback! I updated the patch, 'needs-private-lo'
option enables kerberos, ldap, load_balance and ssl extra tests now.


As was discussed, I don't think "needs private lo" is the only condition 
for these tests.  At least kerberos and ldap also need extra software 
installed, and load_balance might need editing the system's hosts file. 
So someone would still need to familiarize themselves with these tests 
individually before setting a global option like this.


Also, if we were to create test groupings like this, I think the 
implementation should be different.  The way you have it, there is a 
sort of central registry of all affected tests in 
src/test/perl/PostgreSQL/Test/Utils.pm and a mapping of groups to tests. 
 I would prefer a more decentralized approach where each test decides 
on its own whether to run, with pseudo-code conditionals like


if (!(PG_TEST_EXTRA contains "ldap" or PG_TEST_EXTRA contains 
"needs-private-lo"))

  skip_all

Anyway, at the moment, I don't see a sensible way to group these things 
beyond what we have now (effectively, "ldap" is already a group, because 
it affects more than one test suite).  Right now, we have six possible 
values, which is probably just about doable to keep track of manually. 
If we get a lot more, then we need to look into this again, but maybe 
then we'll also have more patterns to group things around.

Re: Streaming I/O, vectored I/O (WIP)


On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.


How will these functions work in the brave new async I/O world? I assume 
PrepareReadBuffer() will initiate the async I/O, and 
CompleteReadBuffers() will wait for it to complete. How about 
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and 
WaitBufferRead()?


About the signature of those functions: Does it make sense for 
CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector 
of buffers? If StartReadBuffer() initiates the async I/O immediately, is 
there any benefit to batching the waiting?


If StartReadBuffer() starts the async I/O, the idea that you can call 
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think 
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for 
you in RBM_ZERO_* modes.



Putting all that together, I propose:

/*
 * Initiate reading a block from disk to the buffer cache.
 *
 * XXX: Until we have async I/O, this just allocates the buffer in the 
buffer

 * cache. The actual I/O happens in WaitReadBuffer().
 */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
ReadBufferMode mode,
bool *foundPtr);

/*
 * Wait for a read that was started earlier with StartReadBuffer() to 
finish.

 *
 * XXX: Until we have async I/O, this is the function that actually 
performs

 * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
 * will try to perform all of them in one syscall. Subsequent calls to
 * WaitReadBuffer(), for those other buffers, will finish quickly.
 */
void
WaitReadBuffer(Buffer buf);


I'm not sure how well this fits with the streaming read API. The 
streaming read code performs grouping of adjacent blocks to one 
CompleteReadBuffers() call. If WaitReadBuffer() does the batching, 
that's not really required. But does that make sense with async I/O? 
With async I/O, will you need a vectorized version of StartReadBuffer() too?


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Make psql ignore trailing semicolons in \sf, \ef, etc

2024-01-10 Thread Tom Lane

Laurenz Albe  writes:
> On Mon, 2024-01-08 at 15:48 -0500, Tom Lane wrote:
>> Is this enough of a bug to deserve back-patching?  I'm not sure.

> I like the patch, but I wouldn't back-patch it.  I'd call the current
> behavior a slight inconsistency rather than an outright bug, and I think
> that we should be conservative with back-patching.

Nobody spoke in favor of back-patching, so committed to HEAD only.

regards, tom lane

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-10 Thread Jelte Fennema-Nio

v6 attached with the following changes:

1. Fixed rebase conflicts with master

2. removed PGC_S_PROTOCOL (but kept PGC_PROTOCOL and PGC_SU_PROTOCOL).
This extra source level was not needed. And after some more testing I
realized this extra source level even caused problems, since protocol
messages could not override values set by SET commands anymore.

3. Added a new patch (0010) with a protocol parameter to configure
which GUCs are GUC_REPORT. This is partially to show that the GUC
interface makes sense for protocol parameters, but also because this
would be an extremely useful feature for connection poolers. And [2]
would be able to use this too.

4. Don't error, but only warn, if a GUC provided to
_pq_.protocol_managed_params is unknown. It seemed useful to be able
to specify GUCs in this list that not all Postgres versions support in
the StartupMessage, without having to guess what postgres version
you're going to connect to.

[2]:
https://www.postgresql.org/message-id/flat/cafj8prbfu-wzzqhnrwrhn67n0ug8a9-0-9boo69pptchibd...@mail.gmail.com

v6-0003-Prepare-server-code-for-addition-of-protocol-exte.patch
Description: Binary data

v6-0002-libpq-Handle-NegotiateProtocolVersion-message-mor.patch
Description: Binary data

v6-0004-libpq-Include-minor-version-number-in-PQprotocolV.patch
Description: Binary data

v6-0001-libpq-Remove-instance-of-hardcoded-protocol-versi.patch
Description: Binary data

v6-0005-Bump-protocol-version-to-3.1.patch
Description: Binary data

v6-0006-Add-protocol-message-to-change-parameters.patch
Description: Binary data

v6-0008-Add-_pq_.protocol_managed_params-protocol-extensi.patch
Description: Binary data

v6-0009-Add-tests-for-ParameterSet-and-_pq_.protocol_mana.patch
Description: Binary data

v6-0010-Add-_pq_.report_parameters-protocol-extension.patch
Description: Binary data

v6-0007-Add-GUC-contexts-for-protocol-extensions.patch
Description: Binary data

Re: Postgres Partitions Limitations (5.11.2.3)

2024-01-10 Thread Laurenz Albe

On Wed, 2024-01-10 at 13:41 +0100, Magnus Hagander wrote:
> It still reads a bit weird to me. How about the attached wording instead?

Thanks!  I am fine with your wording.

Yours,
Laurenz Albe

Re: [PATCH] Add sortsupport for range types and btree_gist

2024-01-10 Thread Bernd Helmle

Am Mittwoch, dem 10.01.2024 um 20:13 +0500 schrieb Andrey M. Borodin:
> I think GiST sortsupport comments are more relevant, so there's no
> need for this pfree()s.
> 
> Also, please check other thread, maybe you will find some useful code
> there [0,1]. It was committed[2] once, but reverted. Please make sure
> that corrections made there are taken into account in your patch.
> 

At least, i believe we have the same problem described here (many
thanks Andrey for the links, i wasn't aware about this discussion):

https://www.postgresql.org/message-id/98b34b51-a6db-acc4-1bcf-a29caf69bbc7%40iki.fi

> Thanks for working on this!

Absolutely. This patch needs input ...

Re: Add BF member koel-like indentation checks to SanityCheck CI

2024-01-10 Thread Tristan Partin

On Wed Jan 10, 2024 at 1:58 AM CST, Tom Lane wrote:

Bharath Rupireddy  writes:
> IMO, running the
> pgindent in at least one of the CI systems if not all (either as part
> task SyanityCheck or task Linux - Debian Bullseye - Autoconf) help
> catches things early on in CF bot runs itself. This saves committers
> time but at the cost of free run-time that cirrus-ci provides.

But that puts the burden of pgindent-cleanliness onto initial patch
submitters, which I think is the wrong thing for reasons mentioned
upthread.  We want to enforce this at commit into the master repo, but
I fear enforcing it earlier will drive novice contributors away for
no very good reason.

If we are worried about turning away novice contributors, there are much 
bigger fish to fry than worrying if we will turn them away due to 
requiring code to be formatted a certain way. Like I said earlier, 
formatters are pretty common tools to be using these days. go fmt, deno 
fmt, rustfmt, prettier, black, clang-format, uncrustify, etc.

Code formatting requirements are more likely to turn someone away from 
contributing if code reviews are spent making numerous comments. 
Luckily, we can just say "run this command line incantation of 
pgindent," which in the grand scheme of things is easy compared to all 
the other things you have to be aware of to contribute to Postgres.

--
Tristan Partin
Neon (https://neon.tech)

Re: [PATCH] Add sortsupport for range types and btree_gist

2024-01-10 Thread Bernd Helmle

Am Mittwoch, dem 10.01.2024 um 08:00 +0800 schrieb jian he:


> you do  `CREATE INDEX ON pgbench_accounts USING gist(aid);`
> but the original patch didn't change contrib/btree_gist/btree_int4.c
> So I doubt your benchmark is related to the original patch.
> or maybe I missed something.
> 

The patch originally does this:

+ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD
+FUNCTION11  (int4, int4) btint4sortsupport (internal) ;

This adds sortsupport function to int4 as well. We reuse existing
btint4sortsupport() function, so no need to change btree_int4.c.

> also per doc:
> `
> sortsupport
> Returns a comparator function to sort data in a way that preserves
> locality. It is used by CREATE INDEX and REINDEX commands. The
> quality
> of the created index depends on how well the sort order determined by
> the comparator function preserves locality of the inputs.
> `
> from the doc, add sortsupport function will only influence index
> build time?
> 

Thats the point of this patch. Though it influences the index quality
in a way which seems to cause the measured performance regression
upthread.

> 
> per https://www.postgresql.org/docs/current/gist-extensibility.html
> QUOTE:
> All the GiST support methods are normally called in short-lived
> memory
> contexts; that is, CurrentMemoryContext will get reset after each
> tuple is processed. It is therefore not very important to worry about
> pfree'ing everything you palloc. However, in some cases it's useful
> for a support method to
> ENDOF_QUOTE
> 
> so removing the following part should be OK.
> + if ((Datum) range_a != a)
> + pfree(range_a);
> +
> + if ((Datum) range_b != b)
> + pfree(range_b);
> 

Probably, i think we get a different range objects in case of
detoasting in this case. 

> comparison solely on the lower bounds looks strange to me.
> if lower bound is the same, then compare upper bound, so the
> range_gist_cmp function is consistent with function range_compare.
> so following change:
> 
> + else
> + result = range_cmp_bounds(typcache, , );
> to
> `
> else
> {
> result = range_cmp_bounds(typcache, , );
> if (result == 0)
> result = range_cmp_bounds(typcache, , );
> }
> `
> 
> does contrib/btree_gist/btree_gist--1.7--1.8.sql function be declared
> as strict ? (I am not sure)
> other than that, the whole patch looks good.
> 
> 

That's something surely to consider.

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Wed, Jan 10, 2024 at 05:00:01PM +0300, Alexander Lakhin wrote:
> 10.01.2024 12:46, Bertrand Drouvot wrote:
> 
> > Would it be possible to also send the standby logs?
> 
> Yes, please look at the attached logs. This time I've build postgres with
> -DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
> wal_keep_size=1GB
> wal_debug = on
> log_autovacuum_min_duration = 0
> log_statement = 'all'
> log_min_messages = INFO
> 
> The archive attached contains logs from four runs:
> recovery-1-ok -- an example of successful run for reference
> recovery-7-pruning and recovery-19-pruning -- failures with a failed
>  subtest 'activeslot slot invalidation is logged with on-access pruning'
> recovery-15-vacuum_pg_class -- a failure with a failed
>  subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

Thanks a lot for the testing!

> is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at 
recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the 
active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second 
ReportSlotInvalidation()
call. 

The thing it that it looks like we exited the loop in 
InvalidatePossiblyObsoleteSlot()
because there is more messages from the startup process (789618) after the:

"
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to 
release replication slot "row_removal_activeslot"
"

one.

Do you think you could try to add more debug messages in 
InvalidatePossiblyObsoleteSlot()
to understand why the second call to ReportSlotInvalidation() is not done and 
IIUC
why/how we exited the loop?

FWIW, I did try to reproduce by launching pg_recvlogical and then kill -SIGSTOP
it. Then producing a conflict, I'm able to get the first message and not the 
second
one (which is expected). But the startup process does not exit the loop, which 
is
expected here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Relation bulk write facility

On Fri, Nov 24, 2023 at 10:22 PM Heikki Linnakangas  wrote:
> Yeah, I'm not very happy with this interface. The model is that you get
> a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
> over to bulkw_write(), which takes ownership of it and frees it later.
> There is no other function to free it, although currently the buffer is
> just palloc'd so you could call pfree on it.

I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: the s_lock_stuck on perform_spin_delay

Thanks for jumping in with a review, Matthias!

On Wed, Jan 10, 2024 at 8:03 AM Matthias van de Meent
 wrote:
> I'm not 100% sure on the policy of this, but theoretically you could
> use LockAquireExtended(dontWait=true) while holding a spin lock, as
> that would not have an unknown duration. Then again, this function
> also does elog/ereport, which would cause issues, still, so this code
> may be the better option.

This is definitely not allowable, and anybody who is thinking about
doing it should replace the spinlock with an LWLock.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Custom explain options

2024-01-10 Thread Andrei Lepikhov


On 10/1/2024 20:27, Konstantin Knizhnik wrote:


On 10/01/2024 8:46 am, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:
What do you think about this really useful feature? Do you wish to 
develop

it further?

I am biased here.  This seems like a lot of code for something we've
been delegating to the explain hook for ages.  Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael



Well, I am not sure that proposed patch is flexible enough to handle all 
possible scenarios.
I just wanted to make it as simple as possible to leave some chances for 
it to me merged.
But it is easy to answer the question why existed explain hook is not 
enough:


1. It doesn't allow to add some extra options to EXPLAIN. My intention 
was to be able to do something like this "explain 
(analyze,buffers,prefetch) ...". It is completely not possible with 
explain hook.
I agree. Designing mostly planner-related extensions, I also wanted to 
add some information to the explain of nodes. For example, 
pg_query_state could add the status of the node at the time of 
interruption of execution: started, stopped, or loop closed.
Maybe we should gather some statistics on how developers of extensions 
deal with that issue ...


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Multidimensional Histograms

2024-01-10 Thread Andrei Lepikhov


On 8/1/2024 16:21, Alexander Cheshev wrote:

Hi Andrei,


Maybe my wording needed to be more precise. I didn't implement
multidimensional histograms before, so I don't know how expensive they
are. I meant that for dependency statistics over about six columns, we
have a lot of combinations to compute.


Equi-Depth Histogram in a 6 dimensional case requires 6 times more
iterations. Postgres already uses Equi-Depth Histogram. Even if you
increase the number of buckets from 100 to 1000 then there will be no
overhead as the time complexity of Equi-Depth Histogram has no
dependence on the number of buckets. So, no overhead at all!


Maybe. For three columns, we have 9 combinations (passes) for building 
dependency statistics and 4 combinations for ndistincts; for six 
columns, we have 186 and 57 combinations correspondingly.
Even remembering that dependency is just one number for one combination, 
building the dependency statistics is still massive work. So, in the 
multicolumn case, having something like a histogram may be more effective.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: [PATCH] Add sortsupport for range types and btree_gist

2024-01-10 Thread Andrey M. Borodin




> On 10 Jan 2024, at 19:18, jian he  wrote:
> 
> what I am confused:
> In fmgr.h
> 
> /*
> * Support for cleaning up detoasted copies of inputs.  This must only
> * be used for pass-by-ref datatypes, and normally would only be used
> * for toastable types.  If the given pointer is different from the
> * original argument, assume it's a palloc'd detoasted copy, and pfree it.
> * NOTE: most functions on toastable types do not have to worry about this,
> * but we currently require that support functions for indexes not leak
> * memory.
> */
> #define PG_FREE_IF_COPY(ptr,n) \
> do { \
> if ((Pointer) (ptr) != PG_GETARG_POINTER(n)) \
> pfree(ptr); \
> } while (0)
> 
> but the doc (https://www.postgresql.org/docs/current/gist-extensibility.html)
> says:
> All the GiST support methods are normally called in short-lived memory
> contexts; that is, CurrentMemoryContext will get reset after each
> tuple is processed. It is therefore not very important to worry about
> pfree'ing everything you palloc.
> ENDOF_QUOTE
> 
> so I am not sure in range_gist_cmp, we need the following:
> `
> if ((Datum) range_a != a)
> pfree(range_a);
> if ((Datum) range_b != b)
> pfree(range_b);
> `

I think GiST sortsupport comments are more relevant, so there's no need for 
this pfree()s.

Also, please check other thread, maybe you will find some useful code there 
[0,1]. It was committed[2] once, but reverted. Please make sure that 
corrections made there are taken into account in your patch.

Thanks for working on this!


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/31/2824/
[1] 
https://www.postgresql.org/message-id/flat/285041639646332%40sas1-bf93f9015d57.qloud-c.yandex.net#0e5b4ed57d861d38a3d836c9ec09c0c5
[2] 
https://github.com/postgres/postgres/commit/9f984ba6d23dc6eecebf479ab1d3f2e550a4e9be

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-01-10 Thread Jelte Fennema-Nio

On Wed, 10 Jan 2024 at 15:55, Anthonin Bonnefoy
 wrote:
> PREPARE test_prepared (text, integer) AS /*$1*/ SELECT $2;
> EXECUTE 
> test_prepared('dddbs=''postgres.db'',traceparent=''00-0009-0009-01''',
> 1);

Wow, I had no idea that parameters could be expanded in comments. I'm
honestly wondering if that is supported on purpose.

Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Alexander Lakhin


10.01.2024 13:37, Amit Kapila wrote:

But tomorrow it could be for other tables and if we change this
TRUNCATE logic for pg_largeobject (of which chances are less) then
there is always a chance that one misses changing this comment. I feel
keeping it generic in this case would be better as the problem is
generic but it is currently shown for pg_largeobject.


Yes, for sure. So let's keep it generic as you prefer.

Thank you!

Best regards,
Alexander

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-10 Thread Bharath Rupireddy

On Fri, Jan 5, 2024 at 7:20 AM Jeff Davis  wrote:
>
> On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> > Thanks. Attaching remaining patches as v18 patch-set after commits
> > c3a8e2a7cb16 and 766571be1659.
>
> Comments:

Thanks for reviewing.

> I still think the right thing for this patch is to call
> XLogReadFromBuffers() directly from the callers who need it, and not
> change WALRead(). I am open to changing this later, but for now that
> makes sense to me so that we can clearly identify which callers benefit
> and why. I have brought this up a few times before[1][2], so there must
> be some reason that I don't understand -- can you explain it?

IMO, WALRead() is the best place to have XLogReadFromBuffers() for 2
reasons: 1) All of the WALRead() callers (except FRONTEND tools) will
benefit if WAL is read from WAL buffers. I don't see any reason for a
caller to skip reading from WAL buffers. If there's a caller (in
future) wanting to skip reading from WAL buffers, I'm open to adding a
flag in XLogReaderState to skip.  2) The amount of code is reduced if
XLogReadFromBuffers() sits in WALRead().

> The XLogReadFromBuffersResult is never used. I can see how it might be
> useful for testing or asserts, but it's not used even in the test
> module. I don't think we should clutter the API with that kind of thing
> -- let's just return the nread.

Removed.

> I also do not like the terminology "partial hit" to be used in this
> way. Perhaps "short read" or something about hitting the end of
> readable WAL would be better?

"short read" seems good. Done that way in the new patch.

> I like how the callers of WALRead() are being more precise about the
> bytes they are requesting.
>
> You've added several spinlock acquisitions to the loop. Two explicitly,
> and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
> to read slightly further, but introduce performance risk. Was this
> discussed?

I opted to read slightly further thinking that the loops aren't going
to get longer for spinlocks to appear costly. Basically, I wasn't sure
which approach was the best. Now that there's an opinion to keep them
outside, I'd agree with it. Done that way in the new patch.

> The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
> seems like there's a risk of getting partially-written data? And it's
> not clear to me the check of the wal page headers is the right one
> anyway.
>
> It seems like all of this would be simpler if you checked first how far
> you can safely read data, and then just loop and read that far.  I'm not
> sure that it's worth it to try to mix the validity checks with the
> reading of the data.

XLogReadFromBuffers needs the page header check in after reading the
page from WAL buffers. Typically, we must not read a WAL buffer page
that just got initialized. Because we waited enough for the
in-progress WAL insertions to finish above. However, there can exist a
slight window after the above wait finishes in which the read buffer
page can get replaced especially under high WAL generation rates.
After all, we are reading from WAL buffers without any locks here. So,
let's not count such a page in.

I've addressed the above review comments and attached v19 patch-set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From e03af5726957437c15361bdb1b373fe8982f5c7c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 10 Jan 2024 14:12:17 +
Subject: [PATCH v19] Allow WAL reading from WAL buffers

This commit adds postgres the capability to read WAL from WAL
buffers. When requested WAL isn't available in WAL buffers, the
WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are
walsenders, pg_walinspect etc. They all can now avoid reading WAL
from the WAL file (possibly avoiding disk IO). Tests show that the
WAL buffers hit ratio stood at 95% for 1 primary, 1 sync standby,
1 async standby, with pgbench --scale=300 --client=32 --time=900.
In other words, the walsenders avoided 95% of the time reading from
the file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh

Re: [PATCH] Add sortsupport for range types and btree_gist

2024-01-10 Thread jian he

On Wed, Jan 10, 2024 at 8:00 AM jian he  wrote:
>
> `
> from the doc, add sortsupport function will only influence index build time?
>
> +/*
> + * GiST sortsupport comparator for ranges.
> + *
> + * Operates solely on the lower bounds of the ranges, comparing them using
> + * range_cmp_bounds().
> + * Empty ranges are sorted before non-empty ones.
> + */
> +static int
> +range_gist_cmp(Datum a, Datum b, SortSupport ssup)
> +{
> + RangeType *range_a = DatumGetRangeTypeP(a);
> + RangeType *range_b = DatumGetRangeTypeP(b);
> + TypeCacheEntry *typcache = ssup->ssup_extra;
> + RangeBound lower1,
> + lower2;
> + RangeBound upper1,
> + upper2;
> + bool empty1,
> + empty2;
> + int result;
> +
> + if (typcache == NULL) {
> + Assert(RangeTypeGetOid(range_a) == RangeTypeGetOid(range_b));
> + typcache = lookup_type_cache(RangeTypeGetOid(range_a), 
> TYPECACHE_RANGE_INFO);
> +
> + /*
> + * Cache the range info between calls to avoid having to call
> + * lookup_type_cache() for each comparison.
> + */
> + ssup->ssup_extra = typcache;
> + }
> +
> + range_deserialize(typcache, range_a, , , );
> + range_deserialize(typcache, range_b, , , );
> +
> + /* For b-tree use, empty ranges sort before all else */
> + if (empty1 && empty2)
> + result = 0;
> + else if (empty1)
> + result = -1;
> + else if (empty2)
> + result = 1;
> + else
> + result = range_cmp_bounds(typcache, , );
> +
> + if ((Datum) range_a != a)
> + pfree(range_a);
> +
> + if ((Datum) range_b != b)
> + pfree(range_b);
> +
> + return result;
> +}
>
> per https://www.postgresql.org/docs/current/gist-extensibility.html
> QUOTE:
> All the GiST support methods are normally called in short-lived memory
> contexts; that is, CurrentMemoryContext will get reset after each
> tuple is processed. It is therefore not very important to worry about
> pfree'ing everything you palloc. However, in some cases it's useful
> for a support method to
> ENDOF_QUOTE
>
> so removing the following part should be OK.
> + if ((Datum) range_a != a)
> + pfree(range_a);
> +
> + if ((Datum) range_b != b)
> + pfree(range_b);
>
> comparison solely on the lower bounds looks strange to me.
> if lower bound is the same, then compare upper bound, so the
> range_gist_cmp function is consistent with function range_compare.
> so following change:
>
> + else
> + result = range_cmp_bounds(typcache, , );
> to
> `
> else
> {
> result = range_cmp_bounds(typcache, , );
> if (result == 0)
> result = range_cmp_bounds(typcache, , );
> }
> `
>
> does contrib/btree_gist/btree_gist--1.7--1.8.sql function be declared
> as strict ? (I am not sure)
> other than that, the whole patch looks good.

the original author email address (christoph.he...@cybertec.at)
Address not found.
so I don't include it.

I split the original author's patch into 2.
1. Add GiST sortsupport function for all the btree-gist module data
types except anyrange data type (which actually does not in this
module)
2. Add GiST sortsupport function for anyrange data type.

What changed compared to the original patch:
1. The original patch missed some operator class for all the data
types in btree-gist modules. So I added them.
now add sortsupport function for all the following data types in btree-gist:

int2,int4,int8,float4,float8,numeric
timestamp with time zone,
timestamp without time zone, time with time zone, time without time zone, date
interval, oid, money, char
varchar, text, bytea, bit, varbit
macaddr, macaddr8, inet, cidr, uuid, bool, enum

2. range_gist_cmp: the gist range sortsupport function, it looks like
range_cmp, but the range typcache is cached,
so we don't need to repeatedly call lookup_type_cache.
refactor: As mentioned above, if the range lower bound is the same
then compare the upper bound.
I aslo refactored the comment.

what I am confused:
In fmgr.h

/*
 * Support for cleaning up detoasted copies of inputs.  This must only
 * be used for pass-by-ref datatypes, and normally would only be used
 * for toastable types.  If the given pointer is different from the
 * original argument, assume it's a palloc'd detoasted copy, and pfree it.
 * NOTE: most functions on toastable types do not have to worry about this,
 * but we currently require that support functions for indexes not leak
 * memory.
 */
#define PG_FREE_IF_COPY(ptr,n) \
do { \
if ((Pointer) (ptr) != PG_GETARG_POINTER(n)) \
pfree(ptr); \
} while (0)

but the doc (https://www.postgresql.org/docs/current/gist-extensibility.html)
 says:
All the GiST support methods are normally called in short-lived memory
contexts; that is, CurrentMemoryContext will get reset after each
tuple is processed. It is therefore not very important to worry about
pfree'ing everything you palloc.
ENDOF_QUOTE

so I am not sure in range_gist_cmp, we need the following:
`
if ((Datum) range_a != a)
pfree(range_a);
if ((Datum) range_b != b)
pfree(range_b);
`
From fb85f80ff416f902fde3d80654dfb61b4479f6ec Mon Sep 17 00:00:00 2001
From: jian he 
Date: Wed, 10 Jan 2024 15:10:29 +0800
Subject: [PATCH v5 2/2] Add

Re: add function argument names to regex* functions.

2024-01-10 Thread jian he

On Tue, Jan 9, 2024 at 8:52 AM Dian Fay  wrote:
>
> On Mon Jan 8, 2024 at 9:26 AM EST, jian he wrote:
> > On Mon, Jan 8, 2024 at 8:44 AM Dian Fay  wrote:
> > > The `regexp_replace` summary in table 9.10 is mismatched and still
> > > specifies the first parameter name as `string` instead of `source`.
> > > Since all the other functions use `string`, should `regexp_replace` do
> > > the same or is this a case where an established "standard" diverges?
> >
> > got it. Thanks for pointing it out.
> >
> > in functions-matching.html
> > if I change source to
> > string then
> > there are no markup "string" and markup "string", it's kind of
> > slightly confusing.
> >
> > So does the following refactored description of regexp_replace make sense:
> >
> >  The string is returned unchanged if
> >  there is no match to the pattern.  If there 
> > is a
> >  match, the string is returned with the
> >  replacement string substituted for the 
> > matching
> >  substring.  The replacement string can 
> > contain
> >  \n, where
> > n is 1
> >  through 9, to indicate that the source substring matching the
> >  n'th parenthesized subexpression of
> > the pattern should be
> >  inserted, and it can contain \ to indicate 
> > that the
> >  substring matching the entire pattern should be inserted.  Write
> >  \\ if you need to put a literal backslash in
> > the replacement
> >  text.
>
> That change makes sense to me! I'll see about the section refactoring
> after this lands.

I put the changes into the new patch.
From 8fa04ed1fecb48ca8254d2ed7e60a9013fa130a3 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Wed, 10 Jan 2024 17:46:18 +0800
Subject: [PATCH v3 1/1] add function argument names to regex.* functions.

Specifically add function argument names to the following funtions:
regexp_replace,
regexp_match,
regexp_matches
regexp_count,
regexp_instr,
regexp_like,
regexp_substr,
regexp_split_to_table
regexp_split_to_array
So it would be easier to understand these functions in psql via \df.
Also now these functions can be called in different notaions.
---
 doc/src/sgml/func.sgml  | 50 +++
 src/include/catalog/pg_proc.dat | 71 ++---
 2 files changed, 82 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index de78d58d..23ef07a5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3299,7 +3299,7 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
 
 regexp_instr ( string text, pattern text
  [, start integer
- [, N integer
+ [, occurrence integer
  [, endoption integer
  [, flags text
  [, subexpr integer ] ] ] ] ] )
@@ -3307,7 +3307,7 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in


 Returns the position within string where
-the N'th match of the POSIX regular
+the occurrence'th match of the POSIX regular
 expression pattern occurs, or zero if there is
 no such match; see .

@@ -3413,14 +3413,14 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in

 regexp_replace ( string text, pattern text, replacement text,
  start integer,
- N integer
+ occurrence integer
  [, flags text ] )
 text


-Replaces the substring that is the N'th
+Replaces the substring that is the occurrence'th
 match to the POSIX regular expression pattern,
-or all such matches if N is zero; see
+or all such matches if occurrence is zero; see
 .


@@ -3478,14 +3478,14 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
 
 regexp_substr ( string text, pattern text
  [, start integer
- [, N integer
+ [, occurrence integer
  [, flags text
  [, subexpr integer ] ] ] ] )
 text


 Returns the substring within string that
-matches the N'th occurrence of the POSIX
+matches the occurrence'th occurrence of the POSIX
 regular expression pattern,
 or NULL if there is no such match; see
 .
@@ -5888,13 +5888,13 @@ regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  4
 
 
  The regexp_instr function returns the starting or
- ending position of the N'th match of a
+ ending position of the occurrence'th match of a
  POSIX regular expression pattern to a string, or zero if there is no
  such match.  It has the syntax
  regexp_instr(string,
  pattern
  , start
- , N
+ , occurrence
  , endoption
  , flags
  , subexpr
@@ -5903,8 +5903,8 @@ regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  4
  in string, normally from the beginning of

Re: System username in pg_stat_activity

2024-01-10 Thread Joe Conway

On 1/10/24 08:59, Magnus Hagander wrote:

On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot

I think it depends what we want the new field to reflect. If it is the exact
same thing as the SYSTEM_USER then I think it has to be text (as the SYSTEM_USER
is made of "auth_method:identity"). Now if we want it to be "only" the identity
part of it, then the `name` datatype would be fine. I'd vote for the exact same
thing as the SYSTEM_USER (means auth_method:identity).

I definitely think it should be the same. If it's not exactly the
same, then it should be *two* columns, one with auth method and one
with the name.

And thinking more about it maybe that's cleaner, because that makes it
easier to do things like filter based on auth method?

+1 for the overall feature and +1 for two columns

> + 
> +  
> +   authname name
> +  
> +  
> +   The authentication method and identity (if any) that the user
> +   used to log in. It contains the same value as
> +returns in the backend.
> +  
> + 

I'm fine with auth_method:identity.

> +S.authname,

What about using system_user as the field name? (because if we keep
auth_method:identity it's not really the authname anyway).

I was worried system_user or sysuser would both be confusing with the
fact that we have usesysid -- which would reference a *different*
sys...

I think if it is exactly "system_user" it would be pretty clearly a 
match for SYSTEM_USER

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

Hi,

On Wed, Jan 10, 2024 at 02:59:42PM +0100, Magnus Hagander wrote:
> On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot
> I definitely think it should be the same. If it's not exactly the
> same, then it should be *two* columns, one with auth method and one
> with the name.
> 
> And thinking more about it maybe that's cleaner, because that makes it
> easier to do things like filter based on auth method?

Yeah, that's sounds even better.

> 
> > > + 
> > > +  
> > > +   authname name
> > > +  
> > > +  
> > > +   The authentication method and identity (if any) that the user
> > > +   used to log in. It contains the same value as
> > > +returns in the backend.
> > > +  
> > > + 
> >
> > I'm fine with auth_method:identity.
> >
> > > +S.authname,
> >
> > What about using system_user as the field name? (because if we keep
> > auth_method:identity it's not really the authname anyway).
> 
> I was worried system_user or sysuser would both be confusing with the
> fact that we have usesysid -- which would reference a *different*
> sys...

If we go the 2 fields way, then what about auth_identity and auth_method then?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Jan 10, 2024 at 02:08:03PM +0100, Magnus Hagander wrote:
> > On Wed, Jan 10, 2024 at 1:44 PM Aleksander Alekseev
> >  wrote:
> > >
> > > Hi,
> > >
> > > Thanks for the patch.
>
> +1
>
> > > > This overlaps with for example the values in pg_stat_gss, but it will
> > > > include values for authentication methods that don't have their own
> > > > view such as peer/ident. gss/ssl info will of course still be shown,
> > > > it is just in more than one place.
>
> Yeah, I think that's a good idea.
>
> > > It hurts my sense of beauty that usename and authname are of different
> > > types. But if I'm the only one, maybe we can close our eyes on this.
> > > Also I suspect that placing usename and authname in a close proximity
> > > can be somewhat confusing. Perhaps adding authname as the last column
> > > of the view will solve both nitpicks?
> >
> > But it should probably actually be name, given that's the underlying
> > datatype. I kept changing it around and ended up half way in
> > between...
> >
> >
> > > ```
> > > +/* Information about the authenticated user */
> > > +charst_authuser[NAMEDATALEN];
> > > ```
> > >
> > > Well, here it's called "authuser" and it looks like the intention was
> > > to use `name` datatype... I suggest using "authname" everywhere for
> > > consistency.
>
> I think it depends what we want the new field to reflect. If it is the exact
> same thing as the SYSTEM_USER then I think it has to be text (as the 
> SYSTEM_USER
> is made of "auth_method:identity"). Now if we want it to be "only" the 
> identity
> part of it, then the `name` datatype would be fine. I'd vote for the exact 
> same
> thing as the SYSTEM_USER (means auth_method:identity).

I definitely think it should be the same. If it's not exactly the
same, then it should be *two* columns, one with auth method and one
with the name.

And thinking more about it maybe that's cleaner, because that makes it
easier to do things like filter based on auth method?


> > + 
> > +  
> > +   authname name
> > +  
> > +  
> > +   The authentication method and identity (if any) that the user
> > +   used to log in. It contains the same value as
> > +returns in the backend.
> > +  
> > + 
>
> I'm fine with auth_method:identity.
>
> > +S.authname,
>
> What about using system_user as the field name? (because if we keep
> auth_method:identity it's not really the authname anyway).

I was worried system_user or sysuser would both be confusing with the
fact that we have usesysid -- which would reference a *different*
sys...


> Also, what about adding a test in say 003_peer.pl to check that the value 
> matches
> the SYSTEM_USER one?

Yeah, that's a good idea I think. I quickly looked over for tests and
couldn't really find any for pg_stat_activity, btu we can definitely
piggyback them in one or more of the auth tests.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Custom explain options

2024-01-10 Thread Konstantin Knizhnik



On 09/01/2024 10:33 am, vignesh C wrote:

On Sat, 21 Oct 2023 at 18:34, Konstantin Knizhnik  wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which 
help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page 
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is implemented 
not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, 
indexscan and indexonly scan). Another possible candidate  for explain options 
is local file cache (extra caching layer above shared buffers which is used to 
somehow replace file system cache in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also available 
as PR:  https://github.com/knizhnik/postgres/pull/1 )

I have demonstrated this mechanism using Bloom extension - just to report 
number of Bloom matches.
Not sure that it is really useful information but it is used mostly as example:

explain (analyze,bloom) select * from t where pk=2000;
QUERY PLAN
-
  Bitmap Heap Scan on t  (cost=15348.00..15352.01 rows=1 width=4) (actual 
time=25.244..25.939 rows=1 loops=1)
Recheck Cond: (pk = 2000)
Rows Removed by Index Recheck: 292
Heap Blocks: exact=283
Bloom: matches=293
->  Bitmap Index Scan on t_pk_idx  (cost=0.00..15348.00 rows=1 width=0) 
(actual time=25.147..25.147 rows=293 loops=1)
  Index Cond: (pk = 2000)
  Bloom: matches=293
  Planning:
Bloom: matches=0
  Planning Time: 0.387 ms
  Execution Time: 26.053 ms
(12 rows)

There are two known issues with this proposal:

There are few compilation errors reported by CFBot at [1] with:
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_end_parallel’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2675:3: error: too
few arguments to function ‘InstrAccumParallelQuery’
[05:00:40.452]  2675 |
InstrAccumParallelQuery(>bufferusage[i],
>walusage[i]);
[05:00:40.452]   |   ^~~
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452]  from ../src/include/access/brin.h:13,
[05:00:40.452]  from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:151:13: note: declared here
[05:00:40.452]   151 | extern void InstrAccumParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);
[05:00:40.452]   | ^~~
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_parallel_build_main’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2873:2: error: too
few arguments to function ‘InstrEndParallelQuery’
[05:00:40.452]  2873 |
InstrEndParallelQuery([ParallelWorkerNumber],
[05:00:40.452]   |  ^
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452]  from ../src/include/access/brin.h:13,
[05:00:40.452]  from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:150:13: note: declared here
[05:00:40.452]   150 | extern void InstrEndParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);

[1] - https://cirrus-ci.com/task/5452124486631424?logs=build#L374

Regards,
Vignesh



Thank you for reporting the problem.
Rebased version of the patch is attached.


diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index fba3ba7771..383b1fb8e3 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -174,6 +174,13 @@ typedef struct BloomScanOpaqueData
 
 typedef BloomScanOpaqueData *BloomScanOpaque;
 
+typedef struct
+{
+   uint64 matches;
+} BloomUsage;
+
+extern BloomUsage bloomUsage;
+
 /* blutils.c */
 extern void initBloomState(BloomState *state, Relation index);
 extern void BloomFillMetapage(Relation index, Page metaPage);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 0a303a49b2..da29dc7f68 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -166,6 +166,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
CHECK_FOR_INTERRUPTS();
}
FreeAccessStrategy(bas);
-
+   bloomUsage.matches += ntids;
return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6836129c90..9120f96083 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -18,6 +18,7 @@
 #include "access/reloptions.h"
 #include "bloom.h"
 #include "catalog/index.h"
+#include "commands/explain.h"
 #include "commands/vacuum.h"
 #include

Re: System username in pg_stat_activity

Hi,

On Wed, Jan 10, 2024 at 02:08:03PM +0100, Magnus Hagander wrote:
> On Wed, Jan 10, 2024 at 1:44 PM Aleksander Alekseev
>  wrote:
> >
> > Hi,
> >
> > Thanks for the patch.

+1

> > > This overlaps with for example the values in pg_stat_gss, but it will
> > > include values for authentication methods that don't have their own
> > > view such as peer/ident. gss/ssl info will of course still be shown,
> > > it is just in more than one place.

Yeah, I think that's a good idea.

> > It hurts my sense of beauty that usename and authname are of different
> > types. But if I'm the only one, maybe we can close our eyes on this.
> > Also I suspect that placing usename and authname in a close proximity
> > can be somewhat confusing. Perhaps adding authname as the last column
> > of the view will solve both nitpicks?
> 
> But it should probably actually be name, given that's the underlying
> datatype. I kept changing it around and ended up half way in
> between...
> 
> 
> > ```
> > +/* Information about the authenticated user */
> > +charst_authuser[NAMEDATALEN];
> > ```
> >
> > Well, here it's called "authuser" and it looks like the intention was
> > to use `name` datatype... I suggest using "authname" everywhere for
> > consistency.

I think it depends what we want the new field to reflect. If it is the exact
same thing as the SYSTEM_USER then I think it has to be text (as the SYSTEM_USER
is made of "auth_method:identity"). Now if we want it to be "only" the identity
part of it, then the `name` datatype would be fine. I'd vote for the exact same
thing as the SYSTEM_USER (means auth_method:identity).

> + 
> +  
> +   authname name
> +  
> +  
> +   The authentication method and identity (if any) that the user
> +   used to log in. It contains the same value as
> +returns in the backend.
> +  
> + 

I'm fine with auth_method:identity.

> +S.authname,

What about using system_user as the field name? (because if we keep
auth_method:identity it's not really the authname anyway).

Also, what about adding a test in say 003_peer.pl to check that the value 
matches
the SYSTEM_USER one?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

On Wed, Jan 10, 2024 at 2:27 PM Dagfinn Ilmari Mannsåker
 wrote:
>
> Magnus Hagander  writes:
>
> > On Wed, Jan 10, 2024 at 1:44 PM Aleksander Alekseev
> >  wrote:
> >>
> >> It hurts my sense of beauty that usename and authname are of different
> >> types. But if I'm the only one, maybe we can close our eyes on this.
> >> Also I suspect that placing usename and authname in a close proximity
> >> can be somewhat confusing. Perhaps adding authname as the last column
> >> of the view will solve both nitpicks?
> >
> > But it should probably actually be name, given that's the underlying
> > datatype. I kept changing it around and ended up half way in
> > between...
>
> https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-SESSION-TABLE
> (and pg_typeof(system_user)) says it's text.  Which makes sense, since
> it can easily be longer than 63 bytes, e.g. in the case of a TLS client
> certificate DN.

We already truncate all those to NAMEDATALEN in pg_stat_ssl for
example. so I think the truncation part of those should be OK. We'll
truncate "a little bit more" since we also have the 'cert:', but not
significantly so I think.

but yeah, conceptually it should probably be text because name is
supposedly a *postgres identifier*, which this is not.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Custom explain options

2024-01-10 Thread Konstantin Knizhnik




On 10/01/2024 8:29 am, Andrei Lepikhov wrote:

On 30/11/2023 22:40, Konstantin Knizhnik wrote:
In all this cases we are using array of `Instrumentation` and if it 
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations 
between worker processes  and I have updated this code in my PR to 
send actual amount of custom instrumentation data. But it can not 
help with the cases above.
What do you think about this really useful feature? Do you wish to 
develop it further?


In Neon (cloud Postgres) we have changed Postgres core to include in 
explain information about prefetch and local file cache.
EXPLAIN seems to be most convenient way for users to get this 
information which can be very useful for investigation of query 
execution speed.
So my intention was to make it possible to add extra information to 
explain without patching Postgres core.

Existed explain hook is not enough for it.

I am not sure that the suggested approach is flexible enough. First of 
all I tried to make it is simple as possible, minimizing changes in 
Postgres core.

Re: System username in pg_stat_activity

2024-01-10 Thread Dagfinn Ilmari Mannsåker

Magnus Hagander  writes:

> On Wed, Jan 10, 2024 at 1:44 PM Aleksander Alekseev
>  wrote:
>>
>> It hurts my sense of beauty that usename and authname are of different
>> types. But if I'm the only one, maybe we can close our eyes on this.
>> Also I suspect that placing usename and authname in a close proximity
>> can be somewhat confusing. Perhaps adding authname as the last column
>> of the view will solve both nitpicks?
>
> But it should probably actually be name, given that's the underlying
> datatype. I kept changing it around and ended up half way in
> between...

https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-SESSION-TABLE
(and pg_typeof(system_user)) says it's text.  Which makes sense, since
it can easily be longer than 63 bytes, e.g. in the case of a TLS client
certificate DN.

- ilmari

Re: Custom explain options

2024-01-10 Thread Konstantin Knizhnik




On 10/01/2024 8:46 am, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:

What do you think about this really useful feature? Do you wish to develop
it further?

I am biased here.  This seems like a lot of code for something we've
been delegating to the explain hook for ages.  Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael



Well, I am not sure that proposed patch is flexible enough to handle all 
possible scenarios.
I just wanted to make it as simple as possible to leave some chances for 
it to me merged.
But it is easy to answer the question why existed explain hook is not 
enough:


1. It doesn't allow to add some extra options to EXPLAIN. My intention 
was to be able to do something like this "explain 
(analyze,buffers,prefetch) ...". It is completely not possible with 
explain hook.
2. May be I wrong, but it is impossible now to collect and combine 
instrumentation from all parallel workers without changing Postgres core


Explain hook can be useful if you add some custom node to query 
execution plan and want to provide information about this node.
But if you are implementing some alternative storage mechanism or some 
optimization for existed plan nodes, then it is very difficult to do it 
using existed explain hook.

Re: Make attstattarget nullable

2024-01-10 Thread Alvaro Herrera

On 2023-Dec-23, Peter Eisentraut wrote:

> Here is an updated patch rebased over 3e2e0d5ad7.
> 
> The 0001 patch stands on its own, but I also tacked on two additional WIP
> patches that simplify some pg_attribute handling and make these kinds of
> refactorings simpler in the future.  See description in the patches.

I didn't look at 0002 and 0003, since they're marked as WIP.  (But I did
like the removal that happens in 0003, so I hope these two also make it
to 17).

> On 05.12.23 13:52, Peter Eisentraut wrote:
> > In [0] it was discussed that we could make attstattarget a nullable
> > column, instead of always storing an explicit -1 default value for most
> > columns.  This patch implements this.

Seems reasonable.  Do we really need a catversion bump for this?

I like that we now have SET STATISTICS DEFAULT rather than -1 to reset
to default.  Do we want to document that setting explicitly to -1
continues to have that behavior?  (I would add something like "Setting
to a value of -1 is an obsolete spelling to get the same behavior."
after the phrase that explains DEFAULT in the ALTER TABLE manpage.)

I noticed that equalTupleDescs no longer compares attstattarget, and
this is because the field is not in TupleDesc anymore.  I looked at the
callers of equalTupleDescs and I think this is exactly what we want
(precisely because attstattarget is no longer in TupleDesc.)

> > This changes the pg_attribute field attstattarget into a nullable field
> > in the variable-length part of the row.

I don't think we use "the variable-length part of the row" as a term
anywhere.  We only have the variable-length columns, and we made a bit
of a mistake in using CATALOG_VARLEN to differentiate the part of the
catalogs that are not mapped to the structs (because at the time those
were in effect only the variable length fields).  I think this is
largely not a problem, but let's be careful with how we word the related
comments.  So:

I think the comment next to "#ifdef CATALOG_VARLEN" is now a bit
misleading, because the field immediately below is effectively not
varlena.  Maybe make it
#ifdef CATALOG_VARLEN   /* nullable/varlena fields start here */

In RemoveAttributeById, a comment says
"Clear the other variable-length fields."
but this is no longer fully correct.  Again maybe make it "... the other
nullable or variable-length fields".

In get_attstattarget() I think we should return 0 for dropped columns
without reading attstattarget, which is useless anyway, and if it did
happen to return non-null, it might cause us to do stuff, which would be
a waste.

It's annoying that the new code in index_concurrently_swap() is more
verbose than the code being replaced, but it seems OK to me, since it
allows us to distinguish a null value in attstattarget from actual 0
without complicating the get_attstattarget API (which I think we would
have to do if we wanted to use it here.)

I don't have anything else on this patch at this point.

Thanks

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

On Wed, 10 Jan 2024 at 15:04, Amit Kapila  wrote:
>
> On Wed, Jan 10, 2024 at 2:59 PM Shlok Kyal  wrote:
> >
> > This patch is not applying on the HEAD. Please rebase and share the
> > updated patch.
> >
>
> IIRC, there were some regressions observed with this patch. So, one
> needs to analyze those as well. I think we should mark it as "Returned
> with feedback".

Thanks, I have updated the status to "Returned with feedback".
Feel free to post an updated version with the fix for the regression
and start a new entry for the same.

Regards,
Vignesh

Re: System username in pg_stat_activity

On Wed, Jan 10, 2024 at 1:44 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> Thanks for the patch.
>
> > The attached patch adds a column "authuser" to pg_stat_activity which
> > contains the username of the externally authenticated user, being the
> > same value as the SYSTEM_USER keyword returns in a backend.
>
> I believe what was meant is "authname", not "authuser".
>
> > This overlaps with for example the values in pg_stat_gss, but it will
> > include values for authentication methods that don't have their own
> > view such as peer/ident. gss/ssl info will of course still be shown,
> > it is just in more than one place.
> >
> > I was originally thinking this column should be "sysuser" to map to
> > the keyword, but since we already have "usesysid" as a column name in
> > pg_stat_activity I figured that could be confusing since it actually
> > means something completely different. But happy to change that back if
> > people think that's better.
>
> This part of the documentation is wrong:
>
> ```
> + 
> +  
> +   authname name
> +  
> ```
>
> Actually the type is `text`:
>
> ```
> =# \d  pg_stat_activity ;
>   View "pg_catalog.pg_stat_activity"
>   Column  |   Type   | Collation | Nullable | Default
> --+--+---+--+-
>  datid| oid  |   |  |
>  datname  | name |   |  |
>  pid  | integer  |   |  |
>  leader_pid   | integer  |   |  |
>  usesysid | oid  |   |  |
>  usename  | name |   |  |
>  authname | text |   |  |
> ```
>
> It hurts my sense of beauty that usename and authname are of different
> types. But if I'm the only one, maybe we can close our eyes on this.
> Also I suspect that placing usename and authname in a close proximity
> can be somewhat confusing. Perhaps adding authname as the last column
> of the view will solve both nitpicks?

But it should probably actually be name, given that's the underlying
datatype. I kept changing it around and ended up half way in
between...


> ```
> +/* Information about the authenticated user */
> +charst_authuser[NAMEDATALEN];
> ```
>
> Well, here it's called "authuser" and it looks like the intention was
> to use `name` datatype... I suggest using "authname" everywhere for
> consistency.

Yeah, I flipped back and forth a few times and clearly got stuck in
the middle. They should absolutely be the same everywhere - whatever
name is used it should be consistent.


> Since the patch affects pg_proc.dat I believe the commit message
> should remind bumping the catalog version.

Yes.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index de78d58d4b..ec6a25495f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23462,7 +23462,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5e..dc146a4c9f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,17 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   authname name
+  
+  
+   The authentication method and identity (if any) that the user
+   used to log in. It contains the same value as
+returns in the backend.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e43e36f5ac..fe81aefbbb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,7 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.authname,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 3dac79c30e..4e0e16a8fe 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -357,6 +357,11 @@ pgstat_bestart(void)
 	else
 		MemSet(_clientaddr, 0, sizeof(lbeentry.st_clientaddr));
 
+	if (MyProcPort && GetSystemUser())
+		strlcpy(lbeentry.st_authname, GetSystemUser(), NAMEDATALEN);
+	else
+		MemSet(_authname, 0, sizeof(lbeentry.st_authname));
+
 #ifdef USE_SSL
 	if (MyProcPort && MyProcPort->ssl_in_use)
 	{
diff --git a/src/backend/utils/adt/pgstatfuncs.c

Re: A failure in t/038_save_logical_slots_shutdown.pl

On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> #   at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state:   in production
> pg_control last modified: Wed Jan 10 07:55:44 2024
> Latest checkpoint location:   0/1508018
> Latest checkpoint's REDO location:0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
> # Is it same as the value read from log?
> ok( $latest_checkpoint eq $confirmed_flush_from_log,
> "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
> );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
> This is because the shutdown checkpoint gets an LSN that's greater
> than the slot's confirmed_flush LSN - see the shutdown checkpoint
> record getting inserted into WAL after the slot is marked dirty in
> CheckPointReplicationSlots().
>
> With this analysis in mind, I think the tests need to do something
> like the following:
>
> diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> b/src/test/recovery/t/038_save_logical_slots_shut
> down.pl
> index 5a4f5dc1d4..d49e6014fc 100644
> --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> @@ -32,7 +32,7 @@ sub compare_confirmed_flush
>   unless defined($latest_checkpoint);
>
> # Is it same as the value read from log?
> -   ok( $latest_checkpoint eq $confirmed_flush_from_log,
> +   ok( $latest_checkpoint ge $confirmed_flush_from_log,
> "Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location"
> );
>
> Thoughts?

I got the log files from Bharath offline. Thanks Bharath for sharing
the log files offline.
The WAL record sequence is exactly the same in the failing test and
tests which are passing.
One observation in our case the confirmed flush lsn points exactly to
shutdown checkpoint, but in the failing test the lsn pointed is
invalid, pg_waldump says that address is invalid and skips about 24
bytes and then sees a valid record

Passing case confirm flush lsn(0/150D158) from my machine:
pg_waldump 00010001 -s 0/150D158
rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

Failing case confirm flush lsn( 0/1508000) from failing tests log file:
pg_waldump 00010001 -s 0/1508000
pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
over 24 bytes
rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

I'm still not sure why in this case, it is not exactly pointing to a
valid WAL record, it has to skip 24 bytes to find the valid checkpoint
shutdown record.
I will investigate this further and share the analysis.

Regards,
Vignesh

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Matthias van de Meent

On Wed, 10 Jan 2024 at 02:44, Andy Fan  wrote:
> Hi,
>
> I want to know if Andres or you have plan
> to do some code review. I don't expect this would happen very soon, just
> want to make sure this will not happen that both of you think the other
> one will do, but actually none of them does it in fact. a commit fest
> [1] has been added for this.

> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -5419,6 +5419,7 @@ LockBufHdr(BufferDesc *desc)
> perform_spin_delay();
> }
> finish_spin_delay();
> +START_SPIN_LOCK();
> return old_buf_state | BM_LOCKED;
> }

I think that we need to 'arm' the checks just before we lock the spin
lock, and 'disarm' the checks just after we unlock the spin lock,
rather than after and before, respectively. That way, we won't have a
chance of false negatives: with your current patch it is possible that
an interrupt fires between the acquisition of the lock and the code in
START_SPIN_LOCK() marking the thread as holding a spin lock, which
would cause any check in that signal handler to incorrectly read that
we don't hold any spin locks.

> +++ b/src/backend/storage/lmgr/lock.c
> @@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
> boolfound_conflict;
> boollog_lock = false;
>
> +Assert(SpinLockCount == 0);
> +

I'm not 100% sure on the policy of this, but theoretically you could
use LockAquireExtended(dontWait=true) while holding a spin lock, as
that would not have an unknown duration. Then again, this function
also does elog/ereport, which would cause issues, still, so this code
may be the better option.

> +elog(PANIC, "stuck spinlock detected at %s, %s:%d after waiting for %u 
> ms",
> + func, file, line, delay_ms);

pg_usleep doesn't actually guarantee that we'll wait for exactly that
duration; depending on signals received while spinning and/or OS
scheduling decisions it may be off by orders of magnitude.

> +++ b/src/common/scram-common.c

This is unrelated to the main patchset.

> +++ b/src/include/storage/spin.h

Minor: I think these changes could better be included in miscadmin, or
at least the definition for SpinLockCount should be moved there: The
spin lock system itself shouldn't be needed in places where we need to
make sure that we don't hold any spinlocks, and miscadmin.h already
holds things related to "System interrupt and critical section
handling", which seems quite related.

Kind regards,

Matthias van de Meent

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jan 2024 at 08:25, Michael Paquier  wrote:
>
> On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> >
> > I thought removing op_bytes completely ( as you said "This patch
> > extends it with two more operation sizes, and there are even cases
> > where it may be a variable" ) from pg_stat_io view then adding
> > something like {read | write | extend}_bytes and {read | write |
> > extend}_calls could be better, so that we don't lose any information.
>
> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios.  This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?

Yes, you are right.

> >> Yeah, a limitation like that may be acceptable for now.  Tracking the
> >> WAL writer and WAL sender activities can be relevant in a lot of cases
> >> even if we don't have the full picture for the WAL receiver yet.
> >
> > I added that and disabled B_WAL_RECEIVER backend with comments
> > explaining why. v8 is attached.
>
> I can see that's what you have been adding here, which should be OK:
>
> > -if (track_io_timing)
> > +/*
> > + * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp 
> > IOs
> > + * but these IOs are not countable for now because IOOP_READ IOs' 
> > op_bytes
> > + * (number of bytes per unit of I/O) might not be the same all the 
> > time.
> > + * The current implementation requires that the op_bytes must be the 
> > same
> > + * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> > + * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> > + * now.
> > + */
> > +if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > +return;
>
> This could be worded better, but that's one of these nits from me I
> usually tweak when committing stuff.

Thanks for doing that! Do you have any specific comments that can help
improve it?

> > +/*
> > + * Decide if IO timings need to be tracked.  Timings associated to
> > + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> > + * else rely on track_io_timing.
> > + */
> > +static bool
> > +pgstat_should_track_io_time(IOObject io_object)
> > +{
> > +if (io_object == IOOBJECT_WAL)
> > +return track_wal_io_timing;
> > +
> > +return track_io_timing;
> > +}
>
> One thing I was also considering is if eliminating this routine would
> make pgstat_count_io_op_time() more readable the result, but I cannot
> get to that.

I could not think of a way to eliminate pgstat_should_track_io_time()
route without causing performance regressions. What do you think about
moving inside of 'pgstat_should_track_io_time(io_object) if check' to
another function and call this function from
pgstat_count_io_op_time()? This does not change anything but IMO it
increases the readability.

> >  if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> >  {
> > -
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
> > @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext 
> > io_context, IOOp io_op,
> >  }
> >  else if (io_op == IOOP_READ)
> >  {
> > -
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
>
> A second thing is if this would be better with more switch/cases, say:
> switch (io_op):
> {
> case IOOP_EXTEND:
> case IOOP_WRITE:
> switch (io_object):
> {
> case WAL:
> /* do nothing */
> break;
> case RELATION:
> case TEMP:
> .. blah ..
> }
> break;
> case IOOP_READ:
> switch (io_object):
> {
> .. blah ..
> }
> break;
> }
>
> Or just this one to make it clear that nothing happens for WAL
> objects:
> switch (io_object):
> {
>case WAL:
>/* do nothing */
>break;
>case RELATION:
>switch (io_op):
>{
>case IOOP_EXTEND:
>

Re: System username in pg_stat_activity

2024-01-10 Thread Aleksander Alekseev

Hi,

Thanks for the patch.

> The attached patch adds a column "authuser" to pg_stat_activity which
> contains the username of the externally authenticated user, being the
> same value as the SYSTEM_USER keyword returns in a backend.

I believe what was meant is "authname", not "authuser".

> This overlaps with for example the values in pg_stat_gss, but it will
> include values for authentication methods that don't have their own
> view such as peer/ident. gss/ssl info will of course still be shown,
> it is just in more than one place.
>
> I was originally thinking this column should be "sysuser" to map to
> the keyword, but since we already have "usesysid" as a column name in
> pg_stat_activity I figured that could be confusing since it actually
> means something completely different. But happy to change that back if
> people think that's better.

This part of the documentation is wrong:

```
+ 
+  
+   authname name
+  
```

Actually the type is `text`:

```
=# \d  pg_stat_activity ;
  View "pg_catalog.pg_stat_activity"
  Column  |   Type   | Collation | Nullable | Default
--+--+---+--+-
 datid| oid  |   |  |
 datname  | name |   |  |
 pid  | integer  |   |  |
 leader_pid   | integer  |   |  |
 usesysid | oid  |   |  |
 usename  | name |   |  |
 authname | text |   |  |
```

It hurts my sense of beauty that usename and authname are of different
types. But if I'm the only one, maybe we can close our eyes on this.
Also I suspect that placing usename and authname in a close proximity
can be somewhat confusing. Perhaps adding authname as the last column
of the view will solve both nitpicks?

```
+/* Information about the authenticated user */
+charst_authuser[NAMEDATALEN];
```

Well, here it's called "authuser" and it looks like the intention was
to use `name` datatype... I suggest using "authname" everywhere for
consistency.

Since the patch affects pg_proc.dat I believe the commit message
should remind bumping the catalog version.

-- 
Best regards,
Aleksander Alekseev

Re: Postgres Partitions Limitations (5.11.2.3)

On Tue, Dec 5, 2023 at 3:57 PM Ashutosh Bapat
 wrote:
>
> On Tue, Dec 5, 2023 at 1:40 AM Laurenz Albe  wrote:
> >
> > On Fri, 2023-12-01 at 18:49 +0530, Ashutosh Bapat wrote:
> > > On Thu, Nov 30, 2023 at 10:29 PM Laurenz Albe  
> > > wrote:
> > > >
> > > > On Thu, 2023-11-30 at 19:22 +0530, Ashutosh Bapat wrote:
> > > > > May be attach the patch to hackers thread (this) as well?
> > > >
> > > > If you want, sure.  I thought it was good enough if the thread
> > > > is accessible via the commitfest app.
> > >
> > > The addition is long enough that it deserved to be outside of parentheses.
> > >
> > > I think it's worth mentioning the exception but in a way that avoids
> > > repeating what's mentioned in the last paragraph of just the previous
> > > section. I don't have brilliant ideas about how to rephrase it.
> > >
> > > Maybe "Using ONLY to add or drop a constraint, other than PRIMARY and
> > > UNIQUE, on only the partitioned table is supported as long as there
> > > are no partitions. ...".
> >
> > I agree that the parenthesis is too long.  I shortened it in the attached
> > patch. Is that acceptable?
>
> It's still longer than the actual sentence :). I am fine with it if
> somebody else finds it acceptable.

It still reads a bit weird to me. How about the attached wording instead?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 22d04006ad..01b1d82b0d 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4348,7 +4348,9 @@ ALTER INDEX measurement_city_id_logdate_key
 Using ONLY to add or drop a constraint on only
 the partitioned table is supported as long as there are no
 partitions.  Once partitions exist, using ONLY
-will result in an error.  Instead, constraints on the partitions
+will result in an error for any constraints other than
+UNIQUE and PRIMARY KEY.
+Instead, constraints on the partitions
 themselves can be added and (if they are not present in the parent
 table) dropped.

Re: Refactoring backend fork+exec code


On 08/12/2023 14:33, Heikki Linnakangas wrote:

+   [PMC_AV_LAUNCHER] = {"autovacuum launcher", AutoVacLauncherMain, true},
+   [PMC_AV_WORKER] = {"autovacuum worker", AutoVacWorkerMain, true},
+   [PMC_BGWORKER] = {"bgworker", BackgroundWorkerMain, true},
+   [PMC_SYSLOGGER] = {"syslogger", SysLoggerMain, false},
+
+   [PMC_STARTUP] = {"startup", StartupProcessMain, true},
+   [PMC_BGWRITER] = {"bgwriter", BackgroundWriterMain, true},
+   [PMC_ARCHIVER] = {"archiver", PgArchiverMain, true},
+   [PMC_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+   [PMC_WAL_WRITER] = {"wal_writer", WalWriterMain, true},
+   [PMC_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
+};


It feels like we have too many different ways of documenting the type of a
process. This new PMC_ stuff, enum AuxProcType, enum BackendType.

Agreed. And "am_walsender" and such variables.


Here's a patch that gets rid of AuxProcType. It's independent of the 
other patches in this thread; if this is committed, I'll rebase the rest 
of the patches over this and get rid of the new PMC_* enum.


Three patches, actually. The first one fixes an existing comment that I 
noticed to be incorrect while working on this. I'll push that soon, 
barring objections. The second one gets rid of AuxProcType, and the 
third one replaces IsBackgroundWorker, IsAutoVacuumLauncherProcess() and 
IsAutoVacuumWorkerProcess() with checks on MyBackendType. So 
MyBackendType is now the primary way to check what kind of a process the 
current process is.


'am_walsender' would also be fairly straightforward to remove and 
replace with AmWalSenderProcess(). I didn't do that because we also have 
am_db_walsender and am_cascading_walsender which cannot be directly 
replaced with MyBackendType. Given that, it might be best to keep 
am_walsender for symmetry.


--
Heikki Linnakangas
Neon (https://neon.tech)
From a0ecc54763252de74907fd0e4aed30fdb753ff86 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 10 Jan 2024 13:46:23 +0200
Subject: [PATCH v6 1/3] Fix incorrect comment on how BackendStatusArray is
 indexed

The comment was copy-pasted from the call to ProcSignalInit() in
AuxiliaryProcessMain(), which uses a similar scheme of having reserved
slots for aux processes after MaxBackends slots for backends. However,
ProcSignalInit() indexing starts from 1, whereas BackendStatusArray
starts from 0. The code is correct, but the comment was wrong.

Reviewed-by: XXX
Discussion: XXX
---
 src/backend/utils/activity/backend_status.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 3dac79c30eb..1a1050c8da1 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -263,9 +263,9 @@ pgstat_beinit(void)
 		 * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
 		 * have a BackendId, the slot is statically allocated based on the
 		 * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-		 * in the range from 1 to MaxBackends (inclusive), so we use
-		 * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-		 * auxiliary process.
+		 * in the range from 0 to MaxBackends (exclusive), so we use
+		 * MaxBackends + AuxProcType as the index of the slot for an auxiliary
+		 * process.
 		 */
 		MyBEEntry = [MaxBackends + MyAuxProcType];
 	}
-- 
2.39.2

From 68b8c662af229f93244b45282f3112d9a5d809b9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 10 Jan 2024 14:00:09 +0200
Subject: [PATCH v6 2/3] Remove MyAuxProcType, use MyBackendType instead

MyAuxProcType was redundant with MyBackendType.

Reviewed-by: XXX
Discussion: XXX
---
 src/backend/postmaster/auxprocess.c | 68 +---
 src/backend/postmaster/postmaster.c | 36 +--
 src/backend/utils/activity/backend_status.c | 11 ++--
 src/include/miscadmin.h | 71 ++---
 src/include/postmaster/auxprocess.h |  2 +-
 src/tools/pgindent/typedefs.list|  1 -
 6 files changed, 74 insertions(+), 115 deletions(-)

diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index ab86e802f21..bd3815b63d0 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -38,14 +38,6 @@
 static void ShutdownAuxiliaryProcess(int code, Datum arg);
 
 
-/* 
- *		global variables
- * 
- */
-
-AuxProcType MyAuxProcType = NotAnAuxProcess;	/* declared in miscadmin.h */
-
-
 /*
  *	 AuxiliaryProcessMain
  *
@@ -55,39 +47,13 @@ AuxProcType MyAuxProcType = NotAnAuxProcess;	/* declared in miscadmin.h */
  *	 This code is here just because of historical reasons.
  */
 void
-AuxiliaryProcessMain(AuxProcType auxtype)
+AuxiliaryProcessMain(BackendType auxtype)
 {
 	Assert(IsUnderPostmaster);
 
-

Re: [patch] pg_basebackup: mention that spread checkpoints are the default in --help

On Tue, Oct 31, 2023 at 8:01 PM Michael Banck  wrote:
>
> Hi,
>
> On Tue, Oct 31, 2023 at 04:59:24PM +0530, Shlok Kyal wrote:
> > I went through the Cfbot for this patch and found out that the build
> > is failing with the following error (Link:
> > https://cirrus-ci.com/task/4648506929971200?logs=build#L1217):
>
> Oops, sorry. Attached is a working third version of this patch.

While I think Peters argument about one reading better than the other
one, that does also increase the "help message bloat" mentioned by
Michael. So I think we're better off actually using the original
version, so I'm going to go ahead and push that one (and also to avoid
endless bikeshedding)-

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: A recent message added to pg_upgade

On Wed, Jan 10, 2024 at 10:11 AM Michael Paquier  wrote:
>
> On Wed, Nov 15, 2023 at 07:58:06AM +0530, Amit Kapila wrote:
> > I am fine with this but there is no harm in doing this before or along
> > with the main patch. As of now, I don't see any problem but as the
> > main patch is still under review, so thought we could even wait for
> > the patch to become "Ready For Committer".
>
> My apologies for the delay.
>
> Now that 9a17be1e244a is in the tree, please find attached a patch to
> restrict the startup of the launcher using IsBinaryUpgrade in
> ApplyLauncherRegister(), with adjustments to the surrounding comments.
>

- if (max_logical_replication_workers == 0)
+ /*
+ * The logical replication launcher is disabled during binary upgrades,
+ * as logical replication workers can stream data during the upgrade
+ * which can cause replication origins to move forward after we have
+ * copied them.  It can cause the system to request the data which is
+ * already present in the new cluster.
+ */
+ if (max_logical_replication_workers == 0 || IsBinaryUpgrade)

This comment is not very clear to me. The first part of the sentence
can't apply to the new cluster as after the upgrade, subscriptions
will be disabled and the second part talks about requesting the wrong
data in the new cluster. As per my understanding, the problem here is
that, on the old cluster, the origins may move forward after we copy
them and then we copy physical files. Now, in the new cluster when we
try to request the data, it will be already present.

> Was there anything else you wanted to be covered and/or updated?
>

No, only this patch.

-- 
With Regards,
Amit Kapila.

Re: Compile warnings in dbcommands.c building with meson

2024-01-10 Thread Aleksander Alekseev

Hi,

> When building current head on debian bullseye I get this compile warning:
>
> In file included from ../src/backend/commands/dbcommands.c:20:
> ../src/backend/commands/dbcommands.c: In function ‘createdb’:
> ../src/include/postgres.h:104:9: warning: ‘src_hasloginevt’ may be
> used uninitialized in this function [-Wmaybe-uninitialized]
>   104 |  return (Datum) (X ? 1 : 0);
>   | ^~~
> ../src/backend/commands/dbcommands.c:683:8: note: ‘src_hasloginevt’
> was declared here
>   683 |  bool  src_hasloginevt;
>   |^~~
>
>
> I only get this when building with meson, not when building with
> autotools. AFAICT, I have the same config:
>
> ./configure --enable-debug --enable-depend --with-python
> --enable-cassert --with-openssl --enable-tap-tests --with-icu
>
> vs
>
> meson setup build -Ddebug=true -Dpython=true -Dcassert=true
> -Dssl=openssl -Dtap-test=true -Dicu=enabled -Dnls=disabled
>
>
> in both cases the compiler is:
> gcc (Debian 10.2.1-6) 10.2.1 20210110

Seems to me that the compiler is not smart enough to process:

```
if (!get_db_info(dbtemplate, ShareLock,
 _dboid, _owner, _encoding,
 _istemplate, _allowconn, _hasloginevt,
 _frozenxid, _minmxid, _deftablespace,
 _collate, _ctype, _iculocale,
_icurules, _locprovider,
 _collversion))
ereport(ERROR, ...
```

Should we just silence the warning like this - see attachment? I don't
think createdb() is called that often to worry about slight
performance change, if there is any.

-- 
Best regards,
Aleksander Alekseev


v1-0001-Silence-compiler-warnings.patch
Description: Binary data

Re: Slightly improved meson error for docs tools

2024-01-10 Thread Aleksander Alekseev

Hi,

> least lists what's expected -- I'm not sure if this is the way to go,
> or if we should somehow list the individual tools that are failing
> here?

Personally I think the patch is OK.

-- 
Best regards,
Aleksander Alekseev

System username in pg_stat_activity

The attached patch adds a column "authuser" to pg_stat_activity which
contains the username of the externally authenticated user, being the
same value as the SYSTEM_USER keyword returns in a backend.

This overlaps with for example the values in pg_stat_gss, but it will
include values for authentication methods that don't have their own
view such as peer/ident. gss/ssl info will of course still be shown,
it is just in more than one place.

I was originally thinking this column should be "sysuser" to map to
the keyword, but since we already have "usesysid" as a column name in
pg_stat_activity I figured that could be confusing since it actually
means something completely different. But happy to change that back if
people think that's better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index de78d58d4b..ec6a25495f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23462,7 +23462,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5e..dc146a4c9f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,17 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   authname name
+  
+  
+   The authentication method and identity (if any) that the user
+   used to log in. It contains the same value as
+returns in the backend.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e43e36f5ac..fe81aefbbb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,7 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.authname,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 3dac79c30e..ffface119e 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -357,6 +357,11 @@ pgstat_bestart(void)
 	else
 		MemSet(_clientaddr, 0, sizeof(lbeentry.st_clientaddr));
 
+	if (MyProcPort && GetSystemUser())
+		strlcpy(lbeentry.st_authuser, GetSystemUser(), NAMEDATALEN);
+	else
+		MemSet(_authuser, 0, sizeof(lbeentry.st_authuser));
+
 #ifdef USE_SSL
 	if (MyProcPort && MyProcPort->ssl_in_use)
 	{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 30a2063505..cd9769c52c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -304,7 +304,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	31
+#define PG_STAT_GET_ACTIVITY_COLS	32
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -617,6 +617,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 nulls[30] = true;
 			else
 values[30] = UInt64GetDatum(beentry->st_query_id);
+
+			/* Authenticated user */
+			values[31] = CStringGetTextDatum(beentry->st_authuser);
 		}
 		else
 		{
@@ -646,6 +649,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			nulls[28] = true;
 			nulls[29] = true;
 			nulls[30] = true;
+			nulls[31] = true;
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7979392776..bca8ea8db8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5440,9 +5440,9 @@
   proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'int4',
-  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,text,numeric,text,bool,text,bool,bool,int4,int8}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id}',
+  proallargtypes =>

Slightly improved meson error for docs tools

The meson build doesn't tell you what tool is missing when trying to
build the docs (and you don't have it in the path.. sigh), it just
tells you that something is missing. Attached is a small patch that at
least lists what's expected -- I'm not sure if this is the way to go,
or if we should somehow list the individual tools that are failing
here?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/meson.build b/meson.build
index 57f9735feb..c317144b6b 100644
--- a/meson.build
+++ b/meson.build
@@ -587,7 +587,7 @@ if not docs_opt.disabled()
   if xmllint_bin.found() and xsltproc_bin.found()
 docs_dep = declare_dependency()
   elif docs_opt.enabled()
-error('missing required tools for docs in HTML / man page format')
+error('missing required tools (xmllint and xsltproc needed) for docs in HTML / man page format')
   endif
 endif

Re: WIP Incremental JSON Parser

2024-01-10 Thread Andrew Dunstan




On 2024-01-09 Tu 13:46, Jacob Champion wrote:

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan  wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.



Thanks for looking! I've been playing around with a similar idea, but 
yours might be better.






Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?



Also a good suggestion. Will look and see. IIRC I had trouble with this bit.





+   case JSON_SEM_SCALAR_CALL:
+   {
+   json_scalar_action sfunc = sem->scalar;
+
+   if (sfunc != NULL)
+   (*sfunc) (sem->semstate, scalar_val, scalar_tok);
+   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?



Good point. In fact it might be responsible for the error I'm currently 
trying to get to the bottom of.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Random pg_upgrade test failure on drongo

On Wed, Jan 10, 2024 at 3:30 PM Alexander Lakhin  wrote:
>
> 10.01.2024 12:31, Amit Kapila wrote:
> > I am slightly hesitant to add any particular system table name in the
> > comments as this can happen for any other system table as well, so
> > slightly adjusted the comments in the attached. However, I think it is
> > okay to mention the particular system table name in the commit
> > message. Let me know what do you think.
>
> Thank you, Amit!
>
> I'd like to note that the culprit is exactly pg_largeobject as coded in
> dumpDatabase():
>  /*
>   * pg_largeobject comes from the old system intact, so set its
>   * relfrozenxids, relminmxids and relfilenode.
>   */
>  if (dopt->binary_upgrade)
> ...
>  appendPQExpBufferStr(loOutQry,
>   "TRUNCATE pg_catalog.pg_largeobject;\n");
>
> I see no other TRUNCATEs (or similar logic) around, so I would specify the
> table name in the comments. Though maybe I'm missing something...
>

But tomorrow it could be for other tables and if we change this
TRUNCATE logic for pg_largeobject (of which chances are less) then
there is always a chance that one misses changing this comment. I feel
keeping it generic in this case would be better as the problem is
generic but it is currently shown for pg_largeobject.

-- 
With Regards,
Amit Kapila.

Compile warnings in dbcommands.c building with meson

When building current head on debian bullseye I get this compile warning:

In file included from ../src/backend/commands/dbcommands.c:20:
../src/backend/commands/dbcommands.c: In function ‘createdb’:
../src/include/postgres.h:104:9: warning: ‘src_hasloginevt’ may be
used uninitialized in this function [-Wmaybe-uninitialized]
  104 |  return (Datum) (X ? 1 : 0);
  | ^~~
../src/backend/commands/dbcommands.c:683:8: note: ‘src_hasloginevt’
was declared here
  683 |  bool  src_hasloginevt;
  |^~~


I only get this when building with meson, not when building with
autotools. AFAICT, I have the same config:

./configure --enable-debug --enable-depend --with-python
--enable-cassert --with-openssl --enable-tap-tests --with-icu

vs

meson setup build -Ddebug=true -Dpython=true -Dcassert=true
-Dssl=openssl -Dtap-test=true -Dicu=enabled -Dnls=disabled


in both cases the compiler is:
gcc (Debian 10.2.1-6) 10.2.1 20210110

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Documentation to upgrade logical replication cluster

On Mon, Jan 8, 2024 at 12:52 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 4, 2024 at 2:22 PM vignesh C  wrote:
> >
> > Hi,
> >
> > We have documentation on how to upgrade "publisher" and "subscriber"
> > at [1], but currently we do not have any documentation on how to
> > upgrade logical replication clusters.
> > Here is a patch to document how to upgrade different logical
> > replication clusters: a) Upgrade 2 node logical replication cluster b)
> > Upgrade cascaded logical replication cluster c) Upgrade 2 node
> > circular logical replication cluster.
> > Thoughts?
> >
> > [1] - https://www.postgresql.org/docs/devel/pgupgrade.html
>
> Thanks for this. It really helps developers a lot. In addition to the
> docs, why can't all of these steps be put into a perl/shell script or
> a C tool sitting in the src/bin directory?
>
> I prefer a postgres src/bin tool which takes publisher and subscriber
> connection strings as the inputs, talks to them and upgrades both
> publisher and subscriber. Of course, one can write such a tool outside
> of postgres in their own programming language, but the capability to
> upgrade postgres servers with logical replication is such an important
> task one would often require it. Therefore, an off-the-shelf tool not
> only avoids manual efforts but makes it effortless for the users,
> after all, if any of the steps isn't performed as stated in the docs
> the servers may end up in an inconsistent state.
>

This idea has merits but not sure if we just add a few tests that
users can refer to if they want or provide a utility as you described.
I would prefer a test or two for now and if there is a demand then we
can consider having such a utility. In either case, I feel it is
better discussed in a separate thread.

-- 
With Regards,
Amit Kapila.

Re: Documentation to upgrade logical replication cluster

On Thu, Jan 4, 2024 at 2:22 PM vignesh C  wrote:
>
> We have documentation on how to upgrade "publisher" and "subscriber"
> at [1], but currently we do not have any documentation on how to
> upgrade logical replication clusters.
> Here is a patch to document how to upgrade different logical
> replication clusters: a) Upgrade 2 node logical replication cluster b)
> Upgrade cascaded logical replication cluster c) Upgrade 2 node
> circular logical replication cluster.
>

Today, off-list, I had a short discussion on this documentation with
Jonathan and Michael. I was not sure whether we should add this in the
main documentation of the upgrade or maintain it as a separate wiki
page. My primary worry was that this seemed to be taking too much
space on pgupgrade page and making the information on that page a bit
unreadable. Jonathan suggested that we can add this information to the
logical replication page [1] and add a reference in the pgupgrade
page. That suggestion makes sense to me considering we have
sub-sections like Monitoring, Security, and Configuration Settings on
the logical replication page. We can have a new sub-section Upgrade on
the same lines. What do you think?

[1] - https://www.postgresql.org/docs/devel/logical-replication.html

-- 
With Regards,
Amit Kapila.

Re: Is the subtype_diff function in CREATE TYPE only can be C function?

2024-01-10 Thread Ashutosh Bapat

On Wed, Jan 10, 2024 at 1:49 PM ddme  wrote:
>
> Hi all,
>
> I notice that the CREATE TYPE syntax can specify subtype_diff function
>
> CREATE TYPE name AS RANGE (
> SUBTYPE = subtype
> [ , SUBTYPE_OPCLASS = subtype_operator_class ]
> [ , COLLATION = collation ]
> [ , CANONICAL = canonical_function ]
> [ , SUBTYPE_DIFF = subtype_diff_function ] <— here
> [ , MULTIRANGE_TYPE_NAME = multirange_type_name ]
> )
>
> And a example is
> ```sql
>
> CREATE TYPE float8_range AS RANGE (subtype = float8, subtype_diff = float8mi);
>
> ```
>
> I notice that float8mi is a C function, and I find the call_subtype_diff() in 
> source code that it seems only can call C function.

call_subtype_diff() invokes FunctionCall2Coll() which in turn invokes
the function handler for non-C functions. See
fmgr_info_cxt_security() for example. So subtype_diff can be a SQL
callable function written in any supported language.

>
> I want to know
>
> 1. Can the subtype_diff function in CREATE TYPE be sql or plpgsql function?

I think so.

> 2. How to call subtype_diff function? I know it related with GiST index, I 
> need a example on how to trigger subtype_diff function.

I am not familiar with GiST code enough to answer that question. But
looking at the places where call_subtype_diff() is called, esp. the
comments there might give you hints.
OR somebody more familiar with GiST code will give you a direct answer.

-- 
Best Wishes,
Ashutosh Bapat

Re: Extend pgbench partitioning to pgbench_history

2024-01-10 Thread Shlok Kyal

Sorry, I did not intend to send this message for this email. I by
mistake sent this mail. Please ignore this mail

On Wed, 10 Jan 2024 at 15:27, Shlok Kyal  wrote:
>
> Hi,
>
> There are some test failures reported by Cfbot at [1]:
>
> [09:15:01.794] 192/276 postgresql:pgbench /
> pgbench/001_pgbench_with_server ERROR 7.48s exit status 3
> [09:15:01.794] >>>
> INITDB_TEMPLATE=/tmp/cirrus-ci-build/build/tmp_install/initdb-template
> LD_LIBRARY_PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/lib
> REGRESS_SHLIB=/tmp/cirrus-ci-build/build/src/test/regress/regress.so
> PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin:/tmp/cirrus-ci-build/build/src/bin/pgbench:/tmp/cirrus-ci-build/build/src/bin/pgbench/test:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
> PG_REGRESS=/tmp/cirrus-ci-build/build/src/test/regress/pg_regress
> MALLOC_PERTURB_=67 /usr/local/bin/python3
> /tmp/cirrus-ci-build/build/../src/tools/testwrap --basedir
> /tmp/cirrus-ci-build/build --srcdir
> /tmp/cirrus-ci-build/src/bin/pgbench --testgroup pgbench --testname
> 001_pgbench_with_server -- /usr/local/bin/perl -I
> /tmp/cirrus-ci-build/src/test/perl -I
> /tmp/cirrus-ci-build/src/bin/pgbench
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl
> [09:15:01.794] ― ✀
> ―
> [09:15:01.794] stderr:
> [09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_2'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1247.
> [09:15:01.794] # Failed test 'transaction count for
> /tmp/cirrus-ci-build/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_log_3.25193
> (11)'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1257.
> [09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_3'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1257.
> [09:15:01.794] # Looks like you failed 3 tests of 439.
> [09:15:01.794]
> [09:15:01.794] (test program exited with status code 3)
>
> [1] - https://cirrus-ci.com/task/5139049757802496
>
> Thanks and regards
> Shlok Kyal

Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Alexander Lakhin


10.01.2024 12:31, Amit Kapila wrote:

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.


Thank you, Amit!

I'd like to note that the culprit is exactly pg_largeobject as coded in
dumpDatabase():
    /*
 * pg_largeobject comes from the old system intact, so set its
 * relfrozenxids, relminmxids and relfilenode.
 */
    if (dopt->binary_upgrade)
...
    appendPQExpBufferStr(loOutQry,
 "TRUNCATE pg_catalog.pg_largeobject;\n");

I see no other TRUNCATEs (or similar logic) around, so I would specify the
table name in the comments. Though maybe I'm missing something...

Best regards,
Alexander

Re: Extend pgbench partitioning to pgbench_history

2024-01-10 Thread Shlok Kyal

Hi,

There are some test failures reported by Cfbot at [1]:

[09:15:01.794] 192/276 postgresql:pgbench /
pgbench/001_pgbench_with_server ERROR 7.48s exit status 3
[09:15:01.794] >>>
INITDB_TEMPLATE=/tmp/cirrus-ci-build/build/tmp_install/initdb-template
LD_LIBRARY_PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/lib
REGRESS_SHLIB=/tmp/cirrus-ci-build/build/src/test/regress/regress.so
PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin:/tmp/cirrus-ci-build/build/src/bin/pgbench:/tmp/cirrus-ci-build/build/src/bin/pgbench/test:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
PG_REGRESS=/tmp/cirrus-ci-build/build/src/test/regress/pg_regress
MALLOC_PERTURB_=67 /usr/local/bin/python3
/tmp/cirrus-ci-build/build/../src/tools/testwrap --basedir
/tmp/cirrus-ci-build/build --srcdir
/tmp/cirrus-ci-build/src/bin/pgbench --testgroup pgbench --testname
001_pgbench_with_server -- /usr/local/bin/perl -I
/tmp/cirrus-ci-build/src/test/perl -I
/tmp/cirrus-ci-build/src/bin/pgbench
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl
[09:15:01.794] ― ✀
―
[09:15:01.794] stderr:
[09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_2'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1247.
[09:15:01.794] # Failed test 'transaction count for
/tmp/cirrus-ci-build/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_log_3.25193
(11)'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1257.
[09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_3'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1257.
[09:15:01.794] # Looks like you failed 3 tests of 439.
[09:15:01.794]
[09:15:01.794] (test program exited with status code 3)

[1] - https://cirrus-ci.com/task/5139049757802496

Thanks and regards
Shlok Kyal

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-10 Thread Ashutosh Bapat

On Tue, Jan 9, 2024 at 10:09 AM Michael Paquier  wrote:

>
> Okay, I can see your point.  I have reorganized the docs in the
> following order:
> - INJECTION_POINT
> - Attach
> - Detach

Thanks. This looks better. Needs further wordsmithy. But that can wait
till the code has been reviewed.

>
> > I think, exposing the current injection point strings as
> > #defines and encouraging users to use these macros instead of string
> > literals will be a good start.
>
> Nah, I disagree with this one actually.  It is easy to grep for the
> macro INJECTION_POINT to be able to achieve the same research job, and
> this would make the code more inconsistent when callbacks are run
> within extensions which don't care about a #define in a backend
> header.

The macros should be in extension facing header, ofc. But I take back
this suggestion since defining these macros is extra work (every time
a new injection point is declared) and your suggestion to grep
practically works. We can improve things as needed.

>
> > With the current implementation it's possible to "declare" injection
> > point with same name at multiple places. It's useful but is it
> > intended?
>
> Yes.  I would recommend not doing that, but I don't see why there
> would be a point in restricting that, either.

Since an unintentional misspelling might trigger an unintended
injection point. But we will see how much of that happens in practice.

> > +#ifdef USE_INJECTION_POINTS
> > +static bool
> > +file_exists(const char *name)
> >
> > There's similar function in jit.c and dfmgr.c. Can we not reuse that code?
>
> This has been mentioned in a different comment.  Refactored as of
> 0001, but there is something here related to EACCES for the JIT path.
> Seems weird to me that we would not fail if the JIT library cannot be
> accessed when stat() fails.

I agree with this change to jit. Without having search permissions on
every directory in the path, the function can not determine if the
file exists or not. So throwing an error is better than just returning
false which means that
the file does not exist.

>
> > + /* Save the entry */
> > + memcpy(entry_by_name->name, name, sizeof(entry_by_name->name));
> > + entry_by_name->name[INJ_NAME_MAXLEN - 1] = '\0';
> > + memcpy(entry_by_name->library, library, sizeof(entry_by_name->library));
> > + entry_by_name->library[INJ_LIB_MAXLEN - 1] = '\0';
> > + memcpy(entry_by_name->function, function, 
> > sizeof(entry_by_name->function));
> > + entry_by_name->function[INJ_FUNC_MAXLEN - 1] = '\0';
> >
> > Most of the code is using strncpy instead of memcpy. Why is this code 
> > different?
>
> strncpy() is less used in the backend code.  It comes to a matter of
> taste, IMO.

To me using memcpy implies that the contents of the memory being
copied can be non-character. For a buffer containing a character
string I would prefer strncpy. But I wouldn't argue furher..

>
> Yeah, that's an intended design choice to keep the code simpler and
> faster as there is no need to track the library and function names in
> the local caches or implement something similar to invalidation
> messages for this facility because it would impact performance anyway
> in the call paths.  In short, just don't do that, or use two distinct
> points.

In practice the InjectionPointDetach() and InjectionPointAttach()
calls may not be close by and the user may not be able to figure out
why the injection points are behaviour weird. It may impact
performance but unexpected behaviour should be avoided, IMO.

If nothing else this should be documented.

>
> On Thu, Jan 04, 2024 at 06:22:35PM +0530, Ashutosh Bapat wrote:
> > One more comment on 0001
> > InjectionPointAttach() doesn't test whether the given function exists
> > in the given library. Even if InjectionPointAttach() succeeds,
> > INJECTION_POINT might throw error because the function doesn't exist.
> > This can be seen as an unwanted behaviour. I think
> > InjectionPointAttach() should test whether the function exists and
> > possibly load it as well by adding it to the local cache.
>
> This has the disadvantage of filling the local cache but that may not
> be necessary with an extra load_external_function() in the attach
> path.  I agree to make things safer, but I would do that when
> attempting to run the callback instead.  Perhaps there's an argument
> for the case of somebody replacing a library on-the-fly.  I don't
> really buy it, but people like doing fancy things sometimes.

I am ok with not populating the cache but checking with just
load_external_function(). This is again another ease of use scenario
where a silly mistake by user is caught earlier making user's life
easier. That at least should be the goal of the first cut.

>
> > 0002 comments
> > --- /dev/null
> > +++ 
> > b/src/test/modules/test_injection_points/expected/test_injection_points.out
> >
> > When built without injection point support, this test fails. We should
> > add an alternate output file for such a build so that the

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Michael, it definitely increases stability of the test (tens of iterations
> with 20 tests in parallel performed successfully),

Thanks for testing!

> although I've managed to
> see another interesting failure (twice):
> 13  #   Failed test 'activeslot slot invalidation is logged with vacuum 
> on pg_class'
> 13  #   at t/035_standby_logical_decoding.pl line 227.
> 

Looking at the attached log files and particularly 
1/regress_log_035_standby_logical_decoding:

"
[11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with 
vacuum on pg_class
[11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with 
vacuum on pg_class
"

That seems weird, the inactive slot has been invalidated while the active one 
is not.
While it takes a bit longer to invalidate an active slot, I don't think the 
test can
move on until both are invalidated (then leading to the tests 24 and 25)). I can
see the tests are very slow to run (13.993s for 24) but still don't get how 24 
could
succeed while 25 does not.

Looking at 2/regress_log_035_standby_logical_decoding:

"
[13:41:02.076](20.279s) ok 23 - inactiveslot slot invalidation is logged with 
vacuum on pg_class
[13:41:02.076](0.000s) not ok 24 - activeslot slot invalidation is logged with 
vacuum on pg_class
"

Same "weird" behavior but this time the tests numbering are not the same (23 
and 24).
That is even more weird as those tests should be the 24 and 25 ones.

Would it be possible to also send the standby logs?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: speed up a logical replica setup

On Wed, 6 Dec 2023 at 12:53, Euler Taveira  wrote:
>
> On Thu, Nov 9, 2023, at 8:12 PM, Michael Paquier wrote:
>
> On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> > On 08.11.23 00:12, Michael Paquier wrote:
> >> - Should the subdirectory pg_basebackup be renamed into something more
> >> generic at this point?  All these things are frontend tools that deal
> >> in some way with the replication protocol to do their work.  Say
> >> a replication_tools?
> >
> > Seems like unnecessary churn.  Nobody has complained about any of the other
> > tools in there.
>
> Not sure.  We rename things across releases in the tree from time to
> time, and here that's straight-forward.
>
>
> Based on this discussion it seems we have a consensus that this tool should be
> in the pg_basebackup directory. (If/when we agree with the directory renaming,
> it could be done in a separate patch.) Besides this move, the v3 provides a 
> dry
> run mode. It basically executes every routine but skip when should do
> modifications. It is an useful option to check if you will be able to run it
> without having issues with connectivity, permission, and existing objects
> (replication slots, publications, subscriptions). Tests were slightly 
> improved.
> Messages were changed to *not* provide INFO messages by default and --verbose
> provides INFO messages and --verbose --verbose also provides DEBUG messages. I
> also refactored the connect_database() function into which the connection will
> always use the logical replication mode. A bug was fixed in the transient
> replication slot name. Ashutosh review [1] was included. The code was also 
> indented.
>
> There are a few suggestions from Ashutosh [2] that I will reply in another
> email.
>
> I'm still planning to work on the following points:
>
> 1. improve the cleanup routine to point out leftover objects if there is any
>connection issue.
> 2. remove the physical replication slot if the standby is using one
>(primary_slot_name).
> 3. provide instructions to promote the logical replica into primary, I mean,
>stop the replication between the nodes and remove the replication setup
>(publications, subscriptions, replication slots). Or even include another
>action to do it. We could add both too.
>
> Point 1 should be done. Points 2 and 3 aren't essential but will provide a 
> nice
> UI for users that would like to use it.

Few comments:
1) We should not allow specifying the same database name twice as we
will try to create the slots multiple times in the publisher, this can
be detected while parsing the options and error can be thrown:
+   case 'd':
+
simple_string_list_append(_names, optarg);
+   num_dbs++;
+   break;

+static bool
+create_all_logical_replication_slots(LogicalRepInfo *dbinfo)
+{
+   int i;
+
+   for (i = 0; i < num_dbs; i++)
+   {
+   PGconn *conn;
+   PGresult   *res;
+   charreplslotname[NAMEDATALEN];



+   /* Create replication slot on publisher. */
+   if (create_logical_replication_slot(conn, [i],
replslotname) != NULL || dry_run)
+   pg_log_info("create replication slot \"%s\" on
publisher", replslotname);
+   else
+   return false;
+
+   disconnect_database(conn);
+   }

E.g.: pg_subscriber -d postgres -d postgres

2) 2023 should be changed to 2024
+/*-
+ *
+ * pg_subscriber.c
+ *   Create a new logical replica from a standby server
+ *
+ * Copyright (C) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_subscriber/pg_subscriber.c
+ *
+ *-
+ */

3) Similarly here too:
diff --git a/src/bin/pg_basebackup/t/040_pg_subscriber.pl
b/src/bin/pg_basebackup/t/040_pg_subscriber.pl
new file mode 100644
index 00..9d20847dc2
--- /dev/null
+++ b/src/bin/pg_basebackup/t/040_pg_subscriber.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+#
+# Test checking options of pg_subscriber.
+#

4) Similarly here too:
diff --git a/src/bin/pg_basebackup/t/041_pg_subscriber_standby.pl
b/src/bin/pg_basebackup/t/041_pg_subscriber_standby.pl
new file mode 100644
index 00..ce25608c68
--- /dev/null
+++ b/src/bin/pg_basebackup/t/041_pg_subscriber_standby.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+#
+# Test using a standby server as the subscriber.

Regards,
Vignesh

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

On Wed, Jan 10, 2024 at 2:59 PM Shlok Kyal  wrote:
>
> This patch is not applying on the HEAD. Please rebase and share the
> updated patch.
>

IIRC, there were some regressions observed with this patch. So, one
needs to analyze those as well. I think we should mark it as "Returned
with feedback".

-- 
With Regards,
Amit Kapila.

Re: Random pg_upgrade test failure on drongo

On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin  wrote:
>
> 09.01.2024 13:08, Amit Kapila wrote:
> >
> >> As to checkpoint_timeout, personally I would not increase it, because it
> >> seems unbelievable to me that pg_restore (with the cluster containing only
> >> two empty databases) can run for longer than 5 minutes. I'd rather
> >> investigate such situation separately, in case we encounter it, but maybe
> >> it's only me.
> >>
> > I feel it is okay to set a higher value of checkpoint_timeout due to
> > the same reason though the probability is less. I feel here it is
> > important to explain in the comments why we are using these settings
> > in the new test. I have thought of something like: "During the
> > upgrade, bgwriter or checkpointer could hold the file handle for some
> > removed file. Now, during restore when we try to create the file with
> > the same name, it errors out. This behavior is specific to only some
> > specific Windows versions and the probability of seeing this behavior
> > is higher in this test because we use wal_level as logical via
> > allows_streaming => 'logical' which in turn sets shared_buffers as
> > 1MB."
> >
> > Thoughts?
>
> I would describe that behavior as "During upgrade, when pg_restore performs
> CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
> handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
> fail if OS (such as older Windows versions) doesn't remove an unlinked file
> completely till it's open. ..."
>

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-an-intermetant-BF-failure-in-003_logical_slot.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2024-01-10 Thread Shlok Kyal

Hi,

This patch is not applying on the HEAD. Please rebase and share the
updated patch.

Thanks and Regards
Shlok Kyal

On Wed, 10 Jan 2024 at 14:55, Peter Smith  wrote:
>
> Oops - now with attachments
>
> On Mon, Aug 21, 2023 at 5:56 PM Peter Smith  wrote:
>>
>> Hi Melih,
>>
>> Last week we revisited your implementation of design#2. Vignesh rebased it, 
>> and then made a few other changes.
>>
>> PSA v28*
>>
>> The patch changes include:
>> * changed the logic slightly by setting recv_immediately(new variable), if 
>> this variable is set the main apply worker loop will not wait in this case.
>> * setting the relation state to ready immediately if there are no more 
>> incremental changes to be synced.
>> * receive the incremental changes if applicable and set the relation state 
>> to ready without waiting.
>> * reuse the worker if the worker is free before trying to start a new table 
>> sync worker
>> * restarting the tablesync worker only after wal_retrieve_retry_interval
>>
>> ~
>>
>> FWIW, we just wanted to share with you the performance measurements seen 
>> using this design#2 patch set:
>>
>> ==
>>
>> RESULTS (not busy tests)
>>
>> --
>> 10 empty tables
>> 2w  4w  8w  16w
>> HEAD:   125 119 140 133
>> HEAD+v28*:  92  93  123 134
>> %improvement:   27% 22% 12% -1%
>> --
>> 100 empty tables
>> 2w  4w  8w  16w
>> HEAD:   1037843 11091155
>> HEAD+v28*:  591 625 26162569
>> %improvement:   43% 26% -136%   -122%
>> --
>> 1000 empty tables
>> 2w  4w  8w  16w
>> HEAD:   15874   10047   991910338
>> HEAD+v28*:  33673   12199   90949896
>> %improvement:   -112%   -21%8%  4%
>> --
>> 2000 empty tables
>> 2w  4w  8w  16w
>> HEAD:   45266   24216   19395   19820
>> HEAD+v28*:  88043   21550   21668   22607
>> %improvement:  -95% 11% -12%-14%
>>
>> ~~~
>>
>> Note - the results were varying quite a lot in comparison to the HEAD
>> e.g. HEAD results are very consistent, but the v28* results observed are not
>> HEAD 1000 (2w): 15861, 15777, 16007, 15950, 15886, 15740, 15846, 15740, 
>> 15908, 15940
>> v28* 1000 (2w):  34214, 13679, 8792, 33289, 31976, 56071, 57042, 56163, 
>> 34058, 11969
>>
>> --
>> Kind Regards,
>> Peter Smith.
>> Fujitsu Australia

Re: Relation bulk write facility