Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-31 Thread Andres Freund
Hi,

On 2023-03-31 16:57:41 +0300, Alexander Korotkov wrote:
> On Wed, Mar 29, 2023 at 8:34 PM Alexander Korotkov  
> wrote:
> > On Mon, Mar 27, 2023 at 1:49 PM Alexander Korotkov  
> > wrote:
> > > On Fri, Mar 24, 2023 at 3:39 AM Andres Freund  wrote:
> > > > On 2023-03-23 23:24:19 +0300, Alexander Korotkov wrote:
> > > > > On Thu, Mar 23, 2023 at 8:06 PM Andres Freund  
> > > > > wrote:
> > > > > > I seriously doubt that solving this at the tuple locking level is 
> > > > > > the right
> > > > > > thing. If we want to avoid refetching tuples, why don't we add a 
> > > > > > parameter to
> > > > > > delete/update to generally put the old tuple version into a slot, 
> > > > > > not just as
> > > > > > an optimization for a subsequent lock_tuple()? Then we could remove 
> > > > > > all
> > > > > > refetching tuples for triggers. It'd also provide the basis for 
> > > > > > adding support
> > > > > > for referencing the OLD version in RETURNING, which'd be quite 
> > > > > > powerful.
> > >
> > > After some thoughts, I think I like idea of fetching old tuple version
> > > in update/delete.  Everything that evades extra tuple fetching and do
> > > more of related work in a single table AM call, makes table AM API
> > > more flexible.
> > >
> > > I'm working on patch implementing this.  I'm going to post it later today.
> >
> > Here is the patchset.  I'm continue to work on comments and refactoring.
> >
> > My quick question is why do we need ri_TrigOldSlot for triggers?
> > Can't we just pass the old tuple for after row trigger in
> > ri_oldTupleSlot?
> >
> > Also, I wonder if we really need a LazyTupleSlot.  It allows to evade
> > extra tuple slot allocation.  But as I get in the end the tuple slot
> > allocation is just a single palloc.  I bet the effect would be
> > invisible in the benchmarks.
> 
> Sorry, previous patches don't even compile.  The fixed version is attached.
> I'm going to post significantly revised patchset soon.

Given that the in-tree state has been broken for a week, I think it probably
is time to revert the commits that already went in.

Greetings,

Andres Freund




Re: Minimal logical decoding on standbys

2023-03-31 Thread Amit Kapila
On Fri, Mar 31, 2023 at 7:14 PM Drouvot, Bertrand
 wrote:
>
> On 3/31/23 1:58 PM, Amit Kapila wrote:
> > On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
> >  wrote:
> >>
> >
> > + * This is needed for logical decoding on standby. Indeed the "problem" is 
> > that
> > + * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets 
> > woken up
> > + * by walreceiver when new WAL has been flushed. Which means that typically
> > + * walsenders will get woken up at the same time that the startup process
> > + * will be - which means that by the time the logical walsender checks
> > + * GetXLogReplayRecPtr() it's unlikely that the startup process
> > already replayed
> > + * the record and updated XLogCtl->lastReplayedEndRecPtr.
> > + *
> > + * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner 
> > case.
> >
> > IIUC we are introducing condition variables as we can't rely on
> > current wait events because they will lead to spurious wakeups for
> > logical walsenders due to the below code in walreceiver:
> > XLogWalRcvFlush()
> > {
> > ...
> > /* Signal the startup process and walsender that new WAL has arrived */
> > WakeupRecovery();
> > if (AllowCascadeReplication())
> > WalSndWakeup();
> >
> > Is my understanding correct?
> >
>
> Both the walsender and the startup process are waked up at the
> same time. If the walsender does not find any new record that has been 
> replayed
> (because the startup process did not replay yet), then it will sleep during i
> ts timeout time (then delaying the decoding).
>
> The CV helps to wake up the walsender has soon as a replay is done.
>
> > Can't we simply avoid waking up logical walsenders at this place and
> > rather wake them up at ApplyWalRecord() where the 0005 patch does
> > conditionvariable broadcast? Now, there doesn't seem to be anything
> > that distinguishes between logical and physical walsender but I guess
> > we can add a variable in WalSnd structure to identify it.
> >
>
> That sounds like a good idea. We could imagine creating a 
> LogicalWalSndWakeup()
> doing the Walsender(s) triage based on a new variable (as you suggest).
>
> But, it looks to me that we:
>
> - would need to go through the list of all the walsenders to do the triage
> - could wake up some logical walsender(s) unnecessary
>

Why it could wake up unnecessarily?

> This extra work would occur during each replay.
>
> while with the CV, only the ones in the CV wait queue would be waked up.
>

Currently, we wake up walsenders only after writing some WAL records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

-- 
With Regards,
Amit Kapila.




Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Drouvot, Bertrand

Hi,

On 4/1/23 1:13 AM, Andres Freund wrote:

Hi,

On 2023-03-31 17:00:00 +0300, Alexander Lakhin wrote:

31.03.2023 15:55, Tom Lane wrote:

See also the thread about bug #16329 [1].  Alexander promised to look
into improving the test coverage in this area, maybe he can keep an
eye on the WAL logic coverage too.


Yes, I'm going to analyze that area too. Maybe it'll take more time
(a week or two) if I encounter some bugs there (for now I observe anomalies
with gist__int_ops), but I will definitely try to improve the gist testing.


Because I needed it to verify the changes in the referenced patch, I wrote
tests exercising killtuples based pruning for gist and hash.



Thanks for the patch!

I did not looked at the detail but "just" checked that the coverage is now done.

And Indeed, when running "make check" + "027_stream_regress.pl":

I can see it moving from (without the patch):

function gistXLogDelete called 0 returned 0% blocks executed 0%
function gistRedoDeleteRecord called 0 returned 0% blocks executed 0%
function gistprunepage called 0 returned 0% blocks executed 0%
function _hash_vacuum_one_page called 0 returned 0% blocks executed 0%

to (with the patch):

function gistXLogDelete called 9 returned 100% blocks executed 100%
function gistRedoDeleteRecord called 5 returned 100% blocks executed 100% 
(thanks to 027_stream_regress.pl)
function gistprunepage called 9 returned 100% blocks executed 79%
function _hash_vacuum_one_page called 12 returned 100% blocks executed 94%


For now I left the new tests in their own files. But possibly they should be
in gist.sql and hash_index.sql respectively?


+1 to put them in gist.sql and hash_index.sql.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WL_SOCKET_ACCEPT fairness on Windows

2023-03-31 Thread Thomas Munro
On Sat, Apr 1, 2023 at 2:35 PM Andres Freund  wrote:
> I wonder if we ought to bite the bullet and replace the use of
> WaitForMultipleObjects() with RegisterWaitForSingleObject() and then use
> GetQueuedCompletionStatus() to wait. The fairness issue here is a motivation,
> but the bigger one is that that'd get us out from under the
> MAXIMUM_WAIT_OBJECTS (IIRC 64) limit. Afaict that'd also allow us to read
> multiple notifications at once, using GetQueuedCompletionStatusEx().

Interesting.  So a callback would run in a system-managed thread, and
that would post a custom message in an IOCP for us to read, kinda like
the fake waitpid() thing?  Seems a bit gross and context-switchy but I
agree that the 64 event limit is also terrible.

> Medium term that'd also be a small step towards using readiness based APIs in
> a few places...

Yeah, that would be cool.

> > I think we could get the same effect as pgwin32_select() more cheaply
> > by doing the initial WaitForMultipleEvents() call with the caller's
> > timeout exactly as we do today, and then, while we have space,
> > repeatedly calling
> > WaitForMultipleEvents(handles=[last_signaled_event_index + 1],
> > timeout=0) until it reports timeout.
>
> That would make sense, and should indeed be reasonable cost-wise.

Cool.

> > I mention this now because I'm not sure whether to consider this an
> > 'open item' for 16, or merely an enhancement for 17.  I guess the
> > former, because someone might call that a new denial of service
> > vector.  On the other hand, if you fill up the listen queue for socket
> > 1 with enough vigour, you're also denying service to socket 1, so I
> > don't know if it's worth worrying about.  Opinions on that?
>
> I'm not sure either. It doesn't strike me as a particularly relevant
> bottleneck. And the old approach of doing more work for every single
> connection also made many connections worse, I think?

Alright, let's see if anyone else thinks this is worth fixing for 16.

> > diff --git a/src/backend/storage/ipc/latch.c 
> > b/src/backend/storage/ipc/latch.c
> > index f4123e7de7..cc7b572008 100644
> > --- a/src/backend/storage/ipc/latch.c
> > +++ b/src/backend/storage/ipc/latch.c
> > @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> > cur_timeout,
> >*/
> >   cur_event = (WaitEvent *) >events[rc - WAIT_OBJECT_0 - 1];
> >
> > +loop:
> > +
>
> As far as I can tell, we'll never see WL_LATCH_SET or WL_POSTMASTER_DEATH. I
> think it'd look cleaner to move the body of if (cur_event->events & 
> WL_SOCKET_MASK)
> into a separate function that we then also can call further down.

We could see them, AFAICS, and I don't see much advantage in making
that assumption?  Shouldn't we just shove it in a loop, just like the
other platforms' implementations?  Done in this version, which is best
viewed with git show --ignore-all-space.

> Seems like we could use returned_events to get nevents in the way you want it,
> without adding even more ++/-- to each of the different events?

Yeah.  This time I use reported_events.  I also fixed a maths failure:
I'd forgotten to use rc - WAIT_OBJECT_0, suggesting that CI never
reached the code.
From 125deab4030d7cf8918df19fb67fc4c8bee27570 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 1 Apr 2023 12:36:12 +1300
Subject: [PATCH v2] Teach WaitEventSetWait to report multiple events on
 Windows.

The WaitEventSet implementation on Windows has always reported only one
event at a time, and always the "lowest" in its event array.  Since
commit 7389aad6 started using WaitEventSet to handle incoming socket
connections, this unfairness changes the accept priority when using
multiple server sockets.  If one of them has a non-empty listen queue
due to incoming connections, the other might never be serviced.  The
previously coding based on select() was fair in that way.

Fix, by teaching WaitEventSetWait() to poll for extra events.  No change
in behavior in the common case of callers with nevents=1 (for example
the main FEBE socket code), but for the postmaster's main loop we'll
drain all the events that can fit in the output buffer, which is
deliberately set large enough to handle the maximum possible number of
sockets.  This brings the Windows behavior in line with Unix.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKG%2BA2dk29hr5zRP3HVJQ-_PncNJM6HVQ7aaYLXLRBZU-xw%40mail.gmail.com

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..334236aefc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1933,14 +1933,10 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 #elif defined(WAIT_USE_WIN32)
 
 /*
- * Wait using Windows' WaitForMultipleObjects().
+ * Wait using Windows' WaitForMultipleObjects().  Each call only "consumes" one
+ * event, so we keep calling until we've filled up our output buffer.
  *
- * Unfortunately this will only ever return a single readiness 

Re: WL_SOCKET_ACCEPT fairness on Windows

2023-03-31 Thread Andres Freund
Hi,

On 2023-04-01 13:42:21 +1300, Thomas Munro wrote:
> Commit 7389aad6 started using WaitEventSetWait() to wait for incoming
> connections.  Before that, we used select(), for which we have our own
> implementation for Windows.
> 
> While hacking on patches to rip a bunch of unused Win32 socket wrapper
> code out, I twigged that the old pgwin32_select() code was careful to
> report multiple sockets at once by brute force polling of all of them,
> while WaitEventSetWait() doesn't do that: it reports just one event,
> because that's what the Windows WaitForMultipleEvents() syscall does.
> I guess this means you can probably fill up the listen queue of server
> socket 1 to prevent server socket 2 from ever being serviced, whereas
> on Unix we'll accept one connection at a time from each in round-robin
> fashion.

It does indeed sound like it'd behave that way:

> If bWaitAll is FALSE, the return value minus WAIT_OBJECT_0 indicates the
> lpHandles array index of the object that satisfied the wait. If more than
> one object became signaled during the call, this is the array index of the
> signaled object with the smallest index value of all the signaled objects.

I wonder if we ought to bite the bullet and replace the use of
WaitForMultipleObjects() with RegisterWaitForSingleObject() and then use
GetQueuedCompletionStatus() to wait. The fairness issue here is a motivation,
but the bigger one is that that'd get us out from under the
MAXIMUM_WAIT_OBJECTS (IIRC 64) limit. Afaict that'd also allow us to read
multiple notifications at once, using GetQueuedCompletionStatusEx().

Medium term that'd also be a small step towards using readiness based APIs in
a few places...


> I think we could get the same effect as pgwin32_select() more cheaply
> by doing the initial WaitForMultipleEvents() call with the caller's
> timeout exactly as we do today, and then, while we have space,
> repeatedly calling
> WaitForMultipleEvents(handles=[last_signaled_event_index + 1],
> timeout=0) until it reports timeout.

That would make sense, and should indeed be reasonable cost-wise.


> I mention this now because I'm not sure whether to consider this an
> 'open item' for 16, or merely an enhancement for 17.  I guess the
> former, because someone might call that a new denial of service
> vector.  On the other hand, if you fill up the listen queue for socket
> 1 with enough vigour, you're also denying service to socket 1, so I
> don't know if it's worth worrying about.  Opinions on that?

I'm not sure either. It doesn't strike me as a particularly relevant
bottleneck. And the old approach of doing more work for every single
connection also made many connections worse, I think?

> diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> index f4123e7de7..cc7b572008 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> cur_timeout,
>*/
>   cur_event = (WaitEvent *) >events[rc - WAIT_OBJECT_0 - 1];
>  
> +loop:
> +

As far as I can tell, we'll never see WL_LATCH_SET or WL_POSTMASTER_DEATH. I
think it'd look cleaner to move the body of if (cur_event->events & 
WL_SOCKET_MASK)
into a separate function that we then also can call further down.


>   occurred_events->pos = cur_event->pos;
>   occurred_events->user_data = cur_event->user_data;
>   occurred_events->events = 0;
> @@ -2044,6 +2046,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> cur_timeout,
>   occurred_events->events = WL_LATCH_SET;
>   occurred_events++;
>   returned_events++;
> + nevents--;
>   }
>   }
>   else if (cur_event->events == WL_POSTMASTER_DEATH)
> @@ -2063,6 +2066,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> cur_timeout,
>   occurred_events->events = WL_POSTMASTER_DEATH;
>   occurred_events++;
>   returned_events++;
> + nevents--;
>   }
>   }
>   else if (cur_event->events & WL_SOCKET_MASK)
> @@ -2124,6 +2128,36 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> cur_timeout,
>   {
>   occurred_events++;
>   returned_events++;
> + nevents--;
> + }
> + }

Seems like we could use returned_events to get nevents in the way you want it,
without adding even more ++/-- to each of the different events?

Greetings,

Andres Freund




RE: Non-superuser subscription owners

2023-03-31 Thread houzj.f...@fujitsu.com
On Saturday, April 1, 2023 4:00 AM Robert Haas 

Hi,

> 
> On Thu, Mar 30, 2023 at 9:49 PM houzj.f...@fujitsu.com
>  wrote:
> > It looks like the super user check is out of a transaction, I haven't
> > checked why it only failed on one BF animal, but it seems we can put
> > the check into the transaction like the following:
> 
> That looks like a reasonable fix but I can't reproduce the problem locally. I
> thought the reason why that machine sees the problem might be that it uses
> -DRELCACHE_FORCE_RELEASE, but I tried that option here and the tests still 
> pass.
> Anyone ideas how to reproduce?

I think it's a timing problem because superuser_arg() function will cache the
roleid that passed in last time, so it might not search the syscache to hit the
Assert() check each time. And in the regression test, the roleid cache happened
to be invalidated before the superuser_arg() by some concurrently ROLE change(
maybe in subscription.sql and publication.sql).

I can reproduce it by using gdb and starting another session to change the ROLE.

When the apply worker starts, use the gdb to block the apply worker in the
transaction before the super user check. Then start another session to ALTER
ROLE to invalidate the roleid cache in superuser_arg() which will cause the
apply worker to search the syscache and hit the Assert().

--
origin_startpos = replorigin_session_get_progress(false);
B*  CommitTransactionCommand();

/* Is the use of a password mandatory? */
must_use_password = MySubscription->passwordrequired &&
! superuser_arg(MySubscription->owner);
--

Best Regards,
Hou zj


WL_SOCKET_ACCEPT fairness on Windows

2023-03-31 Thread Thomas Munro
Hi,

Commit 7389aad6 started using WaitEventSetWait() to wait for incoming
connections.  Before that, we used select(), for which we have our own
implementation for Windows.

While hacking on patches to rip a bunch of unused Win32 socket wrapper
code out, I twigged that the old pgwin32_select() code was careful to
report multiple sockets at once by brute force polling of all of them,
while WaitEventSetWait() doesn't do that: it reports just one event,
because that's what the Windows WaitForMultipleEvents() syscall does.
I guess this means you can probably fill up the listen queue of server
socket 1 to prevent server socket 2 from ever being serviced, whereas
on Unix we'll accept one connection at a time from each in round-robin
fashion.

I think we could get the same effect as pgwin32_select() more cheaply
by doing the initial WaitForMultipleEvents() call with the caller's
timeout exactly as we do today, and then, while we have space,
repeatedly calling
WaitForMultipleEvents(handles=[last_signaled_event_index + 1],
timeout=0) until it reports timeout.  Windows always reports the
signaled event with the lowest index in the array, so the idea is to
poll the remaining part of the array without waiting, to check for any
more.  In the most common case of FEBE socket waits etc there would be
no extra system call (because it uses nevents=1, so no space for
more), and in the postmaster's main loop there would commonly be only
one extra system call to determine that no other events have been
signaled.

The attached patch shows the idea.  It's using an ugly goto, but I
guess it should be made decent with a real loop; I just didn't want to
change the indentation level for this POC.

I mention this now because I'm not sure whether to consider this an
'open item' for 16, or merely an enhancement for 17.  I guess the
former, because someone might call that a new denial of service
vector.  On the other hand, if you fill up the listen queue for socket
1 with enough vigour, you're also denying service to socket 1, so I
don't know if it's worth worrying about.  Opinions on that?

I don't have Windows to test any of this.  Although this patch passes
on CI, that means nothing, as I don't expect the new code to be
reached, so I'm hoping to find a someone who would be able to set up
such a test on Windows, ie putting elog in to the new code path and
trying to reach it by connecting to two different ports fast enough to
exercise the multiple event case.

Or maybe latch.c really needs its own test suite.
From 733fd375578d8216fed9a81d4a1a575b1f542ca9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 1 Apr 2023 12:36:12 +1300
Subject: [PATCH] Teach WaitEventSetWait to report multiple events on Windows.

The WaitEventSet implementation on Windows has always reported only one
event at a time, and always the "lowest" in its event array.  Since
commit 7389aad6 started using WaitEventSet to handle incoming socket
connections, this unfairness might potentially upset someone who wants
to handle incoming connection on multiple sockets.  If one of them has a
non-empty listen queue due to incoming connections, the other might
never be serviced.  The previously coding based on select() was fair in
that way.

Fix, by teaching WaitEventSetWait() to poll for extra events.  No change
in behavior in the common case of callers with nevents=1, but for the
postmaster's main look, we'll drain all the events that can fit in the
output buffer, which is deliberately set large enough to handle the
maximum possible number of sockets.  This brings the Windows behavior in
line with Unix.
---
 src/backend/storage/ipc/latch.c | 34 +
 1 file changed, 34 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..cc7b572008 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 */
 	cur_event = (WaitEvent *) >events[rc - WAIT_OBJECT_0 - 1];
 
+loop:
+
 	occurred_events->pos = cur_event->pos;
 	occurred_events->user_data = cur_event->user_data;
 	occurred_events->events = 0;
@@ -2044,6 +2046,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			occurred_events->events = WL_LATCH_SET;
 			occurred_events++;
 			returned_events++;
+			nevents--;
 		}
 	}
 	else if (cur_event->events == WL_POSTMASTER_DEATH)
@@ -2063,6 +2066,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			occurred_events->events = WL_POSTMASTER_DEATH;
 			occurred_events++;
 			returned_events++;
+			nevents--;
 		}
 	}
 	else if (cur_event->events & WL_SOCKET_MASK)
@@ -2124,6 +2128,36 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		{
 			occurred_events++;
 			returned_events++;
+			nevents--;
+		}
+	}
+
+	/*
+	 * Do we have space to report more events that might also be signaled later
+	 * in the array than cur_event?  Being able to return multiple 

Re: zstd compression for pg_dump

2023-03-31 Thread Justin Pryzby
On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
> On 4/1/23 01:16, Justin Pryzby wrote:
> > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> >> On 3/27/23 19:28, Justin Pryzby wrote:
> >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>  On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
> >>  wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> >
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> >
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't 
> >> feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> >
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
>  I may be missing something, but why would the patch affect this? Why
>  would it even affect safety of the parallel dump? And I don't see any
>  changes to the clone stuff ...
> >>>
> >>> zstd supports using threads during compression, with -Z zstd:workers=N.
> >>> When unix forks, the child processes can't do anything to mess up the
> >>> state of the parent processes.  
> >>>
> >>> But windows pg_dump uses threads instead of forking, so it seems
> >>> possible that the pg_dump -j threads that then spawn zstd threads could
> >>> "leak threads" and break the main thread.  I suspect there's no issue,
> >>> but we still ought to verify that before declaring it safe.
> >>
> >> OK. I don't have access to a Windows machine so I can't test that. Is it
> >> possible to disable the zstd threading, until we figure this out?
> > 
> > I think that's what's best.  I made it issue a warning if "workers" was
> > specified.  It could also be an error, or just ignored.
> > 
> > I considered disabling workers only for windows, but realized that I
> > haven't tested with threads myself - my local zstd package is compiled
> > without threading, and I remember having some issue recompiling it with
> > threading.  Jacob's recipe for using meson wraps works well, but it
> > still seems better to leave it as a future feature.  I used that recipe
> > to enabled zstd with threading on CI (except for linux/autoconf).
> 
> +1 to disable this if we're unsure it works correctly. I agree it's
> better to just error out if workers are requested - I rather dislike
> when a tool just ignores an explicit parameter. And AFAICS it's what
> zstd does too, when someone requests workers on incompatible build.
> 
> FWIW I've been thinking about this a bit more and I don't quite see why
> would the threading cause issues (except for Windows). I forgot
> pg_basebackup already supports zstd, including the worker threading, so
> why would it work there and not in pg_dump? Sure, pg_basebackup is not
> parallel, but with separate pg_dump processes that shouldn't be an issue
> (although I'm not sure when zstd creates threads).

There's no concern at all except under windows (because on windows
pg_dump -j is implemented using threads rather than forking).
Especially since zstd:workers is already allowed in the basebackup
backend process.

> I'll try building zstd with threading enabled, and do some tests over
> the weekend.

Feel free to wait until v17 :)

I used "meson wraps" to get a local version with threading.  Note that
if you want to use a zstd subproject, you may have to specify -D
zstd=enabled, or else meson may not enable the library at all.

Also, in order to introspect its settings, I had to do like this:

mkdir subprojects
meson wrap install zstd
meson subprojects download
mkdir build.meson
meson setup -C build.meson 

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread Melanie Plageman
On Fri, Mar 31, 2023 at 8:05 PM David Rowley  wrote:
>
> On Sat, 1 Apr 2023 at 12:57, Melanie Plageman  
> wrote:
> > I've attached v7 with that commit dropped and with support for parallel
> > vacuum workers to use the same number of buffers in their own Buffer
> > Access Strategy ring as the main vacuum phase did. I also updated the
> > docs to indicate that vacuum_buffer_usage_limit is per backend (not per
> > instance of VACUUM).
>
> (was just replying about v6-0002 when this came in. Replying here instead)
>
> For v7-0001, can we just get rid of both of those static globals?  I'm
> gobsmacked by the existing "A few variables that don't seem worth
> passing around as parameters" comment.  Not wanting to pass parameters
> around is a horrible excuse for adding global variables, even static
> ones.

Makes sense to me.

> Attached is what I propose in .diff form so that the CFbot can run on
> your v7 patches without picking this up.

Your diff LGTM.

Earlier upthread in [1], Bharath had mentioned in a review comment about
removing the global variables that he would have expected the analogous
global in analyze.c to also be removed (vac_strategy [and analyze.c also
has anl_context]).

I looked into doing this, and this is what I found out (see full
rationale in [2]):

> it is a bit harder to remove it from analyze because acquire_func
> doesn't take the buffer access strategy as a parameter and
> acquire_sample_rows uses the vac_context global variable to pass to
> table_scan_analyze_next_block().

I don't know if this is worth mentioning in the commit removing the
other globals? Maybe it will just make it more confusing...

> I considered if we could switch memory contexts before calling
> expand_vacuum_rel() and get_all_vacuum_rels(), but I see, at least in
> the case of expand_vacuum_rel() that we'd probably want to list_free()
> the output of find_all_inheritors() to save that from leaking into the
> vac_context.  It seems safe just to switch into the vac_context only
> when we really want to keep that memory around. (I do think switching
> in each iteration of the foreach(part_lc, part_oids) loop is
> excessive, however. Just not enough for me to want to change it)

Yes, I see what you mean. Your decision makes sense to me.

- Melanie

[1] 
https://www.postgresql.org/message-id/CALj2ACXKgAQpKsCPi6ox%2BK5JLDB9TAxeObyVOfrmgTjqmc0aAA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAAKRu_brtqmd4e7kwEeKjySP22y4ywF32M7pvpi%2Bx5txgF0%2Big%40mail.gmail.com




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread Justin Pryzby
On Sat, Apr 01, 2023 at 01:05:19PM +1300, David Rowley wrote:
> Attached is what I propose in .diff form so that the CFbot can run on
> your v7 patches without picking this up.

But it processes .diff, too

https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F




Re: zstd compression for pg_dump

2023-03-31 Thread Tomas Vondra
On 4/1/23 01:16, Justin Pryzby wrote:
> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
>> On 3/27/23 19:28, Justin Pryzby wrote:
>>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
 On 3/16/23 05:50, Justin Pryzby wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
>> wrote:
>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>> build against it, I had to construct an import library, and put that
>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>> which makes me wonder if I've chosen a harder way than necessary?
>>
>> It looks like pg_dump's meson.build is missing dependencies on zstd
>> (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?
>
>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>> pg_restore output is identical to uncompressed dumps and nothing
>>> explodes. I haven't inspected the threading implementation for safety
>>> yet, as you mentioned.
>>
>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>> handling safety for this, by isolating each thread's state. I don't feel
>> comfortable pronouncing this new addition safe or not, because I'm not
>> sure I understand what the comments in the format-specific _Clone()
>> callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

 I may be missing something, but why would the patch affect this? Why
 would it even affect safety of the parallel dump? And I don't see any
 changes to the clone stuff ...
>>>
>>> zstd supports using threads during compression, with -Z zstd:workers=N.
>>> When unix forks, the child processes can't do anything to mess up the
>>> state of the parent processes.  
>>>
>>> But windows pg_dump uses threads instead of forking, so it seems
>>> possible that the pg_dump -j threads that then spawn zstd threads could
>>> "leak threads" and break the main thread.  I suspect there's no issue,
>>> but we still ought to verify that before declaring it safe.
>>
>> OK. I don't have access to a Windows machine so I can't test that. Is it
>> possible to disable the zstd threading, until we figure this out?
> 
> I think that's what's best.  I made it issue a warning if "workers" was
> specified.  It could also be an error, or just ignored.
> 
> I considered disabling workers only for windows, but realized that I
> haven't tested with threads myself - my local zstd package is compiled
> without threading, and I remember having some issue recompiling it with
> threading.  Jacob's recipe for using meson wraps works well, but it
> still seems better to leave it as a future feature.  I used that recipe
> to enabled zstd with threading on CI (except for linux/autoconf).
> 

+1 to disable this if we're unsure it works correctly. I agree it's
better to just error out if workers are requested - I rather dislike
when a tool just ignores an explicit parameter. And AFAICS it's what
zstd does too, when someone requests workers on incompatible build.

FWIW I've been thinking about this a bit more and I don't quite see why
would the threading cause issues (except for Windows). I forgot
pg_basebackup already supports zstd, including the worker threading, so
why would it work there and not in pg_dump? Sure, pg_basebackup is not
parallel, but with separate pg_dump processes that shouldn't be an issue
(although I'm not sure when zstd creates threads).

The one thing I'm wondering about is at which point are the worker
threads spawned - but presumably not before the pg_dump processes fork.

I'll try building zstd with threading enabled, and do some tests over
the weekend.

> The function is first checking if it was passed a filename which already
> has a suffix.  And if not, it searches through a list of suffixes,
> testing for an existing file with each suffix.  The search with stat()
> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> hasSuffix() branch isn't dead code.  Another opened question.

 AFAICS it's done this way because of this comment in pg_backup_directory

  * ...
  * ".gz" suffix is added to the filenames. The TOC files are never
  * compressed by pg_dump, however they are accepted with the .gz suffix
  * too, in case the user has manually compressed them with 'gzip'.

 I 

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread David Rowley
On Sat, 1 Apr 2023 at 12:57, Melanie Plageman  wrote:
> I've attached v7 with that commit dropped and with support for parallel
> vacuum workers to use the same number of buffers in their own Buffer
> Access Strategy ring as the main vacuum phase did. I also updated the
> docs to indicate that vacuum_buffer_usage_limit is per backend (not per
> instance of VACUUM).

(was just replying about v6-0002 when this came in. Replying here instead)

For v7-0001, can we just get rid of both of those static globals?  I'm
gobsmacked by the existing "A few variables that don't seem worth
passing around as parameters" comment.  Not wanting to pass parameters
around is a horrible excuse for adding global variables, even static
ones.

Attached is what I propose in .diff form so that the CFbot can run on
your v7 patches without picking this up.

I considered if we could switch memory contexts before calling
expand_vacuum_rel() and get_all_vacuum_rels(), but I see, at least in
the case of expand_vacuum_rel() that we'd probably want to list_free()
the output of find_all_inheritors() to save that from leaking into the
vac_context.  It seems safe just to switch into the vac_context only
when we really want to keep that memory around. (I do think switching
in each iteration of the foreach(part_lc, part_oids) loop is
excessive, however. Just not enough for me to want to change it)

David


get_rid_of_a_few_globals_from_vacuum.c.diff
Description: Binary data


Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread Melanie Plageman
On Fri, Mar 31, 2023 at 5:47 PM David Rowley  wrote:
>
> On Sat, 1 Apr 2023 at 02:52, Melanie Plageman  
> wrote:
> > There was one small typo keeping this from compiling. Also a repeated
> > word. I've fixed these. I also edited a bit of indentation and tweaked
> > some wording. Diff attached (to be applied on top of your diff).
>
> Thanks for fixing that mistake.
>
> For reference, I had changed things to end lines early so that the
> glossterm tags could be on a line of their own without breaking to a
> new line. The rest of the file seems to be done that way, so I thought
> we'd better stick to it.
>
> I swapped out "associated WAL" for "unflushed WAL". I didn't agree
> that the WAL that would be flushed would have any particular
> association with the to-be-written page.
>
> I dropped CTAS since I didn't see any other mention in the docs about
> that. I could maybe see the sense in making reference to the
> abbreviated form if we were going to mention it again and didn't want
> to spell the whole thing out each time, but that's not the case here.
>
> I pushed the result.

Cool!

I've attached v7 with that commit dropped and with support for parallel
vacuum workers to use the same number of buffers in their own Buffer
Access Strategy ring as the main vacuum phase did. I also updated the
docs to indicate that vacuum_buffer_usage_limit is per backend (not per
instance of VACUUM).

- Melanie
From 9357d88d14c1ba0a0cffc4bd734e8f3ec0e2dd9f Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 22 Feb 2023 12:06:41 -0500
Subject: [PATCH v7 1/5] remove global variable vac_strategy

---
 src/backend/commands/vacuum.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c54360a6a0..a6aac30529 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -75,7 +75,6 @@ int			vacuum_multixact_failsafe_age;
 
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
-static BufferAccessStrategy vac_strategy;
 
 
 /*
@@ -94,7 +93,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
 			  TransactionId lastSaneFrozenXid,
 			  MultiXactId lastSaneMinMulti);
 static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
-	   bool skip_privs);
+	   bool skip_privs, BufferAccessStrategy bstrategy);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -338,7 +337,7 @@ vacuum(List *relations, VacuumParams *params,
 		in_outer_xact = IsInTransactionBlock(isTopLevel);
 
 	/*
-	 * Due to static variables vac_context, anl_context and vac_strategy,
+	 * Due to static variable vac_context
 	 * vacuum() is not reentrant.  This matters when VACUUM FULL or ANALYZE
 	 * calls a hostile index expression that itself calls ANALYZE.
 	 */
@@ -404,7 +403,6 @@ vacuum(List *relations, VacuumParams *params,
 		bstrategy = GetAccessStrategy(BAS_VACUUM);
 		MemoryContextSwitchTo(old_context);
 	}
-	vac_strategy = bstrategy;
 
 	/*
 	 * Build list of relation(s) to process, putting any new data in
@@ -509,7 +507,7 @@ vacuum(List *relations, VacuumParams *params,
 
 			if (params->options & VACOPT_VACUUM)
 			{
-if (!vacuum_rel(vrel->oid, vrel->relation, params, false))
+if (!vacuum_rel(vrel->oid, vrel->relation, params, false, bstrategy))
 	continue;
 			}
 
@@ -527,7 +525,7 @@ vacuum(List *relations, VacuumParams *params,
 }
 
 analyze_rel(vrel->oid, vrel->relation, params,
-			vrel->va_cols, in_outer_xact, vac_strategy);
+			vrel->va_cols, in_outer_xact, bstrategy);
 
 if (use_own_xacts)
 {
@@ -1838,7 +1836,7 @@ vac_truncate_clog(TransactionId frozenXID,
  *		At entry and exit, we are not inside a transaction.
  */
 static bool
-vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
+vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs, BufferAccessStrategy bstrategy)
 {
 	LOCKMODE	lmode;
 	Relation	rel;
@@ -2084,7 +2082,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
 			cluster_rel(relid, InvalidOid, _params);
 		}
 		else
-			table_relation_vacuum(rel, params, vac_strategy);
+			table_relation_vacuum(rel, params, bstrategy);
 	}
 
 	/* Roll back any GUC changes executed by index functions */
@@ -2118,7 +2116,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
 		memcpy(_vacuum_params, params, sizeof(VacuumParams));
 		toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
 
-		vacuum_rel(toast_relid, NULL, _vacuum_params, true);
+		vacuum_rel(toast_relid, NULL, _vacuum_params, true, bstrategy);
 	}
 
 	/*
-- 
2.37.2

From ffa2524260aa3c05b6ee0c272980dcfecb108646 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 22 Feb 2023 12:26:01 -0500

Re: RFC: logical publication via inheritance root?

2023-03-31 Thread Jacob Champion
On Fri, Mar 31, 2023 at 3:17 PM Peter Smith  wrote:
> OK, my understanding is that TimescaleDB uses some kind of
> quasi-partitioned/inherited tables (aka hypertables? [1]) internally,
> and your proposed WIP patch provides a set_logical_root() function
> which combines with the logical replication (LR) PUBLICATION option
> "publish_via_partition_root" to help to replicate those.

Correct!

> You also mentioned pg_partman. IIUC pg_partman is a partitioning
> extension [2] that pre-dated the native PostgreSQL partitioning
> introduced in PG10 (i.e. quite a while ago). I guess it would be a
> very niche group of users that are still using pg_partman old-style
> (pre-PG10) partitions and want to migrate them but have not already
> done so.

Yeah. I've got no evidence either way, unfortunately -- on the one
hand, surely people have been able to upgrade by now? And on the
other, implementation inertia seems to override most other engineering
goals...

Probably best to ask the partman users, and not me. :D Or assume it's
a non-benefit unless someone says otherwise (even then, the partman
maintainers would need to agree it's useful and add support for this).

> Outside the scope of special TimescaleDB tables and the speculated
> pg_partman old-style table migration, will this proposed new feature
> have any other application? In other words, do you know if this
> proposal will be of any benefit to the *normal* users who just have
> native PostgreSQL inherited tables they want to replicate?

I think it comes down to why an inheritance scheme was used. If it's
because you want to group rows into named, queryable subsets (e.g. the
"cities/capitals" example in the docs [1]), I don't think this has any
utility, because I assume you'd want to replicate your subsets as-is.

But if it's because you've implemented a partitioning scheme of your
own (the docs still list reasons you might want to [2], even today),
and all you ever really do is interact with the root table, I think
this feature will give you some of the same benefits that
publish_via_partition_root gives native partition users. We're very
much in that boat, but I don't know how many others are.

Thanks!
--Jacob

[1] https://www.postgresql.org/docs/15/tutorial-inheritance.html
[2] 
https://www.postgresql.org/docs/15/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE




Re: Add LZ4 compression in pg_dump

2023-03-31 Thread Tomas Vondra
On 3/31/23 11:19, gkokola...@pm.me wrote:
> 
>> ...
>>
>>
>> I think the patch is fine, but I'm wondering if the renames shouldn't go
>> a bit further. It removes references to LZ4File struct, but there's a
>> bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
>> prefix? We don't have GzipFile either.
>>
>> Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
>> then we probably should not define LZ4_compressor_init ...
> 
> This is a good point. The initial thought was that since lz4.h is now
> removed, such ambiguity will not be present. In v2 of the patch the
> function is renamed to `LZ4State_compression_init` since this name
> describes better its purpose. It initializes the LZ4State for
> compression.
> 
> As for the LZ4File_ prefix, I have no objections. Please find the
> prefix changed to LZ4Stream_. For the record, the word 'File' is not
> unique to the lz4 implementation. The common data structure used by
> the API in compress_io.h:
> 
>typedef struct CompressFileHandle CompressFileHandle; 
> 
> The public functions for this API are named:
> 
>   InitCompressFileHandle
>   InitDiscoverCompressFileHandle
>   EndCompressFileHandle
> 
> And within InitCompressFileHandle the pattern is:
> 
> if (compression_spec.algorithm == PG_COMPRESSION_NONE)
> InitCompressFileHandleNone(CFH, compression_spec);
> else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
> InitCompressFileHandleGzip(CFH, compression_spec);
> else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
> InitCompressFileHandleLZ4(CFH, compression_spec);
> 
> It was felt that a prefix was required due to the inclusion 'lz4.h'
> header where naming functions as 'LZ4_' would be wrong. The prefix
> 'LZ4File_' seemed to be in line with the naming of the rest of
> the relevant functions and structures. Other compressions, gzip and
> none, did not face the same issue.
> 
> To conclude, I think that having a prefix is slightly preferred
> over not having one. Since the prefix `LZ4File_` is not desired,
> I propose `LZ4Stream_` in v2.
> 
> I will not object to dismissing the argument and drop `File` from
> the prefix, if so requested.
> 

Thanks.

I think the LZ4Stream prefix is reasonable, so let's roll with that. I
cleaned up the patch a little bit (mostly comment tweaks, etc.), updated
the commit message and pushed it.

The main tweak I did is renaming all the LZ4State variables from "fs" to
state. The old name referred to the now abandoned "file state", but
after the rename to LZ4State that seems confusing. Some of the places
already used "state"and it's easier to know "state" is always LZ4State,
so let's keep it consistent.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-03-31 Thread Justin Pryzby
On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
>  wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
>  It looks like pg_dump's meson.build is missing dependencies on zstd
>  (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
>  Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>  handling safety for this, by isolating each thread's state. I don't feel
>  comfortable pronouncing this new addition safe or not, because I'm not
>  sure I understand what the comments in the format-specific _Clone()
>  callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> > 
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> > 
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?

I think that's what's best.  I made it issue a warning if "workers" was
specified.  It could also be an error, or just ignored.

I considered disabling workers only for windows, but realized that I
haven't tested with threads myself - my local zstd package is compiled
without threading, and I remember having some issue recompiling it with
threading.  Jacob's recipe for using meson wraps works well, but it
still seems better to leave it as a future feature.  I used that recipe
to enabled zstd with threading on CI (except for linux/autoconf).

> >>> The function is first checking if it was passed a filename which already
> >>> has a suffix.  And if not, it searches through a list of suffixes,
> >>> testing for an existing file with each suffix.  The search with stat()
> >>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> >>> hasSuffix() branch isn't dead code.  Another opened question.
> >>
> >> AFAICS it's done this way because of this comment in pg_backup_directory
> >>
> >>  * ...
> >>  * ".gz" suffix is added to the filenames. The TOC files are never
> >>  * compressed by pg_dump, however they are accepted with the .gz suffix
> >>  * too, in case the user has manually compressed them with 'gzip'.
> >>
> >> I haven't tried, but I believe that if you manually compress the
> >> directory, it may hit this branch.
> > 
> > That would make sense, but when I tried, it didn't work like that.
> > The filenames were all uncompressed names.  Maybe it worked differently
> > in an older release.  Or maybe it changed during development of the
> > parallel-directory-dump patch and it's actually dead code.
> 
> Interesting. Would be good to find out. I wonder if a little bit of
> git-log digging could tell us more. Anyway, until we confirm it's dead
> code, we should probably do what .gz does and have the same check for
> .lz4 and .zst files.

I found that hasSuffix() and cfopen() originated in the refactored patch
Heikki's sent here; there's no history beyond that.

https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com

The patch published there appends the .gz within cfopen(), and the
caller writes into the TOC the filename without ".gz".  It seems like

Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Andres Freund
Hi,

On 2023-03-31 17:00:00 +0300, Alexander Lakhin wrote:
> 31.03.2023 15:55, Tom Lane wrote:
> > See also the thread about bug #16329 [1].  Alexander promised to look
> > into improving the test coverage in this area, maybe he can keep an
> > eye on the WAL logic coverage too.
> 
> Yes, I'm going to analyze that area too. Maybe it'll take more time
> (a week or two) if I encounter some bugs there (for now I observe anomalies
> with gist__int_ops), but I will definitely try to improve the gist testing.

Because I needed it to verify the changes in the referenced patch, I wrote
tests exercising killtuples based pruning for gist and hash.

For the former I first wrote it in contrib/btree_gist. But that would mean the
recovery side isn't exercised via 027_stream_regress.sql. So I rewrote it to
use point instead, which is a tad more awkward, but...

For now I left the new tests in their own files. But possibly they should be
in gist.sql and hash_index.sql respectively?


As I also wrote at the top of the tests, we can't easily verify that
killtuples pruning has actually happened, nor guarantee that concurrent
activity doesn't prevent it (e.g. autovacuum having a snapshot open or
such). At least not without loosing coverage of WAL logging / replay. To make
it more likely I added them as their own test group.


I don't know if we want the tests in this form, but I do find them useful for
now.


Greetings,

Andres Freund
>From 00a8f347fb5a57f3cafe082cdd9c3ea8f6a62053 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 31 Mar 2023 14:58:50 -0700
Subject: [PATCH v1] WIP: Add test exercising hash and gist killtuples

---
 src/test/regress/expected/gist_killtuples.out | 78 
 .../expected/hash_index_killtuples.out| 89 +++
 src/test/regress/parallel_schedule|  4 +
 src/test/regress/sql/gist_killtuples.sql  | 49 ++
 .../regress/sql/hash_index_killtuples.sql | 57 
 5 files changed, 277 insertions(+)
 create mode 100644 src/test/regress/expected/gist_killtuples.out
 create mode 100644 src/test/regress/expected/hash_index_killtuples.out
 create mode 100644 src/test/regress/sql/gist_killtuples.sql
 create mode 100644 src/test/regress/sql/hash_index_killtuples.sql

diff --git a/src/test/regress/expected/gist_killtuples.out b/src/test/regress/expected/gist_killtuples.out
new file mode 100644
index 000..18d39c871a1
--- /dev/null
+++ b/src/test/regress/expected/gist_killtuples.out
@@ -0,0 +1,78 @@
+-- Basic test of gist's killtuples / pruning during split implementation. This
+-- is not guaranteed to reach those paths, concurrent activity could prevent
+-- cleanup of dead tuples or such. But it's likely to reach them, which seems
+-- better than nothing.
+set enable_seqscan=off;
+set enable_bitmapscan=off;
+CREATE TABLE test_gist_killtuples (
+	k point,
+	v int DEFAULT 0
+);
+CREATE INDEX ON test_gist_killtuples USING gist(k);
+-- create index entries pointing to deleted tuples
+INSERT INTO test_gist_killtuples(k, v) SELECT point(generate_series(1, 1000), 0), 0;
+-- via deletes
+DELETE FROM test_gist_killtuples WHERE k << point(500, 0);
+-- and updates
+UPDATE test_gist_killtuples SET k = k + point(1, 0), v = 1 WHERE k >> point(500, 0);
+-- make the index see tuples are dead via killtuples
+PREPARE engage_killtuples AS
+SELECT sum(k <-> point(0, 0)) FROM test_gist_killtuples WHERE k >> point(0, 0);
+EXPLAIN (COSTS OFF) EXECUTE engage_killtuples;
+   QUERY PLAN   
+
+ Aggregate
+   ->  Index Only Scan using test_gist_killtuples_k_idx on test_gist_killtuples
+ Index Cond: (k >> '(0,0)'::point)
+(3 rows)
+
+EXECUTE engage_killtuples;
+  sum   
+
+ 376250
+(1 row)
+
+-- create new index entries, in the same ranges, this should see the dead entries and prune the pages
+INSERT INTO test_gist_killtuples(k, v) SELECT point(generate_series(1, 300), 0), 2;
+UPDATE test_gist_killtuples SET k = k + point(1, 0), v = 3 WHERE k >> point(600, 0);
+-- do some basic cross checking of index scans vs sequential scan
+-- use prepared statement, so we can explain and execute without repeating the statement
+PREPARE check_query AS SELECT tgk.v, min(tgk.k <-> point(0, 0)), max(tgk.k <-> point(0, 0)), count(*),
+  brute = ROW(tgk.v, min(tgk.k <-> point(0, 0)), max(tgk.k <-> point(0, 0)), count(*)) AS are_equal
+FROM (
+SELECT v, min(k <-> point(0, 0)), max(k <-> point(0, 0)), count(*)
+FROM test_gist_killtuples GROUP BY v ORDER BY v
+  ) brute,
+  test_gist_killtuples tgk
+WHERE tgk.k >> point(brute.min - 1, 0) AND tgk.k << point(brute.max + 1, 0)
+GROUP BY brute, tgk.v
+ORDER BY tgk.v;
+EXPLAIN (COSTS OFF) EXECUTE check_query;
+   QUERY PLAN

Re: Data is copied twice when specifying both child and parent table in publication

2023-03-31 Thread Jacob Champion
On 3/31/23 03:04, shiy.f...@fujitsu.com wrote:
> I noticed that a similar problem has been discussed in this thread, see [1] 
> [2]
> [3] [4].

Ah, thank you. I didn't go far back enough in the thread...

> It seems complicated to fix it if we want to automatically skip tables
> that have been synchronized previously by code

I agree, this is looking very complex. I need to read through the
examples you sent more closely.

> and this may overkill in some
> cases (e.g. The target table in subscriber is not a partitioned table, and the
> user want to synchronize all data in the partitioned table from the 
> publisher).

Hm. It seems like the setup process doesn't really capture the user's
intent. There are just so many things that they could be theoretically
trying to do.

> Besides, it seems not a common case. So I'm not sure we should fix it. Maybe 
> we
> can just add some documentation for it as Peter mentioned.

I think we should absolutely document the pitfalls here. (I'm still
trying to figure out what they are, though, so I don't have any concrete
suggestions yet...)

Thanks!
--Jacob




Re: Data is copied twice when specifying both child and parent table in publication

2023-03-31 Thread Jacob Champion
On 3/30/23 20:01, Peter Smith wrote:
> For example, Just imagine if logic could be made smarter to recognize
> that since there was already the 'part_def' being subscribed so it
> should NOT use the default 'copy_data=true' when the REFRESH launches
> the ancestor table 'part'...
> 
> Even if that logic was implemented, I have a feeling you could *still*
> run into problems if the 'part' table was made of multiple partitions.
> I think you might get to a situation where you DO want some partition
> data copied (because you did not have it yet but now you are
> subscribing to the root you want it) while at the same time, you DON'T
> want to get duplicated data from other partitions (because you already
> knew about those ones  -- like your example does).

Hm, okay. My interest here is mainly because my logical-roots proposal
generalizes the problem (and therefore makes it worse).

For what it's worth, that patchset introduces the ability for the
subscriber to sync multiple tables into one. I wonder if that could be
used somehow to help fix this problem too?

> At least, we need to check there are sufficient "BE CAREFUL" warnings
> in the documentation for scenarios like this.

Agreed. These are sharp edges.

Thanks,
--Jacob




Re: running logical replication as the subscription owner

2023-03-31 Thread Jeff Davis
On Fri, 2023-03-31 at 15:17 -0400, Robert Haas wrote:
> I think that's too Boolean. The special case in 0001 is a better
> solution for the cases where it works. It's both more granular and
> more convenient.

I guess the "more convenient" is where I'm confused, because the "grant
subscription_owner to table owner with set role true" is not likely to
be conveniently already present; it would need to be issued manually to
take advantage of this special case.

Do you have any concern about the weirdness where assigning the
subscription to a higher-privilege user Z would cause B's trigger to
fail?

Regards,
Jeff Davis





Re: RFC: logical publication via inheritance root?

2023-03-31 Thread Peter Smith
On Wed, Mar 1, 2023 at 9:47 AM Jacob Champion  wrote:
>
> Hi,
>
> I'm going to register this in CF for feedback.
>
> Summary for potential reviewers: we don't use declarative partitions in
> the Timescale partitioning scheme, but it'd be really nice to be able to
> replicate between our tables and standard tables, or between two
> Timescale-partitioned tables with different layouts. This patch lets
> extensions (or savvy users) upgrade an existing inheritance relationship
> between two tables into a "logical partition" relationship, so that they
> can be handled with the publish_via_partition_root machinery.
>
> I hope this might also help pg_partman users migrate between old- and
> new-style partition schemes, but that's speculation.
>

OK, my understanding is that TimescaleDB uses some kind of
quasi-partitioned/inherited tables (aka hypertables? [1]) internally,
and your proposed WIP patch provides a set_logical_root() function
which combines with the logical replication (LR) PUBLICATION option
"publish_via_partition_root" to help to replicate those.

You also mentioned pg_partman. IIUC pg_partman is a partitioning
extension [2] that pre-dated the native PostgreSQL partitioning
introduced in PG10 (i.e. quite a while ago). I guess it would be a
very niche group of users that are still using pg_partman old-style
(pre-PG10) partitions and want to migrate them but have not already
done so. Also, the pg_partman README [3] says since v4.0.0 there is
extensive support for native PostgreSQL partitions, so perhaps
existing LR already works for those.

Outside the scope of special TimescaleDB tables and the speculated
pg_partman old-style table migration, will this proposed new feature
have any other application? In other words, do you know if this
proposal will be of any benefit to the *normal* users who just have
native PostgreSQL inherited tables they want to replicate? I haven’t
yet looked at the WIP patch TAP tests – so apologies for my question
if the benefits to normal users are self-evident from your test cases.

--
[1] 
https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/
[2] https://www.crunchydata.com/blog/native-partitioning-with-postgres
[3] https://github.com/pgpartman/pg_partman

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread David Rowley
On Sat, 1 Apr 2023 at 02:52, Melanie Plageman  wrote:
> There was one small typo keeping this from compiling. Also a repeated
> word. I've fixed these. I also edited a bit of indentation and tweaked
> some wording. Diff attached (to be applied on top of your diff).

Thanks for fixing that mistake.

For reference, I had changed things to end lines early so that the
glossterm tags could be on a line of their own without breaking to a
new line. The rest of the file seems to be done that way, so I thought
we'd better stick to it.

I swapped out "associated WAL" for "unflushed WAL". I didn't agree
that the WAL that would be flushed would have any particular
association with the to-be-written page.

I dropped CTAS since I didn't see any other mention in the docs about
that. I could maybe see the sense in making reference to the
abbreviated form if we were going to mention it again and didn't want
to spell the whole thing out each time, but that's not the case here.

I pushed the result.

David




Re: Add pg_walinspect function with block info columns

2023-03-31 Thread Peter Geoghegan
On Thu, Mar 30, 2023 at 2:41 PM Peter Geoghegan  wrote:
> There is still an outstanding question around the overhead of
> outputting FPIs and even block data from pg_get_wal_block_info(). At
> one point Melanie suggested that we'd need to do something about that,
> and I tend to agree. Attached patch provides an optional parameter
> that will make pg_get_wal_block_info return NULLs for both block_data
> and block_fpi_data, no matter whether or not there is something to
> show. Note that this only affects those two bytea columns; we'll still
> show everything else, including valid block_data_length and
> block_fpi_length values (so the metadata describing the on-disk size
> of block_data and block_fpi_data is unaffected).

I pushed this patch just now. Except that the final commited version
had the "suppress_block_data" output parameter name flipped. It was
inverted and renamed to "show_data" (and made "DEFAULT true"). This is
closer to how the pg_stat_statements() function handles a similar
issue with overly large query texts.

I'm very happy with the end result of the work on this thread. It
works a lot better for the sorts of queries I am interested in. Thanks
to all involved, particularly Bharath.

-- 
Peter Geoghegan




Re: Making background psql nicer to use in tap tests

2023-03-31 Thread Daniel Gustafsson
The attached v4 fixes some incorrect documentation (added by me in v3), and
fixes that background_psql() didn't honor on_error_stop and extraparams passed
by the user.  I've also added a commit which implements the \password test from
the SCRAM iteration count patchset as well as cleaned up a few IPC::Run
includes from test scripts.

--
Daniel Gustafsson



v4-0002-Add-test-SCRAM-iteration-changes-with-psql-passwo.patch
Description: Binary data


v4-0001-Refactor-background-psql-TAP-functions.patch
Description: Binary data


Re: Non-superuser subscription owners

2023-03-31 Thread Robert Haas
On Thu, Mar 30, 2023 at 9:49 PM houzj.f...@fujitsu.com
 wrote:
> It looks like the super user check is out of a transaction, I haven't checked 
> why
> it only failed on one BF animal, but it seems we can put the check into the
> transaction like the following:

That looks like a reasonable fix but I can't reproduce the problem
locally. I thought the reason why that machine sees the problem might
be that it uses -DRELCACHE_FORCE_RELEASE, but I tried that option here
and the tests still pass. Anyone ideas how to reproduce?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2023-03-31 Thread Andres Freund
Hi,

On 2023-03-30 21:33:00 -0700, Andres Freund wrote:
> > diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> > index 2ce9366277..93fb9d438a 100644
> > --- a/src/include/access/gistxlog.h
> > +++ b/src/include/access/gistxlog.h
> > @@ -51,11 +51,14 @@ typedef struct gistxlogDelete
> >  {
> > TransactionId snapshotConflictHorizon;
> > uint16  ntodelete;  /* number of deleted offsets */
> > +   boolisCatalogRel;   /* to handle recovery conflict during 
> > logical
> > +* decoding on 
> > standby */
> >
> > -   /* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
> > +   /* TODELETE OFFSET NUMBERS */
> > +   OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> >  } gistxlogDelete;
> >
> > -#define SizeOfGistxlogDelete   (offsetof(gistxlogDelete, ntodelete) + 
> > sizeof(uint16))
> > +#define SizeOfGistxlogDelete   offsetof(gistxlogDelete, offsets)

> 
> I don't think the changes are quite sufficient:
> 
> for gist:
> 
> > @@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
> >   */
> >  XLogRecPtr
> >  gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
> > -  TransactionId snapshotConflictHorizon)
> > +  TransactionId snapshotConflictHorizon, Relation 
> > heaprel)
> >  {
> > gistxlogDelete xlrec;
> > XLogRecPtr  recptr;
> >
> > +   xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
> > xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
> > xlrec.ntodelete = ntodelete;
> 
> Note that gistXLogDelete() continues to register data with two different
> XLogRegisterData() calls. This will append data without any padding:
> 
>   XLogRegisterData((char *) , SizeOfGistxlogDelete);
> 
>   /*
>* We need the target-offsets array whether or not we store the whole
>* buffer, to allow us to find the snapshotConflictHorizon on a standby
>* server.
>*/
>   XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));
> 
> 
> But replay now uses the new offset member:
> 
> > @@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
> > gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
> > Buffer  buffer;
> > Pagepage;
> > +   OffsetNumber *toDelete = xldata->offsets;
> >
> > /*
> >  * If we have any conflict processing to do, it must happen before we
> 
> 
> That doesn't look right. If there's any padding before offsets, we'll afaict
> read completely bogus data?
> 
> As it turns out, there is padding:
> 
> struct gistxlogDelete {
> TransactionId  snapshotConflictHorizon; /* 0 4 */
> uint16 ntodelete;/* 4 2 */
> _Bool  isCatalogRel; /* 6 1 */
> 
> /* XXX 1 byte hole, try to pack */
> 
> OffsetNumber   offsets[];/* 8 0 */
> 
> /* size: 8, cachelines: 1, members: 4 */
> /* sum members: 7, holes: 1, sum holes: 1 */
> /* last cacheline: 8 bytes */
> };
> 
> 
> I am frankly baffled how this works at all, this should just about immediately
> crash?
> 
> 
> Oh, I see. We apparently don't reach the gist deletion code in the tests:
> https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
> https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174
> 
> And indeed, if I add an abort() into , it's not reached.
> 
> And it's not because tests use a temp table, the caller is also unreachable:
> https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

After writing a minimal test to reach it, it turns out to actually work - I
missed that SizeOfGistxlogDelete now includes the padding, where commonly that
pattern tries to *exclude* trailing padding.  Sorry for the noise on this one
:(

I am writing two minimal test cases to reach this code for hash and
gist. Not to commit as part of this, but to be able to verify that it
works. I'll post them in the separate thread I started about the lack of
regression test coverage in the area.

Greetings,

Andres Freund




Re: Track IO times in pg_stat_io

2023-03-31 Thread Melanie Plageman
Attached is a rebased version in light of 8aaa04b32d

- Melanie
From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v8 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.
---
 src/backend/storage/buffer/localbuf.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6f9e7eda57..ecccb6c1a9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,6 +114,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
+	instr_time	io_start,
+io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -220,6 +222,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
   BufTagGetForkNum(>tag),
@@ -233,6 +240,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		/* Temporary table I/O does not use Buffer Access Strategies */
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		pgBufferUsage.local_blks_written++;
 	}
 
-- 
2.37.2

From 726ab546a11707baa167106d5b6266451dfae445 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 20:36:10 -0400
Subject: [PATCH v8 4/4] pgstat_database uses pgstat_io time counters

Use pgstat_io's pending counters to increment pgStatBlockWriteTime and
pgStatBlockReadTime.
---
 src/backend/utils/activity/pgstat_io.c | 14 --
 src/include/pgstat.h   |  4 
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 4e98c4749a..905566decd 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -124,13 +124,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
@@ -181,15 +179,19 @@ pgstat_flush_io(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-instr_time	time;
+PgStat_Counter time;
 
 bktype_shstats->counts[io_object][io_context][io_op] +=
 	PendingIOStats.counts[io_object][io_context][io_op];
 
-time = PendingIOStats.pending_times[io_object][io_context][io_op];
+time = INSTR_TIME_GET_MICROSEC(PendingIOStats.pending_times[io_object][io_context][io_op]);
 
-bktype_shstats->times[io_object][io_context][io_op] +=
-	INSTR_TIME_GET_MICROSEC(time);
+bktype_shstats->times[io_object][io_context][io_op] += time;
+
+if (io_op == IOOP_WRITE)
+	pgStatBlockWriteTime += time;
+else if (io_op == IOOP_READ)
+	pgStatBlockReadTime += time;
 			}
 		}
 	}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bf54c6defe..833476a2bb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -544,10 +544,6 @@ extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 
-#define pgstat_count_buffer_read_time(n)			\
-	(pgStatBlockReadTime += (n))
-#define pgstat_count_buffer_write_time(n)			\
-	(pgStatBlockWriteTime += (n))
 #define pgstat_count_conn_active_time(n)			\
 	(pgStatActiveTime += (n))
 #define pgstat_count_conn_txn_idle_time(n)			\
-- 
2.37.2

From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v8 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Andres Freund 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 

Re: Schema variables - new implementation for Postgres 15

2023-03-31 Thread Dmitry Dolgov
> On Tue, Mar 28, 2023 at 09:34:20PM +0200, Pavel Stehule wrote:
> Hi
>
> > Talking about documentation I've noticed that the implementation
> > contains few limitations, that are not mentioned in the docs. Examples
> > are WITH queries:
> >
> > WITH x AS (LET public.svar = 100) SELECT * FROM x;
> > ERROR:  LET not supported in WITH query
> >
>
>  The LET statement doesn't support the RETURNING clause, so using inside
> CTE does not make any sense.
>
> Do you have some tips, where this behaviour should be mentioned?

Yeah, you're right, it's probably not worth adding. I usually find it a
good idea to explicitly mention any limitations, but WITH docs are
actually have one line about statements without the RETURNING clause,
plus indeed for LET it makes even less sense.

> > and using with set-returning functions (haven't found any related tests).
> >
>
> There it is:
>
> +CREATE VARIABLE public.svar AS int;
> +-- should be ok
> +LET public.svar = generate_series(1, 1);
> +-- should fail
> +LET public.svar = generate_series(1, 2);
> +ERROR:  expression returned more than one row
> +LET public.svar = generate_series(1, 0);
> +ERROR:  expression returned no rows
> +DROP VARIABLE public.svar;

Oh, interesting. I was looking for another error message from
parse_func.c:

set-returning functions are not allowed in LET assignment expression

Is this one you've posted somehow different?

> > Another small note is about this change in the rowsecurity:
> >
> > /*
> > -* For SELECT, UPDATE and DELETE, add security quals to enforce
> > the USING
> > -* policies.  These security quals control access to existing
> > table rows.
> > -* Restrictive policies are combined together using AND, and
> > permissive
> > -* policies are combined together using OR.
> > +* For SELECT, LET, UPDATE and DELETE, add security quals to
> > enforce the
> > +* USING policies.  These security quals control access to
> > existing table
> > +* rows. Restrictive policies are combined together using AND, and
> > +* permissive policies are combined together using OR.
> >  */
> >
> > From this commentary one may think that LET command supports row level
> > security, but I don't see it being implemented. A wrong commentary?
> >
>
> I don't think so.  The row level security should be supported. I tested it
> on example from doc:
>
> [...]
>
> (2023-03-28 21:32:33) postgres=# set role to t1role;
> SET
> (2023-03-28 21:32:40) postgres=# select * from accounts ;
> ┌─┬─┬┐
> │ manager │ company │ contact_email  │
> ╞═╪═╪╡
> │ t1role  │ xxx │ t1r...@xxx.org │
> └─┴─┴┘
> (1 row)
>
> (2023-03-28 21:32:45) postgres=# let v = (select company from accounts);
> LET
> (2023-03-28 21:32:58) postgres=# select v;
> ┌─┐
> │  v  │
> ╞═╡
> │ xxx │
> └─┘
> (1 row)
>
> (2023-03-28 21:33:03) postgres=# set role to default;
> SET
> (2023-03-28 21:33:12) postgres=# set role to t2role;
> SET
> (2023-03-28 21:33:19) postgres=# select * from accounts ;
> ┌─┬─┬┐
> │ manager │ company │ contact_email  │
> ╞═╪═╪╡
> │ t2role  │ yyy │ t2r...@yyy.org │
> └─┴─┴┘
> (1 row)
>
> (2023-03-28 21:33:22) postgres=# let v = (select company from accounts);
> LET
> (2023-03-28 21:33:26) postgres=# select v;
> ┌─┐
> │  v  │
> ╞═╡
> │ yyy │
> └─┘
> (1 row)

Hm, but isn't the row level security enforced here on the select level,
not when assigning some value via LET? Plus, it seems the comment
originally refer to the command types (CMD_SELECT, etc), and there is no
CMD_LET and no need for it, right?

I'm just trying to understand if there was anything special done for
session variables in this regard, and if not, the commentary change
seems to be not needed (I know, I know, it's totally nitpicking).




Re: BUG #17877: Referencing a system column in a foreign key leads to incorrect memory access

2023-03-31 Thread Tom Lane
Ranier Vilela  writes:
> I think that commit f0d65c0
> 
> has an oversight.
> Attnum == 0, is system column too, right?

No, it's not valid in pg_attribute rows.

> All other places at tablecmds.c, has this test:
> if (attnum <= 0)
> ereport(ERROR,

I was actually copying this code in indexcmds.c:

if (attno < 0)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("index creation on system columns is not 
supported")));

There's really no reason to prefer one over the other in this context.

regards, tom lane




Re: running logical replication as the subscription owner

2023-03-31 Thread Robert Haas
On Fri, Mar 31, 2023 at 3:36 AM Jeff Davis  wrote:
> That's moving the goalposts a little, though:
>
> "Some concern was expressed ... might break things that are currently
> working..."
>
> https://www.postgresql.org/message-id/CA+TgmoaE35kKS3-zSvGiZszXP9Tb9rNfYzT=+fo8ehk5edk...@mail.gmail.com
>
> If the original use case was "don't break stuff", I think patch 0002
> solves that, and we don't need this special case in 0001. Would you
> agree with that statement?

I think that's too Boolean. The special case in 0001 is a better
solution for the cases where it works. It's both more granular and
more convenient. The fact that you might be able to get by with 0002
doesn't negate that.

> > But I imagine CREATE SUBSCRIPTION being used either by
> > superusers or by people who already have those role grants anyway,
> > because I imagine replication as something that a highly privileged
> > user configures on behalf of everyone who uses the system. And in
> > that
> > case those role grants aren't something new that you do specifically
> > for logical replication - they're already there because you need them
> > to administer stuff. Or you're the superuser and don't need them
> > anyway.
>
> Did the discussion drift back towards the SET ROLE in the other
> direction? I thought we had settled that in v16 we would require that
> the subscription owner can SET ROLE to the table owner (as in your
> current 0001), but that we could revisit it later.

Yeah, I think that's what we agreed. I'm just saying that I'm not as
concerned about that design as you are, and encouraging you to maybe
not be quite so dismayed by it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Should vacuum process config file reload more often

2023-03-31 Thread Melanie Plageman
On Fri, Mar 31, 2023 at 10:31 AM Melanie Plageman
 wrote:
>
> On Thu, Mar 30, 2023 at 3:26 PM Daniel Gustafsson  wrote:
> >
> > > On 30 Mar 2023, at 04:57, Masahiko Sawada  wrote:
> >
> > > As another idea, why don't we use macros for that? For example,
> > > suppose VacuumCostStatus is like:
> > >
> > > typedef enum VacuumCostStatus
> > > {
> > >VACUUM_COST_INACTIVE_LOCKED = 0,
> > >VACUUM_COST_INACTIVE,
> > >VACUUM_COST_ACTIVE,
> > > } VacuumCostStatus;
> > > VacuumCostStatus VacuumCost;
> > >
> > > non-vacuum code can use the following macros:
> > >
> > > #define VacuumCostActive() (VacuumCost == VACUUM_COST_ACTIVE)
> > > #define VacuumCostInactive() (VacuumCost <= VACUUM_COST_INACTIVE) //
> > > or we can use !VacuumCostActive() instead.
> >
> > I'm in favor of something along these lines.  A variable with a name that
> > implies a boolean value (active/inactive) but actually contains a tri-value 
> > is
> > easily misunderstood.  A VacuumCostState tri-value variable (or a better 
> > name)
> > with a set of convenient macros for extracting the boolean active/inactive 
> > that
> > most of the code needs to be concerned with would more for more readable 
> > code I
> > think.
>
> The macros are very error-prone. I was just implementing this idea and
> mistakenly tried to set the macro instead of the variable in multiple
> places. Avoiding this involves another set of macros, and, in the end, I
> think the complexity is much worse. Given the reviewers' uniform dislike
> of VacuumCostInactive, I favor going back to two variables
> (VacuumCostActive + VacuumFailsafeActive) and moving
> LVRelState->failsafe_active to the global VacuumFailsafeActive.
>
> I will reimplement this in the next version.
>
> On the subject of globals, the next version will implement
> Horiguchi-san's proposal to separate GUC variables from the globals used
> in the code (quoted below). It should hopefully reduce the complexity of
> this patchset.
>
> > Although it's somewhat unrelated to the goal of this patch, I think we
> > should clean up the code tidy before proceeding. Shouldn't we separate
> > the actual parameters from the GUC base variables, and sort out the
> > all related variaghble? (something like the attached, on top of your
> > patch.)

Attached is v12. It has a number of updates, including a commit to
separate VacuumCostLimit and VacuumCostDelay from the gucs
vacuum_cost_limit and vacuum_cost_delay, and a return to
VacuumCostActive.

- Melanie
From 106f5db65846e0497945b6171bdc29f5727aadc3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 31 Mar 2023 10:38:39 -0400
Subject: [PATCH v12 1/4] Make vacuum's failsafe_active a global

While vacuuming a table in failsafe mode, VacuumCostActive should not be
re-enabled. This currently isn't a problem because vacuum cost
parameters are only refreshed in between vacuuming tables and failsafe
status is reset for every table. In preparation for allowing vacuum cost
parameters to be updated more frequently, elevate
LVRelState->failsafe_active to a global, VacuumFailsafeActive, which
will be checked when determining whether or not to re-enable vacuum
cost-related delays.
---
 src/backend/access/heap/vacuumlazy.c  | 16 +++-
 src/backend/commands/vacuum.c |  1 +
 src/backend/commands/vacuumparallel.c |  1 +
 src/include/commands/vacuum.h |  1 +
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..f4755bcc4b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,8 +153,6 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
-	/* Wraparound failsafe has been triggered? */
-	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
 	bool		consider_bypass_optimization;
 
@@ -391,7 +389,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->failsafe_active = false;
+	VacuumFailsafeActive = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
 	vacrel->do_index_cleanup = true;
@@ -709,7 +707,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			}
 			else
 			{
-if (!vacrel->failsafe_active)
+if (!VacuumFailsafeActive)
 	appendStringInfoString(, _("index scan bypassed: "));
 else
 	appendStringInfoString(, _("index scan bypassed by failsafe: "));
@@ -2293,7 +2291,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * vacuuming or heap vacuuming.  This VACUUM operation won't end up
 		 * back here again.
 		 */
-		Assert(vacrel->failsafe_active);
+		Assert(VacuumFailsafeActive);
 	}
 
 	/*
@@ -2374,7 +2372,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 */
 	

Re: [BUG] pg_stat_statements and extended query protocol

2023-03-31 Thread Imseih (AWS), Sami
> So...  The idea here is to set a custom fetch size so as the number of
> calls can be deterministic in the tests, still more than 1 for the
> tests we'd have.  And your point is that libpq enforces always 0 when
> sending the EXECUTE message causing it to always return all the rows
> for any caller of PQsendQueryGuts().

That is correct.

> The extended protocol allows that, so you would like a libpq API to
> have more control of what we send with EXECUTE:
> https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-QUERY-CONCEPTS


> The extended query protocol would require multiple 'E' messages, but
> we would not need multiple describe or bind messages, meaning that
> this cannot just be an extra flavor PQsendQueryParams().  Am I gettig
> that right?  

Correct, there will need to be separate APIs for Parse/Bind, Execute
and Close of a Portal.


> The correct API design seems tricky, to say the least.
> Perhaps requiring this much extra work in libpq for the purpose of
> having some tests in this thread is not a brilliant idea..  Or perhaps
> we could just do it and have something a-la-JDBC with two routines?
> That would be one libpq routine for describe/bind and one for execute
> where the limit can be given by the caller in the latter case, similar
> to sendDescribeStatement() and sendExecute() in
> QueryExecutorImpl.java.

I am not too clear on your point here. ISTM you are suggesting adding
new libpq APis similar to JDBC, which is what I am also suggesting.

Did I understand correctly?


--
Sami Imseih
Amazon Web Services (AWS)





Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-03-31 Thread Jacob Champion
On 3/31/23 02:14, Daniel Gustafsson wrote:
>> On 14 Mar 2023, at 20:20, Jacob Champion  wrote:
> 
>> Rebased over yesterday's Meson changes in v8.
> 
> I had a look at this and agree that it's something we should do.

Great, thanks for the review!

> +  # Let tests differentiate between vanilla OpenSSL and LibreSSL.
> +  AC_CHECK_DECLS([LIBRESSL_VERSION_NUMBER], [], [], [#include 
> ])
> We have a check for SSL_CTX_set_cert_cb which is specifically done since it's
> not present in Libressl.  Rather than spending more cycles in autoconf/meson,
> couldn't we use HAVE_SSL_CTX_SET_CERT_CB for this test?  (Longer term, maybe 
> we
> should make the checks properly distinguish between OpenSSL and LibreSSL as
> they are diverging, but thats not for this patch to tackle.)

I can make that change; note that it'll also skip some of the new tests
with OpenSSL 1.0.1, where there's no SSL_CTX_set_cert_cb. If that's
acceptable, it should be an easy switch.

> I can agree with the comment that this seems brittle. How about moving the 
> installation of openssl to after the brew cleanup stage to avoid the need for 
> this? While that may leave more in the cache, it seems more palatable. 
> Something like this essentially:
> 
>   brew install 
>   brew cleanup -s 
>   # Comment about why OpenSSL is kept separate
>   brew install openssl@3

That looks much better to me, but it didn't work when I tried it. One or
more of the packages above it (and/or the previous cache?) has already
installed OpenSSL as one of its dependencies, so the last `brew install`
becomes a no-op. I tried an `install --force` as well, but that didn't
seem to do anything differently. :/

> +   libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be used 
> with sslrootcert=system",
> +   conn->sslmode);
> I think we should help the user by indicating which sslmode we allow in this
> message.

Added in v9.

> +
> + /*
> +  * sslmode is not specified. Let it be filled in with the compiled
> +  * default for now, but if sslrootcert=system, we'll override the
> +  * default later before returning.
> +  */
> + sslmode_default = option;
> As a not to self and other reviewers, "git am" misplaced this when applying 
> the
> patch such that the result was syntactically correct but semantically wrong,
> causing very weird test errors.

Lovely... I've formatted v9 with a longer patch context.

> + sslmode_default->val = strdup("verify-full");
> This needs to be checked for OOM error.

Whoops, should be fixed now.

> -   if (fnbuf[0] != '\0' &&
> -   stat(fnbuf, ) == 0)
> +   if (strcmp(fnbuf, "system") == 0)
> I'm not a fan of magic values, but sadly I don't have a better idea for this.
> We should however document that the keyword takes precedence over a file with
> the same name (even though the collision is unlikely).

Added a note to the docs.

> +   if (SSL_CTX_set_default_verify_paths(SSL_context) != 1)
> OpenSSL documents this as "A missing default location is still treated as a
> success.", is that something we need to document or in any way deal with?
> (Skimming the OpenSSL code I'm not sure it's actually correct in v3+, but I
> might very well have missed something.)

I think it's still true in v3+, because that sounds exactly like the
brew issue we're working around in Cirrus. I'm not sure if there's much
for us to do in that case, short of reimplementing the OpenSSL defaults
logic and checking it ourselves. (And even that would look different
between OpenSSL and LibreSSL...)

Is there something we could document that's more helpful than "make sure
your installation isn't broken"?

--Jacob1:  84f67249e6 ! 1:  18fd368e0e libpq: add sslrootcert=system to use default CAs
@@ doc/src/sgml/libpq.sgml: postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
 + they control, rendering the verify-ca mode 
useless.
 +
 +   
++   
++
++ The magic system value will take precedence 
over a
++ local certificate file with the same name. If for some reason 
you find
++ yourself in this situation, use an alternative path like
++ sslrootcert=./system instead.
++
++   

   
  
2:  11b69d0bc0 ! 2:  ba09e1d83f libpq: force sslmode=verify-full for system CAs
@@ doc/src/sgml/libpq.sgml: postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
 + weaker modes useless.
  
 -   
-+   
-   
-  
- 
+-   
+ 
+  The magic system value will take precedence 
over a
+  local certificate file with the same name. If for some reason 
you find
 
  ## doc/src/sgml/runtime.sgml ##
 @@ doc/src/sgml/runtime.sgml: pg_dumpall -p 5432 | psql -d postgres -p 5433
@@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn)
 +  && 

Re: [RFC] Add jit deform_counter

2023-03-31 Thread Dmitry Dolgov
> On Wed, Mar 29, 2023 at 01:50:37PM +1300, David Rowley wrote:
> On Sun, 12 Jun 2022 at 21:14, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > I've noticed that JIT performance counter generation_counter seems to 
> > include
> > actions, relevant for both jit_expressions and jit_tuple_deforming options. 
> > It
> > means one can't directly see what is the influence of jit_tuple_deforming
> > alone, which would be helpful when adjusting JIT options. To make it better 
> > a
> > new counter can be introduced, does it make sense?
>
> I'm not so sure about this idea. As of now, if I look at EXPLAIN
> ANALYZE's JIT summary, the individual times add up to the total time.
>
> If we add this deform time, then that's no longer going to be true as
> the "Generation" time includes the newly added deform time.
>
> master:
>  JIT:
>Functions: 600
>Options: Inlining false, Optimization false, Expressions true, Deforming 
> true
>Timing: Generation 37.758 ms, Inlining 0.000 ms, Optimization 6.736
> ms, Emission 172.244 ms, Total 216.738 ms
>
> 37.758 + 6.736 + 172.244 = 216.738
>
> I think if I was a DBA wondering why JIT was taking so long, I'd
> probably either be very astonished or I'd report a bug if I noticed
> that all the individual component JIT times didn't add up to the total
> time.
>
> I don't think the solution is to subtract the deform time from the
> generation time either.
>
> Can't users just get this by looking at EXPLAIN ANALYZE with and
> without jit_tuple_deforming?

It could be done this way, but then users need to know that tuple
deforming is included into generation time (I've skimmed through the
docs, there seems to be no direct statements about that, although it
could be guessed). At the same time I don't think it's very
user-friendly approach -- after all it could be the same for other
timings, i.e. only one counter for all JIT operations present,
expecting users to experiment how would it change if this or that option
will be different.

I agree about adding up to the total time though. What about changing
the format to something like this?

   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 37.758 ms (Deforming 1.234 ms), Inlining 0.000 ms, 
Optimization 6.736 ms, Emission 172.244 ms, Total 216.738 ms

This way it doesn't look like deforming timing is in the same category
as others, but rather a part of another value.




Re: SELECT INTO without columns or star

2023-03-31 Thread Kirk Wolak
On Fri, Mar 31, 2023 at 11:26 AM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > On Fri, Mar 31, 2023 at 8:10 AM Zhang Mingli 
> wrote:
> >> When I exec a sql SELECT INTO without columns or * by mistake, it
> succeeds:
>
> > Yes, a table may have zero columns by design.
>
> Yup, we've allowed that for some time now; see the compatibility comments
> at the bottom of the SELECT man page.
>
> psql's display of zero-column results is a bit weird, which maybe
> somebody should fix sometime:
>
> regression=# select from generate_series(1,4);
> --
> (4 rows)
>
> I'd expect four blank lines there.  Expanded format is even less sane:
>
> regression=# \x
> Expanded display is on.
> regression=# select from generate_series(1,4);
> (4 rows)
>
> ISTM that should produce
>
> [ RECORD 1 ]
> [ RECORD 2 ]
> [ RECORD 3 ]
> [ RECORD 4 ]
>
> and no "(4 rows)" footer, because \x mode doesn't normally print that.
>
> This is all just cosmetic of course, but it's still confusing.
>
> regards, tom lane
>

Tom,
  I wouldn't mind working on a patch to fix this... (Especially if it helps
the %T get into PSQL).
I find this output confusing as well.

  Should I start a new email thread: Proposal: Fix psql output when
selecting no columns
And get the discussion moving.  I'd like to get a clear answer on what to
output.   But I have
become more comfortable with PSQL due to messing with readline for windows,
and 2-3 other patches
I've been working on.

Thanks, Kirk


Re: SQL/JSON revisited

2023-03-31 Thread Alvaro Herrera
Here's v14 of the IS JSON predicate.  I find no further problems here
and intend to get it pushed later today.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
>From 998ec6bd170e83cdb916ab40bec5cac56ecd1d81 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 31 Mar 2023 18:47:51 +0200
Subject: [PATCH v14] SQL/JSON: support the IS JSON predicate

This patch introduces the SQL standard IS JSON predicate. It operates
on text and bytea values representing JSON, as well as on the json and
jsonb types. Each test has IS and IS NOT variants and supports a WITH
UNIQUE KEYS flag. The tests are:

IS JSON [VALUE]
IS JSON ARRAY
IS JSON OBJECT
IS JSON SCALAR

These should be self-explanatory.

The WITH UNIQUE KEYS flag makes these return false when duplicate keys
exist in any object within the value, not necessarily directly contained
in the outermost object.

Author: Nikita Glukhov 
Author: Teodor Sigaev 
Author: Oleg Bartunov 
Author: Alexander Korotkov 
Author: Amit Langote 
Author: Andrew Dunstan 

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/CAF4Au4w2x-5LTnN_bxky-mq4=woqsgsxspenczhrazsned8...@mail.gmail.com
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886dea...@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6d...@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
---
 doc/src/sgml/func.sgml|  80 +++
 src/backend/executor/execExpr.c   |  13 ++
 src/backend/executor/execExprInterp.c |  99 +
 src/backend/jit/llvm/llvmjit_expr.c   |   6 +
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/backend/nodes/makefuncs.c |  19 ++
 src/backend/nodes/nodeFuncs.c |  26 +++
 src/backend/parser/gram.y |  70 ++-
 src/backend/parser/parse_expr.c   |  76 +++
 src/backend/parser/parser.c   |   3 +
 src/backend/utils/adt/json.c  | 132 +++-
 src/backend/utils/adt/jsonfuncs.c |  20 ++
 src/backend/utils/adt/ruleutils.c |  37 
 src/include/catalog/catversion.h  |   2 +-
 src/include/executor/execExpr.h   |   8 +
 src/include/nodes/makefuncs.h |   3 +
 src/include/nodes/primnodes.h |  26 +++
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/json.h  |   1 +
 src/include/utils/jsonfuncs.h |   3 +
 src/interfaces/ecpg/preproc/parse.pl  |   1 +
 src/interfaces/ecpg/preproc/parser.c  |   3 +
 .../ecpg/test/expected/sql-sqljson.c  |  73 +--
 .../ecpg/test/expected/sql-sqljson.stderr |  86 +---
 .../ecpg/test/expected/sql-sqljson.stdout |   8 +
 src/interfaces/ecpg/test/sql/sqljson.pgc  |  17 ++
 src/test/regress/expected/sqljson.out | 198 ++
 src/test/regress/sql/sqljson.sql  |  96 +
 28 files changed, 1039 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 38e7f46760..918a492234 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16005,6 +16005,86 @@ table2-mapping
 

 
+  
+details SQL/JSON
+   facilities for testing JSON.
+  
+
+  
+   SQL/JSON Testing Functions
+   
+
+ 
+  
+Function signature
+   
+   
+Description
+   
+   
+Example(s)
+  
+ 
+
+
+ 
+  
+IS JSON
+expression IS  NOT  JSON
+ { VALUE | SCALAR | ARRAY | OBJECT } 
+ { WITH | WITHOUT } UNIQUE  KEYS  
+   
+   
+This predicate tests whether expression can be
+parsed as JSON, possibly of a specified type.
+If SCALAR or ARRAY or
+OBJECT is specified, the
+test is whether or not the JSON is of that particular type. If
+WITH UNIQUE KEYS is specified, then any object in the
+expression is also tested to see if it
+has duplicate keys.
+   
+   
+
+SELECT js,
+  js IS JSON "json?",
+  js IS JSON SCALAR "scalar?",
+  js IS JSON OBJECT "object?",
+  js IS JSON ARRAY "array?"
+FROM (VALUES
+  ('123'), ('"abc"'), ('{"a": "b"}'), ('[1,2]'),('abc')) foo(js);
+ js | json? | scalar? | object? | array? 
++---+-+-+
+ 123| t | t   | f   | f
+ "abc"  | t | t   | f   | f
+ {"a": "b"} | t | f   | t   | f
+ [1,2]  | t | f   | f   | t
+ abc| f | f   | f   | f
+
+   
+   
+
+SELECT js,
+  js IS JSON OBJECT 

Re: BUG #17877: Referencing a system column in a foreign key leads to incorrect memory access

2023-03-31 Thread Ranier Vilela
Hi,

I think that commit f0d65c0

has an oversight.

Attnum == 0, is system column too, right?

All other places at tablecmds.c, has this test:

if (attnum <= 0)
ereport(ERROR,

regards,
Ranier Vilela


reject_system_column_equal_zero.patch
Description: Binary data


Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Andres Freund
Hi,

On 2023-03-31 10:45:51 +0300, Andrey Borodin wrote:
> On Fri, Mar 31, 2023 at 8:07 AM Andres Freund  wrote:
> >
> > I was working on committing patch 0001 from [1] and was a bit confused about
> > some of the changes to the WAL format for gist and hash index vacuum. It
> > looked to me that the changed code just flat out would not work.
> >
> > Turns out the problem is that we don't reach deletion for hash and gist
> > vacuum:
> >
> > gist:
> >
> > > Oh, I see. We apparently don't reach the gist deletion code in the tests:
> > > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
> > > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174
> > >
> > > And indeed, if I add an abort() into , it's not reached.
> > >
> > > And it's not because tests use a temp table, the caller is also 
> > > unreachable:
> > > https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643
> >
> 
> GiST logs deletions in gistXLogUpdate(), which is covered.
> gistXLogDelete() is only used for cleaning during page splits.

I am not sure what your point here is - deleting entries to prevent a page
split is deleting entries. What am I missing?


> propose refactoring GiST WAL to remove gistXLogDelete() and using
> gistXLogUpdate() instead.

I think we still would need to have coverage for gistprunepage(), even if
so. So that seems a separate issue.

I think what gist.sql is missing is a combination of delete, index scan (to
mark entries dead), new insertions (to trigger pruning to prevent page
splits).

Greetings,

Andres Freund




Re: SELECT INTO without columns or star

2023-03-31 Thread Tom Lane
"David G. Johnston"  writes:
> On Fri, Mar 31, 2023 at 8:10 AM Zhang Mingli  wrote:
>> When I exec a sql SELECT INTO without columns or * by mistake, it succeeds:

> Yes, a table may have zero columns by design.

Yup, we've allowed that for some time now; see the compatibility comments
at the bottom of the SELECT man page.

psql's display of zero-column results is a bit weird, which maybe
somebody should fix sometime:

regression=# select from generate_series(1,4);
--
(4 rows)

I'd expect four blank lines there.  Expanded format is even less sane:

regression=# \x
Expanded display is on.
regression=# select from generate_series(1,4);
(4 rows)

ISTM that should produce

[ RECORD 1 ]
[ RECORD 2 ]
[ RECORD 3 ]
[ RECORD 4 ]

and no "(4 rows)" footer, because \x mode doesn't normally print that.

This is all just cosmetic of course, but it's still confusing.

regards, tom lane




Re: SELECT INTO without columns or star

2023-03-31 Thread David G. Johnston
On Fri, Mar 31, 2023 at 8:10 AM Zhang Mingli  wrote:

> When I exec a sql SELECT INTO without columns or * by mistake, it succeeds:
>
>
Yes, a table may have zero columns by design.

David J.


SELECT INTO without columns or star

2023-03-31 Thread Zhang Mingli
Hi, hackers

When I exec a sql SELECT INTO without columns or * by mistake, it succeeds:

select * from t1;
 a | b
---+---
 1 | 2
 2 | 3
 3 | 4
(3 rows)

select into t2 from t1;
SELECT 3

 \pset null '(null)'
Null display is "(null)".

select * from t2;
--
(3 rows)

It seems that t2 has empty rows but not null. Is it an expected behavior?
And what’s the semantic of SELECT INTO without any columns?
I also see lots of that SELECT INTO in out test cases like:
-- SELECT INTO doesn't support USING
SELECT INTO tableam_tblselectinto_heap2 USING heap2 FROM tableam_tbl_heap2;

Regards,
Zhang Mingli


Re: SQL JSON path enhanced numeric literals

2023-03-31 Thread Nikita Malakhov
Hi!

Sorry to bother, but there is a question on JsonPath - how many bits in the
JsonPath
header could be used for the version? The JsonPath header is 4 bytes, and
currently
the Version part is defined as
#define JSONPATH_VERSION (0x01)

Thanks!

On Sun, Mar 5, 2023 at 6:55 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 03.03.23 21:16, Dean Rasheed wrote:
> > I think this new feature ought to be mentioned in the docs somewhere.
> > Perhaps a sentence or two in the note below table 9.49 would suffice,
> > since it looks like that's where jsonpath numbers are mentioned for
> > the first time.
>
> Done.  I actually put it into the data types chapter, where some other
> differences between SQL and SQL/JSON syntax were already discussed.
>
> > In jsonpath_scan.l, I think the hex/oct/bininteger cases could do with
> > a comment, such as
> >
> > /* Non-decimal integers in ECMAScript; must not have underscore after
> radix */
> > hexinteger0[xX]{hexdigit}(_?{hexdigit})*
> > octinteger0[oO]{octdigit}(_?{octdigit})*
> > bininteger0[bB]{bindigit}(_?{bindigit})*
> >
> > since that's different from the main lexer's syntax.
>
> done
>
> > Perhaps it's worth mentioning that difference in the docs.
>
> done
>
> > Otherwise, this looks good to me.
>
> committed
>
>
>
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Should vacuum process config file reload more often

2023-03-31 Thread Melanie Plageman
On Thu, Mar 30, 2023 at 3:26 PM Daniel Gustafsson  wrote:
>
> > On 30 Mar 2023, at 04:57, Masahiko Sawada  wrote:
>
> > As another idea, why don't we use macros for that? For example,
> > suppose VacuumCostStatus is like:
> >
> > typedef enum VacuumCostStatus
> > {
> >VACUUM_COST_INACTIVE_LOCKED = 0,
> >VACUUM_COST_INACTIVE,
> >VACUUM_COST_ACTIVE,
> > } VacuumCostStatus;
> > VacuumCostStatus VacuumCost;
> >
> > non-vacuum code can use the following macros:
> >
> > #define VacuumCostActive() (VacuumCost == VACUUM_COST_ACTIVE)
> > #define VacuumCostInactive() (VacuumCost <= VACUUM_COST_INACTIVE) //
> > or we can use !VacuumCostActive() instead.
>
> I'm in favor of something along these lines.  A variable with a name that
> implies a boolean value (active/inactive) but actually contains a tri-value is
> easily misunderstood.  A VacuumCostState tri-value variable (or a better name)
> with a set of convenient macros for extracting the boolean active/inactive 
> that
> most of the code needs to be concerned with would more for more readable code 
> I
> think.

The macros are very error-prone. I was just implementing this idea and
mistakenly tried to set the macro instead of the variable in multiple
places. Avoiding this involves another set of macros, and, in the end, I
think the complexity is much worse. Given the reviewers' uniform dislike
of VacuumCostInactive, I favor going back to two variables
(VacuumCostActive + VacuumFailsafeActive) and moving
LVRelState->failsafe_active to the global VacuumFailsafeActive.

I will reimplement this in the next version.

On the subject of globals, the next version will implement
Horiguchi-san's proposal to separate GUC variables from the globals used
in the code (quoted below). It should hopefully reduce the complexity of
this patchset.

> Although it's somewhat unrelated to the goal of this patch, I think we
> should clean up the code tidy before proceeding. Shouldn't we separate
> the actual parameters from the GUC base variables, and sort out the
> all related variaghble? (something like the attached, on top of your
> patch.)

- Melanie




Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Alexander Lakhin

31.03.2023 15:55, Tom Lane wrote:

See also the thread about bug #16329 [1].  Alexander promised to look
into improving the test coverage in this area, maybe he can keep an
eye on the WAL logic coverage too.


Yes, I'm going to analyze that area too. Maybe it'll take more time
(a week or two) if I encounter some bugs there (for now I observe anomalies
with gist__int_ops), but I will definitely try to improve the gist testing.

Best regards,
Alexander




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-31 Thread Alexander Korotkov
On Wed, Mar 29, 2023 at 8:34 PM Alexander Korotkov  wrote:
> On Mon, Mar 27, 2023 at 1:49 PM Alexander Korotkov  
> wrote:
> > On Fri, Mar 24, 2023 at 3:39 AM Andres Freund  wrote:
> > > On 2023-03-23 23:24:19 +0300, Alexander Korotkov wrote:
> > > > On Thu, Mar 23, 2023 at 8:06 PM Andres Freund  
> > > > wrote:
> > > > > I seriously doubt that solving this at the tuple locking level is the 
> > > > > right
> > > > > thing. If we want to avoid refetching tuples, why don't we add a 
> > > > > parameter to
> > > > > delete/update to generally put the old tuple version into a slot, not 
> > > > > just as
> > > > > an optimization for a subsequent lock_tuple()? Then we could remove 
> > > > > all
> > > > > refetching tuples for triggers. It'd also provide the basis for 
> > > > > adding support
> > > > > for referencing the OLD version in RETURNING, which'd be quite 
> > > > > powerful.
> >
> > After some thoughts, I think I like idea of fetching old tuple version
> > in update/delete.  Everything that evades extra tuple fetching and do
> > more of related work in a single table AM call, makes table AM API
> > more flexible.
> >
> > I'm working on patch implementing this.  I'm going to post it later today.
>
> Here is the patchset.  I'm continue to work on comments and refactoring.
>
> My quick question is why do we need ri_TrigOldSlot for triggers?
> Can't we just pass the old tuple for after row trigger in
> ri_oldTupleSlot?
>
> Also, I wonder if we really need a LazyTupleSlot.  It allows to evade
> extra tuple slot allocation.  But as I get in the end the tuple slot
> allocation is just a single palloc.  I bet the effect would be
> invisible in the benchmarks.

Sorry, previous patches don't even compile.  The fixed version is attached.
I'm going to post significantly revised patchset soon.

--
Regards,
Alexander Korotkov


0001-Improve-lazy-tuple-slot-v2.patch
Description: Binary data


0002-Revise-changes-for-tuple_update-and-tuple_delete-v2.patch
Description: Binary data


Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-31 Thread Melanie Plageman
On Thu, Mar 30, 2023 at 11:54 PM David Rowley  wrote:
>
> On Mon, 20 Mar 2023 at 11:50, Melanie Plageman
>  wrote:
> > Attached is an updated v6.
>
> I had a look over the v6-0001 patch.  There are a few things I think
> could be done better:
>
> "Some operations will access a large number of pages at a time", does
> this really need "at a time"? I think it's more relevant that the
> operation uses a large number of pages.
>
> Missing  around Buffer Access Strategy.
>
> Various things could be linked to other sections of the glossary, e.g.
> pages could link to glossary-data-page, shared buffers could link to
> glossary-shared-memory and WAL could link to glossary-wal.
>
> The final paragraph should have  tags around the various
> commands that you list.
>
> I have adjusted those and slightly reworded a few other things. See
> the attached .diff which can be applied atop of v6-0001.

There was one small typo keeping this from compiling. Also a repeated
word. I've fixed these. I also edited a bit of indentation and tweaked
some wording. Diff attached (to be applied on top of your diff).

- Melanie
From 7ca6e4c47656d0e3fe5984f9e67aa51f1d22c73c Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 31 Mar 2023 09:50:45 -0400
Subject: [PATCH v6] more adjustments

---
 doc/src/sgml/glossary.sgml | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index eb837d4622..8bf5e2b92d 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -256,26 +256,26 @@
Buffer Access Strategy

 
- Some operations will access a large number of
- pages.  A
- Buffer Access Strategy helps to prevent these
- these operations from evicting too many pages from
- shared buffers.
+ Some operations will access a large number of pages.  A Buffer
+ Access Strategy helps to prevent these operations from
+ evicting too many pages from shared buffers.
 
 
  A Buffer Access Strategy sets up references to a limited number of
- shared buffers
- and reuses them circularly.  When the operation requires a new page, a
- victim buffer is chosen from the buffers in the strategy, which may
- require dirty buffers and possibly also require
- WAL to be flushed to disk.
- 
- 
- Buffer Access Strategies are used for various operations such as;
+ shared buffers and
+ reuses them circularly.  When the operation requires a new page, a victim
+ buffer is chosen from the buffers in the strategy ring, which may require
+ flushing the page's dirty data and possibly also associated WAL to permanent storage.
+
+
+ Buffer Access Strategies are used for various operations such as:
  sequential scans of large tables, VACUUM,
- COPY, CREATE TABLE AS SELECT,
+ COPY, CREATE TABLE AS SELECT (CTAS),
  ALTER TABLE, CREATE DATABASE,
- CREATE INDEX, and CLUSTER.
+ CREATE INDEX, and CLUSTER.
 

   
-- 
2.37.2



Re: Minimal logical decoding on standbys

2023-03-31 Thread Drouvot, Bertrand

Hi,

On 3/31/23 1:58 PM, Amit Kapila wrote:

On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
 wrote:




+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.

IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?



Both the walsender and the startup process are waked up at the
same time. If the walsender does not find any new record that has been replayed
(because the startup process did not replay yet), then it will sleep during i
ts timeout time (then delaying the decoding).

The CV helps to wake up the walsender has soon as a replay is done.


Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.



That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
doing the Walsender(s) triage based on a new variable (as you suggest).

But, it looks to me that we:

- would need to go through the list of all the walsenders to do the triage
- could wake up some logical walsender(s) unnecessary

This extra work would occur during each replay.

while with the CV, only the ones in the CV wait queue would be waked up.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: FW: Add the ability to limit the amount of memory that can be allocated to backends.

2023-03-31 Thread Reid Thompson
On Thu, 2023-03-30 at 16:11 +, John Morris wrote:
>  Hi Reid!
> Some thoughts
> I was looking at lmgr/proc.c, and I see a potential integer overflow
> - bothmax_total_bkend_mem and result are declared as “int”, so the
> expression “max_total_bkend_mem * 1024 * 1024 - result * 1024 * 1024”
> could have a problem whenmax_total_bkend_mem is set to 2G or more.
>     /*
>     * Account for shared
> memory size and initialize
>     *
> max_total_bkend_mem_bytes.
>     */
>    
> pg_atomic_init_u64(>max_total_bkend_mem_bytes,
>  
>       max_total_bkend_mem *
> 1024 * 1024 - result * 1024 * 1024);
> 
> 
> As more of a style thing (and definitely not an error), the calling
> code might look smoother if the memory check and error handling were
> moved into a helper function, say “backend_malloc”.  For example, the
> following calling code
>  
>     /* Do not exceed maximum allowed memory allocation */
>     if
> (exceeds_max_total_bkend_mem(Slab_CONTEXT_HDRSZ(chunksPerBlock)))
>     {
>     MemoryContextStats(TopMemoryContext);
>     ereport(ERROR,
>    
> (errcode(ERRCODE_OUT_OF_MEMORY),
>    
> errmsg("out of memory - exceeds max_total_backend_memory"),
>    
> errdetail("Failed while creating memory context \"%s\".",
>  
>       name)));
>     }
>  
>     slab = (SlabContext *)
> malloc(Slab_CONTEXT_HDRSZ(chunksPerBlock));
>   if (slab == NULL)
>   ….
> Could become a single line:
>     Slab = (SlabContext *)
> backend_malloc(Slab_CONTEXT_HDRSZ(chunksPerBlock);
>  
> Note this is a change in error handling as backend_malloc() throws
> memory errors. I think this change is already happening, as the error
> throws you’ve already added are potentially visible to callers (to be
> verified). It also could result in less informative error messages
> without the specifics of “while creating memory context”.  Still, it
> pulls repeated code into a single wrapper and might be worth
> considering.
>  
> I do appreciate the changes in updating the global memory counter. My
> recollection is the original version updated stats with every
> allocation, and there was something that looked like a spinlock
> around the update.  (My memory may be wrong …). The new approach
> eliminates contention, both by having fewer updates and by using
> atomic ops.  Excellent.
>  
>  I also have some thoughts about simplifying the atomic update logic
> you are currently using. I need to think about it a bit more and will
> get back to you later on that.
>  
> * John Morris
>  
>  
>  
>  

John,
Thanks for looking this over and catching this. I appreciate the catch
and the guidance. 

Thanks,
Reid



Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Tom Lane
Andrey Borodin  writes:
> On Fri, Mar 31, 2023 at 8:07 AM Andres Freund  wrote:
>> Turns out the problem is that we don't reach deletion for hash and gist
>> vacuum:

> GiST logs deletions in gistXLogUpdate(), which is covered.
> gistXLogDelete() is only used for cleaning during page splits. I'd
> propose refactoring GiST WAL to remove gistXLogDelete() and using
> gistXLogUpdate() instead.
> However I see that gistXLogPageDelete() is not exercised, and is worth
> fixing IMO. Simply adding 10x more data in gist.sql helps, but I think
> we can do something more clever...

See also the thread about bug #16329 [1].  Alexander promised to look
into improving the test coverage in this area, maybe he can keep an
eye on the WAL logic coverage too.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/16329-7a6aa9b6fa1118a1%40postgresql.org




Re: Infinite Interval

2023-03-31 Thread Ashutosh Bapat
I hurried too much on the previous patch. It introduced other
problems. Attached is a better patch and also fixes problem below
#select 'infinity'::interval * 0;
 ?column?
--
 infinity
(1 row)

with the patch we see
#select 'infinity'::interval * 0;
2023-03-31 18:00:43.131 IST [240892] ERROR:  interval out of range
2023-03-31 18:00:43.131 IST [240892] STATEMENT:  select
'infinity'::interval * 0;
ERROR:  interval out of range

which looks more appropriate given 0 * inf = Nan for float.

There's some way to avoid separate checks for infinite-ness of
interval and factor and use a single block using some integer
arithmetic. But I think this is more readable. So I avoided doing
that. Let me know if this works for you.

Also added some test cases.

--
Best Wishes,
Ashutosh Bapat

On Fri, Mar 31, 2023 at 3:46 PM Ashutosh Bapat
 wrote:
>
> On Tue, Mar 28, 2023 at 7:17 PM Ashutosh Bapat
>  wrote:
>  > make sure that every
> > operator that interval as one of its operands or the result has been
> > covered in the code.
>
> time_mi_time - do we want to add an Assert to make sure that this
> function does not produce an Interval structure which looks like
> non-finite interval?
>
> multiplying an interval by infinity throws an error
> #select '5 days'::interval * 'infinity'::float8;
> 2023-03-29 19:40:15.797 IST [136240] ERROR:  interval out of range
> 2023-03-29 19:40:15.797 IST [136240] STATEMENT:  select '5
> days'::interval * 'infinity'::float8;
> ERROR:  interval out of range
>
> I think this should produce an infinite interval now. Attached patch
> to fix this, to be applied on top of your patch. With the patch
> #select '5 days'::interval * 'infinity'::float8;
>  ?column?
> --
>  infinity
> (1 row)
>
> Going through the tests now.
>
> --
> Best Wishes,
> Ashutosh Bapat
commit 29d8501bb0e1b727abc81e862185770fd8e3a6c9
Author: Ashutosh Bapat 
Date:   Fri Mar 31 18:02:25 2023 +0530

fixup! Add infinite interval values

diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c006e27dc0..cd05c61de3 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -3566,6 +3566,7 @@ interval_mul(PG_FUNCTION_ARGS)
 	int32		orig_month = span->month,
 orig_day = span->day;
 	Interval   *result;
+	int			is_factor_inf = isinf(factor);
 
 	result = (Interval *) palloc(sizeof(Interval));
 
@@ -3580,12 +3581,35 @@ interval_mul(PG_FUNCTION_ARGS)
 	 */
 	if (INTERVAL_NOT_FINITE(span))
 	{
-		if (factor < 0.0)
+		if (factor == 0.0)
+			ereport(ERROR,
+	(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+	 errmsg("interval out of range")));
+		else if (factor < 0.0)
 			interval_um_internal(span, result);
 		else
 			memcpy(result, span, sizeof(Interval));
 		PG_RETURN_INTERVAL_P(result);
 	}
+	else if (is_factor_inf)
+	{
+		Interval	zero;
+		int			result_is_inf;
+
+		memset(, 0, sizeof(zero));
+		result_is_inf = interval_cmp_internal(span, ) * is_factor_inf;
+
+		if (result_is_inf == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+	 errmsg("interval out of range")));
+		else if (result_is_inf < 0)
+			INTERVAL_NOBEGIN(result);
+		else
+			INTERVAL_NOEND(result);
+
+		PG_RETURN_INTERVAL_P(result);
+	}
 
 	result_double = span->month * factor;
 	if (isnan(result_double) ||


Re: Memory leak from ExecutorState context?

2023-03-31 Thread Jehan-Guillaume de Rorthais
On Tue, 28 Mar 2023 17:25:49 +0200
Tomas Vondra  wrote:
...
>  * Note that BufFile structs are allocated with palloc(), and therefore
>  * will go away automatically at query/transaction end.  Since the
> underlying
>  * virtual Files are made with OpenTemporaryFile, all resources for
>  * the file are certain to be cleaned up even if processing is aborted
>  * by ereport(ERROR).  The data structures required are made in the
>  * palloc context that was current when the BufFile was created, and
>  * any external resources such as temp files are owned by the ResourceOwner
>  * that was current at that time.
> 
> which I take as confirmation that it's legal to allocate BufFile in any
> memory context, and that cleanup is handled by the cache in fd.c.

OK. I just over interpreted comments and been over prudent.

> [...]
> >> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> >> idea at the moment.  
> > 
> > I stepped it down to NOTICE and added some more infos.
> > 
> > [...]
> >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > memory (137494656 > 2097152)
> [...]
> 
> OK, although NOTICE that may actually make it less useful - the default
> level is WARNING, and regular users are unable to change the level. So
> very few people will actually see these messages.

The main purpose of NOTICE was to notice user/dev, as client_min_messages=notice
by default. 

But while playing with it, I wonder if the message is in the good place anyway.
It is probably pessimistic as it shows memory consumption when increasing the
number of batch, but before actual buffile are (lazily) allocated. The message
should probably pop a bit sooner with better numbers.

Anyway, maybe this should be added in the light of next patch, balancing
between increasing batches and allowed memory. The WARNING/LOG/NOTICE message
could appears when we actually break memory rules because of some bad HJ
situation.

Another way to expose the bad memory consumption would be to add memory infos to
the HJ node in the explain output, or maybe collect some min/max/mean for
pg_stat_statement, but I suspect tracking memory spikes by query is another
challenge altogether...

In the meantime, find in attachment v3 of the patch with debug and NOTICE
messages removed. Given the same plan from my previous email, here is the
memory contexts close to the query end:

 ExecutorState: 32768 total in 3 blocks; 15512 free (6 chunks); 17256 used
   HashTableContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used 
HashBatchFiles: 28605680 total in 3256 blocks; 970128 free (38180 chunks);
27635552 used
HashBatchContext: 960544 total in 23 blocks; 7928 free (0 chunks); 952616 
used


Regards,
>From 6814994fa0576a8ba6458412ac5f944135fc3813 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 27 ++-
 src/backend/executor/nodeHashjoin.c | 18 +-
 src/include/executor/hashjoin.h | 15 ---
 src/include/executor/nodeHashjoin.h |  2 +-
 4 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..0c0d5b4a3c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);
 
+	hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */
 
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 
 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		

Re: Minimal logical decoding on standbys

2023-03-31 Thread Amit Kapila
On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
 wrote:
>

+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.

IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?

Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.

-- 
With Regards,
Amit Kapila.




Re: TAP output format in pg_regress

2023-03-31 Thread Daniel Gustafsson
> On 29 Mar 2023, at 22:08, Daniel Gustafsson  wrote:
> 
>> On 29 Mar 2023, at 21:14, Daniel Gustafsson  wrote:
> 
>> Ugh, I clearly should've stayed on the couch yesterday.
> 
> Maybe today as well, just as I had sent this I realized there is mention of 
> the
> output format in the docs that I had missed.  The attached changes that as 
> well
> to match the new format.  (The section in question doesn't mention using 
> meson,
> but I have a feeling that's tackled in one of the meson docs threads.)

Pushed to master now with a few tweaks to comments and docs.  Thanks for all
the reviews, I hope the new format will make meson builds easier/better without
compromising the human readable aspect of traditional builds.

--
Daniel Gustafsson





Re: Infinite Interval

2023-03-31 Thread Ashutosh Bapat
On Tue, Mar 28, 2023 at 7:17 PM Ashutosh Bapat
 wrote:
 > make sure that every
> operator that interval as one of its operands or the result has been
> covered in the code.

time_mi_time - do we want to add an Assert to make sure that this
function does not produce an Interval structure which looks like
non-finite interval?

multiplying an interval by infinity throws an error
#select '5 days'::interval * 'infinity'::float8;
2023-03-29 19:40:15.797 IST [136240] ERROR:  interval out of range
2023-03-29 19:40:15.797 IST [136240] STATEMENT:  select '5
days'::interval * 'infinity'::float8;
ERROR:  interval out of range

I think this should produce an infinite interval now. Attached patch
to fix this, to be applied on top of your patch. With the patch
#select '5 days'::interval * 'infinity'::float8;
 ?column?
--
 infinity
(1 row)

Going through the tests now.

--
Best Wishes,
Ashutosh Bapat
commit e345924db6e6b531f98124159731560a38eae7d0
Author: Ashutosh Bapat 
Date:   Fri Mar 31 15:41:25 2023 +0530

fixup! Add infinite interval values

diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c006e27dc0..79842d3e0d 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -3566,6 +3566,7 @@ interval_mul(PG_FUNCTION_ARGS)
 	int32		orig_month = span->month,
 orig_day = span->day;
 	Interval   *result;
+	int			is_factor_inf = isinf(factor);
 
 	result = (Interval *) palloc(sizeof(Interval));
 
@@ -3574,6 +3575,15 @@ interval_mul(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
  errmsg("interval out of range")));
 
+	if (is_factor_inf != 0)
+	{
+		if (is_factor_inf < 0)
+			INTERVAL_NOBEGIN(result);
+		else
+			INTERVAL_NOEND(result);
+		PG_RETURN_INTERVAL_P(result);
+	}
+
 	/*
 	 * Multiplying infinite interval by finite number keeps it infinite but
 	 * may change the sign.


Re: ICU locale validation / canonicalization

2023-03-31 Thread Jeff Davis
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:
> I don't think we should show the notice when the canonicalization 
> doesn't change anything.  This is not useful:
> 
> +NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-
> upper"

Done.

> Also, the message should be phrased more from the perspective of the 
> user instead of using ICU jargon, like
> 
> NOTICE:  using canonicalized form "%s" for locale specification "%s"
> 
> (Still too many big words?)

Changed to:

   NOTICE:  using standard form "%s" for locale "%s"

> Needs documentation updates in doc/src/sgml/charset.sgml.

I made a very minor update. Do you have something more specific in
mind?

Regards,
Jeff Davis


From 0069b0a6e6972c91445f8b07b42aff519159c0e8 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v11] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.ca...@j-davis.com
---
 doc/src/sgml/charset.sgml |  2 +-
 src/backend/commands/collationcmds.c  | 46 +-
 src/backend/commands/dbcommands.c | 20 +
 src/backend/utils/adt/pg_locale.c | 85 +++
 src/bin/initdb/initdb.c   | 81 ++
 src/bin/initdb/t/001_initdb.pl|  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |  4 +-
 src/include/utils/pg_locale.h |  1 +
 .../regress/expected/collate.icu.utf8.out | 29 ++-
 src/test/regress/sql/collate.icu.utf8.sql | 15 
 10 files changed, 258 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 12fabb7372..6dd95b8966 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -893,7 +893,7 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
 The first example selects the ICU locale using a language
 tag per BCP 47.  The second example uses the traditional
 ICU-specific locale syntax.  The first style is preferred going
-forward, but it is not supported by older ICU versions.
+forward, and is used internally to store locales.


 Note that you can name the collation objects in the SQL environment
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 45de78352c..c91fe66d9b 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, );
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -259,6 +264,25 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 		 errmsg("parameter \"locale\" must be specified")));
 
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+char *langtag = icu_language_tag(colliculocale,
+ icu_validation_level);
+
+if (langtag && strcmp(colliculocale, langtag) != 0)
+{
+	ereport(NOTICE,
+			(errmsg("using standard form \"%s\" for locale \"%s\"",
+	langtag, colliculocale)));
+
+	colliculocale = langtag;
+}
+			}
+
 			icu_validate_locale(colliculocale);
 		}
 
@@ -569,26 +593,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, );
-	if (U_FAILURE(status))
-		ereport(ERROR,
-(errmsg("could not convert locale name \"%s\" to language tag: %s",
-		localename, u_errorName(status;
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we 

RE: Data is copied twice when specifying both child and parent table in publication

2023-03-31 Thread shiy.f...@fujitsu.com
On Fri, Mar 31, 2023 2:16 AM Jacob Champion  wrote:
> 
> On Wed, Mar 29, 2023 at 2:00 AM Amit Kapila 
> wrote:
> > Pushed.
> 
> While rebasing my logical-roots patch over the top of this, I ran into
> another situation where mixed viaroot settings can duplicate data. The
> key idea is to subscribe to two publications with mixed settings, as
> before, and add a partition root that's already been replicated with
> viaroot=false to the other publication with viaroot=true.
> 
>   pub=# CREATE TABLE part (a int) PARTITION BY RANGE (a);
>   pub=# CREATE PUBLICATION pub_all FOR ALL TABLES;
>   pub=# CREATE PUBLICATION pub_other FOR TABLE other WITH
> (publish_via_partition_root);
>   -- populate with data, then switch to subscription side
>   sub=# CREATE SUBSCRIPTION sub CONNECTION ... PUBLICATION pub_all,
> pub_other;
>   -- switch back to publication
>   pub=# ALTER PUBLICATION pub_other ADD TABLE part;
>   -- and back to subscription
>   sub=# ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
>   -- data is now duplicated
> 
> (Standalone reproduction attached.)
> 
> This is similar to what happens if you alter the
> publish_via_partition_root setting for an existing publication, but
> I'd argue it's easier to hit by accident. Is this part of the same
> class of bugs, or is it different (or even expected) behavior?
> 

I noticed that a similar problem has been discussed in this thread, see [1] [2]
[3] [4]. It seems complicated to fix it if we want to automatically skip tables
that have been synchronized previously by code, and this may overkill in some
cases (e.g. The target table in subscriber is not a partitioned table, and the
user want to synchronize all data in the partitioned table from the publisher).
Besides, it seems not a common case. So I'm not sure we should fix it. Maybe we
can just add some documentation for it as Peter mentioned.

[1] 
https://www.postgresql.org/message-id/CAJcOf-eQR_%3Dq0f4ZVHd342QdLvBd_995peSr4xCU05hrS3TeTg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/OS0PR01MB5716C756312959F293A822C794869%40OS0PR01MB5716.jpnprd01.prod.outlook.com
 (the second issue in it)
[3] 
https://www.postgresql.org/message-id/CA%2BHiwqHnDHcT4OOcga9rDFyc7TvDrpN5xFH9J2pyHQo9ptvjmQ%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/CAA4eK1%2BNWreG%3D2sKiMz8vFzTsFhEHCjgQMyAu6zj3sdLmcheYg%40mail.gmail.com

Regards,
Shi Yu


Re: Minimal logical decoding on standbys

2023-03-31 Thread Jeff Davis
On Fri, 2023-03-31 at 01:31 +0900, Masahiko Sawada wrote:
> I think that we don't need to change for the latter case as
> WalSndWait() perfectly works. As for the former cases, since we need
> to wait for CV, timeout, or socket writable we can use
> ConditionVariableEventSleep().

For this patch series, I agree.

But if the ConditionVariableEventSleep() API is added, then I think we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

Regards,
Jeff Davis





Re: Minimal logical decoding on standbys

2023-03-31 Thread Jeff Davis
On Wed, 2023-03-08 at 11:25 +0100, Drouvot, Bertrand wrote:
> - Maybe ConditionVariableEventSleep() should take care of the
> “WaitEventSetWait returns 1 and cvEvent.event == WL_POSTMASTER_DEATH”
> case?

Thank you, done. I think the nearby line was also wrong, returning true
when there was no timeout. I combined the lines and got rid of the
early return so it can check the list and timeout condition like
normal. Attached.

> - Maybe ConditionVariableEventSleep() could accept and deal with the
> CV being NULL?
> I used it in the POC attached to handle logical decoding on the
> primary server case.
> One option should be to create a dedicated CV for that case though.

I don't think it's a good idea to have a CV-based API that doesn't need
a CV. Wouldn't that just be a normal WaitEventSet?

> - In the POC attached I had to add this extra condition “(cv &&
> !RecoveryInProgress())” to avoid waiting on the timeout when there is
> a promotion.
> That makes me think that we may want to add 2 extra parameters (as 2
> functions returning a bool?) to ConditionVariableEventSleep()
> to check whether or not we still want to test the socket or the CV
> wake up in each loop iteration.

That seems like a complex API. Would it work to signal the CV during
promotion instead?

> Also 3 additional remarks:
> 
> 1) About InitializeConditionVariableWaitSet() and
> ConditionVariableWaitSetCreate(): I'm not sure about the naming as
> there is no CV yet (they "just" deal with WaitEventSet).

It's a WaitEventSet that contains the conditions always required for
any CV, and allows you to add in more.

> 3)
> 
> I wonder if there is no race conditions: ConditionVariableWaitSet is
> being initialized with PGINVALID_SOCKET
> as WL_LATCH_SET and might be also (if IsUnderPostmaster) be
> initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.
> 
> So IIUC, the patch is introducing 2 new possible source of wake up.

Those should be the same conditions already required by
ConditionVariableTimedSleep() in master, right?


Regards,
Jeff Davis
From 2f05cab9012950ae9290fccbb6366d50fc01553e Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 1 Mar 2023 20:02:42 -0800
Subject: [PATCH v2] Introduce ConditionVariableEventSleep().

The new API takes a WaitEventSet which can include socket events. The
WaitEventSet must have been created by
ConditionVariableWaitSetCreate(), another new function, so that it
includes the wait events necessary for a condition variable.
---
 src/backend/storage/lmgr/condition_variable.c | 106 --
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/utils/init/miscinit.c |   1 +
 src/include/storage/condition_variable.h  |  10 ++
 4 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..3dbfa7468b 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -27,9 +27,29 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#define ConditionVariableWaitSetLatchPos	0
+
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+/* Used by ConditionVariableSleep() and ConditionVariableTimedSleep(). */
+static WaitEventSet *ConditionVariableWaitSet = NULL;
+
+/*
+ * Initialize the process-local condition variable WaitEventSet.
+ *
+ * This must be called once during startup of any process that can wait on
+ * condition variables, before it issues any ConditionVariableInit() calls.
+ */
+void
+InitializeConditionVariableWaitSet(void)
+{
+	Assert(ConditionVariableWaitSet == NULL);
+
+	ConditionVariableWaitSet = ConditionVariableWaitSetCreate(
+		TopMemoryContext, 0);
+}
+
 /*
  * Initialize a condition variable.
  */
@@ -40,6 +60,51 @@ ConditionVariableInit(ConditionVariable *cv)
 	proclist_init(>wakeup);
 }
 
+/*
+ * Create a WaitEventSet for ConditionVariableEventSleep(). This should be
+ * used when the caller of ConditionVariableEventSleep() would like to wake up
+ * on either the condition variable signal or a socket event. For example:
+ *
+ *   ConditionVariableInit();
+ *   waitset = ConditionVariableWaitSetCreate(mcxt, 1);
+ *   event_pos = AddWaitEventToSet(waitset, 0, sock, NULL, NULL);
+ *   ...
+ *   ConditionVariablePrepareToSleep();
+ *   while (...condition not met...)
+ *   {
+ *   socket_wait_events = ...
+ *   ModifyWaitEvent(waitset, event_pos, socket_wait_events, NULL);
+ *   ConditionVariableEventSleep(, waitset, ...);
+ *   }
+ *   ConditionVariableCancelSleep();
+ *
+ * The waitset is created with the standard events for a condition variable,
+ * and room for adding n_socket_events additional socket events. The
+ * initially-filled event positions should not be modified, but added socket
+ * events can be modified. The same waitset can be used for multiple condition
+ * variables as long 

Re: PGdoc: add ID attribute to create_publication.sgml

2023-03-31 Thread Amit Kapila
On Thu, Mar 30, 2023 at 3:48 PM Amit Kapila  wrote:
>
> On Thu, Mar 30, 2023 at 9:22 AM Peter Smith  wrote:
> >
> > I have marked the CF entry for this patch as "ready for committer"
> >
>
> LGTM. I'll push this tomorrow unless there are more comments for it.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Add LZ4 compression in pg_dump

2023-03-31 Thread gkokolatos





--- Original Message ---
On Wednesday, March 29th, 2023 at 12:02 AM, Tomas Vondra 
 wrote:


> 
> 
> On 3/28/23 18:07, gkokola...@pm.me wrote:
> 
> > --- Original Message ---
> > On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me gkokola...@pm.me 
> > wrote:
> > 
> > > --- Original Message ---
> > > On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra 
> > > tomas.von...@enterprisedb.com wrote:
> > > 
> > > > This leaves the empty-data issue (which we have a fix for) and the
> > > > switch to LZ4F. And then the zstd part.
> > > 
> > > Please expect promptly a patch for the switch to frames.
> > 
> > Please find the expected patch attached. Note that the bulk of the
> > patch is code unification, variable renaming to something more
> > appropriate, and comment addition. These are changes that are not
> > strictly necessary to switch to LZ4F. I do believe that are
> > essential for code hygiene after the switch and they do belong
> > on the same commit.
> 
> 
> I think the patch is fine, but I'm wondering if the renames shouldn't go
> a bit further. It removes references to LZ4File struct, but there's a
> bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
> prefix? We don't have GzipFile either.
> 
> Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
> then we probably should not define LZ4_compressor_init ...

This is a good point. The initial thought was that since lz4.h is now
removed, such ambiguity will not be present. In v2 of the patch the
function is renamed to `LZ4State_compression_init` since this name
describes better its purpose. It initializes the LZ4State for
compression.

As for the LZ4File_ prefix, I have no objections. Please find the
prefix changed to LZ4Stream_. For the record, the word 'File' is not
unique to the lz4 implementation. The common data structure used by
the API in compress_io.h:

   typedef struct CompressFileHandle CompressFileHandle; 

The public functions for this API are named:

  InitCompressFileHandle
  InitDiscoverCompressFileHandle
  EndCompressFileHandle

And within InitCompressFileHandle the pattern is:

if (compression_spec.algorithm == PG_COMPRESSION_NONE)
InitCompressFileHandleNone(CFH, compression_spec);
else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
InitCompressFileHandleGzip(CFH, compression_spec);
else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
InitCompressFileHandleLZ4(CFH, compression_spec);

It was felt that a prefix was required due to the inclusion 'lz4.h'
header where naming functions as 'LZ4_' would be wrong. The prefix
'LZ4File_' seemed to be in line with the naming of the rest of
the relevant functions and structures. Other compressions, gzip and
none, did not face the same issue.

To conclude, I think that having a prefix is slightly preferred
over not having one. Since the prefix `LZ4File_` is not desired,
I propose `LZ4Stream_` in v2.

I will not object to dismissing the argument and drop `File` from
the prefix, if so requested.

> 
> Also, maybe the comments shouldn't use "File API" when compress_io.c
> calls that "Compressed stream API".

Done.

Cheers,
//Georgios

> 
> 
> regards
> 
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL CompanyFrom b17b60cc1ff608f85c6c75ab19ad40c0863cfa93 Mon Sep 17 00:00:00 2001
From: Georgios Kokolatos 
Date: Fri, 31 Mar 2023 09:16:52 +
Subject: [PATCH v2] Use LZ4 frames in pg_dump's compressor API.

This change allows for greater compaction of data, especially so in very narrow
relations, by avoiding at least a compaction header and footer per row. Since
LZ4 frames are now used by both compression APIs, some code deduplication
opportunities have become obvious and are also implemented.

While at it, rename LZ4File* functions to LZ4Stream* to improve readability.

Reported by: Justin Pryzby
---
 src/bin/pg_dump/compress_lz4.c | 420 +
 1 file changed, 275 insertions(+), 145 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index fc2f4e116d..7023b11a2c 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -17,7 +17,6 @@
 #include "compress_lz4.h"
 
 #ifdef USE_LZ4
-#include 
 #include 
 
 /*
@@ -29,133 +28,286 @@
 #endif
 
 /*--
- * Compressor API
- *--
+ * Common to both APIs
  */
 
-typedef struct LZ4CompressorState
+/*
+ * State used for LZ4 (de)compression by both APIs.
+ */
+typedef struct LZ4State
 {
-	char	   *outbuf;
-	size_t		outsize;
-} LZ4CompressorState;
+	/*
+	 * Used by the Stream API to keep track of the file stream.
+	 */
+	FILE	   *fp;
+
+	LZ4F_preferences_t prefs;
+
+	LZ4F_compressionContext_t	ctx;
+	LZ4F_decompressionContext_t	dtx;
+
+	/*
+	 * Used by the Stream API's lazy initialization.
+	 */
+	bool		inited;
+
+	/*
+	 * Used by the Stream API to distinguish between compression

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-03-31 Thread Daniel Gustafsson
> On 14 Mar 2023, at 20:20, Jacob Champion  wrote:

> Rebased over yesterday's Meson changes in v8.

I had a look at this and agree that it's something we should do.  The patch
seems quite close to committable, I just have a few comments on it:

+  # Let tests differentiate between vanilla OpenSSL and LibreSSL.
+  AC_CHECK_DECLS([LIBRESSL_VERSION_NUMBER], [], [], [#include 
])
We have a check for SSL_CTX_set_cert_cb which is specifically done since it's
not present in Libressl.  Rather than spending more cycles in autoconf/meson,
couldn't we use HAVE_SSL_CTX_SET_CERT_CB for this test?  (Longer term, maybe we
should make the checks properly distinguish between OpenSSL and LibreSSL as
they are diverging, but thats not for this patch to tackle.)


+# brew cleanup removes the empty certs directory in OPENSSLDIR, causing
+# OpenSSL to report unexpected errors ("unregistered scheme") during
+# verification failures. Put it back for now as a workaround.
+#
+#   https://github.com/orgs/Homebrew/discussions/4030
+#
+# Note that $(brew --prefix openssl) will give us the opt/ prefix but not
+# the etc/ prefix, so we hardcode the full path here. openssl@3 is pinned
+# above to try to minimize the chances of this changing beneath us, but 
it's
+# brittle...
+mkdir -p "/opt/homebrew/etc/openssl@3/certs"
I can agree with the comment that this seems brittle. How about moving the 
installation of openssl to after the brew cleanup stage to avoid the need for 
this? While that may leave more in the cache, it seems more palatable. 
Something like this essentially:

brew install 
brew cleanup -s 
# Comment about why OpenSSL is kept separate
brew install openssl@3


+   libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be used with 
sslrootcert=system",
+   conn->sslmode);
I think we should help the user by indicating which sslmode we allow in this
message.


+
+   /*
+* sslmode is not specified. Let it be filled in with the compiled
+* default for now, but if sslrootcert=system, we'll override the
+* default later before returning.
+*/
+   sslmode_default = option;
As a not to self and other reviewers, "git am" misplaced this when applying the
patch such that the result was syntactically correct but semantically wrong,
causing very weird test errors.


+   sslmode_default->val = strdup("verify-full");
This needs to be checked for OOM error.


-   if (fnbuf[0] != '\0' &&
-   stat(fnbuf, ) == 0)
+   if (strcmp(fnbuf, "system") == 0)
I'm not a fan of magic values, but sadly I don't have a better idea for this.
We should however document that the keyword takes precedence over a file with
the same name (even though the collision is unlikely).


+   if (SSL_CTX_set_default_verify_paths(SSL_context) != 1)
OpenSSL documents this as "A missing default location is still treated as a
success.", is that something we need to document or in any way deal with?
(Skimming the OpenSSL code I'm not sure it's actually correct in v3+, but I
might very well have missed something.)

--
Daniel Gustafsson





Re: pg_basebackup: Correct type of WalSegSz

2023-03-31 Thread Daniel Gustafsson
> On 30 Mar 2023, at 08:24, Peter Eisentraut 
>  wrote:
> 
> The pg_basebackup code has WalSegSz as uint32, whereas the rest of the code 
> has it as int.  This seems confusing, and using the extra range wouldn't 
> actually work.  This was in the original commit (fc49e24fa6), but I suppose 
> it was an oversight.

LGTM, it indeed seems to be an accidental oversight.

--
Daniel Gustafsson





Re: Experiments with Postgres and SSL

2023-03-31 Thread Greg Stark
And the cfbot wants a small tweak
From 3d0a502c25504da32b7a362831c700b4e891f79b Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Fri, 31 Mar 2023 02:31:31 -0400
Subject: [PATCH v6 5/6] Allow pipelining data after ssl request

---
 src/backend/postmaster/postmaster.c | 54 ++---
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9b4b37b997..33b317f98b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2001,13 +2001,14 @@ ProcessSSLStartup(Port *port)
 
 		if (port->raw_buf_remaining > 0)
 		{
-			/* This shouldn't be possible -- it would mean the client sent
-			 * encrypted data before we established a session key...
-			 */
-			elog(LOG, "Buffered unencrypted data remains after negotiating native SSL connection");
+			ereport(COMMERROR,
+	(errcode(ERRCODE_PROTOCOL_VIOLATION),
+	 errmsg("received unencrypted data after SSL request"),
+	 errdetail("This could be either a client-software bug or evidence of an attempted man-in-the-middle attack.")));
 			return STATUS_ERROR;
 		}
-		pfree(port->raw_buf);
+		if (port->raw_buf)
+			pfree(port->raw_buf);
 
 		if (port->ssl_alpn_protocol == NULL)
 		{
@@ -2158,15 +2159,44 @@ ProcessStartupPacket(Port *port, bool ssl_done, bool gss_done)
 		}
 
 #ifdef USE_SSL
-		if (SSLok == 'S' && secure_open_server(port) == -1)
-			return STATUS_ERROR;
+		if (SSLok == 'S')
+		{
+			/* push unencrypted buffered data back through SSL setup */
+			len = pq_buffer_has_data();
+			if (len > 0)
+			{
+buf = palloc(len);
+if (pq_getbytes(buf, len) == EOF)
+	return STATUS_ERROR; /* shouldn't be possible */
+port->raw_buf = buf;
+port->raw_buf_remaining = len;
+port->raw_buf_consumed = 0;
+			}
+			Assert(pq_buffer_has_data() == 0);
+
+			if (secure_open_server(port) == -1)
+return STATUS_ERROR;
+			
+			/*
+			 * At this point we should have no data already buffered.  If we do,
+			 * it was received before we performed the SSL handshake, so it wasn't
+			 * encrypted and indeed may have been injected by a man-in-the-middle.
+			 * We report this case to the client.
+			 */
+			if (port->raw_buf_remaining > 0)
+ereport(FATAL,
+		(errcode(ERRCODE_PROTOCOL_VIOLATION),
+		 errmsg("received unencrypted data after SSL request"),
+		 errdetail("This could be either a client-software bug or evidence of an attempted man-in-the-middle attack.")));
+			if (port->raw_buf)
+pfree(port->raw_buf);
+		}
 #endif
 
-		/*
-		 * At this point we should have no data already buffered.  If we do,
-		 * it was received before we performed the SSL handshake, so it wasn't
-		 * encrypted and indeed may have been injected by a man-in-the-middle.
-		 * We report this case to the client.
+		/* This can only really occur now if there was data pipelined after
+		 * the SSL Request but we have refused to do SSL. In that case we need
+		 * to give up because the client has presumably assumed the SSL
+		 * request would be accepted.
 		 */
 		if (pq_buffer_has_data())
 			ereport(FATAL,
-- 
2.40.0

From 5413f1a1ee897640bd3bb99bae226eec7e2e9f50 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Mon, 20 Mar 2023 14:09:30 -0400
Subject: [PATCH v6 4/6] Direct SSL connections ALPN support

---
 src/backend/libpq/be-secure-openssl.c | 66 +++
 src/backend/libpq/be-secure.c |  3 +
 src/backend/postmaster/postmaster.c   | 25 +++
 src/backend/utils/misc/guc_tables.c   |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/bin/psql/command.c|  7 +-
 src/include/libpq/libpq-be.h  |  1 +
 src/include/libpq/libpq.h |  1 +
 src/include/libpq/pqcomm.h| 19 ++
 src/interfaces/libpq/fe-connect.c |  5 ++
 src/interfaces/libpq/fe-secure-openssl.c  | 31 +
 src/interfaces/libpq/libpq-int.h  |  1 +
 12 files changed, 167 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 685aa2ed69..620ffafb0b 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -67,6 +67,12 @@ static int	ssl_external_passwd_cb(char *buf, int size, int rwflag, void *userdat
 static int	dummy_ssl_passwd_cb(char *buf, int size, int rwflag, void *userdata);
 static int	verify_cb(int ok, X509_STORE_CTX *ctx);
 static void info_cb(const SSL *ssl, int type, int args);
+static int  alpn_cb(SSL *ssl,
+	const unsigned char **out,
+	unsigned char *outlen,
+	const unsigned char *in,
+	unsigned int inlen,
+	void *userdata);
 static bool initialize_dh(SSL_CTX *context, bool isServerStart);
 static bool initialize_ecdh(SSL_CTX *context, bool isServerStart);
 static const char *SSLerrmessage(unsigned long ecode);
@@ -432,6 +438,13 @@ 

Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Andrey Borodin
Hi!

On Fri, Mar 31, 2023 at 8:07 AM Andres Freund  wrote:
>
> I was working on committing patch 0001 from [1] and was a bit confused about
> some of the changes to the WAL format for gist and hash index vacuum. It
> looked to me that the changed code just flat out would not work.
>
> Turns out the problem is that we don't reach deletion for hash and gist
> vacuum:
>
> gist:
>
> > Oh, I see. We apparently don't reach the gist deletion code in the tests:
> > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
> > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174
> >
> > And indeed, if I add an abort() into , it's not reached.
> >
> > And it's not because tests use a temp table, the caller is also unreachable:
> > https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643
>

GiST logs deletions in gistXLogUpdate(), which is covered.
gistXLogDelete() is only used for cleaning during page splits. I'd
propose refactoring GiST WAL to remove gistXLogDelete() and using
gistXLogUpdate() instead.
However I see that gistXLogPageDelete() is not exercised, and is worth
fixing IMO. Simply adding 10x more data in gist.sql helps, but I think
we can do something more clever...


Best regards, Andrey Borodin.




Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-03-31 Thread John Naylor
On Tue, Mar 21, 2023 at 6:44 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

Okay, the changes look good. To go further, I think we need to combine into
two patches, one with 0001-0003 and one with 0004:

1. Correct false statements about "shutdown" etc. This should contain
changes that can safely be patched all the way to v11.
2. Change bad advice (single-user mode) into good advice. We can target
head first, and then try to adopt as far back as we safely can (and test).

(...and future work, so not part of the CF here) 3. Tell the user what
caused the problem, instead of saying "go figure it out yourself".

> > In swapping this topic back in my head, I also saw [2] where Robert
suggested
> >
> > "that old prepared transactions and stale replication
> > slots should be emphasized more prominently.  Maybe something like:
> >
> > HINT:  Commit or roll back old prepared transactions, drop stale
> > replication slots, or kill long-running sessions.
> > Ensure that autovacuum is progressing, or run a manual database-wide
VACUUM."
>
> It looks like the hint regarding replication slots was added at some
> point. Currently we have:
>
> ```
> errhint( [...]
> "You might also need to commit or roll back old prepared
> transactions, or drop stale replication slots.")));
> ```

Yes, the exact same text as it appeared in the [2] thread above, which
prompted Robert's comment I quoted. I take the point to mean: All of these
things need to be taken care of *first*, before vacuuming, so the hint
should order things so that it is clear.

> Please let me know if you think
> we should also add a suggestion to kill long-running sessions, etc.

+1 for also adding that.

- errmsg("database is not accepting commands to avoid wraparound data loss
in database \"%s\"",
+ errmsg("database is not accepting commands that generate new XIDs to
avoid wraparound data loss in database \"%s\"",

I'm not quite on board with the new message, but welcome additional
opinions. For one, it's a bit longer and now ambiguous. I also bet that
"generate XIDs" doesn't  really communicate anything useful. The people who
understand exactly what that means, and what the consequences are, are
unlikely to let the system get near wraparound in the first place, and
might even know enough to ignore the hint.

I'm thinking we might need to convey something about "writes". While it's
less technically correct, I imagine it's more useful. Remember, many users
have it drilled in their heads that they need to drop immediately to
single-user mode. I'd like to give some idea of what they can and cannot do.

+ Previously it was required to stop the postmaster and VACUUM the
database
+ in a single-user mode. There is no need to use a single-user mode
anymore.

I think we need to go further and actively warn against it: It's slow,
impossible to monitor, disables replication and disables safeguards against
wraparound. (Other bad things too, but these are easily understandable for
users)

Maybe mention also that it's main use in wraparound situations is for a way
to perform DROPs and TRUNCATEs if that would help speed up resolution.

I propose for discussion that 0004 should show in the docs all the queries
for finding prepared xacts, repl slots etc. If we ever show the info at
runtime, we can dispense with the queries, but there seems to be no urgency
for that...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: running logical replication as the subscription owner

2023-03-31 Thread Jeff Davis
On Thu, 2023-03-30 at 16:08 -0400, Robert Haas wrote:
>  But the run_as_owner option applies to the entire subscription.
> If A activates that option, then B's hypothetical triggers that run
> afoul of the SECURITY_RESTRICTED_OPERATION restrictions start working
> again (woohoo!) but they're now vulnerable to attacks from C. With
> the
> patch as coded, A doesn't need to use run_as_owner, everything still
> just works for B, and A is still protected against C.

That's moving the goalposts a little, though:

"Some concern was expressed ... might break things that are currently
working..."

https://www.postgresql.org/message-id/CA+TgmoaE35kKS3-zSvGiZszXP9Tb9rNfYzT=+fo8ehk5edk...@mail.gmail.com

If the original use case was "don't break stuff", I think patch 0002
solves that, and we don't need this special case in 0001. Would you
agree with that statement?

Hypothetically, if 0001 (without the special case) along with 0002 were
already in 16, and then there was some hypothetical 0003 that
introduced the special case to solve the problem described above with
the bidirectional trust relationship, I'm not sure I'd be sold on 0003.
First, the problem seems fairly minor to me, at least in comparison to
the main problem you are solving in this thread. Second, it seems like
you could work around it by having two subscriptions. Third, it's a bit
unintuitive at least to me: if you introduce a new user Z that can SET
ROLE to any of A, B, or C, and then Z reassigns the subscription to
themselves, then B's trigger will break because B can't SET ROLE to Z.

Others seem to like it, so don't take that as a hard objection.

> 
> But I imagine CREATE SUBSCRIPTION being used either by
> superusers or by people who already have those role grants anyway,
> because I imagine replication as something that a highly privileged
> user configures on behalf of everyone who uses the system. And in
> that
> case those role grants aren't something new that you do specifically
> for logical replication - they're already there because you need them
> to administer stuff. Or you're the superuser and don't need them
> anyway.

Did the discussion drift back towards the SET ROLE in the other
direction? I thought we had settled that in v16 we would require that
the subscription owner can SET ROLE to the table owner (as in your
current 0001), but that we could revisit it later.

Regards,
Jeff Davis





Re: Experiments with Postgres and SSL

2023-03-31 Thread Greg Stark
On Mon, 20 Mar 2023 at 16:31, Greg Stark  wrote:
>
> Here's a first cut at ALPN support.
>
> Currently it's using a hard coded "Postgres/3.0" protocol

Apparently that is explicitly disrecommended by the IETF folk. They
want something like "TBD" so people don't start using a string until
it's been added to the registry. So I've changed this for now (to
"TBD-pgsql")

Ok, I think this has pretty much everything I was hoping to do.

The one thing I'm not sure of is it seems some codepaths in postmaster
have ereport(COMMERROR) followed by returning an error whereas other
codepaths just have ereport(FATAL). And I don't actually see much
logic in which do which. (I get the principle behind COMMERR it just
seems like it doesn't really match the code).

I realized I had exactly the infrastructure needed to allow pipelining
the SSL ClientHello like Neon wanted to do so I added that too. It's
kind of redundant with direct SSL connections but seems like there may
be reasons to use that instead.



-- 
greg
From 083df15eff52f025064e2879a404270e405f7dde Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Thu, 16 Mar 2023 11:55:16 -0400
Subject: [PATCH v5 2/6] Direct SSL connections client support

---
 src/interfaces/libpq/fe-connect.c | 91 +--
 src/interfaces/libpq/libpq-fe.h   |  1 +
 src/interfaces/libpq/libpq-int.h  |  3 +
 3 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index bb7347cb0c..7cd0eb261f 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -274,6 +274,10 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"SSL-Mode", "", 12,		/* sizeof("verify-full") == 12 */
 	offsetof(struct pg_conn, sslmode)},
 
+	{"sslnegotiation", "PGSSLNEGOTIATION", "postgres", NULL,
+		"SSL-Negotiation", "", 14,		/* strlen("requiredirect") == 14 */
+	offsetof(struct pg_conn, sslnegotiation)},
+
 	{"sslcompression", "PGSSLCOMPRESSION", "0", NULL,
 		"SSL-Compression", "", 1,
 	offsetof(struct pg_conn, sslcompression)},
@@ -1504,11 +1508,36 @@ connectOptions2(PGconn *conn)
 		}
 #endif
 	}
+
+	/*
+	 * validate sslnegotiation option, default is "postgres" for the postgres
+	 * style negotiated connection with an extra round trip but more options.
+	 */
+	if (conn->sslnegotiation)
+	{
+		if (strcmp(conn->sslnegotiation, "postgres") != 0
+			&& strcmp(conn->sslnegotiation, "direct") != 0
+			&& strcmp(conn->sslnegotiation, "requiredirect") != 0)
+		{
+			conn->status = CONNECTION_BAD;
+			libpq_append_conn_error(conn, "invalid %s value: \"%s\"",
+	"sslnegotiation", conn->sslnegotiation);
+			return false;
+		}
+
+#ifndef USE_SSL
+		if (conn->sslnegotiation[0] != 'p') {
+			conn->status = CONNECTION_BAD;
+			libpq_append_conn_error(conn, "sslnegotiation value \"%s\" invalid when SSL support is not compiled in",
+	conn->sslnegotiation);
+			return false;
+		}
+#endif
+	}
 	else
 	{
-		conn->sslmode = strdup(DefaultSSLMode);
-		if (!conn->sslmode)
-			goto oom_error;
+		libpq_append_conn_error(conn, "sslnegotiation missing?");
+		return false;
 	}
 
 	/*
@@ -1614,6 +1643,18 @@ connectOptions2(PGconn *conn)
 	conn->gssencmode);
 			return false;
 		}
+#endif
+#ifdef USE_SSL
+		/* GSS is incompatible with direct SSL connections so it requires the
+		 * default postgres style connection ssl negotiation  */
+		if (strcmp(conn->gssencmode, "require") == 0 &&
+			strcmp(conn->sslnegotiation, "postgres") != 0)
+		{
+			conn->status = CONNECTION_BAD;
+			libpq_append_conn_error(conn, "gssencmode value \"%s\" invalid when Direct SSL negotiation is enabled",
+	conn->gssencmode);
+			return false;
+		}
 #endif
 	}
 	else
@@ -2747,11 +2788,12 @@ keep_going:		/* We will come back to here until there is
 		/* initialize these values based on SSL mode */
 		conn->allow_ssl_try = (conn->sslmode[0] != 'd');	/* "disable" */
 		conn->wait_ssl_try = (conn->sslmode[0] == 'a'); /* "allow" */
+		/* direct ssl is incompatible with "allow" or "disabled" ssl */
+		conn->allow_direct_ssl_try = conn->allow_ssl_try && !conn->wait_ssl_try && (conn->sslnegotiation[0] != 'p');
 #endif
 #ifdef ENABLE_GSS
 		conn->try_gss = (conn->gssencmode[0] != 'd');	/* "disable" */
 #endif
-
 		reset_connection_state_machine = false;
 		need_new_connection = true;
 	}
@@ -3209,6 +3251,28 @@ keep_going:		/* We will come back to here until there is
 if (pqsecure_initialize(conn, false, true) < 0)
 	goto error_return;
 
+/* If SSL is enabled and direct SSL connections are enabled
+ * and we haven't already established an SSL connection (or
+ * already tried a direct connection and failed or succeeded)
+ * then try just enabling SSL directly.
+ *
+ * If we fail then we'll either fail the connection (if
+ * sslnegotiation is set to requiredirect or turn
+ * allow_direct_ssl_try to false
+ */
+if (conn->allow_ssl_try
+	&&