date:20230202

Re: Perform streaming logical transactions by background workers and parallel apply

2023-02-02 Thread Masahiko Sawada

On Fri, Feb 3, 2023 at 12:29 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, February 3, 2023 11:04 AM Amit Kapila  
> wrote:
> >
> > On Thu, Feb 2, 2023 at 4:52 AM Peter Smith 
> > wrote:
> > >
> > > Some minor review comments for v91-0001
> > >
> >
> > Pushed this yesterday after addressing your comments!
>
> Thanks for pushing.
>
> Currently, we have two remaining patches which we are not sure whether it's 
> worth
> committing for now. Just share them here for reference.
>
> 0001:
>
> Based on our discussion[1] on -hackers, it's not clear that if it's necessary
> to add the sub-feature to stop extra worker when
> max_apply_workers_per_suibscription is reduced. Because:
>
> - it's not clear whether reducing the 'max_apply_workers_per_suibscription' 
> is very
>   common.

A use case I'm concerned about is a temporarily intensive data load,
for example, a data loading batch job in a maintenance window. In this
case, the user might want to temporarily increase
max_parallel_workers_per_subscription in order to avoid a large
replication lag, and revert the change back to normal after the job.
If it's unlikely to stream the changes in the regular workload as
logical_decoding_work_mem is big enough to handle the regular
transaction data, the excess parallel workers won't exit. Another
subscription might want to use parallel workers but there might not be
free workers. That's why I thought we need to free the excess workers
at some point.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Andres Freund

Hi,

On 2023-02-03 20:34:36 +1300, Thomas Munro wrote:
> What if we block signals, fork, then in the child, install the default
> SIGTERM handler, then unblock, and then exec the shell?

Yep.  I was momentarily wondering why we'd even need to unblock signals,
but while exec (et al) reset the signal handler, they don't reset the
mask...

We could, for good measure, do PGSharedMemoryDetach() etc. But I don't
think it's quite worth it if we're careful with signals.  However
ClosePostmasterPorts() might be a good idea? I think not doing it might
cause issues like keeping the listen sockets alive after we shut down
postmaster, preventing us from startup up again?

Looks like PR_SET_PDEATHSIG isn't reset across an execve(). But that
actually seems good?

> If SIGTERM is delivered either before or after exec (but before
> whatever is loaded installs a new handler) then the child is
> terminated, but without running the handler.  Isn't that what we want
> here?

Yep, I think so.

Greetings,

Andres Freund

RE: Deadlock between logrep apply worker and tablesync worker

2023-02-02 Thread houzj.f...@fujitsu.com

On Thursday, February 2, 2023 7:21 PM Amit Kapila  
wrote:
> 
> On Thu, Feb 2, 2023 at 12:05 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, January 31, 2023 1:07 AM vignesh C 
> wrote:
> > > On Mon, 30 Jan 2023 at 17:30, vignesh C  wrote:
> > >
> >
> > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > before and after the patch(change the lock level) and Tom's
> > patch(split
> > transaction) like what Vignesh has shared on -hackers.
> >
> > I run about 100 times for each case. Tom's and the lock level patch
> > behave similarly on my machines[1].
> >
> > HEAD: 3426 ~ 6425 ms
> > HEAD + Tom: 3404 ~ 3462 ms
> > HEAD + Vignesh: 3419 ~ 3474 ms
> > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> >
> > Even apart from the testing time reduction, reducing the lock level
> > and lock the specific object can also help improve the lock contention
> > which user(that use the exposed function) , table sync worker and
> > apply worker can also benefit from it. So, I think pushing the patch to 
> > change
> the lock level makes sense.
> >
> > And the patch looks good to me.
> >
> 
> Thanks for the tests. I also see a reduction in test time variability with 
> Vignesh's
> patch. I think we can release the locks in case the origin is concurrently
> dropped as in the attached patch. I am planning to commit this patch
> tomorrow unless there are more comments or objections.
> 
> > While on it, after pushing the patch, I think there is another case
> > might also worth to be improved, that is the table sync and apply
> > worker try to drop the same origin which might cause some delay. This
> > is another case(different from the deadlock), so I feel we can try to 
> > improve
> this in another patch.
> >
> 
> Right, I think that case could be addressed by Tom's patch to some extent but
> I am thinking we should also try to analyze if we can completely avoid the 
> need
> to remove origins from both processes. One idea could be to introduce
> another relstate something like PRE_SYNCDONE and set it in a separate
> transaction before we set the state as SYNCDONE and remove the slot and
> origin in tablesync worker.
> Now, if the tablesync worker errors out due to some reason during the second
> transaction, it can remove the slot and origin after restart by checking the 
> state.
> However, it would add another relstate which may not be the best way to
> address this problem. Anyway, that can be accomplished as a separate patch.

Here is an attempt to achieve the same.
Basically, the patch removes the code that drop the origin in apply worker. And
add a new state PRE_SYNCDONE after synchronization finished in front of apply
(sublsn set), but before dropping the origin and other final cleanups. The
tablesync will restart and redo the cleanup if it failed after reaching the new
state. Besides, since the changes can already be applied on the table in
PRE_SYNCDONE state, so I also modified the check in
should_apply_changes_for_rel(). And some other conditions for the origin drop
in subscription commands are were adjusted in this patch.

Best Regards,
Hou zj


0001-Avoid-dropping-origins-from-both-apply-and-tablesync.patch
Description:  0001-Avoid-dropping-origins-from-both-apply-and-tablesync.patch

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Tom Lane

Andres Freund  writes:
> On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
>> setsid(2) is required since SUSv2, so I'm not sure which systems
>> are of concern here ... other than Redmond's of course.

> I was thinking of windows, yes.

But given the lack of fork(2), Windows requires a completely
different solution anyway, no?

regards, tom lane

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Tom Lane

Thomas Munro  writes:
> On Fri, Feb 3, 2023 at 8:35 PM Andres Freund  wrote:
>> we don't have a hard dependency on setsid()

> FTR There are no Unixes without setsid()...

Yeah.  What I just got done reading in SUSv2 (1997) is
"Derived from the POSIX.1-1988 standard".  We need not
concern ourselves with any systems not having it.

regards, tom lane

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Andres Freund

Hi,

On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Ugh, I think I might understand what's happening:
> 
> > The signal arrives just after the fork() (within system()). Because we
> > have all our processes configure themselves as process group leaders,
> > and we signal the entire process group (c.f. signal_child()), both the
> > child process and the parent will process the signal. So we'll end up
> > doing a proc_exit() in both. As both are trying to remove themselves
> > from the same PGPROC etc entry, that doesn't end well.
> 
> Ugh ...
> 
> > I don't see how we can solve that properly as long as we use system().
> 
> ... but I don't see how that's system()'s fault?  Doing the fork()
> ourselves wouldn't change anything about that.

If we did the fork ourselves, we'd temporarily change the signal mask
before the fork() and reset it immediately in the parent, but not in the
child. We can't do that with system(), because we don't get control back
early enough - we'd just block signals for the entire duration of
system().

I wonder if this shows a problem with the change in 14 to make pgarch.c
be attached to shared memory. Before that it didn't have to worry about
problems like the above in the archiver, but now we do. It's less severe
than the startup process issue, because we don't have a comparable
signal handler in pgarch, but still.

I'm e.g. not sure that there aren't issues with
procsignal_sigusr1_handler() or such executing in a forked process.

> > A workaround for the back branches could be to have a test in
> > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
> > not do the proc_exit() if they don't match. We probably should just do
> > an _exit() in that case.
> 
> Might work.

I wonder if we should add code complaining loudly about such a mismatch
to proc_exit(), in addition to handling it more silently in
StartupProcShutdownHandler().   Also, an assertion in
[Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a
good idea.

> > OTOH, the current approach only works on systems with setsid(2) support,
> > so we probably shouldn't rely so hard on it anyway.
> 
> setsid(2) is required since SUSv2, so I'm not sure which systems
> are of concern here ... other than Redmond's of course.

I was thinking of windows, yes.

Greetings,

Andres Freund

Re: In-placre persistance change of a relation

2023-02-02 Thread Heikki Linnakangas


I want to call out this part of this patch:


Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.


This is interesting to a lot wider audience than ALTER TABLE SET 
LOGGED/UNLOGGED. It also adds most of the complexity, with the new 
marker files. Can you please split the first patch into two:


1. Cleanup of newly created relations on crash

2. ALTER TABLE SET LOGGED/UNLOGGED changes

Then we can review the first part independently.

Regarding the first part, I'm not sure the marker files are the best 
approach to implement it. You need to create an extra file for every 
relation, just to delete it at commit. It feels a bit silly, but maybe 
it's OK in practice. The undo log patch set solved this problem with the 
undo log, but it looks like that patch set isn't going anywhere. Maybe 
invent a very lightweight version of the undo log for this?


- Heikki

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Thomas Munro

On Fri, Feb 3, 2023 at 8:35 PM Andres Freund  wrote:
> we don't have a
> hard dependency on setsid()

FTR There are no Unixes without setsid()...  HAVE_SETSID was only left
in the tree because we were discussing whether to replace it with
!defined(WIN32) or whether that somehow made things more confusing,
but then while trying to figure out what to do about that, I noticed
that Windows *does* have a near-equivalent thing, or IIRC several
things like that, and that kinda stopped me in my tracks.

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Andres Freund

Hi,

On 2023-02-02 14:39:19 -0800, Nathan Bossart wrote:
> Actually, this still doesn't really explain why we need to exit immediately
> in the SIGTERM handler for restore_command.  We already have handling for
> when the command indicates it exited due to SIGTERM, so it should be no
> problem if the command receives it before the startup process.  And
> HandleStartupProcInterrupts() should exit at an appropriate time after the
> startup process receives SIGTERM.

> My guess was that this is meant to allow breaking out of the system() call,
> but I don't understand why that's important here.  Maybe we could just
> remove this exit-in-SIGTERM-handler business...

I don't think you can, at least not easily. For one, we have no
guarantee that the child process got a signal at all - we don't have a
hard dependency on setsid(). And even if we have setsid(), there's no
guarantee that the executed process reacts to SIGTERM and that the child
didn't create its own process group (and thus isn't reached by the
signal to the process group, sent in signal_child()).

Greetings,

Andres Freund

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Thomas Munro

On Fri, Feb 3, 2023 at 8:24 PM Tom Lane  wrote:
> Andres Freund  writes:
> > Ugh, I think I might understand what's happening:
>
> > The signal arrives just after the fork() (within system()). Because we
> > have all our processes configure themselves as process group leaders,
> > and we signal the entire process group (c.f. signal_child()), both the
> > child process and the parent will process the signal. So we'll end up
> > doing a proc_exit() in both. As both are trying to remove themselves
> > from the same PGPROC etc entry, that doesn't end well.
>
> Ugh ...

Yuck, but yeah that makes sense.

> > I don't see how we can solve that properly as long as we use system().
>
> ... but I don't see how that's system()'s fault?  Doing the fork()
> ourselves wouldn't change anything about that.

What if we block signals, fork, then in the child, install the default
SIGTERM handler, then unblock, and then exec the shell?  If SIGTERM is
delivered either before or after exec (but before whatever is loaded
installs a new handler) then the child is terminated, but without
running the handler.  Isn't that what we want here?

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Tom Lane

Andres Freund  writes:
> Ugh, I think I might understand what's happening:

> The signal arrives just after the fork() (within system()). Because we
> have all our processes configure themselves as process group leaders,
> and we signal the entire process group (c.f. signal_child()), both the
> child process and the parent will process the signal. So we'll end up
> doing a proc_exit() in both. As both are trying to remove themselves
> from the same PGPROC etc entry, that doesn't end well.

Ugh ...

> I don't see how we can solve that properly as long as we use system().

... but I don't see how that's system()'s fault?  Doing the fork()
ourselves wouldn't change anything about that.

> A workaround for the back branches could be to have a test in
> StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
> not do the proc_exit() if they don't match. We probably should just do
> an _exit() in that case.

Might work.

> OTOH, the current approach only works on systems with setsid(2) support,
> so we probably shouldn't rely so hard on it anyway.

setsid(2) is required since SUSv2, so I'm not sure which systems
are of concern here ... other than Redmond's of course.

regards, tom lane

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Andres Freund

Hi,

On 2023-02-02 10:23:21 +0900, Michael Paquier wrote:
> On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote:
> > On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> >> Andres Freund  writes:
> >> The main thing that system() brings to the table is platform-specific
> >> knowledge of where the shell is.  I'm not very sure that we want to
> >> wire in "/bin/sh".
> > 
> > We seem to be doing OK with using SHELLPROG in pg_regress, which just
> > seems to be using $SHELL from the build environment.
> 
> It looks like this had better centralize a bit more of the logic from
> pg_regress.c if that were to happen, to avoid more fuzzy logic with
> WIN32.  That becomes invasive for a back-patch.

I don't think we should consider backpatching such a change. There's
enough subtlety that I'd want to see it bake for some time.

> By the way, there is something that's itching me a bit here.  9a740f8
> has enlarged by a lot the window between PreRestoreCommand() and
> PostRestoreCommand(), however curculio has reported a failure on
> REL_15_STABLE, where we only manipulate my_wait_event_info while the
> flag is on.  Or I am getting that right that there is no way out of it
> unless we remove the dependency to system() even in the back-branches?
> Could there be an extra missing piece here?

Yea, that's indeed odd.

Ugh, I think I might understand what's happening:

The signal arrives just after the fork() (within system()). Because we
have all our processes configure themselves as process group leaders,
and we signal the entire process group (c.f. signal_child()), both the
child process and the parent will process the signal. So we'll end up
doing a proc_exit() in both. As both are trying to remove themselves
from the same PGPROC etc entry, that doesn't end well.

I don't see how we can solve that properly as long as we use system().

A workaround for the back branches could be to have a test in
StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
not do the proc_exit() if they don't match. We probably should just do
an _exit() in that case.

I doubt the idea to signal the entire process group in in signal_child()
is good. I regularly see core dumps of archive commands because we sent
SIGQUIT during an immediate shutdown, and of course cp etc don't have a
SIGQUIT handler, and the default action is to core dump.

But a replacement for it is not a small amount of work. While a
subprocess is running, we can't just handle SIGQUIT with _exit() while
the subprocess is running, we need to first signal the child with
something appropriate (SIGKILL?).

OTOH, the current approach only works on systems with setsid(2) support,
so we probably shouldn't rely so hard on it anyway.

Greetings,

Andres Freund

Re: [PATCH] Add pretty-printed XML output option

2023-02-02 Thread Jim Jones

The system somehow returns different error messages in Linux and 
MacOS/Windows, which was causing the cfbot to fail.


SELECT xmlpretty('')::xml;
  ^
-DETAIL:  line 1: chunk is not well balanced
+DETAIL:  line 1: Premature end of data in tag foo line 1

Test removed in v2.

On 02.02.23 21:35, Jim Jones wrote:

Hi,

This small patch introduces a XML pretty print function. It basically 
takes advantage of the indentation feature of xmlDocDumpFormatMemory 
from libxml2 to format XML strings.


postgres=# SELECT xmlpretty('id="z">42');

    xmlpretty
--
 +
     +
 42+
       +
   +

(1 row)


The patch also contains regression tests and documentation.

Feedback is very welcome!

JimFrom ced9fccddc033de98709a6e93dc6530ce68149db Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Thu, 2 Feb 2023 21:27:16 +0100
Subject: [PATCH v2 1/2] Add pretty-printed XML output option

This small patch introduces a XML pretty print function.
It basically takes advantage of the indentation feature
of xmlDocDumpFormatMemory from libxml2 to format XML strings.
---
 doc/src/sgml/func.sgml|  34 ++
 src/backend/utils/adt/xml.c   |  30 +
 src/include/catalog/pg_proc.dat   |   3 +
 src/test/regress/expected/xml.out | 107 ++
 src/test/regress/sql/xml.sql  |  34 ++
 5 files changed, 208 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..e8b5e581f0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14293,6 +14293,40 @@ SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
 ]]>
 

+   
+ 
+xmlpretty
+
+ 
+ xmlpretty
+ 
+ 
+
+xmlpretty ( xml ) xml
+
+ 
+ 
+ Converts the given XML value to pretty-printed, indented text.
+ 
+ 
+ 
+ Example:
+ 
+ 
+   
+   

 

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 079bcb1208..6409133137 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -473,6 +473,36 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf)
 }
 #endif
 
+Datum
+xmlpretty(PG_FUNCTION_ARGS)
+{
+#ifdef USE_LIBXML
+
+xmlDocPtr  doc;
+xmlChar*buf = NULL;
+text   *arg = PG_GETARG_TEXT_PP(0);
+
+doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL);
+
+/**
+* xmlDocDumpFormatMemory (
+*   xmlDocPtr doc,  # the XML document.
+*   xmlChar ** buf, # buffer where the formatted XML document will be stored.
+*   int *size,  # this could store the size of the created buffer
+* but as we do not need it, we can leave it NULL.
+*   int format) # 1 = node indenting.
+*/
+xmlDocDumpFormatMemory(doc, , NULL, 1);
+
+xmlFreeDoc(doc);
+PG_RETURN_XML_P(cstring_to_xmltype((char*)buf));
+
+#else
+NO_XML_SUPPORT();
+return 0;
+#endif
+}
+
 
 Datum
 xmlcomment(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..3224dc3e76 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8842,6 +8842,9 @@
 { oid => '3053', descr => 'determine if a string is well formed XML content',
   proname => 'xml_is_well_formed_content', prorettype => 'bool',
   proargtypes => 'text', prosrc => 'xml_is_well_formed_content' },
+  { oid => '4642', descr => 'Indented text from xml',
+  proname => 'xmlpretty', prorettype => 'xml',
+  proargtypes => 'xml', prosrc => 'xmlpretty' },
 
 # json
 { oid => '321', descr => 'I/O',
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 3c357a9c7e..98a338ad8d 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1599,3 +1599,110 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH
   | foo/
 (1 row)
 
+-- XML pretty print: single line XML string
+SELECT xmlpretty('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650')::xml;
+xmlpretty 
+--
+ +
+ +
+ Belgian Waffles+
+ $5.95+
+ Two of our famous Belgian Waffles with plenty of real maple syrup+
+ 650+
+  +
+

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Takamichi Osumi (Fujitsu)

Hi,

On Friday, February 3, 2023 2:21 PM Amit Kapila  wrote:
> On Fri, Feb 3, 2023 at 6:41 AM Peter Smith  wrote:
> > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu)
> >  wrote:
> > >
> > ...
> > > > Besides, I am not sure it's a stable test to check the log. Is it
> > > > possible that there's no such log on a slow machine? I modified
> > > > the code to sleep 1s at the beginning of apply_dispatch(), then
> > > > the new added test failed because the server log cannot match.
> > > To get the log by itself is necessary to ensure that the delay is
> > > conducted by the apply worker, because we emit the diffms only if
> > > it's bigger than 0 in maybe_apply_delay(). If we omit the step, we
> > > are not sure the delay is caused by other reasons or the time-delayed
> feature.
> > >
> > > As you mentioned, it's possible that no log is emitted on slow
> > > machine. Then, the idea to make the test safer for such machines should
> be to make the delayed time longer.
> > > But we shortened the delay time to 1 second to mitigate the long test
> execution time of this TAP test.
> > > So, I'm not sure if it's a good idea to make it longer again.
> >
> > I think there are a couple of things that can be done about this problem:
> >
> > 1. If you need the code/test to remain as-is then at least the test
> > message could include some comforting text like "(this can fail on
> > slow machines when the delay time is already exceeded)" so then a test
> > failure will not cause undue alarm.
> >
> > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that
> > it will *always* log the remaining wait time even if that wait time
> > becomes negative. Then I think the test cases can be made
> > deterministic instead of relying on good luck. This seems like the
> > better option.
> >
> 
> I don't understand why we have to do any of this instead of using 3s as
> min_apply_delay similar to what we are doing in
> src/test/recovery/t/005_replay_delay. Also, I think we should use exactly the
> same way to verify the test even though we want to keep the log level as
> DEBUG2 to check logs in case of any failures.
OK, will try to make our tests similar to the tests in 005_replay_delay
as much as possible.
 

> Also, I don't see the need to add more tests like the ones below:
> +# Test whether ALTER SUBSCRIPTION changes the delayed time of the apply
> +worker # (1 day 5 minutes). Note that the extra 5 minute is to account
> +for any # decoding/network overhead.
> 
> Let's try to add tests similar to what we have for recovery_min_apply_delay
> unless there is some functionality in this patch that is not there in the
> recovery_min_apply_delay feature.
The above command is a preparation part to check a behavior unique to 
time-delayed
logical replication, which is to DISABLE a subscription causes the apply worker 
not to apply
the suspended (delayed) transaction. So, it will be OK to have this test.


Best Regards,
Takamichi Osumi

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-02 Thread shveta malik

On Fri, Feb 3, 2023 at 11:50 AM shveta malik  wrote:
>
> On Thu, Feb 2, 2023 at 5:01 PM shveta malik  wrote:
> >
>
> 2e)
> When it goes to reuse flow (i.e. before walrcv_slot_snapshot), if
> needed we can dump newly obtained origin_startpos also:
>
> ereport(DEBUG2,
> (errmsg("[subid:%d] LogicalRepSyncWorker_%ld reusing slot %s and origin %s",
> MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname,
> originname)));
>

One addition, I think it will be good to add relid as well in above so
that we can get info as in we are reusing old slot for which relid.
Once we have all the above in log-file, it makes it very easy to
diagnose reuse-sync worker related problems just by looking at the
logfile.

thanks
Shveta

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-02 Thread shveta malik

On Thu, Feb 2, 2023 at 5:01 PM shveta malik  wrote:
>
> Reviewing further
>

Few more comments for v10-0002 and v7-0001:

1)
+ * need_full_snapshot
+ * if true, create a snapshot able to read all tables,
+ * otherwise do not create any snapshot.
+ *
CreateDecodingContext(..,CreateDecodingContext,..)

--Is the comment correct? Shall we have same comment here as that of
'CreateDecodingContext'
 * need_full_snapshot -- if true, must obtain a snapshot able to read all
 *  tables; if false, one that can read only catalogs is acceptable.
This function is not going to create a snapshot anyways. It is just a
pre-step and then the caller needs to call 'SnapBuild' functions to
build a snapshot. Here need_full_snapshot decides whether we need all
tables or only catalog tables changes only and thus the comment change
is needed.

==

2)

Can we please add more logging:

2a)
when lastusedId is incremented and updated in pg_* table
ereport(DEBUG2,
(errmsg("[subid:%d] Incremented lastusedid
to:%ld",MySubscription->oid, MySubscription->lastusedid)));


Comments for LogicalRepSyncTableStart():

2b ) After every UpdateSubscriptionRel:

ereport(DEBUG2,
(errmsg("[subid:%d] LogicalRepSyncWorker_%ld updated origin to %s and
slot to %s for relid %d",
MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id,
originname, slotname, MyLogicalRepWorker->relid)));


2c )
After walrcv_create_slot:

ereport(DEBUG2,
(errmsg("[subid:%d] LogicalRepSyncWorker_%ld created slot %s",
MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname)));


2d)
After replorigin_create:

ereport(DEBUG2,
(errmsg("[subid:%d] LogicalRepSyncWorker_%ld created origin %s [id: %d]",
MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id,
originname, originid)));


2e)
When it goes to reuse flow (i.e. before walrcv_slot_snapshot), if
needed we can dump newly obtained origin_startpos also:

ereport(DEBUG2,
(errmsg("[subid:%d] LogicalRepSyncWorker_%ld reusing slot %s and origin %s",
MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname,
originname)));


2f)
Also in existing comment:

+ (errmsg("logical replication table synchronization worker for
subscription \"%s\" has moved to sync table \"%s\".",
+ MySubscription->name, get_rel_name(MyLogicalRepWorker->relid;

we can add relid also along with relname. relid is the one stored in
pg_subscription_rel and thus it becomes easy to map. Also we can
change 'logical replication table synchronization worker' to
'LogicalRepSyncWorker_%ld'.
Same change needed in other similar log lines where we say that worker
started and finished.


Please feel free to change the above log lines as you find
appropriate. I have given just a sample sort of thing.

thanks
Shveta

Re: Use windows VMs instead of windows containers on the CI

2023-02-02 Thread Andres Freund

Hi,

On 2023-02-02 17:47:37 +0300, Nazir Bilal Yavuz wrote:
> On 1/12/2023 3:30 AM, Andres Freund wrote:
> > Yea, there's a problem where packer on windows doesn't seem to abort after a
> > powershell script error out. The reason isn't yet quiete clear. I think 
> > Bilal
> > is working on a workaround.
> 
> 
> That should be fixed now. Also, adding a patch for PG15. There were
> conflicts while applying current patch to the REL_15_STABLE branch.

And pushed!  I think an improvement in CI times of this degree is pretty
awesome.

Unfortunately I also noticed that the tap tests on mingw don't run
anymore, due to IPC::Run not being available. But it's independent of
this change. I don't know when that broke. Could you check it out?

Greetings,

Andres Freund

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Amit Kapila

On Fri, Feb 3, 2023 at 11:12 AM Peter Smith  wrote:
>
> On Fri, Feb 3, 2023 at 4:21 PM Amit Kapila  wrote:
> >
> > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith  wrote:
> > >
> > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu)
> > >  wrote:
> > > >
> > > ...
> > > >
> > > >
> > > > > Besides, I am not sure it's a stable test to check the log. Is it 
> > > > > possible that there's
> > > > > no such log on a slow machine? I modified the code to sleep 1s at the 
> > > > > beginning
> > > > > of apply_dispatch(), then the new added test failed because the 
> > > > > server log
> > > > > cannot match.
> > > > To get the log by itself is necessary to ensure
> > > > that the delay is conducted by the apply worker, because we emit the 
> > > > diffms
> > > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step,
> > > > we are not sure the delay is caused by other reasons or the 
> > > > time-delayed feature.
> > > >
> > > > As you mentioned, it's possible that no log is emitted on slow machine. 
> > > > Then,
> > > > the idea to make the test safer for such machines should be to make the 
> > > > delayed time longer.
> > > > But we shortened the delay time to 1 second to mitigate the long test 
> > > > execution time of this TAP test.
> > > > So, I'm not sure if it's a good idea to make it longer again.
> > >
> > > I think there are a couple of things that can be done about this problem:
> > >
> > > 1. If you need the code/test to remain as-is then at least the test
> > > message could include some comforting text like "(this can fail on
> > > slow machines when the delay time is already exceeded)" so then a test
> > > failure will not cause undue alarm.
> > >
> > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that
> > > it will *always* log the remaining wait time even if that wait time
> > > becomes negative. Then I think the test cases can be made
> > > deterministic instead of relying on good luck. This seems like the
> > > better option.
> > >
> >
> > I don't understand why we have to do any of this instead of using 3s
> > as min_apply_delay similar to what we are doing in
> > src/test/recovery/t/005_replay_delay. Also, I think we should use
> > exactly the same way to verify the test even though we want to keep
> > the log level as DEBUG2 to check logs in case of any failures.
> >
>
> IIUC the reasons are due to conflicting requirements. e.g.
> - A longer delay like 3s might work better for testing this feature, but OTOH
> - A longer delay will also cause the whole BF execution to take longer
>

Sure, but we already have the same test for a similar feature and it
seems to be a proven reliable way to test the feature. We do seem to
have seen buildfarm failures for tests related to
recovery_min_apply_delay and the current way is quite stable, so I
would prefer to go with that.

-- 
With Regards,
Amit Kapila.

Re: O(n) tasks cause lengthy startups and checkpoints

2023-02-02 Thread Nathan Bossart

rebased for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From abbd26a3bcfcc828e196187e9f6abf6af64f3393 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v19 1/4] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 doc/src/sgml/glossary.sgml  |  11 +
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 377 
 src/backend/postmaster/meson.build  |   1 +
 src/backend/postmaster/postmaster.c |  38 ++-
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  32 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 14 files changed, 488 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..ad3f53e2a3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -144,6 +144,7 @@
  (but not the autovacuum workers),
  the background writer,
  the checkpointer,
+ the custodian,
  the logger,
  the startup process,
  the WAL archiver,
@@ -484,6 +485,16 @@

   
 
+  
+   Custodian (process)
+   
+
+ An auxiliary process
+ that is responsible for executing assorted cleanup tasks.
+
+   
+  
+
   
Data area

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..a1f042f13a 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..98bb9efcfd
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,377 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode).  The custodian is not
+ * an essential process and can shutdown quickly when requested.  The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+	slock_t		cust_lck;
+
+	CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+	int

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Peter Smith

On Fri, Feb 3, 2023 at 4:21 PM Amit Kapila  wrote:
>
> On Fri, Feb 3, 2023 at 6:41 AM Peter Smith  wrote:
> >
> > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu)
> >  wrote:
> > >
> > ...
> > >
> > >
> > > > Besides, I am not sure it's a stable test to check the log. Is it 
> > > > possible that there's
> > > > no such log on a slow machine? I modified the code to sleep 1s at the 
> > > > beginning
> > > > of apply_dispatch(), then the new added test failed because the server 
> > > > log
> > > > cannot match.
> > > To get the log by itself is necessary to ensure
> > > that the delay is conducted by the apply worker, because we emit the 
> > > diffms
> > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step,
> > > we are not sure the delay is caused by other reasons or the time-delayed 
> > > feature.
> > >
> > > As you mentioned, it's possible that no log is emitted on slow machine. 
> > > Then,
> > > the idea to make the test safer for such machines should be to make the 
> > > delayed time longer.
> > > But we shortened the delay time to 1 second to mitigate the long test 
> > > execution time of this TAP test.
> > > So, I'm not sure if it's a good idea to make it longer again.
> >
> > I think there are a couple of things that can be done about this problem:
> >
> > 1. If you need the code/test to remain as-is then at least the test
> > message could include some comforting text like "(this can fail on
> > slow machines when the delay time is already exceeded)" so then a test
> > failure will not cause undue alarm.
> >
> > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that
> > it will *always* log the remaining wait time even if that wait time
> > becomes negative. Then I think the test cases can be made
> > deterministic instead of relying on good luck. This seems like the
> > better option.
> >
>
> I don't understand why we have to do any of this instead of using 3s
> as min_apply_delay similar to what we are doing in
> src/test/recovery/t/005_replay_delay. Also, I think we should use
> exactly the same way to verify the test even though we want to keep
> the log level as DEBUG2 to check logs in case of any failures.
>

IIUC the reasons are due to conflicting requirements. e.g.
- A longer delay like 3s might work better for testing this feature, but OTOH
- A longer delay will also cause the whole BF execution to take longer

--
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Amit Kapila

On Fri, Feb 3, 2023 at 8:02 AM Peter Smith  wrote:
>
> I think there is some misunderstanding. I was not suggesting removing
> the condition -- only that I thought it could be written without the >
> 0 as:
>
> if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
> opts->min_apply_delay && opts->streaming == LOGICALREP_STREAM_PARALLEL)
> ereport(ERROR,
>

Yeah, we can probably write that way but in the error message we are
already using > 0, so the current style used by patch seems good to
me. Also, I think using the way you are suggesting is more apt for
booleans.

-- 
With Regards,
Amit Kapila.

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 02:39:19PM -0800, Nathan Bossart wrote:
> Maybe we could just
> remove this exit-in-SIGTERM-handler business...

I've spent some time testing this.  It seems to work pretty well, but only
if I keep the exit-on-SIGTERM logic in shell_restore().  Without that, I'm
seeing delayed shutdowns, which I assume means
HandleStartupProcInterrupts() isn't getting called (I'm still investigating
this).  Іn any case, the fact that shell_restore() exits if the command
fails due to SIGTERM seems like an implementation detail that we won't
necessarily want to rely on once recovery modules are available.  In short,
we seem to depend on the SIGTERM handling in RestoreArchivedFile() in order
to be responsive to shutdown requests.

One idea I have is to approximate the current behavior by simply checking
for the shutdown_requested flag before before and after executing
restore_command.  This seems to work as desired even if the exit-on-SIGTERM
logic is removed from shell_restore().  Unless there is some reason to
break out of system() (versus just waiting for the command to fail after it
receives SIGTERM), I think this approach should suffice.

I've attached a draft patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 4b89addf97..56c8bf8c18 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -148,16 +148,17 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 		XLogFileName(lastRestartPointFname, 0, 0L, wal_segment_size);
 
 	/*
-	 * Check signals before restore command and reset afterwards.
+	 * Check for pending shutdown requests before and after executing
+	 * restore_command and exit if there is one.
 	 */
-	PreRestoreCommand();
+	HandleStartupProcShutdownRequest();
 
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
 	ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);
 
-	PostRestoreCommand();
+	HandleStartupProcShutdownRequest();
 
 	if (ret)
 	{
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index bcd23542f1..30c67a670c 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -55,12 +55,6 @@ static volatile sig_atomic_t got_SIGHUP = false;
 static volatile sig_atomic_t shutdown_requested = false;
 static volatile sig_atomic_t promote_signaled = false;
 
-/*
- * Flag set when executing a restore command, to tell SIGTERM signal handler
- * that it's safe to just proc_exit.
- */
-static volatile sig_atomic_t in_restore_command = false;
-
 /*
  * Time at which the most recent startup operation started.
  */
@@ -120,10 +114,7 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
 {
 	int			save_errno = errno;
 
-	if (in_restore_command)
-		proc_exit(1);
-	else
-		shutdown_requested = true;
+	shutdown_requested = true;
 	WakeupRecovery();
 
 	errno = save_errno;
@@ -183,8 +174,7 @@ HandleStartupProcInterrupts(void)
 	/*
 	 * Check if we were requested to exit without finishing recovery.
 	 */
-	if (shutdown_requested)
-		proc_exit(1);
+	HandleStartupProcShutdownRequest();
 
 	/*
 	 * Emergency bailout if postmaster has died.  This is to avoid the
@@ -273,26 +263,16 @@ StartupProcessMain(void)
 	proc_exit(0);
 }
 
+/*
+ * Exit if there is a pending shutdown request.
+ */
 void
-PreRestoreCommand(void)
+HandleStartupProcShutdownRequest(void)
 {
-	/*
-	 * Set in_restore_command to tell the signal handler that we should exit
-	 * right away on SIGTERM. We know that we're at a safe point to do that.
-	 * Check if we had already received the signal, so that we don't miss a
-	 * shutdown request received just before this.
-	 */
-	in_restore_command = true;
 	if (shutdown_requested)
 		proc_exit(1);
 }
 
-void
-PostRestoreCommand(void)
-{
-	in_restore_command = false;
-}
-
 bool
 IsPromoteSignaled(void)
 {
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index dd957f9291..fc20eea1d4 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -27,8 +27,7 @@ extern PGDLLIMPORT int log_startup_progress_interval;
 
 extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(void) pg_attribute_noreturn();
-extern void PreRestoreCommand(void);
-extern void PostRestoreCommand(void);
+extern void HandleStartupProcShutdownRequest(void);
 extern bool IsPromoteSignaled(void);
 extern void ResetPromoteSignaled(void);

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Amit Kapila

On Fri, Feb 3, 2023 at 6:41 AM Peter Smith  wrote:
>
> On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> ...
> >
> >
> > > Besides, I am not sure it's a stable test to check the log. Is it 
> > > possible that there's
> > > no such log on a slow machine? I modified the code to sleep 1s at the 
> > > beginning
> > > of apply_dispatch(), then the new added test failed because the server log
> > > cannot match.
> > To get the log by itself is necessary to ensure
> > that the delay is conducted by the apply worker, because we emit the diffms
> > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step,
> > we are not sure the delay is caused by other reasons or the time-delayed 
> > feature.
> >
> > As you mentioned, it's possible that no log is emitted on slow machine. 
> > Then,
> > the idea to make the test safer for such machines should be to make the 
> > delayed time longer.
> > But we shortened the delay time to 1 second to mitigate the long test 
> > execution time of this TAP test.
> > So, I'm not sure if it's a good idea to make it longer again.
>
> I think there are a couple of things that can be done about this problem:
>
> 1. If you need the code/test to remain as-is then at least the test
> message could include some comforting text like "(this can fail on
> slow machines when the delay time is already exceeded)" so then a test
> failure will not cause undue alarm.
>
> 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that
> it will *always* log the remaining wait time even if that wait time
> becomes negative. Then I think the test cases can be made
> deterministic instead of relying on good luck. This seems like the
> better option.
>

I don't understand why we have to do any of this instead of using 3s
as min_apply_delay similar to what we are doing in
src/test/recovery/t/005_replay_delay. Also, I think we should use
exactly the same way to verify the test even though we want to keep
the log level as DEBUG2 to check logs in case of any failures.

Also, I don't see the need to add more tests like the ones below:
+# Test whether ALTER SUBSCRIPTION changes the delayed time of the apply worker
+# (1 day 5 minutes). Note that the extra 5 minute is to account for any
+# decoding/network overhead.

Let's try to add tests similar to what we have for
recovery_min_apply_delay unless there is some functionality in this
patch that is not there in the recovery_min_apply_delay feature.

-- 
With Regards,
Amit Kapila.

Re: Deadlock between logrep apply worker and tablesync worker

2023-02-02 Thread Amit Kapila

On Thu, Feb 2, 2023 at 4:51 PM Amit Kapila  wrote:
>
> Thanks for the tests. I also see a reduction in test time variability
> with Vignesh's patch. I think we can release the locks in case the
> origin is concurrently dropped as in the attached patch. I am planning
> to commit this patch tomorrow unless there are more comments or
> objections.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

2023-02-02 Thread Amit Kapila

On Wed, Feb 1, 2023 at 10:04 AM Amit Kapila  wrote:
>
> On Tue, Jan 31, 2023 at 8:24 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat
> > >  wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > > Thanks, the patch looks good to me. I have slightly adjusted one of
> > > > > the comments and ran pgindent. See attached. As mentioned in the
> > > > > commit message, we shouldn't backpatch this as this requires a new
> > > > > callback and moreover, users can increase the wal_sender_timeout and
> > > > > wal_receiver_timeout to avoid this problem. What do you think?
> > > >
> > > > The callback and the implementation is all in core. What's the risk
> > > > you see in backpatching it?
> > > >
> > >
> > > Because we are changing the exposed structure and which can break
> > > existing extensions using it.
> >
> > Is that because we are adding the new member in the middle of the
> > structure?
> >
>
> Not only that but this changes the size of the structure and we want
> to avoid that as well in stable branches. See email [1] (you can't
> change the struct size either ...). As per my understanding, our usual
> practice is to not change the exposed structure's size/definition in
> stable branches.
>
>

I am planning to push this to HEAD sometime next week (by Wednesday).
To backpatch this, we need to fix it in some non-standard way, like
without introducing a callback which I am not sure is a good idea. If
some other committers vote to get this in back branches with that or
some different idea that can be backpatched then we can do that
separately as well. I don't see this as a must-fix in back branches
because we have a workaround (increase timeout) or users can use the
streaming option (for >=14).

-- 
With Regards,
Amit Kapila.

Re: Index problem Need an urgent fix

2023-02-02 Thread Julien Rouhaud

hi,

Le ven. 3 févr. 2023 à 11:33, chanukya SDS  a écrit :

> Hey All,
>
> I have a query like below
>
>
> SELECT * FROM data WHERE val=trunc(val) AND
> acc_id='kfd50ed6-0bc3-44a9-881f-ec89713fdd80'::uuid ORDER BY ct DESC LIMIT
> 10;
>
> table structure is
>
>  data
> (id uuid,
>   c_id uuid,
>   acc_id uuid,
>   val numeric,
>   ct timestamptz);
>
>
> Can you please help me to write an index?
>
> or Can someone write an index and revert here..
>
> its very urgent for me.
>

pgsql-hackers isn't a suitable mailing list for this question, you should
use psql-general or psql-performance. please also look at
https://wiki.postgresql.org/wiki/Slow_Query_Questions for details to
provide.

note that if you're not really sure of what index would help and don't want
to disturb your database too much (and assuming you don't have another
environment to test things on), you can use hypopg to create hypothetical
indexes and see how many index definitions would behave without actually
creating them: https://github.com/HypoPG/hypopg

>

Re: Index problem Need an urgent fix

2023-02-02 Thread Bharath Rupireddy

On Fri, Feb 3, 2023 at 9:03 AM chanukya SDS  wrote:
>
> Hey All,
>
> I have a query like below
>
> SELECT * FROM data WHERE val=trunc(val) AND 
> acc_id='kfd50ed6-0bc3-44a9-881f-ec89713fdd80'::uuid ORDER BY ct DESC LIMIT 10;
>
> table structure is
>
>  data
> (id uuid,
>   c_id uuid,
>   acc_id uuid,
>   val numeric,
>   ct timestamptz);
>
> Can you please help me to write an index?
>
> or Can someone write an index and revert here..
>
> its very urgent for me.

Thanks for reaching out. I think the question is more generic without
the problem being described. Is the SELECT query taking more time? If
yes, why? Have you looked at EXPLAIN (ANALYZE...) output? Have you
tried to tune/vary configuration settings to see if it helps?

Also, the best place to ask this question is pgsql-general or pgsql-sql.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: when the startup process doesn't (logging startup delays)

2023-02-02 Thread Bharath Rupireddy

On Fri, Feb 3, 2023 at 2:29 AM Robert Haas  wrote:
>

Thanks for looking at this.

> On Tue, Nov 22, 2022 at 6:05 AM Bharath Rupireddy
>  wrote:
> > If we place just the Assert(!StandbyMode); in
> > enable_startup_progress_timeout(), it fails for
> > begin_startup_progress_phase() in ResetUnloggedRelations() because the
> > InitWalRecovery(), that sets StandbyMode to true, is called before
> > ResetUnloggedRelations() . However, with the if (StandbyMode) {
> > return; }, we fail to report progress of ResetUnloggedRelations() in a
> > standby, which isn't a good idea at all because we only want to
> > disable the timeout during the recovery's main loop.
>
> Ugh. Well, in that case, I guess my vote is to forget about this whole
> Assert business and just commit what you had in v4. Does that work for
> you?

Yes, it seems reasonable to me.

> Protecting against specifically the situation where we're in the
> standby's main redo apply loop is not really what I had in mind here,
> but this is already sort of weirdly complicated-looking, and making it
> more weirdly complicated-looking to achieve the kind of protection
> that I had in mind doesn't really seem worth it.

IMHO, the responsibility of whether or not to report progress of any
operation can lie with the developers writing features using the
progress reporting mechanism. The progress reporting mechanism can
just be independent of all that.

I took the v4 patch, added some comments and attached it as the v7
patch here. Please find it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 0ff4bc13db8abf70f8da158d11eecdfbe7a7e199 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 3 Feb 2023 03:30:49 +
Subject: [PATCH v7] Disable STARTUP_PROGRESS_TIMEOUT in standby mode

In standby mode, we actually don't report progress of recovery to
not bloat server logs as the standby is always in recovery unless
promoted. However, startup_progress_timeout_handler() gets called
every log_startup_progress_interval seconds, which is unnecessary.

Therefore, we disable the timeout in standby mode.

Reported-by: Thomas Munro
Author: Bharath Rupireddy
Reviewed-by: Robert Haas, Simon Riggs
Reviewed-by: Thomas Munro
Backpatch-through: 15
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c | 25 ---
 src/backend/postmaster/startup.c  | 30 ---
 src/include/postmaster/startup.h  |  2 ++
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..dbe9394762 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -385,6 +385,7 @@ static bool recoveryStopAfter;
 /* prototypes for local functions */
 static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *replayTLI);
 
+static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static bool read_backup_label(XLogRecPtr *checkPointLoc,
@@ -469,6 +470,24 @@ XLogRecoveryShmemInit(void)
 	ConditionVariableInit(>recoveryNotPausedCV);
 }
 
+/*
+ * A thin wrapper to enable StandbyMode and do other preparatory work as
+ * needed.
+ */
+static void
+EnableStandbyMode(void)
+{
+	StandbyMode = true;
+
+	/*
+	 * To avoid server log bloat, we don't report recovery progress in a
+	 * standby as it will always be in recovery unless promoted. We disable
+	 * startup progress timeout in standby mode to avoid calling
+	 * startup_progress_timeout_handler() unnecessarily.
+	 */
+	disable_startup_progress_timeout();
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -602,7 +621,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 */
 		InArchiveRecovery = true;
 		if (StandbyModeRequested)
-			StandbyMode = true;
+			EnableStandbyMode();
 
 		/*
 		 * When a backup_label file is present, we want to roll forward from
@@ -739,7 +758,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			InArchiveRecovery = true;
 			if (StandbyModeRequested)
-StandbyMode = true;
+EnableStandbyMode();
 		}
 
 		/* Get the last valid checkpoint record. */
@@ -3117,7 +3136,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
 InArchiveRecovery = true;
 if (StandbyModeRequested)
-	StandbyMode = true;
+	EnableStandbyMode();
 
 SwitchIntoArchiveRecovery(xlogreader->EndRecPtr, replayTLI);
 minRecoveryPoint = xlogreader->EndRecPtr;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index bcd23542f1..efc2580536 100644
---

Re: How to write a new tuple into page?

2023-02-02 Thread Bharath Rupireddy

On Thu, Feb 2, 2023 at 7:31 PM jack...@gmail.com  wrote:
>
> Hi, I'm trying to construct a new tuple type, that's not heaptuple,
> When I get a tupleTableSlot, I will get data info from it and then I
> will constuct a new tuple, and now I need to put it into a physical
> page, how should I do?

Postgres writes table pages from shared buffers/buffer pool to disk
via storage manager (smgr.c). I think looking at the code around
FlushBuffer()/FlushRelationBuffers()/smgrwrite() might help.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Index problem Need an urgent fix

2023-02-02 Thread chanukya SDS

Hey All,

I have a query like below


SELECT * FROM data WHERE val=trunc(val) AND
acc_id='kfd50ed6-0bc3-44a9-881f-ec89713fdd80'::uuid ORDER BY ct DESC LIMIT
10;

table structure is

 data
(id uuid,
  c_id uuid,
  acc_id uuid,
  val numeric,
  ct timestamptz);


Can you please help me to write an index?

or Can someone write an index and revert here..

its very urgent for me.

Thanks
-- 
Thanks,
Chanukya SDS
+918186868384

Sent From My iPhone

RE: Perform streaming logical transactions by background workers and parallel apply

2023-02-02 Thread houzj.f...@fujitsu.com

On Friday, February 3, 2023 11:04 AM Amit Kapila  
wrote:
> 
> On Thu, Feb 2, 2023 at 4:52 AM Peter Smith 
> wrote:
> >
> > Some minor review comments for v91-0001
> >
> 
> Pushed this yesterday after addressing your comments!

Thanks for pushing.

Currently, we have two remaining patches which we are not sure whether it's 
worth
committing for now. Just share them here for reference.

0001:

Based on our discussion[1] on -hackers, it's not clear that if it's necessary
to add the sub-feature to stop extra worker when
max_apply_workers_per_suibscription is reduced. Because:

- it's not clear whether reducing the 'max_apply_workers_per_suibscription' is 
very
  common.
- even when the GUC is reduced, at that point in time all the workers might be
  in use so there may be nothing that can be immediately done.
- IIUC the excess workers (for a reduced GUC) are going to get freed naturally
  anyway over time as more transactions are completed so the pool size will
  reduce accordingly.

And given that the logic of this patch is simple, it would be easy to add this
at a later point if we really see a use case for this.

0002:

Since all the deadlock errors and other errors that caused by parallel streaming
will be logged and user can check this kind of ERROR and disable the parallel
streaming mode to resolve this. Besides, for this retry feature, It will
be hard to distinguish whether the ERROR is caused by parallel streaming, and we
might need to retry in serialize mode for all kinds of ERROR. So, it's not very
clear if automatic use serialize mode to retry in case of any ERROR in parallel
streaming is necessary or not. And we can also add this when we see a use case.

[1] 
https://www.postgresql.org/message-id/CAA4eK1LotEuPsteuJMNpixxTj6R4B8k93q-6ruRmDzCxKzMNpA%40mail.gmail.com

Best Regards,
Hou zj

v92-0001-Stop-extra-worker-if-GUC-was-changed.patch
Description: v92-0001-Stop-extra-worker-if-GUC-was-changed.patch

v92-0002-Retry-to-apply-streaming-xact-only-in-apply-work.patch
Description:  v92-0002-Retry-to-apply-streaming-xact-only-in-apply-work.patch

Re: Perform streaming logical transactions by background workers and parallel apply

2023-02-02 Thread Amit Kapila

On Thu, Feb 2, 2023 at 4:52 AM Peter Smith  wrote:
>
> Some minor review comments for v91-0001
>

Pushed this yesterday after addressing your comments!

-- 
With Regards,
Amit Kapila.

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Peter Smith

Here are my review comments for patch v26-0001.

On Thu, Feb 2, 2023 at 7:18 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> Hi,
>
> On Wednesday, February 1, 2023 1:37 PM Peter Smith  
> wrote:
> > Here are my review comments for the patch v25-0001.
> Thank you for your review !
>

> > 8.
> > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
> > + opts->min_apply_delay > 0 && opts->streaming ==
> > + opts->LOGICALREP_STREAM_PARALLEL)
> > + ereport(ERROR,
> > + errcode(ERRCODE_SYNTAX_ERROR),
> >
> > Saying "> 0" (in the condition) is not strictly necessary here, since it is 
> > never < 0.
> This check is necessary.
>
> For example, imagine a case when we CREATE a subscription with streaming = on
> and then try to ALTER the subscription with streaming = parallel
> without any settings for min_apply_delay. The ALTER command
> throws an error of "min_apply_delay > 0 and streaming = parallel are
> mutually exclusive options." then.
>
> This is because min_apply_delay is supported by ALTER command
> (so the first condition becomes true) and we set
> streaming = parallel (which makes the 2nd condition true).
>
> So, we need to check the opts's actual min_apply_delay value
> to make the irrelavent case pass.

I think there is some misunderstanding. I was not suggesting removing
the condition -- only that I thought it could be written without the >
0 as:

if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
opts->min_apply_delay && opts->streaming == LOGICALREP_STREAM_PARALLEL)
ereport(ERROR,

> > ~~~
> >
> > 9. AlterSubscription
> >
> > + /*
> > + * The combination of parallel streaming mode and
> > + * min_apply_delay is not allowed. See
> > + * parse_subscription_options for details of the reason.
> > + */
> > + if (opts.streaming == LOGICALREP_STREAM_PARALLEL) if
> > + ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) &&
> > opts.min_apply_delay > 0) ||
> > + (!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) &&
> > sub->minapplydelay > 0))
> >
> > Saying "> 0" (in the condition) is not strictly necessary here, since it is 
> > never < 0.
> This is also necessary.
>
> For example, imagine a case that
> there is a subscription whose min_apply_delay is 1 day.
> Then, you want to try to execute ALTER SUBSCRIPTION
> with (min_apply_delay = 0, streaming = parallel).
> If we remove the condition of otps.min_apply_delay > 0,
> then we error out in this case too.
>
> First we pass the first condition
> of the opts.streaming == LOGICALREP_STREAM_PARALLEL,
> since we use streaming option.
> Then, we also set min_apply_delay in this example,
> then without checking the value of min_apply_delay,
> the second condition becomes true
> (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)).
>
> So, we need to make this case(min_apply_delay = 0) pass.
> Meanwhile, checking the "sub" value is necessary for checking existing 
> subscription value.

I think there is some misunderstanding. I was not suggesting removing
the condition -- only that I thought it could be written without the >
0 as::

if (opts.streaming == LOGICALREP_STREAM_PARALLEL)
if ((IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) &&
opts.min_apply_delay) ||
(!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay))
ereport(ERROR,

> > ~~~
> >
> > 10.
> > + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) {
> > + /*
> > + * The combination of parallel streaming mode and
> > + * min_apply_delay is not allowed.
> > + */
> > + if (opts.min_apply_delay > 0)
> >
> > Saying "> 0" (in the condition) is not strictly necessary here, since it is 
> > never < 0.
> This is also required to check the value equals to 0 or not.
> Kindly imagine a case when we want to execute ALTER min_apply_delay from 1day
> with a pair of (min_apply_delay = 0 and
> streaming = parallel). If we remove this check, then this ALTER command fails
> with error. Without the check, when we set min_apply_delay
> and parallel streaming mode, even when making the min_apply_delay 0,
> the error is invoked.
>
> The check for sub.stream is necessary for existing definition of target 
> subscription.

I think there is some misunderstanding. I was not suggesting removing
the condition -- only that I thought it could be written without the >
0 as::

if (opts.min_apply_delay)
if ((IsSet(opts.specified_opts, SUBOPT_STREAMING) && opts.streaming ==
LOGICALREP_STREAM_PARALLEL) ||
(!IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream ==
LOGICALREP_STREAM_PARALLEL))
ereport(ERROR,

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: heapgettup refactoring

2023-02-02 Thread David Rowley

On Fri, 3 Feb 2023 at 06:23, Melanie Plageman  wrote:
> Your code also switches to saving the OffsetNumber -- just in a separate
> variable of OffsetNumber type. I am fine with this if it your rationale
> is that it is not a good idea to store a smaller number in a larger
> datatype. However, the benefit I saw in reusing rs_cindex is that we
> could someday converge the code for heapgettup() and
> heapgettup_pagemode() even more. Even in my final refactor, there is a
> lot of duplicate code between the two.

I was more concerned about the reuse of an unrelated field.  I'm
struggling to imagine why using the separate field would cause any
issues around not being able to reduce the code duplication any more
than we otherwise would. Surely in one case you need to get the offset
by indexing the rs_vistuples[] array and the other is the offset
directly. The only thing I can think of that would allow us not to
have a condition there would be if we populated the rs_vistuples[]
array with 1..n.  I doubt should do that and if we did, we could just
use the rs_cindex to index that without having to worry that we're
using an unrelated field for something.

I've pushed all but the final 2 patches now.

David

Re: CI and test improvements

2023-02-02 Thread Thomas Munro

On Thu, Feb 2, 2023 at 2:12 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Some observations:
> > So what should our policy be on when to roll the CI image forward?  I
> > guess around New Year/now (~6 months after release) is a good time and
> > we should just do it.  Anyone got a reason why we should wait?  Our
> > other CI OSes have slower major version release cycles and longer
> > lives, so it's not quite the same hamster wheel of upgrades.
>
> I'd argue that developers are probably the kind of people who update
> their OS sooner rather than later --- I've usually updated my laptop
> and at least one BF animal to $latest macOS within a month or so of
> the dot-zero release.  So waiting 6 months seems to me like CI will be
> behind the users, which will be unhelpful.  I'd rather drop the oldest
> release sooner, if we need to hold down the number of macOS revisions
> under test.

Cool.  Done.

Out of curiosity, I wondered how the "graphical installer" packagers
like EDB and Postgres.app choose a target, when Apple is moving so
fast.  I see that the current EDB installers target 10.14 for PG15,
which was 5 years old at initial release, and thus already EOL'd for 2
years.  Postgres.app goes back one more year.  In other words, even
though that preadv/pwritev "decl" stuff is unnecessary for PG16 if you
think we should only target OSes that the vendor still supports (which
will be 12, 13, 14), someone would still shout at me if I removed it.

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-02 Thread Peter Smith

On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu)
 wrote:
>
...
>
>
> > Besides, I am not sure it's a stable test to check the log. Is it possible 
> > that there's
> > no such log on a slow machine? I modified the code to sleep 1s at the 
> > beginning
> > of apply_dispatch(), then the new added test failed because the server log
> > cannot match.
> To get the log by itself is necessary to ensure
> that the delay is conducted by the apply worker, because we emit the diffms
> only if it's bigger than 0 in maybe_apply_delay(). If we omit the step,
> we are not sure the delay is caused by other reasons or the time-delayed 
> feature.
>
> As you mentioned, it's possible that no log is emitted on slow machine. Then,
> the idea to make the test safer for such machines should be to make the 
> delayed time longer.
> But we shortened the delay time to 1 second to mitigate the long test 
> execution time of this TAP test.
> So, I'm not sure if it's a good idea to make it longer again.

I think there are a couple of things that can be done about this problem:

1. If you need the code/test to remain as-is then at least the test
message could include some comforting text like "(this can fail on
slow machines when the delay time is already exceeded)" so then a test
failure will not cause undue alarm.

2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that
it will *always* log the remaining wait time even if that wait time
becomes negative. Then I think the test cases can be made
deterministic instead of relying on good luck. This seems like the
better option.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Support logical replication of DDLs

2023-02-02 Thread Peter Smith

Here are some review comments for patch v63-0001.

==
General

1.
(This is not really a review comment - more just an observation...)

This patch seemed mostly like an assortment of random changes that
don't seem to have anything in common except that some *later* patches
of this set are apparently going to want them.

Now maybe doing it this way was the best and neatest thing to do --
I'm not sure. But my first impression was I felt this has gone too far
in some places -- e.g. perhaps some of these changes would have been
better deferred until they are *really* needed instead of just
plonking a whole lot of un-called (i.e. untested) code into patch
0001.


==
Commit message

2.
2) Some of the prototype and structures were moved from pg_publication.h
   to publicationcmds.h as one of the later patch requires inclusion of
   pg_publication.h and these prototype had references to server header
   files.

SUGGESTION (?)
2) Some prototypes and structures were moved from pg_publication.h to
publicationcmds.h. This was because one of the later patches required
the inclusion of pg_publication.h and these prototypes had references
to server header files.


==
src/backend/catalog/aclchk.c

3. ExecuteGrantStmt

+ /* Copy the grantor id needed for DDL deparsing of Grant */
+ istmt.grantor_uid = grantor;
+

SUGGESTION (comment)
Copy the grantor id to the parsetree, needed for DDL deparsing of Grant

==
src/backend/catalog/objectaddress.c

4. getObjectIdentityParts

@@ -5922,7 +5922,7 @@ getObjectIdentityParts(const ObjectAddress *object,
  transformType = format_type_be_qualified(transform->trftype);
  transformLang = get_language_name(transform->trflang, false);

- appendStringInfo(, "for %s on language %s",
+ appendStringInfo(, "for %s language %s",
  transformType,
  transformLang);

There is no clue anywhere what this change was for.

Perhaps this ought to be mentioned in the Commit Message.

Alternatively, maybe defer this change until it becomes clearer who needs it?

==
src/backend/commands/collationcmds.c

5.
+ /*
+ * Make from existing collationid available to callers for statement such as
+ * CREATE COLLATION any_name FROM any_name
+ */
+ if (from_existing_collid && OidIsValid(collid))
+ ObjectAddressSet(*from_existing_collid, CollationRelationId, collid);

"for statement such as" --> "for statements such as"

==
src/backend/commands/event_trigger.c

6.
+EventTriggerQueryState *currentEventTriggerState = NULL;

It seems overkill to make this non-static here. I didn't find anybody
using this variable from outside this source, so unless this was a
mistake I guess it's preparing the ground for some future patch.
Either way, it didn't seem like this belonged in patch 0001.

==
src/backend/commands/sequence.c

7.
+Form_pg_sequence_data
+get_sequence_values(Oid sequenceId)
+{
+ Buffer  buf;
+ SeqTableelm;
+ Relationseqrel;
+ HeapTupleData seqtuple;
+ Form_pg_sequence_data seq;
+ Form_pg_sequence_data retSeq;
+
+ /* Open and AccessShareLock sequence */
+ init_sequence(sequenceId, , );
+
+ if (pg_class_aclcheck(sequenceId, GetUserId(),
+ ACL_SELECT | ACL_UPDATE | ACL_USAGE) != ACLCHECK_OK)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied for sequence %s",
+ RelationGetRelationName(seqrel;
+
+ seq = read_seq_tuple(seqrel, , );
+ retSeq = palloc(sizeof(FormData_pg_sequence_data));
+
+ memcpy(retSeq, seq, sizeof(FormData_pg_sequence_data));
+
+ UnlockReleaseBuffer(buf);
+ relation_close(seqrel, NoLock);
+
+ return retSeq;
+}

IMO the palloc might be better done up-front when the retSeq was declared.

==
src/backend/tcop/utility.c

8.
+/*
+ * Return the given object type as a string.
+ */
+const char *
+stringify_objtype(ObjectType objtype, bool isgrant)
+{
+ switch (objtype)
+ {
+ case OBJECT_AGGREGATE:
+ return "AGGREGATE";
+ case OBJECT_CAST:
+ return "CAST";
+ case OBJECT_COLLATION:
+ return "COLLATION";
+ case OBJECT_COLUMN:
+ return isgrant ? "TABLE" : "COLUMN";
+ case OBJECT_CONVERSION:
+ return "CONVERSION";
+ case OBJECT_DATABASE:
+ return "DATABASE";
+ case OBJECT_DOMAIN:
+ return "DOMAIN";
+ case OBJECT_EVENT_TRIGGER:
+ return "EVENT TRIGGER";
+ case OBJECT_EXTENSION:
+ return "EXTENSION";
+ case OBJECT_FDW:
+ return "FOREIGN DATA WRAPPER";
+ case OBJECT_FOREIGN_SERVER:
+ return isgrant ? "FOREIGN SERVER" : "SERVER";
+ case OBJECT_FOREIGN_TABLE:
+ return "FOREIGN TABLE";

That 'is_grant' param seemed a bit hacky.

At least some comment should be given (maybe in the function header?)
to explain why this boolean is modifying the return string.

Or maybe it is better to have another stringify_objtype_for_grant that
just wraps this?

==
src/backend/utils/adt/regproc.c

9.
+
+/*
+ * Append the parenthesized arguments of the given pg_proc row into the output
+ * buffer. force_qualify indicates whether to schema-qualify type names
+ * regardless of visibility.
+ */
+static void
+format_procedure_args_internal(Form_pg_proc

Re: Remove some useless casts to (void *)

2023-02-02 Thread Corey Huinker

On Thu, Feb 2, 2023 at 5:22 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> I have found that in some corners of the code some calls to standard C
> functions are decorated with casts to (void *) for no reason, and this
> code pattern then gets copied around.  I have gone through and cleaned
> this up a bit, in the attached patches.
>
> The involved functions are: repalloc, memcpy, memset, memmove, memcmp,
> qsort, bsearch
>
> Also hash_search(), for which there was a historical reason (the
> argument used to be char *), but not anymore.


+1

All code is example code.

Applies.
Passes make check world.

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Peter Geoghegan

On Thu, Feb 2, 2023 at 12:56 PM Nikolay Samokhvalov
 wrote:
> I already realized my mistake – indeed, having multiple errors for 1 index
> doesn't seem to be super practically helpful.

I wouldn't mind supporting it if the cost wasn't too high. But I
believe that it's not a good trade-off.

>> I think that that problem should be solved at a higher level, in the
>> program that runs amcheck. Note that pg_amcheck will already do this
>> for B-Tree indexes.
>
>
> That's a great tool, and it's great it supports parallelization, very useful
> on large machines.

Another big advantage of just using pg_amcheck is that running each
index verification in a standalone query avoids needlessly holding the
same MVCC snapshot across all indexes verified (compared to running
one big SQL query that verifies multiple indexes). As simple as
pg_amcheck's approach is (it's doing nothing that you couldn't
replicate in a shell script), in practice that its standardized
approach probably makes things a lot smoother, especially in terms of
how VACUUM is impacted.

-- 
Peter Geoghegan

Re: Remove unused code related to unknown type

2023-02-02 Thread Tom Lane

Peter Eisentraut  writes:
> Here is a patch that removes some unused leftovers from commit 
> cfd9be939e9c516243c5b6a49ad1e1a9a38f1052 (old).

Ugh.  Those are outright wrong now, aren't they?  Better nuke 'em.

regards, tom lane

Re: Move defaults toward ICU in 16?

2023-02-02 Thread Tom Lane

Thomas Munro  writes:
> It's still important to have libc support as an option, though,
> because it's a totally reasonable thing to want sort order to agree
> with the "sort" command on the same host, and you are willing to deal
> with all the complexities that we're trying to escape.

Yeah.  I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

regards, tom lane

Re: Move defaults toward ICU in 16?

2023-02-02 Thread Peter Geoghegan

On Thu, Feb 2, 2023 at 2:15 PM Thomas Munro  wrote:
> Sure, there could be a clean-room implementation that replaces it in
> some sense (just as there is a Java implementation) but it would very
> likely be "the same" because the real thing we're buying here is the
> set of algorithms and data maintenance that the whole industry has
> agreed on.

I don't think that a clean room implementation is implausible. They
seem to already exist, and be explicitly provided for by CLDR, which
is not joined at the hip to ICU:

https://github.com/elixir-cldr/cldr

Most of the value that we tend to think of as coming from ICU actually
comes from CLDR itself, as well as related Unicode Consortium and IETF
standards/RFCs such as BCP-47.

> Unless Britain decides to exit the Latin alphabet, terminate
> membership of ISO and revert to anglo-saxon runes with a sort order
> that is defined in the new constitution as "the opposite of whatever
> Unicode says", it's hard to see obstacles to ICU's long term universal
> applicability.

It would have to literally be defined as "not unicode" for it to
present a real problem. A key goal of Unicode is to accommodate
political and cultural shifts, since even countries can come and go.
In principle Unicode should be able to accommodate just about any
change in preferences, except when there is an irreconcilable
difference of opinion among people that are from the same natural
language group. For example it can accommodate relatively minor
differences of opinion about how text should be sorted among groups
that each speak a regional dialect of the same language. Hardly
anybody even notices this.

Accommodating these variations can only come from making a huge
investment. Most of the work is actually done by natural language
scholars, not technologists. That effort is very unlikely to be
duplicated by some other group with its own conflicting goals. AFAICT
there is no great need for any schisms, since differences of opinion
can usually be accommodated under the umbrella of Unicode.

-- 
Peter Geoghegan

Re: Underscores in numeric literals

2023-02-02 Thread Peter Eisentraut


On 31.01.23 17:09, Dean Rasheed wrote:

On Tue, 31 Jan 2023 at 15:28, Peter Eisentraut
 wrote:


Did you have any thoughts about what to do with the float types?  I
guess we could handle those in a separate patch?



I was assuming that we'd do nothing for float types, because anything
we did would necessarily impact their performance.


Yeah, as long as we are using strtof() and strtod() we should just leave 
it alone.  If we have break that open and hand-code something, we can 
reconsider it.


So I think you could go ahead with committing your patch and we can 
consider this topic done for now.

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 02:01:13PM -0800, Nathan Bossart wrote:
> I've been digging into the history here.  This e-mail seems to have the
> most context [0].  IIUC this was intended to prevent "fast" shutdowns from
> escalating to "immediate" shutdowns because the restore command died
> unexpectedly.  This doesn't apply to archive_cleanup_command because we
> don't FATAL if it dies unexpectedly.  It seems like this idea should apply
> to recovery_end_command, too, but AFAICT it doesn't use the same approach.
> My guess is that this hasn't come up because it's less likely that both 1)
> recovery_end_command is used and 2) someone initiates shutdown while it is
> running.

Actually, this still doesn't really explain why we need to exit immediately
in the SIGTERM handler for restore_command.  We already have handling for
when the command indicates it exited due to SIGTERM, so it should be no
problem if the command receives it before the startup process.  And
HandleStartupProcInterrupts() should exit at an appropriate time after the
startup process receives SIGTERM.

My guess was that this is meant to allow breaking out of the system() call,
but I don't understand why that's important here.  Maybe we could just
remove this exit-in-SIGTERM-handler business...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Remove some useless casts to (void *)

2023-02-02 Thread Peter Eisentraut

I have found that in some corners of the code some calls to standard C 
functions are decorated with casts to (void *) for no reason, and this 
code pattern then gets copied around.  I have gone through and cleaned 
this up a bit, in the attached patches.


The involved functions are: repalloc, memcpy, memset, memmove, memcmp, 
qsort, bsearch


Also hash_search(), for which there was a historical reason (the 
argument used to be char *), but not anymore.From b4f050d23761187e05576cfbe8277b593d9c19fb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 2 Feb 2023 23:02:34 +0100
Subject: [PATCH 1/8] Remove useless casts to (void *) (hash_search)

Some of these appear to be leftovers from when hash_search took a char
* argument (changed in 5999e78fc45dcb91784b64b6e9ae43f4e4f68ca2).
---
 contrib/postgres_fdw/shippable.c  |  6 +--
 src/backend/access/gist/gistbuild.c   |  4 +-
 src/backend/access/gist/gistbuildbuffers.c|  2 +-
 src/backend/access/transam/xlogutils.c|  6 +--
 src/backend/catalog/storage.c |  4 +-
 src/backend/optimizer/util/predtest.c |  2 +-
 src/backend/parser/parse_oper.c   |  6 +--
 src/backend/replication/logical/relation.c|  6 +--
 .../replication/logical/reorderbuffer.c   | 16 +++---
 src/backend/replication/pgoutput/pgoutput.c   |  2 +-
 src/backend/storage/buffer/buf_table.c|  6 +--
 src/backend/storage/buffer/bufmgr.c   |  8 +--
 src/backend/storage/buffer/localbuf.c | 12 ++---
 src/backend/storage/lmgr/lock.c   | 50 +--
 src/backend/storage/smgr/smgr.c   |  6 +--
 src/backend/storage/sync/sync.c   |  4 +-
 src/backend/tsearch/ts_typanalyze.c   |  4 +-
 src/backend/utils/adt/array_typanalyze.c  |  4 +-
 src/backend/utils/adt/ri_triggers.c   |  8 +--
 src/backend/utils/cache/attoptcache.c |  6 +--
 src/backend/utils/cache/relcache.c| 12 ++---
 src/backend/utils/cache/relfilenumbermap.c|  6 +--
 src/backend/utils/cache/spccache.c|  6 +--
 src/backend/utils/cache/ts_cache.c| 12 ++---
 src/backend/utils/cache/typcache.c|  8 +--
 src/backend/utils/time/combocid.c |  2 +-
 src/pl/plpgsql/src/pl_comp.c  |  6 +--
 src/pl/plpgsql/src/pl_exec.c  |  2 +-
 28 files changed, 108 insertions(+), 108 deletions(-)

diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 1b47a30dbf..a500b90390 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -77,7 +77,7 @@ InvalidateShippableCacheCallback(Datum arg, int cacheid, 
uint32 hashvalue)
while ((entry = (ShippableCacheEntry *) hash_seq_search()) != 
NULL)
{
if (hash_search(ShippableCacheHash,
-   (void *) >key,
+   >key,
HASH_REMOVE,
NULL) == NULL)
elog(ERROR, "hash table corrupted");
@@ -184,7 +184,7 @@ is_shippable(Oid objectId, Oid classId, PgFdwRelationInfo 
*fpinfo)
/* See if we already cached the result. */
entry = (ShippableCacheEntry *)
hash_search(ShippableCacheHash,
-   (void *) ,
+   ,
HASH_FIND,
NULL);
 
@@ -200,7 +200,7 @@ is_shippable(Oid objectId, Oid classId, PgFdwRelationInfo 
*fpinfo)
 */
entry = (ShippableCacheEntry *)
hash_search(ShippableCacheHash,
-   (void *) ,
+   ,
HASH_ENTER,
NULL);
 
diff --git a/src/backend/access/gist/gistbuild.c 
b/src/backend/access/gist/gistbuild.c
index d21a308d41..7a6d93bb87 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1587,7 +1587,7 @@ gistMemorizeParent(GISTBuildState *buildstate, 
BlockNumber child, BlockNumber pa
boolfound;
 
entry = (ParentMapEntry *) hash_search(buildstate->parentMap,
-   
   (const void *) ,
+   
   ,

   HASH_ENTER,

   );
entry->parentblkno = parent;
@@ -1625,7 +1625,7 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber 
child)
 
/* Find node buffer in hash table */
entry =

Re: Move defaults toward ICU in 16?

2023-02-02 Thread Thomas Munro

On Fri, Feb 3, 2023 at 5:31 AM Jeff Davis  wrote:
> On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:
> > On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis  wrote:
> > > If we don't want to nudge users toward ICU, is it because we are
> > > waiting for something, or is there a lack of consensus that ICU is
> > > actually better?
> >
> > Do you think it's better?
>
> Yes:
>
>   * ICU more featureful: e.g. supports case-insensitive collations (the
> citext docs suggest looking at ICU instead).
>   * It's faster: a simple non-contrived sort is something like 70%
> faster[1] than one using glibc.
>   * It can provide consistent semantics across platforms.

+1

>   * Easier for users to control what library version is available on
> their system. We can also ask packagers to keep some old versions of
> ICU available for an extended period of time.
>   * If one of the ICU multilib patches makes it in, it will be easier
> for users to select which of the library versions Postgres will use.
>   * Reports versions for indiividual collators, distinct from the
> library version.

+1

> The biggest disadvantage (rather, the flip side of its advantages) is
> that it's a separate dependency. Will ICU still be maintained in 10
> years or will we end up stuck maintaining it ourselves? Then again,
> we've already been shipping it, so I don't know if we can avoid that
> problem entirely now even if we wanted to.

It has a pretty special status, with an absolutely enormous amount of
technology depending on it.

http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html
https://unicode.org/consortium/consort.html
https://home.unicode.org/membership/members/
https://home.unicode.org/about-unicode/

I mean, who knows what the future holds, but ultimately what we're
doing here is taking the de facto reference implementation of the
Unicode collation algorithm.  Are Unicode and the consortium still
going to be here in 10 years?  We're all in on Unicode, and it's also
tangled up with ISO standards, as are parts of the collation stuff.
Sure, there could be a clean-room implementation that replaces it in
some sense (just as there is a Java implementation) but it would very
likely be "the same" because the real thing we're buying here is the
set of algorithms and data maintenance that the whole industry has
agreed on.

Unless Britain decides to exit the Latin alphabet, terminate
membership of ISO and revert to anglo-saxon runes with a sort order
that is defined in the new constitution as "the opposite of whatever
Unicode says", it's hard to see obstacles to ICU's long term universal
applicability.

It's still important to have libc support as an option, though,
because it's a totally reasonable thing to want sort order to agree
with the "sort" command on the same host, and you are willing to deal
with all the complexities that we're trying to escape.

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 04:14:54PM -0500, Robert Haas wrote:
> +   /*
> +* When exitOnSigterm is set and we are in the startup process, use 
> the
> +* special wrapper for system() that enables exiting immediately upon
> +* receiving SIGTERM.  This ensures we can break out of system() if
> +* required.
> +*/
> 
> This comment, for me, raises more questions than it answers. Why do we
> only do this in the startup process?

Currently, this functionality only exists in the startup process because it
is only used for restore_command.  More below...

> Also, and this part is not the fault of this patch but a defect of the
> pre-existing comments, under what circumstances do we not want to exit
> when we get a SIGTERM? It's standard behavior for PostgreSQL backends
> to exit when they receive SIGTERM, so the question isn't why we
> sometimes exit immediately but why we ever don't. The existing code
> calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
> false in other cases, and AFAICS there are zero words of comments
> explaining the reasoning.

I've been digging into the history here.  This e-mail seems to have the
most context [0].  IIUC this was intended to prevent "fast" shutdowns from
escalating to "immediate" shutdowns because the restore command died
unexpectedly.  This doesn't apply to archive_cleanup_command because we
don't FATAL if it dies unexpectedly.  It seems like this idea should apply
to recovery_end_command, too, but AFAICT it doesn't use the same approach.
My guess is that this hasn't come up because it's less likely that both 1)
recovery_end_command is used and 2) someone initiates shutdown while it is
running.

BTW the relevant commits are cdd46c7 (added SIGTERM handling for
restore_command), 9e403c2 (added recovery_end_command), and c21ac0b (added
what is today called archive_cleanup_command).

> +   if (exitOnSigterm && MyBackendType == B_STARTUP)
> +   rc = RunInterruptibleShellCommand(command);
> +   else
> +   rc = system(command);
> 
> And this looks like pure magic. I'm all in favor of not relying on
> system(), but using it under some opaque set of conditions and
> otherwise doing something else is not the way. At the very least this
> needs to be explained a whole lot better.

If we applied this exit-on-SIGTERM behavior to recovery_end_command, I
think we could combine failOnSignal and exitOnSigterm into one flag, and
then it might be a little easier to explain what is going on.  In any case,
I agree that this deserves a lengthy explanation, which I'll continue to
work on.

[0] https://postgr.es/m/499047FE.9090407%40enterprisedb.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Remove unused code related to unknown type

2023-02-02 Thread Peter Eisentraut

Here is a patch that removes some unused leftovers from commit 
cfd9be939e9c516243c5b6a49ad1e1a9a38f1052 (old).From 638fa48b0852bf501069b4a6993454c011d9d42a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 2 Feb 2023 22:53:26 +0100
Subject: [PATCH] Remove unused code related to unknown type

These are leftovers obsoleted by
cfd9be939e9c516243c5b6a49ad1e1a9a38f1052.
---
 src/backend/utils/adt/varlena.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fd81c47474..170b3a3820 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -45,7 +45,6 @@
 /* GUC variable */
 intbytea_output = BYTEA_OUTPUT_HEX;
 
-typedef struct varlena unknown;
 typedef struct varlena VarString;
 
 /*
@@ -112,12 +111,6 @@ typedef struct
  */
 #define TEXTBUFLEN 1024
 
-#define DatumGetUnknownP(X)((unknown *) 
PG_DETOAST_DATUM(X))
-#define DatumGetUnknownPCopy(X)((unknown *) 
PG_DETOAST_DATUM_COPY(X))
-#define PG_GETARG_UNKNOWN_P(n) DatumGetUnknownP(PG_GETARG_DATUM(n))
-#define PG_GETARG_UNKNOWN_P_COPY(n) DatumGetUnknownPCopy(PG_GETARG_DATUM(n))
-#define PG_RETURN_UNKNOWN_P(x) PG_RETURN_POINTER(x)
-
 #define DatumGetVarStringP(X)  ((VarString *) PG_DETOAST_DATUM(X))
 #define DatumGetVarStringPP(X) ((VarString *) 
PG_DETOAST_DATUM_PACKED(X))
 
-- 
2.39.1

Re: Prevent accidental whole-table DELETEs and UPDATEs

2023-02-02 Thread Tom Lane

Nikolay Samokhvalov  writes:
> In many cases, a DELETE or UPDATE not having a WHERE clause (or having it
> with a condition matching all rows in the table) is a sign of some kind of
> mistake, leading to accidental data loss, performance issues, producing a
> lot of dead tuples, and so on. Recently, this topic was again discussed [1]

> Attached is a patch implemented by Andrey Boroding (attached) during our
> today's online session [2], containing a rough prototype for two new GUCs:

> - prevent_unqualified_deletes
> - prevent_unqualified_updates

This sort of thing has been proposed before and rejected before.
I do not think anything has changed.  In any case, I seriously
doubt that something that's basically a one-line test (excluding
overhead such as GUC definitions) is going to meaningfully
improve users' lives.  The cases that I actually see reported
are not "I left off the WHERE" but more like "I fat-fingered
a variable in a sub-select so that it's an outer reference,
causing the test to degenerate to WHERE x = x", or perhaps
"I misunderstood the behavior of NOT IN with nulls, ending up
with a constant-false or constant-true condition".  I'm not sure
if there's a reliable way to spot those sorts of not-so-trivial
semantic errors ... but if we could, that'd be worth discussing.

regards, tom lane

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Robert Haas

On Thu, Feb 2, 2023 at 3:10 PM Nathan Bossart  wrote:
> Hm.  I don't know if we want to encourage further use of
> in_restore_command since it seems to be prone to misuse.  Here's a v2 that
> demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome).  I
> personally like this approach a bit more.

+   /*
+* When exitOnSigterm is set and we are in the startup process, use the
+* special wrapper for system() that enables exiting immediately upon
+* receiving SIGTERM.  This ensures we can break out of system() if
+* required.
+*/

This comment, for me, raises more questions than it answers. Why do we
only do this in the startup process?

Also, and this part is not the fault of this patch but a defect of the
pre-existing comments, under what circumstances do we not want to exit
when we get a SIGTERM? It's standard behavior for PostgreSQL backends
to exit when they receive SIGTERM, so the question isn't why we
sometimes exit immediately but why we ever don't. The existing code
calls ExecuteRecoveryCommand with exitOnSigterm true in some cases and
false in other cases, and AFAICS there are zero words of comments
explaining the reasoning.

+   if (exitOnSigterm && MyBackendType == B_STARTUP)
+   rc = RunInterruptibleShellCommand(command);
+   else
+   rc = system(command);

And this looks like pure magic. I'm all in favor of not relying on
system(), but using it under some opaque set of conditions and
otherwise doing something else is not the way. At the very least this
needs to be explained a whole lot better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: when the startup process doesn't (logging startup delays)

2023-02-02 Thread Robert Haas

On Tue, Nov 22, 2022 at 6:05 AM Bharath Rupireddy
 wrote:
> If we place just the Assert(!StandbyMode); in
> enable_startup_progress_timeout(), it fails for
> begin_startup_progress_phase() in ResetUnloggedRelations() because the
> InitWalRecovery(), that sets StandbyMode to true, is called before
> ResetUnloggedRelations() . However, with the if (StandbyMode) {
> return; }, we fail to report progress of ResetUnloggedRelations() in a
> standby, which isn't a good idea at all because we only want to
> disable the timeout during the recovery's main loop.

Ugh. Well, in that case, I guess my vote is to forget about this whole
Assert business and just commit what you had in v4. Does that work for
you?

Protecting against specifically the situation where we're in the
standby's main redo apply loop is not really what I had in mind here,
but this is already sort of weirdly complicated-looking, and making it
more weirdly complicated-looking to achieve the kind of protection
that I had in mind doesn't really seem worth it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Nikolay Samokhvalov

On Thu, Feb 2, 2023 at 12:43 PM Peter Geoghegan  wrote:

> I agree that this matters at the level of whole indexes.
>

I already realized my mistake – indeed, having multiple errors for 1 index
doesn't seem to be super practically helpful.

> I think that that problem should be solved at a higher level, in the
> program that runs amcheck. Note that pg_amcheck will already do this
> for B-Tree indexes.
>

That's a great tool, and it's great it supports parallelization, very useful
on large machines.

> We should add a "Tip" to the amcheck documentation on 14+ about this.
> We should clearly advise users that they should probably just use
> pg_amcheck.

and with -j$N, with high $N (unless it's production)

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Peter Geoghegan

On Thu, Feb 2, 2023 at 12:31 PM Nikolay Samokhvalov
 wrote:
> I understand your thoughts (I think) and agree with them, but at least one
> scenario where I do want to see *all* errors is corruption prevention – 
> running
> amcheck in lower environments, not in production, to predict and prevent 
> issues.
> For example, not long ago, Ubuntu 16.04 became EOL (in phases), and people
> needed to upgrade, with glibc version change. It was quite good to use amcheck
> on production clones (running on a new OS/glibc) to identify all indexes that
> need to be rebuilt. Being able to see only one of them would be very
> inconvenient. Rebuilding all indexes didn't seem a good idea in the case of
> large databases.

I agree that this matters at the level of whole indexes. That is, if
you want to check every index in the database, it is unhelpful if the
whole process stops just because one individual index has corruption.
Any extra information about the index that is corrupt may not be all
that valuable, but information about other indexes remains almost as
valuable.

I think that that problem should be solved at a higher level, in the
program that runs amcheck. Note that pg_amcheck will already do this
for B-Tree indexes. While verify_nbtree.c won't try to limp on with an
index that is known to be corrupt, pg_amcheck will continue with other
indexes.

We should add a "Tip" to the amcheck documentation on 14+ about this.
We should clearly advise users that they should probably just use
pg_amcheck. Using the SQL interface directly should now mostly be
something that only a tiny minority of experts need to do -- and even
the experts won't do it that way unless they have a good reason to.

-- 
Peter Geoghegan

[PATCH] Add pretty-printed XML output option

2023-02-02 Thread Jim Jones


Hi,

This small patch introduces a XML pretty print function. It basically 
takes advantage of the indentation feature of xmlDocDumpFormatMemory 
from libxml2 to format XML strings.


postgres=# SELECT xmlpretty('id="z">42');

    xmlpretty
--
 +
     +
 42+
       +
   +

(1 row)


The patch also contains regression tests and documentation.

Feedback is very welcome!

Jim
From ced9fccddc033de98709a6e93dc6530ce68149db Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Thu, 2 Feb 2023 21:27:16 +0100
Subject: [PATCH v1] Add pretty-printed XML output option

This small patch introduces a XML pretty print function.
It basically takes advantage of the indentation feature
of xmlDocDumpFormatMemory from libxml2 to format XML strings.
---
 doc/src/sgml/func.sgml|  34 ++
 src/backend/utils/adt/xml.c   |  30 +
 src/include/catalog/pg_proc.dat   |   3 +
 src/test/regress/expected/xml.out | 107 ++
 src/test/regress/sql/xml.sql  |  34 ++
 5 files changed, 208 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..e8b5e581f0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14293,6 +14293,40 @@ SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
 ]]>
 

+   
+ 
+xmlpretty
+
+ 
+ xmlpretty
+ 
+ 
+
+xmlpretty ( xml ) xml
+
+ 
+ 
+ Converts the given XML value to pretty-printed, indented text.
+ 
+ 
+ 
+ Example:
+ 
+ 
+   
+   

 

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 079bcb1208..6409133137 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -473,6 +473,36 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf)
 }
 #endif
 
+Datum
+xmlpretty(PG_FUNCTION_ARGS)
+{
+#ifdef USE_LIBXML
+
+xmlDocPtr  doc;
+xmlChar*buf = NULL;
+text   *arg = PG_GETARG_TEXT_PP(0);
+
+doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL);
+
+/**
+* xmlDocDumpFormatMemory (
+*   xmlDocPtr doc,  # the XML document.
+*   xmlChar ** buf, # buffer where the formatted XML document will be stored.
+*   int *size,  # this could store the size of the created buffer
+* but as we do not need it, we can leave it NULL.
+*   int format) # 1 = node indenting.
+*/
+xmlDocDumpFormatMemory(doc, , NULL, 1);
+
+xmlFreeDoc(doc);
+PG_RETURN_XML_P(cstring_to_xmltype((char*)buf));
+
+#else
+NO_XML_SUPPORT();
+return 0;
+#endif
+}
+
 
 Datum
 xmlcomment(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..3224dc3e76 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8842,6 +8842,9 @@
 { oid => '3053', descr => 'determine if a string is well formed XML content',
   proname => 'xml_is_well_formed_content', prorettype => 'bool',
   proargtypes => 'text', prosrc => 'xml_is_well_formed_content' },
+  { oid => '4642', descr => 'Indented text from xml',
+  proname => 'xmlpretty', prorettype => 'xml',
+  proargtypes => 'xml', prosrc => 'xmlpretty' },
 
 # json
 { oid => '321', descr => 'I/O',
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 3c357a9c7e..98a338ad8d 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1599,3 +1599,110 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH
   | foo/
 (1 row)
 
+-- XML pretty print: single line XML string
+SELECT xmlpretty('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650')::xml;
+xmlpretty 
+--
+ +
+ +
+ Belgian Waffles+
+ $5.95+
+ Two of our famous Belgian Waffles with plenty of real maple syrup+
+ 650+
+  +
++
+ 
+(1 row)
+
+-- XML pretty print: XML string with space, tabs and newline between nodes
+SELECT xmlpretty(' Belgian Waffles $15.95
+ Two of our famous Belgian Waffles with plenty of real maple syrup
+650')::xml;
+

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Nikolay Samokhvalov

On Thu, Feb 2, 2023 at 12:15 PM Peter Geoghegan  wrote:

> On Thu, Feb 2, 2023 at 11:51 AM Peter Geoghegan  wrote:
>
...

> Admittedly there is some value in seeing multiple WARNINGs to true
> experts that are performing some kind of forensic analysis, but that
> doesn't seem worth it to me -- I'm an expert, and I don't think that
> I'd do it this way for any reason other than it being more convenient
> as a way to get information about a system that I don't have access
> to. Even then, I think that I'd probably have serious doubts about
> most of the extra information that I'd get, since it might very well
> be a downstream consequence of the same basic problem.
>
...

I understand your thoughts (I think) and agree with them, but at least one
scenario where I do want to see *all* errors is corruption prevention –
running
amcheck in lower environments, not in production, to predict and prevent
issues.
For example, not long ago, Ubuntu 16.04 became EOL (in phases), and people
needed to upgrade, with glibc version change. It was quite good to use
amcheck
on production clones (running on a new OS/glibc) to identify all indexes
that
need to be rebuilt. Being able to see only one of them would be very
inconvenient. Rebuilding all indexes didn't seem a good idea in the case of
large databases.

Re: transition tables and UPDATE

2023-02-02 Thread Corey Huinker

>
>
> even uglier than what I already had.  So yeah, I think it might be
> useful if we had a way to inject a counter or something in there.
>
>
This came up for me when I was experimenting with making the referential
integrity triggers fire on statements rather than rows. Doing so has the
potential to take a lot of the sting out of big deletes where the
referencing column isn't indexed (thus resulting in N sequentials scans of
the referencing table). If that were 1 statement then we'd get a single
(still ugly) hash join, but it's an improvement.

It has been suggested that the the overhead of forming the tuplestores of
affected rows and reconstituting them into EphemerialNamedRelations could
be made better by instead storing an array of old ctids and new ctids,
which obviously would be in the same order, if we had a means of
reconstituting those with just the columns needed for the check (and
generating a fake ordinality column for your needs), that would be
considerably lighter weight than the tuplestores, and it might make
statement level triggers more useful all around.

a very minor bug and a couple of comment changes for basebackup.c

2023-02-02 Thread Robert Haas

Here are a few small patches for basebackup.c:

0001 fixes what I believe to be a slight logical error in sendFile(),
introduced by me during the v15 development cycle when I introduced
the bbsink abstraction. I believe that it is theoretically possible
for this to cause an assertion failure, although the chances of that
actually happening seem extremely remote in practice. I don't believe
there are any consequences worse than that; for instance, I don't
think this can result in your backup getting corrupted. See the
proposed commit message for full details. Because I believe that this
is formally a bug, I am inclined to think that this should be
back-patched, but I also think it's fairly likely that no one would
ever notice if we didn't. However, patch authors have been known to be
wrong about the consequences of their own bugs from time to time, so
please do let me know if this seems more serious to you than what I'm
indicating, or conversely if you think it's not a problem at all for
some reason.

0002 removes an old comment from the file that I find useless and
slightly misleading.

0003 rewrites a comment about the way that we verify checksums during
backups. If we get a checksum mismatch, we reread the block and see if
the perceived problem goes away. If it doesn't, then we report it.
This is intended as protection against the backup reading a block
while some other process is in the midst of writing it, but there's no
guarantee that any concurrent write will finish quickly enough for our
second read attempt to see the updated contents. The comment claims
otherwise, and that's false, and I'm getting tired of reading this
false claim every time I read this code, so I rewrote the comment to
say what I believe to be true, namely, that our algorithm is flaky and
we have no good way to fix that right now. I'm pretty sure that Andres
pointed this problem out when this feature was under discussion, but
somehow it's still like this. There's another nearby comment which is
also false or at least misleading for basically the same reasons which
probably should be rewritten too, but I was a bit less certain how to
rewrite it and it wasn't making me as annoyed as this one, so for now
I only rewrote the one.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0002-Remove-an-old-comment-that-doesn-t-seem-especially-u.patch
Description: Binary data


0001-In-basebackup.c-perform-end-of-file-test-after-check.patch
Description: Binary data


0003-Reword-overly-optimistic-comment-about-backup-checks.patch
Description: Binary data

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Andrew Dunstan


On 2023-02-02 Th 10:00, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Sure. There's probably some work that could still be done in this area
>> too, such as making pgperltidy work similarly to how we've now make
>> pgindent work.
> Perhaps.  But before we commit to that, I'd like to see some tweaks to the
> pgperltidy rules to make it less eager to revisit the formatting of lines
> close to a change.  Its current behavior will induce a lot of "git blame"
> noise if we apply these same procedures to Perl code.


I haven't done anything about that yet, but I have reworked the script
so it's a lot more like pgindent, with --show-diff and --silent-diff
modes, and allowing a list of files to be indented on the command line.
Non-perl files are filtered out from such a list.


>
> (Should I mention reformat-dat-files?)


If you want I can add those flags there too.


>
>> There's also a question of timing. Possibly the best time would be when
>> we next fork the tree.
> Yeah.  We have generally not wanted to do a mass indent except
> when there's a minimum amount of pending patches, ie after the last
> CF of a cycle.  What I'd suggest is that we plan on doing a mass
> indent and then switch over to the new rules, right after the March
> CF closes.  That gives us a couple months to nail down and test out
> the new procedures before they go live.
>
>   


WFM. Of course then we're not waiting for the developer meeting.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/tools/pgindent/pgperltidy b/src/tools/pgindent/pgperltidy
index 5e704119eb..388c113526 100755
--- a/src/tools/pgindent/pgperltidy
+++ b/src/tools/pgindent/pgperltidy
@@ -1,12 +1,162 @@
-#!/bin/sh
+#!/usr/bin/perl
 
 # src/tools/pgindent/pgperltidy
 
-set -e
+use strict;
+use warnings;
 
-# set this to override default perltidy program:
-PERLTIDY=${PERLTIDY:-perltidy}
+use Getopt::Long;
+use FindBin;
+use File::Find;
+use Perl::Tidy;
 
-. src/tools/perlcheck/find_perl_files
+# we require this exact version. See README.
+die "wrong perltidy version $Perl::Tidy::VERSION"
+  unless $Perl::Tidy::VERSION eq '20170521';
+
+# make sure we're in the source root if we need to look for files
+chdir "$FindBin::RealBin/../../.." unless @ARGV;
+
+my ($show_diff, $silent_diff, $help);
+
+$help = 0;
+
+my %options = (
+	"help"   => \$help,
+	"show-diff"  => \$show_diff,
+	"silent-diff"=> \$silent_diff,);
+GetOptions(%options) || usage("bad command line argument");
+
+usage() if $help;
+
+my @files;
+
+# get the filtered list of files
+if (@ARGV)
+{
+	filter_perl_files();
+}
+else
+{
+	File::Find::find(\, ".");
+}
+
+# process according to the command switches
+if ($silent_diff || $show_diff)
+{
+	indent_files_diff();
+}
+else
+{
+	indent_files_in_place();
+}
+
+exit 0;
+
+# non-destructive perltidy mode
+sub indent_files_diff
+{
+	my $diff_found = 0;
+	foreach my $file (@files)
+	{
+		my $error_flag = Perl::Tidy::perltidy(
+			source=> $file,
+destination   => "$file.tdy",
+perltidyrc=> "$FindBin::RealBin/perltidyrc",
+		   );
+		if ($error_flag)
+		{
+			unlink "$file.tdy";
+			die "pgperltidy error!";
+		}
+		my $diff = `diff -ud -F '^sub ' $file $file.tdy 2>&1`;
+		unlink "$file.tdy";
+		next unless $diff;
+		exit 2 if ($silent_diff);
+		$diff_found = 1;
+		print $diff;
+	}
+}
+
+# destructive perltidy in place
+# expects that the perltidyrc will have --backup-and-modify-in-place
+sub indent_files_in_place
+{
+	foreach my $file (@files)
+	{
+		my $error_flag = Perl::Tidy::perltidy(
+			source=> $file,
+perltidyrc=> "$FindBin::RealBin/perltidyrc",
+		   );
+		if ($error_flag)
+		{
+			die "pgperltidy error!";
+		}
+	}
+}
+
+# subroutine for use with File::Find::find
+sub wanted
+{
+my $mode;
+
+($mode = (lstat($_))[2]) &&
+	  -f _ &&
+	  (/^.*\.p[lm]\z/s ||
+	   (!/^.*\.dat\z/s && (($mode & 0100) == 0100) &&
+		`file $_ 2>/dev/null` =~/:.*perl[0-9]*$/i)) &&
+	   push(@files, $File::Find::name);
+}
+
+# Similar but filtering files specified in @ARGV
+sub filter_perl_files
+{
+	while (my $file = shift @ARGV)
+	{
+		if (! -f $file)
+		{
+			die "can't find $file";
+		}
+		# ignore .dat files in case they are found
+		next if $file =~ /\.dat$/;
+		if ($file =~ /\.p[lm]\z/)
+		{
+			# file has an explicit perl extension, nothing more to test
+			push @files, $file;
+			next;
+		}
+		my $mode = (lstat(_))[2];
+		# check if it's executable by owner
+		next unless (($mode & 0100) == 0100);
+		# check if `file` says it's a perl file
+		if (`file $file 2>/dev/null` =~/:.*perl[0-9]*$/i)
+		{
+			push @files, $file;
+		}
+	}
+}
+
+
+sub usage
+{
+	my $message  = shift;
+	my $helptext = <<'EOF';
+Usage:
+pgperltidy [OPTION]... [FILE]...
+Options:
+	--help  show this message and quit
+	--show-diff show the changes that would be made
+	--silent-diff   exit with status 2

Prevent accidental whole-table DELETEs and UPDATEs

2023-02-02 Thread Nikolay Samokhvalov

In many cases, a DELETE or UPDATE not having a WHERE clause (or having it
with a condition matching all rows in the table) is a sign of some kind of
mistake, leading to accidental data loss, performance issues, producing a
lot of dead tuples, and so on. Recently, this topic was again discussed [1]

Attached is a patch implemented by Andrey Boroding (attached) during our
today's online session [2], containing a rough prototype for two new GUCs:

- prevent_unqualified_deletes
- prevent_unqualified_updates

Both are "false" by default; for superusers, they are not applied.

There is also another implementation of this idea, in the form of an
extension [3], but I think having this in the core would be beneficial to
many users.

Looking forward to your feedback.

[1] https://news.ycombinator.com/item?id=34560332
[2] https://www.youtube.com/watch?v=samLkrC5xQA
[3] https://github.com/eradman/pg-safeupdate


0001-Add-GUCs-to-preven-some-accidental-massive-DML.patch
Description: Binary data

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Peter Geoghegan

On Thu, Feb 2, 2023 at 11:51 AM Peter Geoghegan  wrote:
> I also have some questions about the verification functionality itself:

I forgot to include another big concern here:

* Why are there only WARNINGs, never ERRORs here?

It's far more likely that you'll run into problems when running
amcheck this way. I understand that the heapam checks can do that, but
that is both more useful, and less risky. With heapam we're not
traversing a tree structure in logical/keyspace order. I'm not
claiming that this approach is impossible; just that it doesn't seem
even remotely worth it. Indexes are never supposed to be corrupt, but
if they are corrupt the solution always involves a REINDEX. You never
try to recover the data from an index, since it's redundant and less
authoritative, almost by definition (at least in Postgres).

By far the most important piece of information is that an index has
some non-zero amount of corruption. Any amount of corruption is
supposed to be extremely surprising. It's kind of like if you see one
cockroach in your home. The problem is not that you have one cockroach
in your home; the problem is that you simply have cockroaches. We can
all agree that in some abstract sense, fewer cockroaches is better.
But that doesn't seem to have any practical relevance -- it's a purely
theoretical point. It doesn't really affect what you do about the
problem at that point.

Admittedly there is some value in seeing multiple WARNINGs to true
experts that are performing some kind of forensic analysis, but that
doesn't seem worth it to me -- I'm an expert, and I don't think that
I'd do it this way for any reason other than it being more convenient
as a way to get information about a system that I don't have access
to. Even then, I think that I'd probably have serious doubts about
most of the extra information that I'd get, since it might very well
be a downstream consequence of the same basic problem.

-- 
Peter Geoghegan

Re: Weird failure with latches in curculio on v15

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 01:24:15PM +0900, Michael Paquier wrote:
> On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote:
>> I was vaguely wondering about removing both of those functions
>> in favor of an integrated function that does a system() call
>> with those things before and after it.
> 
> It seems to me that this is pretty much the same as storing
> in_restore_command in shell_restore.c, and that for recovery modules
> this comes down to the addition of an extra callback called in
> startup.c to check if the flag is up or not.  Now the patch is doing 
> things the opposite way: like on HEAD, store the flag in startup.c but
> switch it at will with the routines in startup.c.  I find the approach
> of the patch a bit more intuitive, TBH, as that makes the interface
> simpler for other recovery modules that may want to switch the flag
> back-and-forth, and I suspect that there may be cases in recovery
> modules where we'd still want to switch the flag, but not necessarily
> link it to system().

Hm.  I don't know if we want to encourage further use of
in_restore_command since it seems to be prone to misuse.  Here's a v2 that
demonstrateѕ Tom's idea (bikeshedding on names and comments is welcome).  I
personally like this approach a bit more.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 0a26c49fa74bde307f094492917138c799917d45 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 2 Feb 2023 12:04:12 -0800
Subject: [PATCH v2 1/1] stopgap fix for restore_command

---
 src/backend/access/transam/shell_restore.c | 15 +++-
 src/backend/access/transam/xlogarchive.c   |  7 --
 src/backend/postmaster/startup.c   | 27 +-
 src/include/postmaster/startup.h   |  3 +--
 4 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/shell_restore.c b/src/backend/access/transam/shell_restore.c
index 8458209f49..8fc3e86a10 100644
--- a/src/backend/access/transam/shell_restore.c
+++ b/src/backend/access/transam/shell_restore.c
@@ -21,6 +21,8 @@
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
 #include "common/percentrepl.h"
+#include "miscadmin.h"
+#include "postmaster/startup.h"
 #include "storage/ipc.h"
 #include "utils/wait_event.h"
 
@@ -146,7 +148,18 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	 */
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
-	rc = system(command);
+
+	/*
+	 * When exitOnSigterm is set and we are in the startup process, use the
+	 * special wrapper for system() that enables exiting immediately upon
+	 * receiving SIGTERM.  This ensures we can break out of system() if
+	 * required.
+	 */
+	if (exitOnSigterm && MyBackendType == B_STARTUP)
+		rc = RunInterruptibleShellCommand(command);
+	else
+		rc = system(command);
+
 	pgstat_report_wait_end();
 
 	if (rc != 0)
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 4b89addf97..66312c816b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -147,18 +147,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	else
 		XLogFileName(lastRestartPointFname, 0, 0L, wal_segment_size);
 
-	/*
-	 * Check signals before restore command and reset afterwards.
-	 */
-	PreRestoreCommand();
-
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
 	ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);
 
-	PostRestoreCommand();
-
 	if (ret)
 	{
 		/*
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 8786186898..aa94430c6f 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -273,9 +273,24 @@ StartupProcessMain(void)
 	proc_exit(0);
 }
 
-void
-PreRestoreCommand(void)
+/*
+ * This is a wrapper for system() that enables exiting immediately on SIGTERM.
+ * It is intended for use with restore_command since there isn't a good way to
+ * break out while it is executing.  Note that this behavior only works in the
+ * startup process.
+ *
+ * NB: Since we might call proc_exit() in a signal handler here, it is
+ * imperative that that nothing but the system() call happens between setting
+ * and resetting in_restore_command.  Any additional code must go before or
+ * after this section.
+ */
+int
+RunInterruptibleShellCommand(const char *command)
 {
+	int ret;
+
+	Assert(MyBackendType == B_STARTUP);
+
 	/*
 	 * Set in_restore_command to tell the signal handler that we should exit
 	 * right away on SIGTERM. We know that we're at a safe point to do that.
@@ -285,12 +300,12 @@ PreRestoreCommand(void)
 	in_restore_command = true;
 	if (shutdown_requested)
 		proc_exit(1);
-}
 
-void
-PostRestoreCommand(void)
-{
+	ret = system(command);
+
 	in_restore_command = false;
+
+	return ret;
 }
 
 bool
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index dd957f9291..5188f49d21 100644
---

Re: Amcheck verification of GiST and GIN

2023-02-02 Thread Peter Geoghegan

On Fri, Jan 13, 2023 at 8:15 PM Andrey Borodin  wrote:
> (v21 of patch series)

I can see why the refactoring patch is necessary overall, but I have
some concerns about the details. More specifically:

* PageGetItemIdCareful() doesn't seem like it needs to be moved to
amcheck.c and generalized to work with GIN and GiST.

It seems better to just allow some redundancy, by having static/local
versions of PageGetItemIdCareful() for both GIN and GiST. There are
numerous reasons why that seems better to me. For one thing it's
simpler. For another, the requirements are already a bit different,
and may become more different in the future. I have seriously
considered adding a new PageGetItemCareful() routine to nbtree in the
past (which would work along similar lines when we access
IndexTuples), which would have to be quite different across each index
AM. Maybe this idea of adding a PageGetItemCareful() would totally
supersede the existing PageGetItemIdCareful() function.

But even now, without any of that, the rules for
PageGetItemIdCareful() are already different. For example, with GIN
you cannot have LP_DEAD bits set, so ISTM that you should be checking
for that in its own custom version of PageGetItemIdCareful().

You can just have comments that refer the reader to the original
nbtree version of PageGetItemIdCareful() for a high level overview.

* You have distinct versions of the current btree_index_checkable()
function for both GIN and GiST, which doesn't seem necessary to me --
so this is kind of the opposite of the situation with
PageGetItemIdCareful() IMV.

The only reason to have separate versions of these is to detect when
the wrong index AM is used -- the other 2 checks are 100% common to
all index AMs. Why not just move that one non-generic check out of the
function, to each respective index AM .c file, while keeping the other
2 generic checks in amcheck.c?

Once things are structured this way, it would then make sense to add a
can't-be-LP_DEAD check to the GIN specific version of
PageGetItemIdCareful().

I also have some questions about the verification functionality itself:

* Why haven't you done something like palloc_btree_page() for both
GiST and GIN, and use that for everything?

Obviously this may not be possible in100% of all cases -- even
verify_nbtree.c doesn't manage that. But I see no reason for that
here. Though, in general, it's not exactly clear what's going on with
buffer lock coupling in general.

* Why does gin_refind_parent() buffer lock the parent while the child
buffer lock remains held?

In any case this doesn't really need to have any buffer lock coupling.
Since you're both of the new verification functions you're adding are
"parent" variants, that acquire a ShareLock to block concurrent
modifications and concurrent VACUUM?

* Oh wait, they don't use a ShareLock at all -- they use an
AccessShareLock. This means that there are significant inconsistencies
with the verify_nbtree.c scheme.

I now realize that gist_index_parent_check() and
gin_index_parent_check() are actually much closer to bt_index_check()
than to bt_index_parent_check(). I think that you should stick with
the convention of using the word "parent" whenever we'll need a
ShareLock, and omitting "parent" whenever we will only require an
AccessShareLock. I'm not sure if that means that you should change the
lock strength or change the name of the functions. I am sure that you
should follow the general convention that we have already.

I feel rather pessimistic about our ability to get all the details
right with GIN. Frankly I have serious doubts that GIN itself gets
everything right, which makes our task just about impossible. The GIN
README did gain a "Concurrency" section in 2019, at my behest, but in
general the locking protocols are still chronically under-documented,
and have been revised in various ways as a response to bugs. So at
least in the case of GIN, we really need amcheck coverage, but should
take a very conservative approach.

With GIN I think that we need to make the most modest possible
assumptions about concurrency, by using a ShareLock. Without that, I
think that we can have very little confidence in the verification
checks -- the concurrency rules are just too complicated right now.
Maybe it will be possible in the future, but right now I'd rather not
try that. I find it very difficult to figure out the GIN locking
protocol, even for things that seem like they should be quite
straightforward. This situation would be totally unthinkable in
nbtree, and perhaps with GiST.

* Why does the GIN patch change a comment in contrib/amcheck/amcheck.c?

* There is no pg_amcheck patch here, but I think that there should be,
since that is now the preferred and recommended way to run amcheck in
general.

We could probably do something very similar to what is already there
for nbtree. Maybe it would make sense to change --heapallindexed and
--parent-check so that they call your parent check functions for GiST
and

Re: recovery modules

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 02:34:17PM +0900, Michael Paquier wrote:
> Okay, the changes done here look straight-forward seen from here.  I
> got one small-ish comment.
> 
> +basic_archive_startup(ArchiveModuleState *state)
> +{
> +   BasicArchiveData *data = palloc0(sizeof(BasicArchiveData));
> 
> Perhaps this should use MemoryContextAlloc() rather than a plain
> palloc().  This should not matter based on the position where the
> startup callback is called, still that may be a pattern worth
> encouraging.

Good call.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 97585c26468a39b0f881a9cc8b81ea614da9a547 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH v5 1/3] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3c714a79c6..dda6698509 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -406,8 +406,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -508,7 +508,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -814,7 +814,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -827,7 +827,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(, 0, sizeof(ArchiveModuleCallbacks));
+	memset(, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -844,9 +844,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) ();
+	(*archive_init) ();
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -859,6 +859,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 8f26acafe541abcdd9261c2668876ff288fc02f4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH v5 2/3] move archive module exports to dedicated headers

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  2 ++
 src/backend/postmaster/shell_archive.c  |  3 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 47 +
 src/include/postmaster/pgarch.h | 39 
 src/include/postmaster/shell_archive.h  | 24 +
 7 files changed, 77 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h
 create mode 100644 src/include/postmaster/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -32,7 +32,7 @@
 
 #include "common/int.h"
 #include "miscadmin.h"
-#include "postmaster/pgarch.h"
+#include "postmaster/archive_module.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dda6698509..ec47b2cc20 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,8 +34,10 @@
 #include "lib/binaryheap.h"
 #include "libpq/pqsignal.h"
 #include "pgstat.h"
+#include "postmaster/archive_module.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2023-02-02 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 06:59:51AM -0800, Andres Freund wrote:
> On 2022-09-20 11:32:02 -0700, Nathan Bossart wrote:
>> Note that this change also disallows XMAX_COMMITTED together with
>> the special pre-v9.3 locked-only bit pattern that
>> HEAP_XMAX_IS_LOCKED_ONLY checks for.  This locked-only bit pattern
>> may still be present on servers pg_upgraded from pre-v9.3 versions.
> 
> Given that fact, that aspect at least seems to be not viable?

AFAICT from looking at the v9.2 code, the same idea holds true for this
special bit pattern.  I only see HEAP_XMAX_INVALID set when one of the
infomask lock bits is set, and those bits now correspond to
HEAP_XMAX_LOCK_ONLY and HEAP_XMAX_EXCL_LOCK (which are both covered by the
HEAP_XMAX_IS_LOCKED_ONLY macro).  Of course, I could be missing something.
Do you think we should limit this to the v9.3+ bit pattern?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: heapgettup refactoring

2023-02-02 Thread Melanie Plageman

On Thu, Feb 02, 2023 at 07:00:37PM +1300, David Rowley wrote:
> I've attached a version of the next patch in the series. I admit to
> making a couple of adjustments. I couldn't bring myself to remove the
> snapshot local variable in this commit. We can deal with that when we
> come to it in whichever patch that needs to be changed in.

That seems fine to keep the diff easy to understand. Also,
heapgettup_pagemode() didn't have a snapshot local variable either.

> The only other thing I really did was question your use of rs_cindex
> to store the last OffsetNumber.  I ended up adding a new field which
> slots into the padding between a bool and BlockNumber field named
> rs_coffset for this purpose. I noticed what Andres wrote [1] earlier
> in this thread about that, so thought we should move away from looking
> at the last tuple to get this number
> 
> [1] https://postgr.es/m/20221101010948.hsf33emgnwzvi...@awork3.anarazel.de

So, what Andres had said was: 

> Hm. Finding the next offset via rs_ctup doesn't seem quite right. For one,
> it's not actually that cheap to extract the offset from an ItemPointer because
> of the the way we pack it into ItemPointerData.

Because I was doing this in an earlier version:

> > + HeapTuple   tuple = &(scan->rs_ctup);

And then in the later part of the code got tuple->t_self.

I did this because the code in master does this:

  lineoff = /* previous offnum */
Min(lines,
  OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self;

So I figured it was the same. Based on Andres' feedback, I switched to
saving the offset number in the scan descriptor and figured I could
reuse rs_cindex since it is larger than an OffsetNumber.

Your code also switches to saving the OffsetNumber -- just in a separate
variable of OffsetNumber type. I am fine with this if it your rationale
is that it is not a good idea to store a smaller number in a larger
datatype. However, the benefit I saw in reusing rs_cindex is that we
could someday converge the code for heapgettup() and
heapgettup_pagemode() even more. Even in my final refactor, there is a
lot of duplicate code between the two.

- Melanie

Re: Move defaults toward ICU in 16?

2023-02-02 Thread Jeff Davis

On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:
> On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis  wrote:
> > If we don't want to nudge users toward ICU, is it because we are
> > waiting for something, or is there a lack of consensus that ICU is
> > actually better?
> 
> Do you think it's better?

Yes:

  * ICU more featureful: e.g. supports case-insensitive collations (the
citext docs suggest looking at ICU instead).
  * It's faster: a simple non-contrived sort is something like 70%
faster[1] than one using glibc.
  * It can provide consistent semantics across platforms.

I believe the above reasons are enough to call ICU "better", but it
also seems like a better path for addressing/mitigating collator
versioning problems:

  * Easier for users to control what library version is available on
their system. We can also ask packagers to keep some old versions of
ICU available for an extended period of time.
  * If one of the ICU multilib patches makes it in, it will be easier
for users to select which of the library versions Postgres will use.
  * Reports versions for indiividual collators, distinct from the
library version.

The biggest disadvantage (rather, the flip side of its advantages) is
that it's a separate dependency. Will ICU still be maintained in 10
years or will we end up stuck maintaining it ourselves? Then again,
we've already been shipping it, so I don't know if we can avoid that
problem entirely now even if we wanted to.

I don't mean that ICU solves all of our problems -- far from it. But
you asked if I think it's better, and my answer is yes.

Regards,
Jeff Davis

[1]
https://postgr.es/m/64039a2dbcba6f42ed2f32bb5f0371870a70afda.ca...@j-davis.com

Re: Improve logging when using Huge Pages

2023-02-02 Thread Alvaro Herrera

On 2023-Jan-24, Justin Pryzby wrote:

> On Mon, Jan 23, 2023 at 05:33:35PM -0800, Andres Freund wrote:

> > I'm ok with this being a GUC, it doesn't feel elegant, but I suspect there's
> > no realistic elegant answer.

> Whether it's a GUC or a function, I propose to name it hugepages_active.
> If there's no objections, I'll add to the CF.

Maybe I misread the code (actually I only read the patch), but -- does
it set active=true when huge_pages=on?  I think the code only works when
huge_pages=try.  That might be pretty confusing; I think it should say
"on" in both cases.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-02-02 Thread Reid Thompson

Regarding the shared counter noted here,

> What you could do is to have a single, imprecise, shared counter for the total
> memory allocation, and have a backend-local "allowance". When the allowance is
> used up, refill it from the shared counter (a single atomic op).

Is there a preferred or suggested location to put variables like this?
Perhaps a current variable to use as a reference?

Thanks,
Reid

Re: Error "initial slot snapshot too large" in create replication slot

2023-02-02 Thread Andres Freund

Hi,

On 2022-10-12 14:10:15 +0900, Michael Paquier wrote:
> The discussion seems to have stopped here.  As this is classified as a
> bug fix, I have moved this patch to next CF, waiting on author for the
> moment.

FWIW, I view this more as lifting a limitation. I wouldn't want to
backpatch the change.

Greetings,

Andres Freund

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-02 Thread Justin Pryzby

On Wed, Feb 01, 2023 at 07:24:48PM +0100, Matthias van de Meent wrote:
> On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev  
> wrote:
> >
> > 1 февр. 2023 г., в 20:27, Matthias van de Meent 
> >  написал(а):
> >
> >> In HEAD we set TOTAL to whatever number partitioned table we're
> >> currently processing has - regardless of whether we're the top level
> >> statement.
> >> With the patch we instead add the number of child relations to that
> >> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
> >> posted by Ilya. Approximately immediately after updating that count we
> >> recurse to the child relations, and that only returns once it is done
> >> creating the indexes, so both TOTAL and DONE go up as we process more
> >> partitions in the hierarchy.
> >
> >
> > The TOTAL in the patch is set only when processing the top-level parent and 
> > it is not updated when we recurse, so yes, it is constant. From v3:
> 
> Ugh, I misread the patch, more specifically count_leaf_partitions and
> the !OidIsValid(parentIndexId) condition changes.
> 
> You are correct, sorry for the noise.

That suggests that the comments could've been more clear.  I added a
comment suggested by Tomas and adjusted some others and wrote a commit
message.  I even ran pgindent for about the 3rd time ever.

002 are my changes as a separate patch, which you could apply to your
local branch.

And 003/4 are assertions that I wrote to demonstrate the problem and the
verify the fixes, but not being proposed for commit.

-- 
Justin
>From 375961e18aaa7ed7b2ebee972ad07c7a38099ef4 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 31 Jan 2023 19:13:07 +0400
Subject: [PATCH 1/4] create index progress increment

---
 doc/src/sgml/monitoring.sgml  |  4 +-
 src/backend/commands/indexcmds.c  | 64 +--
 src/backend/utils/activity/backend_progress.c | 25 
 src/include/utils/backend_progress.h  |  1 +
 4 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b67..a911900271c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6601,7 +6601,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the total number of partitions on which the index is to be created.
+   the total number of leaf partitions on which the index is to be created or attached.
This field is 0 during a REINDEX.
   
  
@@ -6612,7 +6612,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the number of partitions on which the index has been created.
+   the number of leaf partitions on which the index has been created or attached.
This field is 0 during a REINDEX.
   
  
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 16ec0b114e6..936b4e3c1db 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -130,6 +130,30 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+
+/*
+ * Count the number of direct and indirect leaf partitions, excluding foreign
+ * tables.
+ */
+static int
+count_leaf_partitions(Oid relid)
+{
+	int			nleaves = 0;
+	List	   *childs = find_all_inheritors(relid, NoLock, NULL);
+	ListCell   *lc;
+
+	foreach(lc, childs)
+	{
+		Oid	partrelid = lfirst_oid(lc);
+
+		if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid)))
+			nleaves++;
+	}
+
+	list_free(childs);
+	return nleaves;
+}
+
 /*
  * CheckIndexCompatible
  *		Determine whether an existing index definition is compatible with a
@@ -1219,8 +1243,14 @@ DefineIndex(Oid relationId,
 			Relation	parentIndex;
 			TupleDesc	parentDesc;
 
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-		 nparts);
+			if (!OidIsValid(parentIndexId))
+			{
+int total_parts;
+
+total_parts = count_leaf_partitions(relationId);
+pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+			 total_parts);
+			}
 
 			/* Make a local copy of partdesc->oids[], just for safety */
 			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
@@ -1250,6 +1280,7 @@ DefineIndex(Oid relationId,
 			{
 Oid			childRelid = part_oids[i];
 Relation	childrel;
+char		child_relkind;
 Oid			child_save_userid;
 int			child_save_sec_context;
 int			child_save_nestlevel;
@@ -1259,6 +1290,7 @@ DefineIndex(Oid relationId,
 bool		found = false;
 
 childrel = table_open(childRelid, lockmode);
+child_relkind = RelationGetForm(childrel)->relkind;
 
 GetUserIdAndSecContext(_save_userid,
 	   _save_sec_context);
@@ -1431,9 +1463,25 @@ DefineIndex(Oid relationId,
 	SetUserIdAndSecContext(child_save_userid,
 		   child_save_sec_context);
 }
+else
+{
+

Re: pg_stat_statements and "IN" conditions

2023-02-02 Thread Dmitry Dolgov

> On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote:
> This appears to have massive conflicts.  Would you please rebase?

Sure, I was already mentally preparing myself to do so in the view of
recent changes in query jumbling. Will post soon.

Re: pg_dump versus hash partitioning

2023-02-02 Thread Tom Lane

Alvaro Herrera  writes:
> ... so for --load-via-partition-root=auto (or
> whatever), we need to ensure that we detect hash partitioning all the
> way down from the topmost to the leaves.

Yeah, that had already occurred to me, which is why I was not feeling
confident about it being an easy hack in pg_dump.

regards, tom lane

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Tom Lane

Andrew Dunstan  writes:
> Sure. There's probably some work that could still be done in this area
> too, such as making pgperltidy work similarly to how we've now make
> pgindent work.

Perhaps.  But before we commit to that, I'd like to see some tweaks to the
pgperltidy rules to make it less eager to revisit the formatting of lines
close to a change.  Its current behavior will induce a lot of "git blame"
noise if we apply these same procedures to Perl code.

(Should I mention reformat-dat-files?)

> There's also a question of timing. Possibly the best time would be when
> we next fork the tree.

Yeah.  We have generally not wanted to do a mass indent except
when there's a minimum amount of pending patches, ie after the last
CF of a cycle.  What I'd suggest is that we plan on doing a mass
indent and then switch over to the new rules, right after the March
CF closes.  That gives us a couple months to nail down and test out
the new procedures before they go live.

regards, tom lane

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2023-02-02 Thread Andres Freund

Hi,

On 2022-09-20 11:32:02 -0700, Nathan Bossart wrote:
> Note that this change also disallows XMAX_COMMITTED together with
> the special pre-v9.3 locked-only bit pattern that
> HEAP_XMAX_IS_LOCKED_ONLY checks for.  This locked-only bit pattern
> may still be present on servers pg_upgraded from pre-v9.3 versions.

Given that fact, that aspect at least seems to be not viable?

Greetings,

Andres Freund

Re: pg_dump versus hash partitioning

2023-02-02 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-02-01 We 20:03, Tom Lane wrote:
>> Anyway, after re-reading the old thread I wonder if my first instinct
>> (force --load-via-partition-root for enum hash cases only) was the
>> best compromise after all.  I'm not sure how painful it is to get
>> pg_dump to detect such cases, but it's probably possible.

> Given the other problems you enumerated upthread, I'd be more inclined
> to go with your other suggestion of
> "--load-via-partition-root=on/off/auto" (with the default presumably
> "auto").

Hmm ... is there any actual value in "off" in this case?  We can be
just about certain that dump/reload of a hashed enum key will fail.

If we made "auto" also use --load-via-partition-root for range keys
having collation properties, there'd be more of an argument for
letting users override it.

regards, tom lane

Re: pg_dump versus hash partitioning

2023-02-02 Thread Robert Haas

On Wed, Feb 1, 2023 at 6:14 PM Tom Lane  wrote:
> You waved your arms about inventing some new hashing infrastructure,
> but it was phrased in such a way that it wasn't clear to me if that
> was actually a serious proposal or not.  But if it is: how will you
> get around the fact that any change to hashing behavior will break
> pg_upgrade of existing hash-partitioned tables?  New infrastructure
> avails nothing if it has to be bug-compatible with the old.  So I'm
> not sure how restricting the fix to master helps us.

I think what I'd propose to do is invent a new method of hashing enums
that can be used for hash partitioning on enum columns going forward,
and make it the default for hash partitioning going forward. The
existing method can continue to be used for hash indexes, to avoid
breaking on disk compatibility.

Now, for the back branches, if you wanted to force
--load-via-partition-root specifically for the case of hash
partitioning on an enum column, I think that would be fine. It's a
hack, but there's no way out in the back branches that is not a hack.
What I don't like is the idea of enabling --load-via-partition-root in
the back branches more broadly, e.g. in all cases whatsoever, or in
all cases involving hash partitioning. If we really want to, we can
make --load-via-partition-root the new default categorically in
master, and make the pg_dump option --no-load-via-partition-root. I'm
not convinced that's a good idea, but maybe it is, and you know, major
releases change behavior sometimes, that's how life goes. Minor
releases, though, should minimize behavior changes, IMHO. It's not at
all nice to apply a critical security update and find that pg_dump
works differently to fix a problem you weren't having.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

2023-02-02 Thread Amit Langote

On Fri, Jan 27, 2023 at 4:01 PM Amit Langote  wrote:
> On Fri, Jan 20, 2023 at 12:52 PM Amit Langote  wrote:
> > Alright, I'll try to get something out early next week.  Thanks for
> > all the pointers.
>
> Sorry for the delay.  Attached is what I've come up with so far.
>
> I didn't actually go with calling the plancache on every lock taken on
> a relation, that is, in ExecGetRangeTableRelation().  One thing about
> doing it that way that I didn't quite like (or didn't see a clean
> enough way to code) is the need to complicate the ExecInitNode()
> traversal for handling the abrupt suspension of the ongoing setup of
> the PlanState tree.

OK, I gave this one more try and attached is what I came up with.

This adds a ExecPlanStillValid(), which is called right after anything
that may in turn call ExecGetRangeTableRelation() which has been
taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in
EState.es_top_eflags.  That includes all ExecInitNode() calls, and a
few other functions that call ExecGetRangeTableRelation() directly,
such as ExecOpenScanRelation().  If ExecPlanStillValid() returns
false, that is, if EState.es_cachedplan is found to have been
invalidated after a lock being taken by ExecGetRangeTableRelation(),
whatever funcion called it must return immediately and so must its
caller and so on.  ExecEndPlan() seems to be able to clean up after a
partially finished attempt of initializing a PlanState tree in this
way.  Maybe my preliminary testing didn't catch cases where pointers
to resources that are normally put into the nodes of a PlanState tree
are now left dangling, because a partially built PlanState tree is not
accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in
such cases.  Maybe there's only es_tupleTable and es_relations that
needs to be explicitly released and the rest is taken care of by
resetting the ExecutorState context.

On testing, I'm afraid we're going to need something like
src/test/modules/delay_execution to test that concurrent changes to
relation(s) in PlannedStmt.relationOids that occur somewhere between
RevalidateCachedQuery() and InitPlan() result in the latter to be
aborted and that it is handled correctly.  It seems like it is only
the locking of partitions (that are not present in an unplanned Query
and thus not protected by AcquirePlannerLocks()) that can trigger
replanning of a CachedPlan, so any tests we write should involve
partitions.  Should this try to test as many plan shapes as possible
though given the uncertainty around ExecEndPlan() robustness or should
manual auditing suffice to be sure that nothing's broken?

On possibly needing to move permission checking to occur *after*
taking locks, I realized that we don't really need to, because no
relation that needs its permissions should be unlocked by the time we
get to ExecCheckPermissions(); note we only check permissions of
tables that are present in the original parse tree and
RevalidateCachedQuery() should have locked those.  I found a couple of
exceptions to that invariant in that views sometimes appear not to be
in the set of relations that RevalidateCachedQuery() locks.  So, I
invented PlannedStmt.viewRelations, a list of RT indexes of view RTEs
that is populated in setrefs.c. ExecLockViewRelations() called before
ExecCheckPermissions() locks those.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

v32-0001-Move-AcquireExecutorLocks-s-responsibility-into-.patch
Description: Binary data

Re: Use windows VMs instead of windows containers on the CI

2023-02-02 Thread Nazir Bilal Yavuz


Hi,


On 1/12/2023 3:30 AM, Andres Freund wrote:

Yea, there's a problem where packer on windows doesn't seem to abort after a
powershell script error out. The reason isn't yet quiete clear. I think Bilal
is working on a workaround.



That should be fixed now. Also, adding a patch for PG15. There were 
conflicts while applying current patch to the REL_15_STABLE branch.



Regards,
Nazir Bilal Yavuz
Microsoft
From ba88fe8effc1400f08bfa4fcdc44862faf2977d4 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 10 Jan 2023 13:58:39 +0300
Subject: [PATCH v2] [master] Use windows VMs instead of windows containers

---
 .cirrus.yml | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 69837bcd5a..5700b8cd66 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -549,8 +549,10 @@ task:
   depends_on: SanityCheck
   only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
$CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*windows.*'
 
-  windows_container:
-image: $CONTAINER_REPO/windows_ci_vs_2019:latest
+  compute_engine_instance:
+image_project: $IMAGE_PROJECT
+image: family/pg-ci-windows-ci-vs-2019
+platform: windows
 cpu: $CPUS
 memory: 4G
 
@@ -589,8 +591,10 @@ task:
   # otherwise it'll be sorted before other tasks
   depends_on: SanityCheck
 
-  windows_container:
-image: $CONTAINER_REPO/windows_ci_mingw64:latest
+  compute_engine_instance:
+image_project: $IMAGE_PROJECT
+image: family/pg-ci-windows-ci-mingw64
+platform: windows
 cpu: $CPUS
 memory: 4G
 
-- 
2.25.1

From bcde952b53c1e14a7eddf39542368c3dd7a97c13 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 2 Feb 2023 16:37:17 +0300
Subject: [PATCH v2] [REL_15] Use windows VMs instead of windows containers

---
 .cirrus.yml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index ab7043c54e..cf0fe29a14 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -400,8 +400,10 @@ task:
 
   only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
$CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*windows.*'
 
-  windows_container:
-image: $CONTAINER_REPO/windows_ci_vs_2019:latest
+  compute_engine_instance:
+image_project: $IMAGE_PROJECT
+image: family/pg-ci-windows-ci-vs-2019
+platform: windows
 cpu: $CPUS
 memory: 4G
 
-- 
2.25.1

Re: Temporary tables versus wraparound... again

2023-02-02 Thread Alvaro Herrera

On 2022-Dec-13, Greg Stark wrote:

> So here I've done it that way. It is a bit of an unfortunate layering
> since it means the heapam_handler is doing the catalog change but it
> does seem inconvenient to pass relfrozenxid etc back up and have the
> caller make the changes when there are no other changes to make.

Are you still at this?  CFbot says the meson tests failed last time for
some reason:
http://commitfest.cputube.org/greg-stark.html

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Non-superuser subscription owners

2023-02-02 Thread Robert Haas

On Wed, Feb 1, 2023 at 3:37 PM Andres Freund  wrote:
> The general point is that IMO is that in many setups you should use a
> user with fewer privileges than a superuser.  It doesn't really matter
> whether we have an ad-hoc restriction for system catalogs. More often
> than not being able to modify other tables will give you a lot of
> privileges too.

I don't know what you mean by this. DML doesn't confer privileges. If
code gets executed and runs with the replication user's credentials,
that could lead to privilege escalation, but just moving rows around
doesn't, at least not in the database sense. It might confer
unanticipated real-world benefits, like if you can update your own
salary or something, but in the context of replication you have to
have had the ability to do that on some other node anyway. If that
change wasn't supposed to get replicated to the local node, then why
are we using replication? Or why is that table in the publication? I'm
confused.

> > The thing I'm struggling to understand is: if you only want to
> > replicate into tables that Alice can write, why not just make Alice
> > own the subscription?
>
> Because it implies that the replication happens as a user that's
> privileged enough to change the configuration of replication.

But again, replication is just about inserting, updating, and deleting
rows. To change the replication configuration, you have to be able to
parlay that into the ability to execute code. That's why I think
trigger security is really important. But I'm wondering if there's
some way we can handle that that doesn't require us to make a decision
about arun-as user. For instance, if firing triggers as the table
owner is an acceptable solution, then the only thing that the run-as
user is actually controlling is which tables we're willing to
replicate into in the first place (unless there's some other way that
logical replication can run arbitrary code). The name almost becomes a
misnomer in that case. It's not a run-as user, it's
use-this-user's-permissions-to-see-if-I-should-fail-replication user.

> > I was thinking a little bit more about that. I
> > still maintain that the current system is poorly set up to make that
> > work, but suppose we wanted to do better. We could add filtering on
> > the subscriber side, like you list schemas or specific relations that
> > you are or are not willing to replicate into.
>
> Isn't that largely a duplication of the ACLs on relations etc?

Yeah, maybe.

> > I'm not quite sure what we do here now, but I agree that trigger
> > firing seems like a problem. It might be that we need to worry about
> > the user on the origin server, too. If Alice inserts a row that causes
> > a replicated table owned by Bob to fire a trigger or evaluate a
> > default expression or whatever due the presence of a subscription
> > owned by Charlie, there is a risk that Alice might try to attack
> > either Bob or Charlie, or that Bob might try to attack Charlie.
>
> The attack on Bob exists without logical replication too - a REINDEX or
> such is executed as the owner of the relation and re-evaluates index
> expressions, constraints etc.  Given our security model I don't think we
> can protect the relation owner if they trust somebody to insert rows, so
> I don't really know what we can do to protect Charlie against Bob.

Yikes.

> > > > And if we suppose that that already works and is safe, well then
> > > > what's the case where I do need a run-as user?
> > >
> > > It's not at all safe today, IMO. You need to trust that nothing bad will
> > > be replicated, otherwise the owner of the subscription has to be
> > > considered compromised.
> >
> > What kinds of things are bad to replicate?
>
> I think that's unfortunately going to be specific to a setup.

Can you give an example?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

2023-02-02 Thread Robert Haas

On Wed, Feb 1, 2023 at 1:09 PM Mark Dilger  wrote:
> > On Feb 1, 2023, at 6:43 AM, Robert Haas  wrote:
> > The thing I'm
> > struggling to understand is: if you only want to replicate into tables
> > that Alice can write, why not just make Alice own the subscription?
> > For a run-as user to make sense, you need a scenario where we want the
> > replication to target only tables that Alice can touch, but we also
> > don't want Alice herself to be able to touch the subscription, so you
> > make Alice the run-as user and yourself the owner, or something like
> > that. But I'm not sure what that scenario is exactly.
>
> This "run-as" idea came about because we didn't want arbitrary roles to be 
> able to change the subscription's connection string.  A competing idea was to 
> have a server object rather than a string, with roles like Alice being able 
> to use the server object if they have been granted usage privilege, and not 
> otherwise.  So the "run-as" and "server" ideas were somewhat competing.

As far as not changing the connection string goes, a few more ideas
have entered the fray: the current patch uses a password_required
property that is modelled on postgres_fdw, and I've elsewhere proposed
a reverse pg_hba.conf.

IMHO, for the use cases that I can imagine, the reverse pg_hba.conf
idea feels better than all competitors, because it's the only idea
that lets you define a class of acceptable connection strings. Jeff's
idea of a separate connection object is fine if you have a specific,
short list of connection strings and you want to allow those and
disallow everything else, and there may be cases where people want
that, and that's fine, but my guess is that it's overly restrictive in
a lot of environments. The password_required property has the virtue
of being compatible with what we do in other places right now, and of
preventing wraparound-to-superuser attacks effectively, but it's
totally unconfigurable and that sucks. The runas user idea gives you
some control over who is allowed to set the connection string, but it
doesn't help you delegate that to a non-superuser, because the idea
there is that you want the non-superuser to be able to set connection
strings that are OK with the actual superuser but not others.

I think part of my confusion here is that I thought that the point of
the runas user was to defend against logical replication itself
changing the connection string, and I don't see how it would do that.
It's just moving rows around. If the point is that somebody who can
log in as the runas user might change the connection string to
something we don't like, that makes somewhat more sense. I think I had
in my head that you wouldn't use someone's actual login role to run
logical replication, but rather some role specifically set up for that
purpose. In that scenario, nobody's running SQL commands as the runas
user, so even if they also own the subscription, there's no way for it
to get modified.

> > Mark was postulating a scenario where the publisher and subscriber
> > don't trust each other. I was thinking a little bit more about that. I
> > still maintain that the current system is poorly set up to make that
> > work, but suppose we wanted to do better. We could add filtering on
> > the subscriber side, like you list schemas or specific relations that
> > you are or are not willing to replicate into. Then you could, for
> > example, connect your subscription to a certain remote publication,
> > but with the restriction that you're only willing to replicate into
> > the "headquarters" schema. Then we'll replicate whatever tables they
> > send us, but if the dorks at headquarters mess up the publications on
> > their end (intentionally or otherwise) and add some tables from the
> > "locally_controlled_stuff" schema, we'll refuse to replicate that into
> > our eponymous schema.
>
> That example is good, though I don't see how "filters" are better than 
> roles+privileges.  Care to elaborate?

I'm not sure that they are. Are we assuming that the user who is
creating subscriptions is also powerful enough to create roles and
give them just the required amount of privilege? If so, it seems like
they might as well just do it that way. And maybe we should assume
that, because in most cases, a dedication replication role makes more
sense to me than running replication under some role that you're also
using for other things. On the other hand, I bet a lot of people today
are just running replication as a superuser, in which case maybe this
could be useful? This whole idea was mostly just me spitballing to see
what other people thought. I'm not wild about the complexity involved
for what you get out of it, so if we don't need it, that's more than
fine with me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_stat_statements and "IN" conditions

2023-02-02 Thread Alvaro Herrera

This appears to have massive conflicts.  Would you please rebase?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Andrew Dunstan



On 2023-02-02 Th 01:40, Tom Lane wrote:
> Noah Misch  writes:
>> Regarding the concern about a pre-receive hook blocking an emergency push, 
>> the
>> hook could approve every push where a string like "pgindent: no" appears in a
>> commit message within the push.  You'd still want to make the tree clean
>> sometime the same week or so.  It's cheap to provide a break-glass like that.
> I think the real question here is whether we can get all (or at least
> a solid majority of) committers to accept such draconian constraints.
> I'd buy into it, and evidently so would you, but I can't help noting
> that less than a quarter of active committers have bothered to
> comment on this thread.  I suspect the other three-quarters would
> be quite annoyed if we tried to institute such requirements.  That's
> not manpower we can afford to drive away.


I'd be very surprised if this caused any active committer to walk away
from the project. Many will probably appreciate the nudge. But maybe I'm
overoptimistic.


>
> Maybe this should get taken up at the this-time-for-sure developer
> meeting at PGCon?
>
>   


Sure. There's probably some work that could still be done in this area
too, such as making pgperltidy work similarly to how we've now make
pgindent work.


There's also a question of timing. Possibly the best time would be when
we next fork the tree.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

How to write a new tuple into page?

2023-02-02 Thread jack...@gmail.com


Hi, I'm trying to construct a new tuple type, that's not heaptuple,
When I get a tupleTableSlot, I will get data info from it and then I
will constuct a new tuple, and now I need to put it into a physical
page, how should I do?

--

jack...@gmail.com

Re: Move defaults toward ICU in 16?

2023-02-02 Thread Robert Haas

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis  wrote:
> If we don't want to nudge users toward ICU, is it because we are
> waiting for something, or is there a lack of consensus that ICU is
> actually better?

Do you think it's better?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: WAL Insertion Lock Improvements

2023-02-02 Thread Bharath Rupireddy

On Tue, Jan 24, 2023 at 7:00 PM Bharath Rupireddy
 wrote:
>
> I'm attaching the v3 patch with the above review comments addressed.
> Hopefully, no memory ordering issues now. FWIW, I've added it to CF
> https://commitfest.postgresql.org/42/4141/.
>
> Test results with the v3 patch and insert workload are the same as
> that of the earlier run - TPS starts to scale at higher clients as
> expected after 512 clients and peaks at 2X with 2048 and 4096 clients.
>
> HEAD:
> 1 1380.411086
> 2 1358.378988
> 4 2701.974332
> 8 5925.380744
> 16 10956.501237
> 32 20877.513953
> 64 40838.046774
> 128 70251.744161
> 256 108114.321299
> 512 120478.988268
> 768 99140.425209
> 1024 93645.984364
> 2048 70111.159909
> 4096 55541.804826
>
> v3 PATCHED:
> 1 1493.800209
> 2 1569.414953
> 4 3154.186605
> 8 5965.578904
> 16 11912.587645
> 32 22720.964908
> 64 42001.094528
> 128 78361.158983
> 256 110457.926232
> 512 148941.378393
> 768 167256.590308
> 1024 155510.675372
> 2048 147499.376882
> 4096 119375.457779

I slightly modified the comments and attached the v4 patch for further
review. I also took perf report - there's a clear reduction in the
functions that are affected by the patch - LWLockWaitListLock,
WaitXLogInsertionsToFinish, LWLockWaitForVar and
LWLockConflictsWithVar. Note that I compiled the source code with
-ggdb for capturing symbols for perf, still the benefit stands at > 2X
for a higher number of clients.

HEAD:
+   16.87% 0.01%  postgres  [.] CommitTransactionCommand
+   16.86% 0.00%  postgres  [.] finish_xact_command
+   16.81% 0.01%  postgres  [.] CommitTransaction
+   15.09% 0.20%  postgres  [.] LWLockWaitListLock
+   14.53% 0.01%  postgres  [.] WaitXLogInsertionsToFinish
+   14.51% 0.02%  postgres  [.] LWLockWaitForVar
+   11.70%11.63%  postgres  [.] pg_atomic_read_u32_impl
+   11.66% 0.08%  postgres  [.] pg_atomic_read_u32
+9.96% 0.03%  postgres  [.] LWLockConflictsWithVar
+4.78% 0.00%  postgres  [.] LWLockQueueSelf
+1.91% 0.01%  postgres  [.] pg_atomic_fetch_or_u32
+1.91% 1.89%  postgres  [.] pg_atomic_fetch_or_u32_impl
+1.73% 0.00%  postgres  [.] XLogInsert
+1.69% 0.01%  postgres  [.] XLogInsertRecord
+1.41% 0.01%  postgres  [.] LWLockRelease
+1.37% 0.47%  postgres  [.] perform_spin_delay
+1.11% 1.11%  postgres  [.] spin_delay
+1.10% 0.03%  postgres  [.] exec_bind_message
+0.91% 0.00%  postgres  [.] WALInsertLockRelease
+0.91% 0.00%  postgres  [.] LWLockReleaseClearVar
+0.72% 0.02%  postgres  [.] LWLockAcquire
+0.60% 0.00%  postgres  [.] LWLockDequeueSelf
+0.58% 0.00%  postgres  [.] GetTransactionSnapshot
 0.58% 0.49%  postgres  [.] GetSnapshotData
+0.58% 0.00%  postgres  [.] WALInsertLockAcquire
+0.55% 0.00%  postgres  [.] XactLogCommitRecord

TPS (compiled with -ggdb for capturing symbols for perf)
1 1392.512967
2 1435.899119
4 3104.091923
8 6159.305522
16 11477.641780
32 22701.000718
64 41662.425880
128 23743.426209
256 89837.651619
512 65164.221500
768 66015.733370
1024 56421.223080
2048 52909.018072
4096 40071.146985

PATCHED:
+2.19% 0.05%  postgres  [.] LWLockWaitListLock
+2.10% 0.01%  postgres  [.] LWLockQueueSelf
+1.73% 1.71%  postgres  [.] pg_atomic_read_u32_impl
+1.73% 0.02%  postgres  [.] pg_atomic_read_u32
+1.72% 0.02%  postgres  [.] LWLockRelease
+1.65% 0.04%  postgres  [.] exec_bind_message
+1.43% 0.00%  postgres  [.] XLogInsert
+1.42% 0.01%  postgres  [.] WaitXLogInsertionsToFinish
+1.40% 0.03%  postgres  [.] LWLockWaitForVar
+1.38% 0.02%  postgres  [.] XLogInsertRecord
+0.93% 0.03%  postgres  [.] LWLockAcquireOrWait
+0.91% 0.00%  postgres  [.] GetTransactionSnapshot
+0.91% 0.79%  postgres  [.] GetSnapshotData
+0.91% 0.00%  postgres  [.] WALInsertLockRelease
+0.91% 0.00%  postgres  [.] LWLockReleaseClearVar
+0.53% 0.02%  postgres  [.] ExecInitModifyTable

TPS (compiled with -ggdb for capturing symbols for perf)
1 1295.296611
2 1459.079162
4 2865.688987
8 5533.724983
16 10771.697842
32 20557.499312
64 39436.423783
128 42555.639048
256 73139.060227
512 124649.665196
768 131162.826976
1024 132185.160007
2048 117377.586644
4096 88240.336940


--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 74c5bd8cc4f1497aa7f2fa02c6487039dc91e847 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 2 Feb 2023 03:42:27 +
Subject: [PATCH v4] Optimize WAL insertion lock acquisition and release

This commit optimizes WAL insertion lock acquisition and release
in the following way:

1. WAL insertion lock's variable insertingAt is currently read and
written with the help of lwlock's wait list lock to avoid
torn-free reads/writes. This wait list lock can become a point of
contention on a highly concurrent write workloads.

Move defaults toward ICU in 16?

2023-02-02 Thread Jeff Davis



As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

If so, is version 16 the right time to adjust defaults to favor ICU?

  * At build time, default to --with-icu (-Dicu=enabled); users who
don't want ICU can specify --without-icu (-Dicu=disabled/auto)
  * At initdb time, default to --locale-provider=icu if built with
ICU support

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Regards,
Jeff Davis

Re: pg_dump versus hash partitioning

2023-02-02 Thread Andrew Dunstan

On 2023-02-01 We 20:03, Tom Lane wrote:
>
> Anyway, after re-reading the old thread I wonder if my first instinct
> (force --load-via-partition-root for enum hash cases only) was the
> best compromise after all.  I'm not sure how painful it is to get
> pg_dump to detect such cases, but it's probably possible.
>
>   

Given the other problems you enumerated upthread, I'd be more inclined
to go with your other suggestion of
"--load-via-partition-root=on/off/auto" (with the default presumably
"auto").

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] New [relation] option engine

2023-02-02 Thread Alvaro Herrera

On 2023-Feb-02, vignesh C wrote:

> On Wed, 1 Feb 2023 at 14:49, Alvaro Herrera  wrote:

> > Well, no feedback has been given, so I'm not sure this is a great
> > outcome.  In the interest of keeping it alive, I've rebased it.  It
> > turns out that the only conflict is with the 2022 -> 2023 copyright line
> > update.
> 
> Since there was no response to rebase the patch, I was not sure if the
> author was planning to continue. Anyways Nikolay Shaplov plans to work
> on this soon, So I have added this to the next commitfest at [1].

Thank you!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Amit Kapila

On Thu, Feb 2, 2023 at 5:05 PM Dean Rasheed  wrote:
>
> On Thu, 2 Feb 2023 at 06:40, Tom Lane  wrote:
> >
> > Noah Misch  writes:
> > > Regarding the concern about a pre-receive hook blocking an emergency 
> > > push, the
> > > hook could approve every push where a string like "pgindent: no" appears 
> > > in a
> > > commit message within the push.  You'd still want to make the tree clean
> > > sometime the same week or so.  It's cheap to provide a break-glass like 
> > > that.
> >
> > I think the real question here is whether we can get all (or at least
> > a solid majority of) committers to accept such draconian constraints.
> > I'd buy into it, and evidently so would you, but I can't help noting
> > that less than a quarter of active committers have bothered to
> > comment on this thread.  I suspect the other three-quarters would
> > be quite annoyed if we tried to institute such requirements.
> >
>
> I didn't reply until now, but I'm solidly in the camp of committers
> who care about keeping the tree properly indented, and I wouldn't have
> any problem with such a check being imposed.
>
> I regularly run pgindent locally, and if I ever commit without
> indenting, it's either intentional, or because I forgot, so the
> reminder would be useful.
>
> And as someone who runs pgindent regularly, I think this will be a net
> time saver, since I won't have to skip over other unrelated indent
> changes all the time.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Thomas Munro

On Fri, Feb 3, 2023 at 12:35 AM Dean Rasheed  wrote:
> And as someone who runs pgindent regularly, I think this will be a net
> time saver, since I won't have to skip over other unrelated indent
> changes all the time.

+1

Re: run pgindent on a regular basis / scripted manner

2023-02-02 Thread Dean Rasheed

On Thu, 2 Feb 2023 at 06:40, Tom Lane  wrote:
>
> Noah Misch  writes:
> > Regarding the concern about a pre-receive hook blocking an emergency push, 
> > the
> > hook could approve every push where a string like "pgindent: no" appears in 
> > a
> > commit message within the push.  You'd still want to make the tree clean
> > sometime the same week or so.  It's cheap to provide a break-glass like 
> > that.
>
> I think the real question here is whether we can get all (or at least
> a solid majority of) committers to accept such draconian constraints.
> I'd buy into it, and evidently so would you, but I can't help noting
> that less than a quarter of active committers have bothered to
> comment on this thread.  I suspect the other three-quarters would
> be quite annoyed if we tried to institute such requirements.
>

I didn't reply until now, but I'm solidly in the camp of committers
who care about keeping the tree properly indented, and I wouldn't have
any problem with such a check being imposed.

I regularly run pgindent locally, and if I ever commit without
indenting, it's either intentional, or because I forgot, so the
reminder would be useful.

And as someone who runs pgindent regularly, I think this will be a net
time saver, since I won't have to skip over other unrelated indent
changes all the time.

Regards,
Dean

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-02 Thread shveta malik

On Wed, Feb 1, 2023 at 5:37 PM Melih Mutlu  wrote:
>
> Hi,
> Please see attached patches.
>
> Thanks,
> --
> Melih Mutlu
> Microsoft

Hi Melih,

Few suggestions on v10-0002-Reuse patch

1)
for (int64 i = 1; i <= lastusedid; i++)
{
charoriginname_to_drop[NAMEDATALEN] = {0};
snprintf(originname_to_drop,
sizeof(originname_to_drop), "pg_%u_%lld", subid, (long long) i);
 ...
  }

--Is it better to use the function
'ReplicationOriginNameForLogicalRep' here instead of sprintf, just to
be consistent everywhere to construct origin-name?

2)
pa_launch_parallel_worker:
launched = logicalrep_worker_launch(MyLogicalRepWorker->dbid,
MySubscription->oid,

MySubscription->name,

MyLogicalRepWorker->userid,
 InvalidOid,

dsm_segment_handle(winfo->dsm_seg),
0);

--Can we please define 'InvalidRepSlotId' macro and pass it here as
the last arg to make it more readable.
#define InvalidRepSlotId 0
Same in ApplyLauncherMain where we are passing 0 as last arg to
logicalrep_worker_launch.

3)
We are safe to drop the replication trackin origin after this
--typo: trackin -->tracking

4)
process_syncing_tables_for_sync:
if (MyLogicalRepWorker->slot_name && strcmp(syncslotname,
MyLogicalRepWorker->slot_name) != 0)
{
  ..
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,

MyLogicalRepWorker->relid,
  originname,

sizeof(originname));

/* Drop replication origin */
replorigin_drop_by_name(originname, true, false);
}

--Are we passing missing_ok as true (second arg) intentionally here in
replorigin_drop_by_name? Once we fix the issue reported  in my earlier
email (ASSERT), do you think it makes sense to  pass missing_ok as
false here?

5)
process_syncing_tables_for_sync:
foreach(lc, rstates)
{

rstate = (SubscriptionRelState *)
palloc(sizeof(SubscriptionRelState));
memcpy(rstate, lfirst(lc),
sizeof(SubscriptionRelState));
/*
 * Pick the table for the next run if it is
not already picked up
 * by another worker.
 */
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
if (rstate->state != SUBREL_STATE_SYNCDONE &&

!logicalrep_worker_find(MySubscription->oid, rstate->relid, false))

{
   .
}
LWLockRelease(LogicalRepWorkerLock);
}

--Do we need to palloc for each relation separately? Shall we do it
once outside the loop and reuse it? Also pfree is not done for rstate
here.

6)
LogicalRepSyncTableStart:
1385 slotname = (char *) palloc(NAMEDATALEN);
1413 prev_slotname = (char *) palloc(NAMEDATALEN);
1481 slotname = prev_slotname;
1502 pfree(prev_slotname);
1512 UpdateSubscriptionRel(MyLogicalRepWorker->subid,
1513
MyLogicalRepWorker->relid,
1514
MyLogicalRepWorker->relstate,
1515
MyLogicalRepWorker->relstate_lsn,
1516   slotname,
1517   originname);

Can you please review the above flow (I have given line# along with),
I think it could be problematic. We alloced prev_slotname, assigned it
to slotname, freed prev_slotname and used slotname after freeing the
prev_slotname.
Also slotname is allocated some memory too, that is not freed.

Reviewing further

JFYI, I get below while applying patch:

shveta@shveta-vm:~/repos/postgres1/postgres$ git am
~/Desktop/shared/reuse/v10-0002-Reuse-Logical-Replication-Background-worker.patch
Applying: Reuse Logical Replication Background worker
.git/rebase-apply/patch:142: trailing whitespace.
values[Anum_pg_subscription_rel_srrelslotname - 1] =
.git/rebase-apply/patch:692: indent with spaces.
errmsg("could not receive list of slots associated
with the subscription %u, error: %s",
.git/rebase-apply/patch:1055: trailing whitespace.
/*
.git/rebase-apply/patch:1057: trailing whitespace.
 * relations.
.git/rebase-apply/patch:1059: trailing whitespace.
 * and origin. Then stop the worker gracefully.
warning: 5 lines add whitespace errors.

thanks
Shveta

Re: Deadlock between logrep apply worker and tablesync worker

2023-02-02 Thread Amit Kapila

On Thu, Feb 2, 2023 at 12:05 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, January 31, 2023 1:07 AM vignesh C  wrote:
> > On Mon, 30 Jan 2023 at 17:30, vignesh C  wrote:
> >
>
> I also tried to test the time of "src/test/subscription/t/002_types.pl"
> before and after the patch(change the lock level) and Tom's patch(split
> transaction) like what Vignesh has shared on -hackers.
>
> I run about 100 times for each case. Tom's and the lock level patch
> behave similarly on my machines[1].
>
> HEAD: 3426 ~ 6425 ms
> HEAD + Tom: 3404 ~ 3462 ms
> HEAD + Vignesh: 3419 ~ 3474 ms
> HEAD + Tom + Vignesh: 3408 ~ 3454 ms
>
> Even apart from the testing time reduction, reducing the lock level and lock
> the specific object can also help improve the lock contention which user(that
> use the exposed function) , table sync worker and apply worker can also 
> benefit
> from it. So, I think pushing the patch to change the lock level makes sense.
>
> And the patch looks good to me.
>

Thanks for the tests. I also see a reduction in test time variability
with Vignesh's patch. I think we can release the locks in case the
origin is concurrently dropped as in the attached patch. I am planning
to commit this patch tomorrow unless there are more comments or
objections.

> While on it, after pushing the patch, I think there is another case might also
> worth to be improved, that is the table sync and apply worker try to drop the
> same origin which might cause some delay. This is another case(different from
> the deadlock), so I feel we can try to improve this in another patch.
>

Right, I think that case could be addressed by Tom's patch to some
extent but I am thinking we should also try to analyze if we can
completely avoid the need to remove origins from both processes. One
idea could be to introduce another relstate something like
PRE_SYNCDONE and set it in a separate transaction before we set the
state as SYNCDONE and remove the slot and origin in tablesync worker.
Now, if the tablesync worker errors out due to some reason during the
second transaction, it can remove the slot and origin after restart by
checking the state. However, it would add another relstate which may
not be the best way to address this problem. Anyway, that can be
accomplished as a separate patch.

-- 
With Regards,
Amit Kapila.

v4-0001-Optimize-the-origin-drop-functionality.patch
Description: Binary data

Re: Syncrep and improving latency due to WAL throttling

2023-02-02 Thread Jakub Wartak

On Thu, Feb 2, 2023 at 11:03 AM Tomas Vondra
 wrote:

> > I agree that some other concurrent backend's
> > COMMIT could fsync it, but I was wondering if that's sensible
> > optimization to perform (so that issue_fsync() would be called for
> > only commit/rollback records). I can imagine a scenario with 10 such
> > concurrent backends running - all of them with this $thread-GUC set -
> > but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> > would be wasted close to 400s just due to local fsyncs?). I don't have
> > a strong opinion or in-depth on this, but that smells like IO waste.
> >
>
> Not sure what optimization you mean,

Let me clarify, let's say something like below (on top of the v3) just
to save IOPS:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2340,6 +2340,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
if (sync_method != SYNC_METHOD_OPEN &&
sync_method != SYNC_METHOD_OPEN_DSYNC)
{
+   bool openedLogFile = false;
if (openLogFile >= 0 &&
!XLByteInPrevSeg(LogwrtResult.Write,
openLogSegNo,

wal_segment_size))
@@ -2351,9 +2352,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
openLogTLI = tli;
openLogFile = XLogFileOpen(openLogSegNo, tli);
ReserveExternalFD();
+   openedLogFile = true;
}

-   issue_xlog_fsync(openLogFile, openLogSegNo, tli);
+   /* can we bypass fsyncing() XLOG from the backend if
+* we have been called without commit request?
+* usually the feature will be off here
(XLogDelayPending=false)
+*/
+   if(openedLogFile == true || XLogDelayPending == false)
+   issue_xlog_fsync(openLogFile,
openLogSegNo, tli);
}

+ maybe some additional logic to ensure that this micro-optimization
for saving IOPS would be not enabled if the backend is calling that
XLogFlush/Write() for actual COMMIT record

> But I think the backends still have to sleep at some point, so that they
> don't queue too much unflushed WAL - that's kinda the whole point, no?

Yes, but it can be flushed to standby, flushed locally but not fsynced
locally (?) - provided that it was not COMMIT - I'm just wondering
whether it makes sense (Question 1)

> The issue is more about triggering the throttling too early, before we
> hit the bandwidth limit. Which happens simply because we don't have a
> very good way to decide whether the latency is growing, so the patch
> just throttles everything.

Maximum TCP bandwidth limit seems to be fluctuating in the real world
I suppose, so it couldn't be a hard limit. On the other hand I can
imagine operators setting
"throttle-those-backends-if-global-WALlatencyORrate>XXX"
(administrative decision). That would be cool to have but yes it would
require WAL latency and rate measurement first (on its own that would
make a very nice addition to the pg_stat_replication). But one thing
to note would be that there could be many potential latencies (& WAL
throughput rates) to consider (e.g. quorum of 3 standby sync having
different latencies) - which one to choose?

(Question 2) I think we have reached simply a decision point on
whether the WIP/PoC is good enough as it is (like Andres wanted and
you +1 to this) or it should work as you propose or maybe keep it as
an idea for the future?

-J.

Re: pg_dump versus hash partitioning

2023-02-02 Thread Alvaro Herrera

On 2023-Feb-01, Robert Haas wrote:

> I think you can construct plausible cases where it's not just
> academic. For instance, suppose I intend to use some kind of logical
> replication system, not necessarily the one built into PostgreSQL, to
> replicate data between two systems. Before engaging that system, I
> need to make the initial database contents match. The system is
> oblivious to partitioning, and just replicates each table to a table
> with a matching name.

This only works if that other system's hashing behavior is identical to
Postgres' for hashing that particular enum; there's no other way that
you could make the tables match exactly in the way you propose.  What
this tells me is that it's not really reasonable for users to expect
that this situation would actually work.  It is totally reasonable for
range and list, but not for hash.

If the idea of --load-via-partition-root=auto is going to be the fix for
this problem, then it has to consider that hash partitioning might be in
a level below the topmost one.  For example,

create type colors as enum ('blue', 'red', 'green');
create table topmost (prim int, col colors, a int) partition by range (prim);
create table parent partition of topmost for values from (0) to (1000) 
partition by hash (col);
create table child1 partition of parent for values with (modulus 3, remainder 
0);
create table child2 partition of parent for values with (modulus 3, remainder 
1);
create table child3 partition of parent for values with (modulus 3, remainder 
2);

If you dump this with --load-via-partition-root, for child1 it'll give you this:

--
-- Data for Name: child1; Type: TABLE DATA; Schema: public; Owner: alvherre
--

COPY public.topmost (prim, col, a) FROM stdin;
\.

which is what we want; so for --load-via-partition-root=auto (or
whatever), we need to ensure that we detect hash partitioning all the
way down from the topmost to the leaves.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)

Re: Syncrep and improving latency due to WAL throttling

2023-02-02 Thread Tomas Vondra

On 2/1/23 14:40, Jakub Wartak wrote:
> On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
>  wrote:
> 
>>> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
>>> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
>>> we are flushing just because of WAL thortting ?) Would that still be
>>> safe?
>>
>> It's not clear to me how could this work and still be safe. I mean, we
>> *must* flush the local WAL first, otherwise the replica could get ahead
>> (if we send unflushed WAL to replica and then crash). Which would be
>> really bad, obviously.
> 
> Well it was just a thought: in this particular test - with no other
> concurrent activity happening - we are fsyncing() uncommitted
> Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
> record comes into play.

Right. I see it as testing (more or less) a worst-case scenario,
measuring impact on commands generating a lot of WAL. I'm not sure the
slowdown comes from the extra fsyncs, thgouh - I'd bet it's more about
the extra waits for confirmations from the replica.

> I agree that some other concurrent backend's
> COMMIT could fsync it, but I was wondering if that's sensible
> optimization to perform (so that issue_fsync() would be called for
> only commit/rollback records). I can imagine a scenario with 10 such
> concurrent backends running - all of them with this $thread-GUC set -
> but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> would be wasted close to 400s just due to local fsyncs?). I don't have
> a strong opinion or in-depth on this, but that smells like IO waste.
> 

Not sure what optimization you mean, but triggering the WAL flushes from
a separate process would be beneficial. But we already do that, more or
less - that's what WAL writer is about, right? Maybe it's not aggressive
enough or something, not sure.

But I think the backends still have to sleep at some point, so that they
don't queue too much unflushed WAL - that's kinda the whole point, no?
The issue is more about triggering the throttling too early, before we
hit the bandwidth limit. Which happens simply because we don't have a
very good way to decide whether the latency is growing, so the patch
just throttles everything.

Consider a replica on a network link with 10ms round trip. Then commit
latency can't really be better than 10ms, and throttling at that point
can't really improve anything, it just makes it slower. Instead, we
should measure the latency somehow, and only throttle when it increases.
And probably make it proportional to the delta (so the higher it's from
the "minimal" latency, the more we'd throttle).

I'd imagine we'd measure the latency (or the wait for sync replica) over
reasonably short time windows (1/10 of a second?), and using that to
drive the throttling. If the latency is below some acceptable value,
don't throttle at all. If it increases, start throttling.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_dump versus hash partitioning

2023-02-02 Thread Laurenz Albe

On Wed, 2023-02-01 at 17:49 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Wed, Feb 1, 2023 at 5:08 PM Tom Lane  wrote:
> > > I can agree with that argument for range or list partitioning, where
> > > the partitions have some semantic meaning to the user.  I don't buy it
> > > for hash partitioning.  It was an implementation artifact to begin
> > > with that a given row ended up in partition 3 not partition 11, so why
> > > would users care which partition it ends up in after a dump/reload?
> > > If they think there is a difference between the partitions, they need
> > > education.
> 
> > I see your point. I think it's somewhat valid. However, I also think
> > it muddies the definition of what pg_dump is allowed to do in a way
> > that I do not like. I think there's a difference between the CTID or
> > XMAX of a tuple changing and it ending up in a totally different
> > partition. It feels like it makes the definition of correctness
> > subjective: we do think that people care about range and list
> > partitions as individual entities, so we'll put the rows back where
> > they were and complain if we can't, but we don't think they think
> > about hash partitions that way, so we will err on the side of making
> > the dump restoration succeed. That's a level of guessing what the user
> > meant that I think is uncomfortable.
> 
> I see your point too, and to me it's evidence for the position that
> we should never have done hash partitioning in the first place.

You suggested earlier to deprecate hash partitioning.  That's a bit
much, but I'd say that most use cases of hash partitioning that I can
imagine would involve integers.  We could warn against using hash
partitioning for data types other than numbers and date/time in the
documentation.

I also understand the bad feeling of changing partitions during a
dump/restore, but I cannot think of a better way out.

> What do you think of "--load-via-partition-root=on/off/auto", where
> auto means "not with hash partitions" or the like?

That's perhaps the best way.  So users who know that their hash
partitions won't change and want the small speed benefit can have it.

Yours,
Laurenz Albe

Re: [PATCH] Fix old thinko in formula to compute sweight in numeric_sqrt().

2023-02-02 Thread Dean Rasheed

On Tue, 31 Jan 2023 at 21:59, Joel Jacobson  wrote:
>
> Nice, you managed to simplify it even further.
> I think the comment and the code now are crystal clear together.
>
> I've tested it successfully, test report attached.
>

Cool. Thanks for testing.
Committed.

Regards,
Dean

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-02 Thread shveta malik

On Thu, Feb 2, 2023 at 9:18 AM shveta malik  wrote:
>
>
> Hi Melih,
> I think I am able to identify the root cause. It is not memory
> corruption, but the way origin-names are stored in system-catalog
> mapped to a particular relation-id before even those are created.
>

Apart from the problem mentioned in my earlier email, I think there is
one more issue here as seen by the same assert causing testcase. The
'lastusedid' stored in system-catalog kept on increasing w/o even slot
and origin getting created. 2 workers worked well with
max_replication_slots=2 and then since all slots were consumed 3rd one
could not create any slot and exited but it increased lastusedid. Then
another worker came, incremented lastusedId in system-catalog and
failed to create slot and exited and so on. This makes lastUsedId
incremented continuously until you kill the subscriber or free any
slot used previously. If you keep subscriber running long enough, it
will make lastUsedId go beyond its limit.
Shouldn't lastUsedId be incremented only after making sure that worker
has created a slot and origin corresponding to that particular
rep_slot_id (derived using lastUsedId). Please let me know if my
understanding is not correct.

thanks
Shveta

1 2 >

1 - 100 of 107 matches

Mail list logo