Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Masahiko Sawada
Hi,

On Wed, Mar 15, 2023 at 8:55 AM Peter Smith  wrote:
>
> On Tue, Mar 14, 2023 at 10:43 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 14, 2023 at 12:37 PM Peter Smith  wrote:
> > >
> > > Thanks for the review!
> > >
> > > On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
> > > ...
> > >
> > > > 4) We check if txn->toptxn is not null twice here both in if condition
> > > > and in the assignment, we could retain the assignment operation as
> > > > earlier to remove the 2nd check:
> > > > -   if (txn->toptxn)
> > > > -   txn = txn->toptxn;
> > > > +   if (isa_subtxn(txn))
> > > > +   txn = get_toptxn(txn);
> > > >
> > > > We could avoid one check again by:
> > > > +   if (isa_subtxn(txn))
> > > > +   txn = txn->toptxn;
> > > >
> > >
> > > Yeah, that is true, but I chose not to keep the original assignment in
> > > this case mainly because then it is consistent with the other changed
> > > code ---  e.g. Every other direct member assignment/access of the
> > > 'toptxn' member now hides behind the macros so I did not want this
> > > single place to be the odd one out. TBH, I don't think 1 extra check
> > > is of any significance, but it is not a problem to change like you
> > > suggested if other people also want it done that way.
> > >
> >
> > Can't we directly use rbtxn_get_toptxn() for this case? I think that
> > way code will look neat. I see that it is not exactly matching the
> > existing check so you might be worried but I feel the new code will
> > achieve the same purpose and will be easy to follow.
> >
>
> OK. Done as suggested.
>

+1 to the idea. Here are some minor comments:

@@ -1667,7 +1658,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn, bool txn_prep
  * about the toplevel xact (we send the XID in all messages), but we never
  * stream XIDs of empty subxacts.
  */
- if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
+ if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
  txn->txn_flags |= RBTXN_IS_STREAMED;

Probably the following comment of the above lines also needs to be updated?

 * The toplevel transaction, identified by (toptxn==NULL), is marked as
 * streamed always,

---
+/* Is this a top-level transaction? */
+#define rbtxn_is_toptxn(txn)\
+(\
+ (txn)->toptxn == NULL\
+)
+
+/* Is this a subtransaction? */
+#define rbtxn_is_subtxn(txn)\
+(\
+ (txn)->toptxn != NULL\
+)
+
+/* Get the top-level transaction of this (sub)transaction. */
+#define rbtxn_get_toptxn(txn)\
+(\
+ rbtxn_is_subtxn(txn) ? (txn)->toptxn : (txn)\
+)

We need a whitespace before backslashes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Allow logical replication to copy tables in binary format

2023-03-14 Thread Amit Kapila
On Tue, Mar 14, 2023 at 8:50 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> On Tuesday, March 14, 2023 8:02 PM Melih Mutlu  wrote:
> (3) copy_table()
>
> +   /*
> +* If the publisher version is earlier than v14, it COPY command 
> doesn't
> +* support the binary option.
> +*/
>
> This sentence doesn't look correct grammatically. We can replace "it COPY 
> command" with "subscription" for example. Kindly please fix it.
>

How about something like: "The binary option for replication is
supported since v14."?

-- 
With Regards,
Amit Kapila.




Re: Should vacuum process config file reload more often

2023-03-14 Thread Masahiko Sawada
On Sat, Mar 11, 2023 at 8:11 AM Melanie Plageman
 wrote:
>
> Quotes below are combined from two of Sawada-san's emails.
>
> I've also attached a patch with my suggested current version.
>
> On Thu, Mar 9, 2023 at 10:27 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 10, 2023 at 11:23 AM Melanie Plageman
> >  wrote:
> > >
> > > On Tue, Mar 7, 2023 at 12:10 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Mar 6, 2023 at 5:26 AM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > On Thu, Mar 2, 2023 at 6:37 PM Melanie Plageman
> > > > > In this version I've removed wi_cost_delay from WorkerInfoData. There 
> > > > > is
> > > > > no synchronization of cost_delay amongst workers, so there is no 
> > > > > reason
> > > > > to keep it in shared memory.
> > > > >
> > > > > One consequence of not updating VacuumCostDelay from wi_cost_delay is
> > > > > that we have to have a way to keep track of whether or not autovacuum
> > > > > table options are in use.
> > > > >
> > > > > This patch does this in a cringeworthy way. I added two global
> > > > > variables, one to track whether or not cost delay table options are in
> > > > > use and the other to store the value of the table option cost delay. I
> > > > > didn't want to use a single variable with a special value to indicate
> > > > > that table option cost delay is in use because
> > > > > autovacuum_vacuum_cost_delay already has special values that mean
> > > > > certain things. My code needs a better solution.
> > > >
> > > > While it's true that wi_cost_delay doesn't need to be shared, it seems
> > > > to make the logic somewhat complex. We need to handle cost_delay in a
> > > > different way from other vacuum-related parameters and we need to make
> > > > sure av[_use]_table_option_cost_delay are set properly. Removing
> > > > wi_cost_delay from WorkerInfoData saves 8 bytes shared memory per
> > > > autovacuum worker but it might be worth considering to keep
> > > > wi_cost_delay for simplicity.
> > >
> > > Ah, it turns out we can't really remove wi_cost_delay from WorkerInfo
> > > anyway because the launcher doesn't know anything about table options
> > > and so the workers have to keep an updated wi_cost_delay that the
> > > launcher or other autovac workers who are not vacuuming that table can
> > > read from when calculating the new limit in autovac_balance_cost().
> >
> > IIUC if any of the cost delay parameters has been set individually,
> > the autovacuum worker is excluded from the balance algorithm.
>
> Ah, yes! That's right. So it is not a problem. Then I still think
> removing wi_cost_delay from the worker info makes sense. wi_cost_delay
> is a double and can't easily be accessed atomically the way
> wi_cost_limit can be.
>
> Keeping the cost delay local to the backends also makes it clear that
> cost delay is not something that should be written to by other backends
> or that can differ from worker to worker. Without table options in the
> picture, the cost delay should be the same for any worker who has
> reloaded the config file.

Agreed.

>
> As for the cost limit safe access issue, maybe we can avoid a LWLock
> acquisition for reading wi_cost_limit by using an atomic similar to what
> you suggested here for "did_rebalance".
>
> > > I've added in a shared lock for reading from wi_cost_limit in this
> > > patch. However, AutoVacuumUpdateLimit() is called unconditionally in
> > > vacuum_delay_point(), which is called quite often (per block-ish), so I
> > > was trying to think if there is a way we could avoid having to check
> > > this shared memory variable on every call to vacuum_delay_point().
> > > Rebalances shouldn't happen very often (done by the launcher when a new
> > > worker is launched and by workers between vacuuming tables). Maybe we
> > > can read from it less frequently?
> >
> > Yeah, acquiring the lwlock for every call to vacuum_delay_point()
> > seems to be harmful. One idea would be to have one sig_atomic_t
> > variable in WorkerInfoData and autovac_balance_cost() set it to true
> > after rebalancing the worker's cost-limit. The worker can check it
> > without locking and update its delay parameters if the flag is true.
>
> Instead of having the atomic indicate whether or not someone (launcher
> or another worker) did a rebalance, it would simply store the current
> cost limit. Then the worker can normally access it with a simple read.
>
> My rationale is that if we used an atomic to indicate whether or not we
> did a rebalance ("did_rebalance"), we would have the same cache
> coherency guarantees as if we just used the atomic for the cost limit.
> If we read from the "did_rebalance" variable and missed someone having
> written to it on another core, we still wouldn't get around to checking
> the wi_cost_limit variable in shared memory, so it doesn't matter that
> we bothered to keep it in shared memory and use a lock to access it.
>
> I noticed we don't allow wi_cost_limit to ever be less than 0, so we
> could store 

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

2023-03-14 Thread Thomas Munro
On Wed, Mar 15, 2023 at 7:54 AM Nathan Bossart  wrote:
> Here is roughly what I had in mind:
>
> NOTE: Although the delay is specified in microseconds, older Unixen 
> and
> Windows use periodic kernel ticks to wake up, which might increase the
> delay time significantly.  We've observed delay increases as large as
> 20 milliseconds on supported platforms.

Sold.  And pushed.

I couldn't let that 20ms != 1s/100 problem go, despite my claim that I
would, and now I see:  NetBSD does have 10ms resolution, so everyone
can relax, arithmetic still works.  It's just that it always or often
adds on one extra tick, for some strange reason.  So you can measure
20ms, 30ms, ... but never as low as 10ms.  *Shrug*.  Your description
covered that nicely.

https://marc.info/?l=netbsd-current-users=144832117108168=2

> > (The word "interrupt" is a bit overloaded, which doesn't help with
> > this discussion.)
>
> Yeah, I think it would be clearer if "interrupt" was disambiguated.

OK, I rewrote it to avoid that terminology.

On small detail, after reading Tom's 2019 proposal to do this[1]: He
mentioned SUSv2's ENOSYS error.  I see that SUSv3 (POSIX.1-2001)
dropped that.  Systems that don't have the "timers" option simply
shouldn't define the function, but we already require the "timers"
option for clock_gettime().  And more practically, I know that all our
target systems have it and it works.

Pushed.

[1] https://www.postgresql.org/message-id/4902.1552349...@sss.pgh.pa.us




Re: Fix fseek() detection of unseekable files on WIN32

2023-03-14 Thread Michael Paquier
On Tue, Mar 14, 2023 at 01:26:27PM +0100, Juan José Santamaría Flecha wrote:
> As highlighted in [1] fseek() might fail to error even when accessing
> unseekable streams.
> 
> PFA a patch that checks the file type before the actual fseek(), so only
> supported calls are made.

+ * streams, so harden that funcion with our version.
s/funcion/function/.

+extern int pgfseek64(FILE *stream, pgoff_t offset, int origin);
+extern pgoff_t pgftell64(FILE *stream);
+#define fseeko(stream, offset, origin) pgfseek64(stream, offset, origin)
+#define ftello(stream) pgftell64(stream)

What about naming the internal wrappers _pgfseeko64() and
_pgftello64(), located in a new file named win32fseek.c?  It may be
possible that we would need a similar treatment for fseek(), in the
future, though I don't see an issue why this would be needed now.

+   if (GetFileType((HANDLE) _get_osfhandle(_fileno(stream))) != FILE_TYPE_DISK)
+   {
+   errno = ESPIPE;
+   return -1;
+   }
Shouldn't there be cases where we should return EINVAL for some of the
other types, like FILE_TYPE_REMOTE or FILE_TYPE_UNKNOWN?  We should
return ESPIPE only for FILE_TYPE_PIPE and FILE_TYPE_CHAR, then?
--
Michael


signature.asc
Description: PGP signature


Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Nathan Bossart
+   sleep = strtod(opt, _end);
+   if (sleep < 0 || *opt_end || errno == ERANGE)

Should we set errno to 0 before calling strtod()?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: recovery modules

2023-03-14 Thread Nathan Bossart
I noticed that the new TAP test for basic_archive was failing
intermittently for cfbot.  It looks like the query for checking that the
post-backup WAL is restored sometimes executes before archive recovery is
complete (because hot_standby is on).  To fix this, I adjusted the test to
use poll_query_until instead.  There are no other changes in v14.

I first tried to set hot_standby to off on the restored node so that the
query wouldn't run until archive recovery completed.  This seemed like it
would work because start() useѕ "pg_ctl --wait", which has the following
note in the docs:

Startup is considered complete when the PID file indicates that the
server is ready to accept connections.

However, that's not what happens when hot_standby is off.  In that case,
the postmaster.pid file is updated with PM_STATUS_STANDBY once recovery
starts, which wait_for_postmaster_start() interprets as "ready."  I see
this was reported before [0], but that discussion fizzled out.  IIUC it was
done this way to avoid infinite waits when hot_standby is off and standby
mode is enabled.  I could be missing something obvious, but that doesn't
seem necessary when hot_standby is off and recovery mode is enabled because
recovery should end at some point (never mind the halting problem).  I'm
still digging into this and may spin off a new thread if I can conjure up a
proposal.

[0] 
https://postgr.es/m/CAMkU%3D1wrMqPggnEfszE-c3PPLmKgRK17_qr7tmxBECYEbyV-4Q%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ef9aade7270d12104647439a99e3b1822393a318 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 23 Feb 2023 14:27:48 -0800
Subject: [PATCH v14 1/7] Move extra code out of the Pre/PostRestoreCommand()
 block.

If SIGTERM is received within this block, the startup process will
immediately proc_exit() in the signal handler, so it is inadvisable
to include any more code than is required in this section.  This
change moves the code recently added to this block (see 1b06d7b and
7fed801) to outside of the block.  This ensures that only system()
is called while proc_exit() might be called in the SIGTERM handler,
which is how this code worked from v8.4 to v14.
---
 src/backend/access/transam/xlogarchive.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index fcc87ff44f..41684418b6 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -159,20 +159,27 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 			 xlogRestoreCmd)));
 
+	fflush(NULL);
+	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
+
 	/*
-	 * Check signals before restore command and reset afterwards.
+	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
+	 * that it should proc_exit() right away.  This is done for the duration of
+	 * the system() call because there isn't a good way to break out while it
+	 * is executing.  Since we might call proc_exit() in a signal handler, it
+	 * is best to put any additional logic before or after the
+	 * PreRestoreCommand()/PostRestoreCommand() section.
 	 */
 	PreRestoreCommand();
 
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 	rc = system(xlogRestoreCmd);
-	pgstat_report_wait_end();
 
 	PostRestoreCommand();
+
+	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
-- 
2.25.1

>From 78fdf06e4b1bbb37c8ae37937728aaeb4b2cf50b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 14 Feb 2023 09:44:53 -0800
Subject: [PATCH v14 2/7] Don't proc_exit() in startup's SIGTERM handler if
 forked by system().

Instead, emit a message to STDERR and _exit() in this case.  This
change also adds assertions to proc_exit(), ProcKill(), and
AuxiliaryProcKill() to verify that these functions are not called
by a process forked by system(), etc.
---
 src/backend/postmaster/startup.c | 17 -
 src/backend/storage/ipc/ipc.c|  3 +++
 src/backend/storage/lmgr/proc.c  |  2 ++
 src/backend/utils/error/elog.c   | 28 
 src/include/utils/elog.h |  6 +-
 5 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..0e7de26bc2 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -121,7 +123,20 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
 	int			save_errno = errno;
 
 	if (in_restore_command)
-		proc_exit(1);
+	{
+		/*
+		 * If we are in a child process (e.g., forked by system() in
+		 * RestoreArchivedFile()), we don't want to 

Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Michael Paquier
On Tue, Mar 14, 2023 at 08:20:23PM -0700, Andrey Borodin wrote:
> PFA v8. Thanks!

Looks OK to me.  I've looked as well at resetting query_buffer on
failure, which I guess is better this way because this is an
accumulation of the previous results, right?
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-14 Thread Michael Paquier
On Tue, Mar 14, 2023 at 01:47:09PM +0100, Juan José Santamaría Flecha wrote:
> I have just posted a patch to enforce the detection of unseekable streams
> in the fseek() calls [1], please feel free to review it.

Thanks.  I have been able to get around 0001 to fix _pgfstat64() and
applied it down to v14 where this code has been introduced.  Now to
the part about fseek() and ftello()..
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Amit Kapila
On Tue, Mar 14, 2023 at 3:18 PM Önder Kalacı  wrote:
>>

Pushed this patch but forgot to add a new testfile. Will do that soon.


-- 
With Regards,
Amit Kapila.




Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Andrey Borodin
On Tue, Mar 14, 2023 at 6:25 PM Michael Paquier  wrote:
>
> On Wed, Mar 15, 2023 at 10:19:28AM +0900, Kyotaro Horiguchi wrote:
> > I hesitated to propose such a level of simplification, but basically I
> > was alsothinking the same thing.
+1

> Okay, fine by me to use one single message.  I'd rather still keep the
> three tests, though, as they check the three conditions upon which the
> error would be triggered.

PFA v8. Thanks!

Best regards, Andrey Borodin.


v8-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


v8-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-14 Thread Masahiko Sawada
On Tue, Mar 14, 2023 at 8:27 PM John Naylor
 wrote:
>
> I wrote:
>
> > > > Since the block-level measurement is likely overestimating quite a bit, 
> > > > I propose to simply reverse the order of the actions here, effectively 
> > > > reporting progress for the *last page* and not the current one: First 
> > > > update progress with the current memory usage, then add tids for this 
> > > > page. If this allocated a new block, only a small bit of that will be 
> > > > written to. If this block pushes it over the limit, we will detect that 
> > > > up at the top of the loop. It's kind of like our earlier attempts at a 
> > > > "fudge factor", but simpler and less brittle. And, as far as OS pages 
> > > > we have actually written to, I think it'll effectively respect the 
> > > > memory limit, at least in the local mem case. And the numbers will make 
> > > > sense.
> > > >
> > > > Thoughts?
> > >
> > > It looks to work but it still doesn't work in a case where a shared
> > > tidstore is created with a 64kB memory limit, right?
> > > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > > from the beginning.
> >
> > I have two ideas:
> >
> > 1. Make it optional to track chunk memory space by a template parameter. It 
> > might be tiny compared to everything else that vacuum does. That would 
> > allow other users to avoid that overhead.
> > 2. When context block usage exceeds the limit (rare), make the additional 
> > effort to get the precise usage -- I'm not sure such a top-down facility 
> > exists, and I'm not feeling well enough today to study this further.
>
> Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at 
> now. Unfortunately, If I had spent five minutes reminding myself what the 
> original objections were to this approach, I could have saved us some effort. 
> Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow 
> [1], and fragmentation [2] (leading to underestimation).
>
> In v31, in the local case at least, the underestimation is actually worse 
> than tracking chunk space, since it ignores chunk header and alignment.  I'm 
> not sure about the DSA case. This doesn't seem great.

Right.

>
> It shouldn't be a surprise why a simple increment of raw allocation size is 
> comparable in speed -- GetMemoryChunkSpace() calls the right function through 
> a pointer, which is slower. If we were willing to underestimate for the sake 
> of speed, that takes away the reason for making memory tracking optional.
>
> Further, if the option is not specified, in v31 there is no way to get the 
> memory use at all, which seems odd. Surely the caller should be able to ask 
> the context/area, if it wants to.

There are precedents that don't provide a way to return memory usage,
such as simplehash.h and dshash.c.

>
> I still like my idea at the top of the page -- at least for vacuum and m_w_m. 
> It's still not completely clear if it's right but I've got nothing better. It 
> also ignores the work_mem issue, but I've given up anticipating all future 
> cases at the moment.
>

What does it mean by "the precise usage" in your idea? Quoting from
the email you referred to, Andres said:

---
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
---

IIUC he suggested measuring memory usage in block-level in order to
count blocks that are not actually freed but some of its chunks are
freed. That's why we used MemoryContextMemAllocated(). On the other
hand, recently you pointed out[1]:

---
I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory
limit only needs to track what's been allocated in chunks within a
block, since writing there is what invokes a page fault.
---

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

> I'll put this item and a couple other things together in a separate email 
> tomorrow.

Thanks!

Regards,

[1] 
https://www.postgresql.org/message-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y%3Dq6ow%3DLTzyDvA%40mail.gmail.com


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-14 Thread Andres Freund
Hi,

On 2023-03-15 09:56:10 +0900, Michael Paquier wrote:
> On Tue, Mar 14, 2023 at 02:54:40PM -0700, Andres Freund wrote:
> >  Object description
> >  ---
> >   function pg_get_wal_record_info(pg_lsn)
> > - function pg_get_wal_records_info(pg_lsn,pg_lsn)
> >   function pg_get_wal_records_info_till_end_of_wal(pg_lsn)
> > - function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
> > + function pg_get_wal_records_info(pg_lsn,pg_lsn)
> >   function pg_get_wal_stats_till_end_of_wal(pg_lsn,boolean)
> > + function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
> >  (5 rows)
> > 
> >  -- Make sure checkpoints don't interfere with the test.
> > 
> > Looks like it's missing an ORDER BY.
> 
> Interesting.  This is "\dx+ pg_walinspect".
> listOneExtensionContents() uses pg_describe_object() for that, and
> there is already an ORDER BY based on it.  I would not have expected
> this part to be that much sensitive.  Is this using a specific ICU
> collation, because this is a side-effect of switching ICU as the
> default in initdb?

It's using ICU, but not a specific collation. The build I linked to is WIP
hackery to add ICU support to windows CI. Here's the initdb output:
https://api.cirrus-ci.com/v1/artifact/task/6288336663347200/testrun/build/testrun/pg_walinspect/regress/log/initdb.log

The database cluster will be initialized with this locale configuration:
  provider:icu
  ICU locale:  en_US
  LC_COLLATE:  English_United States.1252
  LC_CTYPE:English_United States.1252
  LC_MESSAGES: English_United States.1252
  LC_MONETARY: English_United States.1252
  LC_NUMERIC:  English_United States.1252
  LC_TIME: English_United States.1252
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

For comparison, here's a recent CI run (which also failed on windows, but for
unrelated reasons), without ICU:
https://api.cirrus-ci.com/v1/artifact/task/6478925920993280/testrun/build/testrun/pg_walinspect/regress/log/initdb.log

The database cluster will be initialized with locale "English_United 
States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".


> As a solution, this could use pg_identify_object(classid, objid, 0) in
> the ORDER BY clause to enforce a better ordering of the objects dealt
> with as it decomposes the object name and the object type.  That
> should be enough, I assume, as it looks to be parenthesis vs
> underscore that switch the order.


Greetings,

Andres Freund




Re: Add pg_walinspect function with block info columns

2023-03-14 Thread Peter Geoghegan
On Tue, Mar 14, 2023 at 5:34 PM Melanie Plageman
 wrote:
> On Tue, Mar 14, 2023 at 6:57 PM Peter Geoghegan  wrote:
> > Why doesn't it already work like this? Why do we need a separate
> > pg_get_wal_block_info() function at all?
>
> Well, I think if you only care about the WAL record-level information
> and not the block-level information, having the WAL record information
> denormalized like that with all the block information would be a
> nuisance.

I generally care about both. When I want to look at things at the
pg_get_wal_records_info() level (as opposed to a summary), the
block_ref information is *always* of primary importance. I don't want
to have to write my own bug-prone parser for block_ref, but why should
the only alternative be joining against pg_get_wal_block_info()? The
information that I'm interested in is "close at hand" to
pg_get_wal_records_info() already.

I understand that in the general case there might be quite a few
blocks associated with a WAL record. For complicated cases,
pg_get_wal_block_info() does make sense. However, the vast majority of
individual WAL records (and possibly most WAL record types) are
related to one block only. One block that is generally from the
relation's main fork.

> But, perhaps you are suggesting a parameter to pg_get_wal_records_info()
> like "with_block_info" or something, which produces the full
> denormalized block + record output?

I was thinking of something like that, yes -- though it wouldn't
necessarily have to be the *full* denormalized block_ref info, the FPI
itself, etc. Just the more useful stuff.

It occurs to me that my concern about the information that
pg_get_wal_records_info() lacks could be restated as a concern about
what pg_get_wal_block_info() lacks: pg_get_wal_block_info() fails to
show basic information about the WAL record whose blocks it reports
on, even though it could easily show all of the
pg_get_wal_records_info() info once per block (barring block_ref). So
addressing my concern by adjusting pg_get_wal_block_info() might be
the best approach. I'd probably be happy with that -- I'd likely just
stop using pg_get_wal_records_info() completely under this scheme.

Overall, I'm concerned that we may have missed the opportunity to make
simple things easier. Again, wanting to see (say) all of the PRUNE
records and VACUUM records with an "order by relfilenode,
block_number, lsn" seems likely to be a very common requirement to me.
It's exactly the kind of thing that you'd expect an SQL interface to
make easy.

--
Peter Geoghegan




Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Michael Paquier
On Wed, Mar 15, 2023 at 10:19:28AM +0900, Kyotaro Horiguchi wrote:
> I hesitated to propose such a level of simplification, but basically I
> was alsothinking the same thing.

Okay, fine by me to use one single message.  I'd rather still keep the
three tests, though, as they check the three conditions upon which the
error would be triggered.
--
Michael


signature.asc
Description: PGP signature


Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Kyotaro Horiguchi
At Tue, 14 Mar 2023 12:03:00 -0700, Nathan Bossart  
wrote in 
> On Tue, Mar 14, 2023 at 01:58:59PM +0900, Kyotaro Horiguchi wrote:
> > +   if (*opt_end)
> > +   pg_log_error("\\watch: incorrect 
> > interval value '%s'", opt);
> > +   else if (errno == ERANGE)
> > +   pg_log_error("\\watch: out-of-range 
> > interval value '%s'", opt);
> > +   else
> > +   pg_log_error("\\watch: interval value 
> > '%s' less than zero", opt);
> > 
> > I'm not sure if we need error messages for that resolution and I'm a
> > bit happier to have fewer messages to translate:p. Merging the cases
> > of ERANGE and negative values might be better. And I think we usually
> > refer to unparsable input as "invalid".
> > 
> > if (*opt_end)
> >pg_log_error("\\watch: invalid interval value '%s'", opt);
> > else
> >pg_log_error("\\watch: interval value '%s' out of range", opt);
> 
> +1, I don't think it's necessary to complicate these error messages too
> much.  This code hasn't reported errors for nearly 10 years, and I'm not
> aware of any complaints.  I  till think we could simplify this to "\watch:
> invalid delay interval: %s" and call it a day.

I hesitated to propose such a level of simplification, but basically I
was alsothinking the same thing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Making background psql nicer to use in tap tests

2023-03-14 Thread Andres Freund
Hi,

On 2023-03-14 21:24:32 +0100, Daniel Gustafsson wrote:
> > On 31 Jan 2023, at 01:00, Andres Freund  wrote:
> 
> > I've hacked some on this. I first tried to just introduce a few helper
> > functions in Cluster.pm, but that ended up being awkward. So I bit the 
> > bullet
> > and introduced a new class (in BackgroundPsql.pm), and made 
> > background_psql()
> > and interactive_psql() return an instance of it.
> 
> Thanks for working on this!

Thanks for helping it move along :)


> > This is just a rough prototype. Several function names don't seem great, it
> > need POD documentation, etc.
> 
> It might be rough around the edges but I don't think it's too far off a state
> in which in can be committed, given that it's replacing something even 
> rougher.
> With documentation and some polish I think we can iterate on it in the tree.

Cool.


> > I don't quite like the new interface yet:
> > - It's somewhat common to want to know if there was a failure, but also get
> >  the query result, not sure what the best function signature for that is in
> >  perl.
> 
> What if query() returns a list with the return value last?  The caller will 
> get
> the return value when assigning a single var as the return, and can get both 
> in
> those cases when it's interesting.  That would make for reasonably readable
> code in most places?

>$ret_val = $h->query("SELECT 1;");
>($query_result, $ret_val) = $h->query("SELECT 1;");

I hate perl.


> Returning a hash seems like a worse option since it will complicate callsites
> which only want to know success/failure.

Yea. Perhaps it's worth having a separate function for this? ->query_rc() or 
such?



> > - right now there's a bit too much logic in background_psql() /
> >  interactive_psql() for my taste
> 
> Not sure what you mean, I don't think they're especially heavy on logic?

-EMISSINGWORD on my part. A bit too much duplicated logic.


> +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> +which can modified later.
> 
> This require a bit of knowledge about the internals which I think we should
> hide in this new API.  How about providing a function for defining the 
> timeout?

"definining" in the sense of accessing it? Or passing one in?


> Re timeouts: one thing I've done repeatedly is to use short timeouts and reset
> them per query, and that turns pretty ugly fast.  I hacked up your patch to
> provide $h->reset_timer_before_query() which then injects a {timeout}->start
> before running each query without the caller having to do it.  Not sure if I'm
> alone in doing that but if not I think it makes sense to add.

I don't quite understand the use case, but I don't mind it as a functionality.

Greetings,

Andres Freund




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-14 Thread Michael Paquier
On Tue, Mar 14, 2023 at 10:35:43AM +0530, Bharath Rupireddy wrote:
> My thoughts are simple here - how would one (an end user, not me and
> not you) figure out how to get info/stats till the end of WAL? I'm
> sure it would be difficult to find that out without looking at the
> code or commit history. Search for till end of WAL behaviour with new
> version will be more given the 1.0 version has explicit functions to
> do that. IMO, there's no harm in being explicit in how to achieve till
> end of WAL functionality around in the docs.

Okay.  I have kept these notes, but tweaked the wording to be a bit
cleaner, replacing the term "till" by "until".  To my surprise while
studying this point, "till" is a term older than "until" in English
literacy, but it is rarely used. 

> I get it. I divided the patches to 0001 and 0002 with 0001 focussing
> on the change of behaviour around future end LSNs, dropping till end
> of WAL functions and tests tweakings related to it. 0002 has all other
> tests tidy up things.
> 
> Please find the attached v8 patch set for further review.

The tests of 0001 were still too complex IMO.  The changes can be much
simpler as it requires only to move the till_end_of_wal() calls from
pg_walinspect.sql to oldextversions.sql.  Nothing more.
--
Michael


signature.asc
Description: PGP signature


Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-14 Thread Melanie Plageman
Thanks for the review!

On Sat, Mar 11, 2023 at 2:16 PM Justin Pryzby  wrote:
>
> On Sat, Mar 11, 2023 at 09:55:33AM -0500, Melanie Plageman wrote:
> > Subject: [PATCH v3 2/3] use shared buffers when failsafe active
> >
> > + /*
> > +  * Assume the caller who allocated the memory for the
> > +  * BufferAccessStrategy object will free it.
> > +  */
> > + vacrel->bstrategy = NULL;
>
> This comment could use elaboration:
>
> ** VACUUM normally restricts itself to a small ring buffer; but in
> failsafe mode, in order to process tables as quickly as possible, allow
> the leaving behind large number of dirty buffers.

Agreed. It definitely needs a comment like this. I will update in the
next version along with addressing your other feedback (after sorting
out some of the other points in this mail on which I still have
questions).

> > Subject: [PATCH v3 3/3] add vacuum option to specify ring_size and guc
>
> >  #define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var
> > +#define bufsize_limit_to_nbuffers(bufsize) (bufsize * 1024 / BLCKSZ)
>
> Macros are normally be capitalized

Yes, there doesn't seem to be a great amount of consistency around
this... See pgstat.c read_chunk_s and bufmgr.c BufHdrGetBlock and
friends. Though there are probably more capitalized than not. Since it
does a bit of math and returns a value, I wanted to convey that it was
more like a function. Also, since the name was long, I thought all-caps
would be hard to read. However, if you or others feel strongly, I am
attached neither to the capitalization nor to the name at all (what do
you think of the name?).

> It's a good idea to write "(bufsize)", in case someone passes "a+b".

Ah yes, this is a good idea -- I always miss at least one set of
parentheses when writing a macro. In this case, I didn't think of it
since the current caller couldn't pass an expression.

> > @@ -586,6 +587,45 @@ GetAccessStrategy(BufferAccessStrategyType btype)
> > +BufferAccessStrategy
> > +GetAccessStrategyWithSize(BufferAccessStrategyType btype, int ring_size)
>
> Maybe it would make sense for GetAccessStrategy() to call
> GetAccessStrategyWithSize().  Or maybe not.

You mean instead of having anyone call GetAccessStrategyWithSize()?
We would need to change the signature of GetAccessStrategy() then -- and
there are quite a few callers.

>
> > + {"vacuum_buffer_usage_limit", PGC_USERSET, RESOURCES_MEM,
> > + gettext_noop("Sets the buffer pool size for 
> > operations employing a buffer access strategy."),
>
> The description should mention vacuum, if that's the scope of the GUC's
> behavior.

Good catch. Will update in next version.

> > +#vacuum_buffer_usage_limit = -1 # size of vacuum buffer access strategy 
> > ring.
> > + # -1 to use default,
> > + # 0 to disable vacuum buffer access strategy 
> > and use shared buffers
> > + # > 0 to specify size
>
> If I'm not wrong, there's still no documentation about "ring buffers" or
> postgres' "strategy".  Which seems important to do for this patch, along
> with other documentation.

Yes, it is. I have been thinking about where in the docs to add it (the
docs about buffer access strategies). Any ideas?

> This patch should add support in vacuumdb.c.

Oh, I had totally forgotten about vacuumdb.

> And maybe a comment about adding support there, since it's annoying
> when it the release notes one year say "support VACUUM (FOO)" and then
> one year later say "support vacuumdb --foo".

I'm not sure I totally follow. Do you mean to add a comment to
ExecVacuum() saying to add support to vacuumdb when adding a new option
to vacuum?

In the past, did people forget to add support to vacuumdb for new vacuum
options or did they forget to document that they did that or did they
forgot to include that they did that in the release notes?

- Melanie




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-14 Thread Michael Paquier
On Tue, Mar 14, 2023 at 02:54:40PM -0700, Andres Freund wrote:
>  Object description
>  ---
>   function pg_get_wal_record_info(pg_lsn)
> - function pg_get_wal_records_info(pg_lsn,pg_lsn)
>   function pg_get_wal_records_info_till_end_of_wal(pg_lsn)
> - function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
> + function pg_get_wal_records_info(pg_lsn,pg_lsn)
>   function pg_get_wal_stats_till_end_of_wal(pg_lsn,boolean)
> + function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
>  (5 rows)
> 
>  -- Make sure checkpoints don't interfere with the test.
> 
> Looks like it's missing an ORDER BY.

Interesting.  This is "\dx+ pg_walinspect".
listOneExtensionContents() uses pg_describe_object() for that, and
there is already an ORDER BY based on it.  I would not have expected
this part to be that much sensitive.  Is this using a specific ICU
collation, because this is a side-effect of switching ICU as the
default in initdb?

As a solution, this could use pg_identify_object(classid, objid, 0) in
the ORDER BY clause to enforce a better ordering of the objects dealt
with as it decomposes the object name and the object type.  That
should be enough, I assume, as it looks to be parenthesis vs
underscore that switch the order.
--
Michael


signature.asc
Description: PGP signature


Re: Add pg_walinspect function with block info columns

2023-03-14 Thread Melanie Plageman
On Tue, Mar 14, 2023 at 6:57 PM Peter Geoghegan  wrote:
>
> On Tue, Mar 14, 2023 at 3:34 PM Melanie Plageman
>  wrote:
> > After patching master to add in the columns from
> > pg_get_wal_records_info() which are not returned by
> > pg_get_wal_block_info() (except block_ref column of course), this query:
> >
> >   SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn);
> >
> > took 467 ms.
> >
> > Perhaps this difference isn't important, but I found it noticeable.
>
> This seems weird to me too. It's not so much the performance overhead
> that bothers me (though that's not great either). It seems *illogical*
> to me. The query you end up writing must do two passes over the WAL
> records, but its structure almost suggests that it's necessary to do
> two separate passes over distinct "streams".
>
> Why doesn't it already work like this? Why do we need a separate
> pg_get_wal_block_info() function at all?

Well, I think if you only care about the WAL record-level information
and not the block-level information, having the WAL record information
denormalized like that with all the block information would be a
nuisance.

But, perhaps you are suggesting a parameter to pg_get_wal_records_info()
like "with_block_info" or something, which produces the full
denormalized block + record output?

- Melanie




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-14 Thread Melanie Plageman
Thanks for your interest in this patch!

On Mon, Mar 13, 2023 at 8:38 AM Ants Aasma  wrote:
>
> On Sat, 11 Mar 2023 at 16:55, Melanie Plageman
>  wrote:
> >
> > > On Tue, Feb 28, 2023 at 3:16 AM Bharath Rupireddy
> > >  wrote:
> > >
> > > > On Thu, Jan 12, 2023 at 6:06 AM Andres Freund  
> > > > wrote:
> > > > >
> > > > > On 2023-01-11 17:26:19 -0700, David G. Johnston wrote:
> > > > > > Should we just add "ring_buffers" to the existing "shared_buffers" 
> > > > > > and
> > > > > > "temp_buffers" settings?
> > > > >
> > > > > The different types of ring buffers have different sizes, for good 
> > > > > reasons. So
> > > > > I don't see that working well. I also think it'd be more often useful 
> > > > > to
> > > > > control this on a statement basis - if you have a parallel import 
> > > > > tool that
> > > > > starts NCPU COPYs you'd want a smaller buffer than a single threaded 
> > > > > COPY. Of
> > > > > course each session can change the ring buffer settings, but still.
> > > >
> > > > How about having GUCs for each ring buffer (bulk_read_ring_buffers,
> > > > bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)?
> > > > These options can help especially when statement level controls aren't
> > > > easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If
> > > > needed users can also set them at the system level. For instance, one
> > > > can set bulk_write_ring_buffers to other than 16MB or -1 to disable
> > > > the ring buffer to use shared_buffers and run a bunch of bulk write
> > > > queries.
> >
> > In attached v3, I've changed the name of the guc from buffer_usage_limit
> > to vacuum_buffer_usage_limit, since it is only used for vacuum and
> > autovacuum.
>
> Sorry for arriving late to this thread, but what about sizing the ring
> dynamically? From what I gather the primary motivation for larger ring
> size is avoiding WAL flushes due to dirty buffer writes. We already
> catch that event with StrategyRejectBuffer(). So maybe a dynamic
> sizing algorithm could be applied to the ringbuffer. Make the buffers
> array in strategy capable of holding up to the limit of buffers, but
> set ring size conservatively. If we have to flush WAL, double the ring
> size (up to the limit). If we loop around the ring without flushing,
> decrease the ring size by a small amount to let clock sweep reclaim
> them for use by other backends.

So, the original motivation of this patch was to allow autovacuum in
failsafe mode to abandon use of a buffer access strategy, since, at that
point, there is no reason to hold back. The idea was expanded to be an
option to explicit vacuum, since users often must initiate an explicit
vacuum after a forced shutdown due to transaction ID wraparound.

As for routine vacuuming and the other buffer access strategies, I think
there is an argument for configurability based on operator knowledge --
perhaps your workload will use the data you are COPYing as soon as the
COPY finishes, so you might as well disable a buffer access strategy or
use a larger fraction of shared buffers. Also, the ring sizes were
selected sixteen years ago and average server memory and data set sizes
have changed.

StrategyRejectBuffer() will allow bulkreads to, as you say, use more
buffers than the original ring size, since it allows them to kick
dirty buffers out of the ring and claim new shared buffers.

Bulkwrites and vacuums, however, will inevitably dirty buffers and
require flushing the buffer (and thus flushing the associated WAL) when
reusing them. Bulkwrites and vacuum do not kick dirtied buffers out of
the ring, since dirtying buffers is their common case. A dynamic
resizing like the one you suggest would likely devolve to vacuum and
bulkwrite strategies always using the max size.

As for decreasing the ring size, buffers are only "added" to the ring
lazily and, technically, as it is now, buffers which have been added
added to the ring can always be reclaimed by the clocksweep (as long as
they are not pinned). The buffer access strategy is more of a
self-imposed restriction than it is a reservation. Since the ring is
small and the buffers are being frequently reused, odds are the usage
count will be 1 and we will be the one who set it to 1, but there is no
guarantee. If, when attempting to reuse the buffer, its usage count is
> 1 (or it is pinned), we also will kick it out of the ring and go look
for a replacement buffer.

I do think that it is a bit unreasonable to expect users to know how
large they would like to make their buffer access strategy ring. What we
want is some way of balancing different kinds of workloads and
maintenance tasks reasonably. If your database has no activity because
it is the middle of the night or it was shutdown because of transaction
id wraparound, there is no reason why vacuum should limit the number of
buffers it uses. I'm sure there are many other such examples.

- Melanie




Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Peter Smith
On Tue, Mar 14, 2023 at 10:33 PM vignesh C  wrote:
>
> On Tue, 14 Mar 2023 at 12:37, Peter Smith  wrote:
> >
> > Thanks for the review!
> >
> > On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
> > ...
> > > Few comments:
> > > 1) Can we move the macros along with the other macros present in this
> > > file, just above this structure, similar to the macros added for
> > > txn_flags:
> > > /* Toplevel transaction for this subxact (NULL for top-level). */
> > > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> > >
> > > 2) The macro name can be changed to rbtxn_is_toptxn, rbtxn_is_subtxn
> > > and rbtxn_get_toptxn to keep it consistent with others:
> > > /* Toplevel transaction for this subxact (NULL for top-level). */
> > > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> > >
> > > 3) We could add separate comments for each of the macros:
> > > /* Toplevel transaction for this subxact (NULL for top-level). */
> > > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> > >
> >
> > All the above are fixed as suggested.
> >
> > > 4) We check if txn->toptxn is not null twice here both in if condition
> > > and in the assignment, we could retain the assignment operation as
> > > earlier to remove the 2nd check:
> > > -   if (txn->toptxn)
> > > -   txn = txn->toptxn;
> > > +   if (isa_subtxn(txn))
> > > +   txn = get_toptxn(txn);
> > >
> > > We could avoid one check again by:
> > > +   if (isa_subtxn(txn))
> > > +   txn = txn->toptxn;
> > >
> >
> > Yeah, that is true, but I chose not to keep the original assignment in
> > this case mainly because then it is consistent with the other changed
> > code ---  e.g. Every other direct member assignment/access of the
> > 'toptxn' member now hides behind the macros so I did not want this
> > single place to be the odd one out. TBH, I don't think 1 extra check
> > is of any significance, but it is not a problem to change like you
> > suggested if other people also want it done that way.
>
> The same issue exists here too:
> 1)
> -   if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
> +   if (rbtxn_is_subtxn(txn))
> {
> -   toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> -   dclist_push_tail(>catchange_txns, 
> >catchange_node);
> +   ReorderBufferTXN *toptxn = rbtxn_get_toptxn(txn);
>
> 2)
>  -   if (change->txn->toptxn)
> -   topxid = change->txn->toptxn->xid;
> +   if (rbtxn_is_subtxn(change->txn))
> +   topxid = rbtxn_get_toptxn(change->txn)->xid;
>
> If you plan to fix, bothe the above also should be handled.

OK, noted. Anyway, for now, I preferred the 'toptxn' member to be
consistently hidden in the code so I don't plan to remove those
macros.

Also, please see Amit's reply [1] to your suggestion.

>
> 3) The comment on top of rbtxn_get_toptxn could be kept similar in
> both the below places. I know it is not because of your change, but
> since it is a very small change probably we could include it along
> with this patch:
> @@ -717,10 +717,7 @@ ReorderBufferProcessPartialChange(ReorderBuffer
> *rb, ReorderBufferTXN *txn,
> return;
>
> /* Get the top transaction. */
> -   if (txn->toptxn != NULL)
> -   toptxn = txn->toptxn;
> -   else
> -   toptxn = txn;
> +   toptxn = rbtxn_get_toptxn(txn);
>
> /*
>  * Indicate a partial change for toast inserts.  The change will be
> @@ -812,10 +809,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb,
> TransactionId xid, XLogRecPtr lsn,
> ReorderBufferTXN *toptxn;
>
> /* get the top transaction */
> -   if (txn->toptxn != NULL)
> -   toptxn = txn->toptxn;
> -   else
> -   toptxn = txn;
> +   toptxn = rbtxn_get_toptxn(txn);
>

IMO the comment ("/* get the top transaction */") was not really
saying anything useful that is not already obvious from the macro name
("rbtxn_get_toptxn"). So I've removed it entirely in your 2nd case.
This change is consistent with other parts of the patch where the
toptxn is just assigned in the declaration.

PSA v3. [2]

--
[1] Amit reply to your suggestion -
https://www.postgresql.org/message-id/CAA4eK1%2BoqfUSC3vpu6bJzgfnSmu_yaeoLS%3DRb3416GuS5iRP1Q%40mail.gmail.com
[2] v3 - 
https://www.postgresql.org/message-id/CAHut%2BPtrD4xU4OPUB64ZK%2BDPDhfKn3zph%3DnDpEWUFFzUvMKo2w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Peter Smith
On Tue, Mar 14, 2023 at 10:43 PM Amit Kapila  wrote:
>
> On Tue, Mar 14, 2023 at 12:37 PM Peter Smith  wrote:
> >
> > Thanks for the review!
> >
> > On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
> > ...
> >
> > > 4) We check if txn->toptxn is not null twice here both in if condition
> > > and in the assignment, we could retain the assignment operation as
> > > earlier to remove the 2nd check:
> > > -   if (txn->toptxn)
> > > -   txn = txn->toptxn;
> > > +   if (isa_subtxn(txn))
> > > +   txn = get_toptxn(txn);
> > >
> > > We could avoid one check again by:
> > > +   if (isa_subtxn(txn))
> > > +   txn = txn->toptxn;
> > >
> >
> > Yeah, that is true, but I chose not to keep the original assignment in
> > this case mainly because then it is consistent with the other changed
> > code ---  e.g. Every other direct member assignment/access of the
> > 'toptxn' member now hides behind the macros so I did not want this
> > single place to be the odd one out. TBH, I don't think 1 extra check
> > is of any significance, but it is not a problem to change like you
> > suggested if other people also want it done that way.
> >
>
> Can't we directly use rbtxn_get_toptxn() for this case? I think that
> way code will look neat. I see that it is not exactly matching the
> existing check so you might be worried but I feel the new code will
> achieve the same purpose and will be easy to follow.
>

OK. Done as suggested.

PSA v3.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v3-0001-Add-macros-for-ReorderBufferTXN-toptxn.patch
Description: Binary data


Re: [PATCH] Add pretty-printed XML output option

2023-03-14 Thread Tom Lane
Jim Jones  writes:
> On 14.03.23 18:40, Tom Lane wrote:
>> I poked at this for awhile and ran into a problem that I'm not sure
>> how to solve: it misbehaves for input with embedded DOCTYPE.

> The issue was the flag XML_SAVE_NO_EMPTY. It was forcing empty elements 
> to be serialized with start-end tag pairs. Removing it did the trick ...
> ... but as a side effect empty start-end tags will be now serialized as 
> empty elements

> postgres=# SELECT xmlserialize(CONTENT '' AS text 
> INDENT);
>   xmlserialize
> --
>      +
>         +
>   
> (1 row)

Huh, interesting.  That is a legitimate pretty-fication of the input,
I suppose, but some people might think it goes beyond the charter of
"indentation".  I'm okay with it personally; anyone want to object?

regards, tom lane




Re: [PATCH] Add pretty-printed XML output option

2023-03-14 Thread Jim Jones

On 14.03.23 18:40, Tom Lane wrote:

Jim Jones  writes:

[ v22-0001-Add-pretty-printed-XML-output-option.patch ]

I poked at this for awhile and ran into a problem that I'm not sure
how to solve: it misbehaves for input with embedded DOCTYPE.

regression=# SELECT xmlserialize(DOCUMENT '' as text indent);
  xmlserialize
--
  +
   +
  
(1 row)


The issue was the flag XML_SAVE_NO_EMPTY. It was forcing empty elements 
to be serialized with start-end tag pairs. Removing it did the trick ...


postgres=# SELECT xmlserialize(DOCUMENT '' AS text INDENT);
 xmlserialize
--
 +
 +

(1 row)

... but as a side effect empty start-end tags will be now serialized as 
empty elements


postgres=# SELECT xmlserialize(CONTENT '' AS text 
INDENT);

 xmlserialize
--
    +
       +
 
(1 row)

It seems to be the standard behavior of other xml indent tools 
(including Oracle)



regression=# SELECT xmlserialize(CONTENT '' as text indent);
  xmlserialize
--
  
(1 row)


The bad result for CONTENT is because xml_parse() decides to
parse_as_document, but xmlserialize_indent has no idea that happened
and tries to use the content_nodes list anyway.  I don't especially
care for the laissez faire "maybe we'll set *content_nodes and maybe
we won't" API you adopted for xml_parse, which seems to be contributing
to the mess.  We could pass back more info so that xmlserialize_indent
knows what really happened.


I added a new (nullable) parameter to the xml_parse function that will 
return the actual XmlOptionType used to parse the xml data. Now 
xmlserialize_indent knows how the data was really parsed:


postgres=# SELECT xmlserialize(CONTENT '' AS text INDENT);
 xmlserialize
--
 +
 +

(1 row)

I added test cases for these queries.

v23 attached.

Thanks!

Best, Jim
From 98fe15f07da345e046b8d29d5dde27ce191055a2 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Fri, 10 Mar 2023 13:47:16 +0100
Subject: [PATCH v23] Add pretty-printed XML output option

This patch implements the XML/SQL:2011 feature 'X069, XMLSERIALIZE: INDENT.'
It adds the options INDENT and NO INDENT (default) to the existing
xmlserialize function. It uses the indentation feature of xmlSaveToBuffer
from libxml2 to indent XML strings - see option XML_SAVE_FORMAT.

Although the INDENT feature is designed to work with xml strings of type
DOCUMENT, this implementation also allows the usage of CONTENT type strings
as long as it contains a well balanced xml.

This patch also includes documentation, regression tests and their three
possible output files xml.out, xml_1.out and xml_2.out.
---
 doc/src/sgml/datatype.sgml|   8 +-
 src/backend/catalog/sql_features.txt  |   2 +-
 src/backend/executor/execExprInterp.c |   9 +-
 src/backend/parser/gram.y |  14 +-
 src/backend/parser/parse_expr.c   |   1 +
 src/backend/utils/adt/xml.c   | 154 +++--
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   4 +-
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/xml.h   |   1 +
 src/test/regress/expected/xml.out | 188 ++
 src/test/regress/expected/xml_1.out   | 106 +++
 src/test/regress/expected/xml_2.out   | 188 ++
 src/test/regress/sql/xml.sql  |  38 ++
 14 files changed, 697 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 467b49b199..53d59662b9 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4460,14 +4460,18 @@ xml 'bar'
 xml, uses the function
 xmlserialize:xmlserialize
 
-XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type )
+XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [NO] INDENT ] )
 
 type can be
 character, character varying, or
 text (or an alias for one of those).  Again, according
 to the SQL standard, this is the only way to convert between type
 xml and character types, but PostgreSQL also allows
-you to simply cast the value.
+you to simply cast the value. The option INDENT allows to
+indent the serialized xml output - the default is NO INDENT.
+It is designed to indent XML strings of type DOCUMENT, but it can also
+   be used with CONTENT as long as value
+   contains a well-formed XML.

 

diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index 0fb9ab7533..bb4c135a7f 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -621,7 +621,7 @@ X061	XMLParse: character string input and DOCUMENT option			YES
 X065	XMLParse: binary string input and CONTENT option			NO	
 X066	XMLParse: binary string input and DOCUMENT option			NO	
 X068	XMLSerialize: BOM			NO	
-X069	XMLSerialize: INDENT			NO	
+X069	XMLSerialize: INDENT			YES	
 X070	XMLSerialize: character string serialization and CONTENT option			

Re: Add pg_walinspect function with block info columns

2023-03-14 Thread Peter Geoghegan
On Tue, Mar 14, 2023 at 3:34 PM Melanie Plageman
 wrote:
> After patching master to add in the columns from
> pg_get_wal_records_info() which are not returned by
> pg_get_wal_block_info() (except block_ref column of course), this query:
>
>   SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn);
>
> took 467 ms.
>
> Perhaps this difference isn't important, but I found it noticeable.

This seems weird to me too. It's not so much the performance overhead
that bothers me (though that's not great either). It seems *illogical*
to me. The query you end up writing must do two passes over the WAL
records, but its structure almost suggests that it's necessary to do
two separate passes over distinct "streams".

Why doesn't it already work like this? Why do we need a separate
pg_get_wal_block_info() function at all?

-- 
Peter Geoghegan




Re: Track IO times in pg_stat_io

2023-03-14 Thread Andres Freund
Hi,

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> On Tue, Mar 7, 2023 at 1:39 PM Andres Freund  wrote:
> > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think 
> > > > that would be a good idea
> > > > to also check that if counts are not Zero then times are not Zero.
> > >
> > > Yes, I think adding some validation around the relationship between
> > > counts and timing should help prevent developers from forgetting to call
> > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > >
> > > However, I think that we cannot check that if IO counts are non-zero
> > > that IO times are non-zero, because the user may not have
> > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > counts are not zero. I've done this in the attached v3.
> >
> > And even if track_io_timing is enabled, the timer granularity might be so 
> > low
> > that we *still* get zeroes.
> >
> > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
>
> And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> stats?

Yes.


> > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> > > relpersistence, ForkNumber forkNum,
> > >
> > >   if (isExtend)
> > >   {
> > > + instr_time  io_start,
> > > + io_time;
> > > +
> > >   /* new buffers are zero-filled */
> > >   MemSet((char *) bufBlock, 0, BLCKSZ);
> > > +
> > > + if (track_io_timing)
> > > + INSTR_TIME_SET_CURRENT(io_start);
> > > + else
> > > + INSTR_TIME_SET_ZERO(io_start);
> > > +
> >
> > I wonder if there's an argument for tracking this in the existing IO stats 
> > as
> > well. But I guess we've lived with this for a long time...
>
> Not sure I want to include that in this patchset.

No, probably not.


> > >  typedef struct PgStat_BktypeIO
> > >  {
> > > - PgStat_Counter 
> > > data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + PgStat_Counter 
> > > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + instr_time  
> > > times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > >  } PgStat_BktypeIO;
> >
> > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > another patch that gets substantial peformance gains by varying the 
> > frequency
> > at which instr_time keeps track of time based on the CPU frequency...
>
> What does that have to do with what we can store on disk?

The frequency can change.


> If so, would it not be enough to do this when reading/writing the stats
> file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.



> void
> instr_time_deserialize(instr_time *dest, int64 *src, int length)
> {
> for (size_t i = 0; i < length; i++)
> {
> INSTR_TIME_SET_ZERO(dest[i]);
> dest[i].ticks = src[i];
> }
> }

That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.



> > It also just doesn't have enough range to keep track of system wide
> > time on a larger system. A single backend won't run for 293 years, but
> > with a few thousand backends that's a whole different story.
> >
> > I think we need to accumulate in instr_time, but convert to floating point
> > when flushing stats.
>
> Hmmm. So, are you saying that we need to read from disk when we query
> the view and add that to what is in shared memory? That we only store
> the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.


> But, I don't see how that avoids the problem of backend total runtime
> being 293 years. We would have to reset and write out the delta whenever
> we thought the times would overflow.

The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.

nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years

If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.

Greetings,

Andres Freund




Re: Add pg_walinspect function with block info columns

2023-03-14 Thread Melanie Plageman
I'm excited to see that pg_get_wal_block_info() was merged. Thanks for
working on this!

Apologies for jumping back in here a bit late. I've been playing around
with it and wanted to comment on the performance of JOINing to
pg_get_wal_records_info().

On Tue, Mar 7, 2023 at 7:08 AM Matthias van de Meent
 wrote:
>
> On Tue, 7 Mar 2023 at 01:34, Michael Paquier  wrote:
> >
> > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
> > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
> > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
> > >> me as it outputs most of the columns that are already given by
> > >> pg_get_wal_records_info.What I think the best way at this point is to
> > >> make it return the following:
> > >> lsn pg_lsn
> > >> block_id int8
> > >> spcOid oid
> > >> dbOid oid
> > >> relNumber oid
> > >> forkNames text
> > >> fpi bytea
> > >> fpi_info text
> >
> > I would add the length of the block data (without the hole and
> > compressed, as the FPI data should always be presented as
> > uncompressed), and the block data if any (without the block data
> > length as one can guess it based on the bytea data length).  Note that
> > a block can have both a FPI and some data assigned to it, as far as I
> > recall.
> >
> > > The basic idea is to create a single entrypoint to all relevant data
> > > from DecodedXLogRecord in SQL, not multiple.
> >
> > While I would agree with this principle on simplicity's ground in
> > terms of minimizing the SQL interface and the pg_wal/ lookups, I
> > disagree about it on unsability ground, because we can avoid extra SQL
> > tweaks with more functions.  One recent example I have in mind is
> > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
> > on the catalogs.
>
> Correct, but in that case the user would build the same query (or at
> least with the same complexity) as what we're executing under the
> hood, right?
>
> > There are of course various degrees of complexity,
> > and perhaps unnest() cannot qualify as one, but having two functions
> > returning normalized records (one for the record information, and a
> > second for the block information), is a rather good balance between
> > usability and interface complexity, in my experience.
>
> I would agree, if it weren't for the reasons written below.
>
> >  If you have two
> > functions, a JOIN is enough to cross-check the block data and the
> > record data,
>
> Joins are expensive on large datasets; and because WAL is one of the
> largest datasets in the system, why would we want to force the user to
> JOIN them if we can produce the data in one pre-baked data structure
> without a need to join?

I wanted to experiment to see how much slower it is to do the join
between pg_get_wal_block_info() and pg_get_wal_records_info() and
profile where the time was spent.

I saved the wal lsn before and after a bunch of inserts (generates
~1,000,000 records).

On master, I did a join like this:

  SELECT count(*) FROM
pg_get_wal_block_info(:start_lsn, :end_lsn) b JOIN
pg_get_wal_records_info(:start_lsn, :end_lsn) w ON w.start_lsn = b.lsn;

which took 1191 ms.

After patching master to add in the columns from
pg_get_wal_records_info() which are not returned by
pg_get_wal_block_info() (except block_ref column of course), this query:

  SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn);

took 467 ms.

Perhaps this difference isn't important, but I found it noticeable.

A large chunk of the time is spent joining the tuplestores.

The second largest chunk of time seems to be in GetWalRecordInfo()'s
calls to XLogRecGetBlockRefInfo(), which spends quite a bit of time in
string construction -- mainly for strings we wouldn't end up needing
after joining to the block info function.

Surprisingly, the string construction seemed to overshadow the
performance impact of doubling the decoding passes over the WAL records.

So maybe it is worth including more record-level info?

On an unrelated note, I had thought that we should have some kind of
parameter to pg_get_wal_block_info() to control whether or not we output
the FPI to save us from wasting time decompressing the FPIs.

AFAIK, we don't have access to projection information from inside the
function, so a parameter like "output_fpi" or the like would have to do.

I wanted to share what I found in trying this in case someone else had
had that thought.

TL;DR, it doesn't seem to matter from a performance perspective if we
skip decompressing the FPIs in pg_get_wal_block_info().

I hacked "output_fpi" into pg_get_wal_block_info(), enabled pglz wal
compression, and generated a boatload of FPIs by dirtying buffers, doing
a checkpoint and then updating those pages again right after the
checkpoint. With output_fpi = true (same as master), my call to
pg_get_wal_block_info() took around 7 seconds and with output_fpi =
false, it took around 6 seconds. Not an impressive difference.

I noticed that only around 

Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-14 Thread Andres Freund
Hi,

I just rebased a patch over

commit 1f282c24e46
Author: Michael Paquier 
Date:   2023-03-13 13:03:29 +0900
 
Refactor and improve tests of pg_walinspect

and got a test failure:

https://cirrus-ci.com/task/5693041982308352
https://api.cirrus-ci.com/v1/artifact/task/5693041982308352/testrun/build/testrun/pg_walinspect/regress/regression.diffs

diff -w -U3 C:/cirrus/contrib/pg_walinspect/expected/oldextversions.out 
C:/cirrus/build/testrun/pg_walinspect/regress/results/oldextversions.out
--- C:/cirrus/contrib/pg_walinspect/expected/oldextversions.out 2023-03-14 
21:19:01.399716500 +
+++ C:/cirrus/build/testrun/pg_walinspect/regress/results/oldextversions.out
2023-03-14 21:26:27.504876700 +
@@ -8,10 +8,10 @@
 Object description
 ---
  function pg_get_wal_record_info(pg_lsn)
- function pg_get_wal_records_info(pg_lsn,pg_lsn)
  function pg_get_wal_records_info_till_end_of_wal(pg_lsn)
- function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
+ function pg_get_wal_records_info(pg_lsn,pg_lsn)
  function pg_get_wal_stats_till_end_of_wal(pg_lsn,boolean)
+ function pg_get_wal_stats(pg_lsn,pg_lsn,boolean)
 (5 rows)

 -- Make sure checkpoints don't interfere with the test.

Looks like it's missing an ORDER BY.

Greetings,

Andres Freund




Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-14 Thread Thomas Munro
Here's a better version with more explicit comments about some
details.  It passes on CI.  Planning to push this soon.
From 0ac63e73313fffa19db22f19452c4bb08ea7bf14 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 14 Mar 2023 11:53:26 +1300
Subject: [PATCH v2] Fix waitpid() emulation on Windows.

Our waitpid() emulation didn't prevent a PID from being recycled by the
OS before the call to waitpid(), and the postmaster could finish up
tracking multiple children with the same PID, leading to confusion.

Fix, by moving the guts of pgwin32_deadchild_callback() into waitpid(),
so that resources are released synchronously.  The process and PID
continue to exist until we close the process handle, which only happens
once we're ready to adjust our book-keeping of running children.

This seems to explain a couple of failures on CI.  It had never been
reported before, despite the code being as old as the Windows port.
Perhaps Windows started recycling PIDs more rapidly, or perhaps timing
changes due to commit 7389aad6 made it more likely to break.

Thanks to Alexander Lakhin for analysis and Andres Freund for tracking
down the root cause.

Back-patch to all releases.

Reported-by: Andres Freund 
Discussion: https://postgr.es/m/20230208012852.bvkn2am4h4iqjogq%40awork3.anarazel.de
---
 src/backend/postmaster/postmaster.c | 70 -
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..71198b72c8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4811,7 +4811,7 @@ retry:
 (errmsg_internal("could not register process for wait: error code %lu",
  GetLastError(;
 
-	/* Don't close pi.hProcess here - the wait thread needs access to it */
+	/* Don't close pi.hProcess here - waitpid() needs access to it */
 
 	CloseHandle(pi.hThread);
 
@@ -6421,36 +6421,21 @@ ShmemBackendArrayRemove(Backend *bn)
 static pid_t
 waitpid(pid_t pid, int *exitstatus, int options)
 {
+	win32_deadchild_waitinfo *childinfo;
+	DWORD		exitcode;
 	DWORD		dwd;
 	ULONG_PTR	key;
 	OVERLAPPED *ovl;
 
-	/*
-	 * Check if there are any dead children. If there are, return the pid of
-	 * the first one that died.
-	 */
-	if (GetQueuedCompletionStatus(win32ChildQueue, , , , 0))
+	/* Try to consume one win32_deadchild_waitinfo from the queue. */
+	if (!GetQueuedCompletionStatus(win32ChildQueue, , , , 0))
 	{
-		*exitstatus = (int) key;
-		return dwd;
+		errno = EAGAIN;
+		return -1;
 	}
 
-	return -1;
-}
-
-/*
- * Note! Code below executes on a thread pool! All operations must
- * be thread safe! Note that elog() and friends must *not* be used.
- */
-static void WINAPI
-pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
-{
-	win32_deadchild_waitinfo *childinfo = (win32_deadchild_waitinfo *) lpParameter;
-	DWORD		exitcode;
-
-	if (TimerOrWaitFired)
-		return;	/* timeout. Should never happen, since we use
- * INFINITE as timeout value. */
+	childinfo = (win32_deadchild_waitinfo *) key;
+	pid = childinfo->procId;
 
 	/*
 	 * Remove handle from wait - required even though it's set to wait only
@@ -6466,13 +6451,11 @@ pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
 		write_stderr("could not read exit code for process\n");
 		exitcode = 255;
 	}
-
-	if (!PostQueuedCompletionStatus(win32ChildQueue, childinfo->procId, (ULONG_PTR) exitcode, NULL))
-		write_stderr("could not post child completion status\n");
+	*exitstatus = exitcode;
 
 	/*
-	 * Handle is per-process, so we close it here instead of in the
-	 * originating thread
+	 * Close the process handle.  Only after this point can the PID can be
+	 * recycled by the kernel.
 	 */
 	CloseHandle(childinfo->procHandle);
 
@@ -6482,7 +6465,34 @@ pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
 	 */
 	free(childinfo);
 
-	/* Queue SIGCHLD signal */
+	return pid;
+}
+
+/*
+ * Note! Code below executes on a thread pool! All operations must
+ * be thread safe! Note that elog() and friends must *not* be used.
+ */
+static void WINAPI
+pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
+{
+	/* Should never happen, since we use INFINITE as timeout value. */
+	if (TimerOrWaitFired)
+		return;
+
+	/*
+	 * Post the win32_deadchild_waitinfo object for waitpid() to deal with. If
+	 * that fails, we leak the object, but we also leak a whole process and
+	 * get into an unrecoverable state, so there's not much point in worrying
+	 * about that.  We'd like to panic, but we can't use that infrastructure
+	 * from this thread.
+	 */
+	if (!PostQueuedCompletionStatus(win32ChildQueue,
+	0,
+	(ULONG_PTR) lpParameter,
+	NULL))
+		write_stderr("could not post child completion status\n");
+
+	/* Queue SIGCHLD signal. */
 	pg_queue_signal(SIGCHLD);
 }
 #endif			/* WIN32 */
-- 
2.39.1



Re: Making background psql nicer to use in tap tests

2023-03-14 Thread Daniel Gustafsson
> On 31 Jan 2023, at 01:00, Andres Freund  wrote:

> I've hacked some on this. I first tried to just introduce a few helper
> functions in Cluster.pm, but that ended up being awkward. So I bit the bullet
> and introduced a new class (in BackgroundPsql.pm), and made background_psql()
> and interactive_psql() return an instance of it.

Thanks for working on this!

> This is just a rough prototype. Several function names don't seem great, it
> need POD documentation, etc.

It might be rough around the edges but I don't think it's too far off a state
in which in can be committed, given that it's replacing something even rougher.
With documentation and some polish I think we can iterate on it in the tree.

I've played around a lot with it and it seems fairly robust.

> I don't quite like the new interface yet:
> - It's somewhat common to want to know if there was a failure, but also get
>  the query result, not sure what the best function signature for that is in
>  perl.

What if query() returns a list with the return value last?  The caller will get
the return value when assigning a single var as the return, and can get both in
those cases when it's interesting.  That would make for reasonably readable
code in most places?

   $ret_val = $h->query("SELECT 1;");
   ($query_result, $ret_val) = $h->query("SELECT 1;"); 

Returning a hash seems like a worse option since it will complicate callsites
which only want to know success/failure.

> - query_until() sounds a bit too much like $node->poll_query_until(). Maybe
>  query_wait_until() is better? OTOH, the other function has poll in the name,
>  so maybe it's ok.

query_until isn't great but query_wait_until is IMO worse since the function
may well be used for tests which aren't using longrunning waits.  It's also
very useful for things which aren't queries at all, like psql backslash
commands.  I don't have any better ideas though, so +1 for sticking with
query_until.

> - right now there's a bit too much logic in background_psql() /
>  interactive_psql() for my taste

Not sure what you mean, I don't think they're especially heavy on logic?

> Those points aside, I think it already makes the tests a good bit more
> readable. My WIP vacuum_defer_cleanup_age patch shrunk by half with it.

The test for \password in the SCRAM iteration count patch shrunk to 1/3 of the
previous coding.

> I think with a bit more polish it's easy enough to use that we could avoid a
> good number of those one-off psql's that we do all over.

Agreed, and ideally implement tests which were left unwritten due to the old
API being clunky.

+   # feed the query to psql's stdin, follwed by \n (so psql processes the

s/follwed/followed/

+A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
+which can modified later.

This require a bit of knowledge about the internals which I think we should
hide in this new API.  How about providing a function for defining the timeout?

Re timeouts: one thing I've done repeatedly is to use short timeouts and reset
them per query, and that turns pretty ugly fast.  I hacked up your patch to
provide $h->reset_timer_before_query() which then injects a {timeout}->start
before running each query without the caller having to do it.  Not sure if I'm
alone in doing that but if not I think it makes sense to add.

--
Daniel Gustafsson





Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-14 Thread Tom Lane
Gilles Darold  writes:
> Thanks Stepane, I've changed commit fest status to "Ready for committers".

Pushed with some minor editing.

regards, tom lane




Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-03-14 Thread Jacob Champion
On Tue, Mar 14, 2023 at 11:01 AM Gregory Stark (as CFM)
 wrote:
> It does look like a rebase for meson.build would be helpful. I'm not
> marking it waiting on author just for meson.build conflicts but it
> would be perhaps more likely to be picked up by a committer if it's
> showing green in cfbot.

Rebased over yesterday's Meson changes in v8.

Thanks!
--Jacob
1:  b07af1c564 ! 1:  84f67249e6 libpq: add sslrootcert=system to use default CAs
@@ doc/src/sgml/runtime.sgml: pg_dumpall -p 5432 | psql -d postgres -p 5433

 
  ## meson.build ##
-@@ meson.build: if get_option('ssl') == 'openssl'
+@@ meson.build: if sslopt in ['auto', 'openssl']
+ else
+   ssl = not_found_dep
  endif
-   endforeach
- 
++
++if ssl.found()
 +  # Let tests differentiate between vanilla OpenSSL and LibreSSL.
 +  sym = 'LIBRESSL_VERSION_NUMBER'
 +  found = cc.has_header_symbol('openssl/opensslv.h', sym, 
dependencies: ssl_int)
 +  cdata.set10('HAVE_DECL_' + sym, found, description:
 +'''Define to 1 if you have the declaration of `@0@', and to 0 if you
 +   don't.'''.format(sym))
-+
-   cdata.set('USE_OPENSSL', 1,
- description: 'Define to 1 to build with OpenSSL support. 
(-Dssl=openssl)')
-   cdata.set('OPENSSL_API_COMPAT', '0x10001000L',
++endif
+   endif
+ endif
+ 
 
  ## src/include/pg_config.h.in ##
 @@
2:  5de1c458b1 = 2:  11b69d0bc0 libpq: force sslmode=verify-full for system CAs
From 84f67249e683d79a9a004b3f12507e77be3d9c6f Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 24 Oct 2022 15:30:25 -0700
Subject: [PATCH v8 1/2] libpq: add sslrootcert=system to use default CAs

Based on a patch by Thomas Habets.

Note the workaround in .cirrus.yml for a strange interaction between
brew and the openssl@3 formula; hopefully this can be removed in the
near future.

Discussion: https://www.postgresql.org/message-id/flat/CA%2BkHd%2BcJwCUxVb-Gj_0ptr3_KZPwi3%2B67vK6HnLFBK9MzuYrLA%40mail.gmail.com
---
 .cirrus.yml   |  14 +-
 configure | 289 +-
 configure.ac  |   2 +
 doc/src/sgml/libpq.sgml   |  17 ++
 doc/src/sgml/runtime.sgml |   6 +-
 meson.build   |   9 +
 src/include/pg_config.h.in|   4 +
 src/interfaces/libpq/fe-secure-openssl.c  |  29 +-
 src/test/ssl/ssl/server-cn-only+server_ca.crt |  38 +++
 src/test/ssl/sslfiles.mk  |   6 +-
 src/test/ssl/t/001_ssltests.pl|  29 ++
 11 files changed, 298 insertions(+), 145 deletions(-)
 create mode 100644 src/test/ssl/ssl/server-cn-only+server_ca.crt

diff --git a/.cirrus.yml b/.cirrus.yml
index 505c50f328..67b42db559 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -469,12 +469,24 @@ task:
   make \
   meson \
   openldap \
-  openssl \
+  openssl@3 \
   python \
   tcl-tk \
   zstd
 
 brew cleanup -s # to reduce cache size
+
+# brew cleanup removes the empty certs directory in OPENSSLDIR, causing
+# OpenSSL to report unexpected errors ("unregistered scheme") during
+# verification failures. Put it back for now as a workaround.
+#
+#   https://github.com/orgs/Homebrew/discussions/4030
+#
+# Note that $(brew --prefix openssl) will give us the opt/ prefix but not
+# the etc/ prefix, so we hardcode the full path here. openssl@3 is pinned
+# above to try to minimize the chances of this changing beneath us, but it's
+# brittle...
+mkdir -p "/opt/homebrew/etc/openssl@3/certs"
   upload_caches: homebrew
 
   ccache_cache:
diff --git a/configure b/configure
index e35769ea73..df7e937e0c 100755
--- a/configure
+++ b/configure
@@ -2094,6 +2094,56 @@ $as_echo "$ac_res" >&6; }
 
 } # ac_fn_c_check_func
 
+# ac_fn_c_check_decl LINENO SYMBOL VAR INCLUDES
+# -
+# Tests whether SYMBOL is declared in INCLUDES, setting cache variable VAR
+# accordingly.
+ac_fn_c_check_decl ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  # Initialize each $ac_[]_AC_LANG_ABBREV[]_decl_warn_flag once.
+  as_decl_name=`echo $2|sed 's/ *(.*//'`
+  as_decl_use=`echo $2|sed -e 's/(/((/' -e 's/)/) 0&/' -e 's/,/) 0& (/g'`
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether $as_decl_name is declared" >&5
+$as_echo_n "checking whether $as_decl_name is declared... " >&6; }
+if eval \${$3+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_save_werror_flag=$ac_c_werror_flag
+  ac_c_werror_flag="$ac_c_decl_warn_flag$ac_c_werror_flag"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$4
+int
+main ()
+{
+#ifndef $as_decl_name
+#ifdef __cplusplus
+  (void) $as_decl_use;
+#else
+  (void) $as_decl_name;
+#endif
+#endif
+
+  ;
+  

Re: [PoC] Let libpq reject unexpected authentication requests

2023-03-14 Thread Jacob Champion
On Mon, Mar 13, 2023 at 10:39 PM Michael Paquier  wrote:
> 0001 was looking fine enough seen from here, so applied it after
> tweaking a few comments.  That's enough to cover most of the needs of
> this thread.

Thank you very much!

> 0002 looks pretty simple as well, I think that's worth a look for this
> CF.

Cool. v17 just rebases the set over HEAD, then, for cfbot.

> I am not sure about 0003, to be honest, as I am wondering if
> there could be a better solution than tying more the mechanism names
> with the expected AUTH_REQ_* values..

Yeah, I'm not particularly excited about the approach I took. It'd be
easier if we had a second SASL method to verify the implementation...
I'd also proposed just adding an Assert, as a third option, to guide
the eventual SASL implementer back to this conversation?

--Jacob
From e2343c008916fe8dd9ecacfa71c60cc250fa208d Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 18 Oct 2022 16:55:36 -0700
Subject: [PATCH v17 2/2] require_auth: decouple SASL and SCRAM

Rather than assume that an AUTH_REQ_SASL* code refers to SCRAM-SHA-256,
future-proof by separating the single allowlist into a list of allowed
authentication request codes and a list of allowed SASL mechanisms.

The require_auth check is now separated into two tiers. The
AUTH_REQ_SASL* codes are always allowed. If the server sends one,
responsibility for the check then falls to pg_SASL_init(), which
compares the selected mechanism against the list of allowed mechanisms.
(Other SCRAM code is already responsible for rejecting unexpected or
out-of-order AUTH_REQ_SASL_* codes, so that's not explicitly handled
with this addition.)

Since there's only one recognized SASL mechanism, conn->sasl_mechs
currently only points at static hardcoded lists. Whenever a second
mechanism is added, the list will need to be managed dynamically.
---
 src/interfaces/libpq/fe-auth.c| 34 +++
 src/interfaces/libpq/fe-connect.c | 41 +++
 src/interfaces/libpq/libpq-int.h  |  3 +-
 src/test/authentication/t/001_password.pl | 14 +---
 src/test/ssl/t/002_scram.pl   |  6 
 5 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index b2497acdad..4ff49d8207 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -522,6 +522,40 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 		goto error;
 	}
 
+	/*
+	 * Before going ahead with any SASL exchange, ensure that the user has
+	 * allowed (or, alternatively, has not forbidden) this particular mechanism.
+	 *
+	 * In a hypothetical future where a server responds with multiple SASL
+	 * mechanism families, we would need to instead consult this list up above,
+	 * during mechanism negotiation. We don't live in that world yet. The server
+	 * presents one auth method and we decide whether that's acceptable or not.
+	 */
+	if (conn->sasl_mechs)
+	{
+		const char **mech;
+		bool		found = false;
+
+		Assert(conn->require_auth);
+
+		for (mech = conn->sasl_mechs; *mech; mech++)
+		{
+			if (strcmp(*mech, selected_mechanism) == 0)
+			{
+found = true;
+break;
+			}
+		}
+
+		if ((conn->sasl_mechs_denied && found)
+			|| (!conn->sasl_mechs_denied && !found))
+		{
+			libpq_append_conn_error(conn, "auth method \"%s\" requirement failed: server requested unacceptable SASL mechanism \"%s\"",
+	conn->require_auth, selected_mechanism);
+			goto error;
+		}
+	}
+
 	if (conn->channel_binding[0] == 'r' &&	/* require */
 		strcmp(selected_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0)
 	{
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index cbadb3f6af..a048793b46 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1258,12 +1258,25 @@ connectOptions2(PGconn *conn)
 	more;
 		bool		negated = false;
 
+		static const uint32 default_methods = (
+			1 << AUTH_REQ_SASL
+			| 1 << AUTH_REQ_SASL_CONT
+			| 1 << AUTH_REQ_SASL_FIN
+		);
+		static const char *no_mechs[] = { NULL };
+
 		/*
-		 * By default, start from an empty set of allowed options and add to
+		 * By default, start from a minimum set of allowed options and add to
 		 * it.
+		 *
+		 * NB: The SASL method codes are always "allowed" here. If the server
+		 * requests SASL auth, pg_SASL_init() will enforce adherence to the
+		 * sasl_mechs list, which by default is empty.
 		 */
 		conn->auth_required = true;
-		conn->allowed_auth_methods = 0;
+		conn->allowed_auth_methods = default_methods;
+		conn->sasl_mechs = no_mechs;
+		conn->sasl_mechs_denied = false;
 
 		for (first = true, more = true; more; first = false)
 		{
@@ -1290,6 +1303,9 @@ connectOptions2(PGconn *conn)
 	 */
 	conn->auth_required = false;
 	conn->allowed_auth_methods = -1;
+
+	/* conn->sasl_mechs is now a list of denied mechanisms. */
+	conn->sasl_mechs_denied = true;
 }
 else if (!negated)
 			

Re: pg_stat_statements and "IN" conditions

2023-03-14 Thread Dmitry Dolgov
> On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote:
> So I was seeing that this patch needs a rebase according to cfbot.

Yeah, folks are getting up to speed in with pgss improvements recently.
Thanks for letting me know.

> However it looks like the review feedback you're looking for is more
> of design questions. What jumbling is best to include in the feature
> set and which is best to add in later patches. It sounds like you've
> gotten conflicting feedback from initial reviews.
>
> It does sound like the patch is pretty mature and you're actively
> responding to feedback so if you got more authoritative feedback it
> might even be committable now. It's already been two years of being
> rolled forward so it would be a shame to keep rolling it forward.

You got it about right. There is a balance to strike between
implementation, that would cover more useful cases, but has more
dependencies (something like possibility of having multiple query id),
and more minimalistic implementation that would actually be acceptable
in the way it is now. But I haven't heard back from David about it, so I
assume everybody is fine with the minimalistic approach.

> Or is there some fatal problem that you're trying to work around and
> still haven't found the magic combination that convinces any
> committers this is something we want? In which case perhaps we set
> this patch returned? I don't get that impression myself though.

Nothing like this on my side, although I'm not good at conjuring
committing powers of the nature.




Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Nathan Bossart
On Tue, Mar 14, 2023 at 01:58:59PM +0900, Kyotaro Horiguchi wrote:
> + if (*opt_end)
> + pg_log_error("\\watch: incorrect 
> interval value '%s'", opt);
> + else if (errno == ERANGE)
> + pg_log_error("\\watch: out-of-range 
> interval value '%s'", opt);
> + else
> + pg_log_error("\\watch: interval value 
> '%s' less than zero", opt);
> 
> I'm not sure if we need error messages for that resolution and I'm a
> bit happier to have fewer messages to translate:p. Merging the cases
> of ERANGE and negative values might be better. And I think we usually
> refer to unparsable input as "invalid".
> 
>   if (*opt_end)
>  pg_log_error("\\watch: invalid interval value '%s'", opt);
>   else
>  pg_log_error("\\watch: interval value '%s' out of range", opt);

+1, I don't think it's necessary to complicate these error messages too
much.  This code hasn't reported errors for nearly 10 years, and I'm not
aware of any complaints.  I ѕtill think we could simplify this to "\watch:
invalid delay interval: %s" and call it a day.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Raising the SCRAM iteration count

2023-03-14 Thread Gregory Stark (as CFM)
CFBot is failing with this test failure... I'm not sure  if this just
represents a timing dependency or a bad test or what?

[09:44:49.937] --- stdout ---
[09:44:49.937] # executing test in
/tmp/cirrus-ci-build/build-32/testrun/authentication/001_password
group authentication test 001_password
[09:44:49.937] ok 1 - scram_iterations in server side ROLE
[09:44:49.937] # test failed
[09:44:49.937] --- stderr ---
[09:44:49.937] # Tests were run but no plan was declared and
done_testing() was not seen.
[09:44:49.937] # Looks like your test exited with 2 just after 1.
[09:44:49.937]
[09:44:49.937] (test program exited with status code 2)

It looks like perhaps a Perl issue?

# Running: /tmp/cirrus-ci-build/build-32/src/test/regress/pg_regress
--config-auth 
/tmp/cirrus-ci-build/build-32/testrun/authentication/001_password/data/t_001_password_primary_data/pgdata
### Starting node "primary"
# Running: pg_ctl -w -D
/tmp/cirrus-ci-build/build-32/testrun/authentication/001_password/data/t_001_password_primary_data/pgdata
-l 
/tmp/cirrus-ci-build/build-32/testrun/authentication/001_password/log/001_password_primary.log
-o --cluster-name=primary start
waiting for server to start done
server started
# Postmaster PID for node "primary" is 66930
[09:44:07.411](1.875s) ok 1 - scram_iterations in server side ROLE
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty
module) (@INC contains: /tmp/cirrus-ci-build/src/test/perl
/tmp/cirrus-ci-build/src/test/authentication /etc/perl
/usr/local/lib/i386-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1
/usr/lib/i386-linux-gnu/perl5/5.32 /usr/share/perl5
/usr/lib/i386-linux-gnu/perl/5.32 /usr/share/perl/5.32
/usr/local/lib/site_perl) at /usr/share/perl5/IPC/Run.pm line 1828.
Unexpected SCALAR(0x5814b508) in harness() parameter 3 at
/tmp/cirrus-ci-build/src/test/perl/PostgreSQL/Test/Cluster.pm line
2112.
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty
module) (@INC contains: /tmp/cirrus-ci-build/src/test/perl
/tmp/cirrus-ci-build/src/test/authentication /etc/perl
/usr/local/lib/i386-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1
/usr/lib/i386-linux-gnu/perl5/5.32 /usr/share/perl5
/usr/lib/i386-linux-gnu/perl/5.32 /usr/share/perl/5.32
/usr/local/lib/site_perl) at /usr/share/perl5/IPC/Run.pm line 1939.
# Postmaster PID for node "primary" is 66930
### Stopping node "primary" using mode immediate
# Running: pg_ctl -D
/tmp/cirrus-ci-build/build-32/testrun/authentication/001_password/data/t_001_password_primary_data/pgdata
-m immediate stop
waiting for server to shut down done
server stopped
# No postmaster PID for node "primary"
[09:44:07.521](0.110s) # Tests were run but no plan was declared and
done_testing() was not seen.
[09:44:07.521](0.000s) # Looks like your test exited with 2 just after 1.




Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

2023-03-14 Thread Nathan Bossart
On Tue, Mar 14, 2023 at 03:38:45PM +1300, Thomas Munro wrote:
> On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
>  wrote:
>> >   * NOTE: although the delay is specified in microseconds, the effective
>> > - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
>> > - * the requested delay to be rounded up to the next resolution boundary.
>> > + * resolution is only 1/HZ on systems that use periodic kernel ticks to 
>> > wake
>> > + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on 
>> > older
>> > + * Unixen and Windows.
>>
>> nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
>> Otherwise, I think this is the right idea.
> 
> Better words welcome; 1-20ms summarises the range I actually measured,
> and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
> it neatly covers that too, so I don't feel too bad about not chasing
> down the reason for that 10ms/20ms discrepancy; maybe I looked at the
> wrong HZ number (which you can change, anyway), I'm not too used to
> NetBSD...  BTW they have a project plan to fix that
> https://wiki.netbsd.org/projects/project/tickless/

Here is roughly what I had in mind:

NOTE: Although the delay is specified in microseconds, older Unixen and
Windows use periodic kernel ticks to wake up, which might increase the
delay time significantly.  We've observed delay increases as large as
20 milliseconds on supported platforms.

>> > + * CAUTION: if interrupted by a signal, this function will return, but its
>> > + * interface doesn't report that.  It's not a good idea to use this
>> > + * for long sleeps in the backend, because backends are expected to 
>> > respond to
>> > + * interrupts promptly.  Better practice for long sleeps is to use 
>> > WaitLatch()
>> > + * with a timeout.
>>
>> I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
>> then why are we concerned about delayed responses to interrupts?
> 
> Because you can't rely on it:
> 
> 1.  Maybe the signal is delivered just before pg_usleep() begins, and
> a handler sets some flag we would like to react to.  Now pg_usleep()
> will not be interrupted.  That problem is solved by using latches
> instead.
> 2.  Maybe the signal is one that is no longer handled by a handler at
> all; these days, latches use SIGURG, which pops out when you read a
> signalfd or kqueue, so pg_usleep() will not wake up.  That problem is
> solved by using latches instead.
> 
> (The word "interrupt" is a bit overloaded, which doesn't help with
> this discussion.)

Yeah, I think it would be clearer if "interrupt" was disambiguated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Sampling-based timing for EXPLAIN ANALYZE

2023-03-14 Thread Greg Stark
On Tue, 17 Jan 2023 at 14:52, Tomas Vondra
 wrote:
>
> On 1/17/23 19:46, Andres Freund wrote:
>
> > I think a "hybrid" explain mode might be worth thinking about. Use the
> > "current" sampling method for the first execution of a node, and for the 
> > first
> > few milliseconds of a query (or perhaps the first few timestamp
> > acquisitions). That provides an accurate explain analyze for short queries,
> > without a significant slowdown. Then switch to sampling, which provides 
> > decent
> > attribution for a bit longer running queries.
> >
>
> Yeah, this is essentially the sampling I imagined when I first read the
> subject of this thread. It samples which node executions to measure (and
> then measures those accurately), while these patches sample timestamps.

That sounds interesting. Fwiw my first thought would be to  implement
it  a  bit differently. Always have a timer running sampling right
from the start, but also if there are less than, say, 1000 samples for
a node then measure the actual start/finish time.

So for any given node once you've hit enough samples to get a decent
estimate you stop checking the time. That way any fast or rarely
called nodes still have accurate measurements even if they get few
samples and any long or frequently called nodes stop getting
timestamps and just use timer counts.


-- 
greg




Re: Privileges on PUBLICATION

2023-03-14 Thread Gregory Stark (as CFM)
FYI this looks like it needs a rebase due to a conflict in copy.c and
an offset in pgoutput.c.

Is there anything specific that still needs review or do you think
you've handled all Peter's concerns? In particular, is there "a
comprehensive description of what it is trying to do"? :)




Re: Using each rel as both outer and inner for JOIN_ANTI

2023-03-14 Thread Gregory Stark (as CFM)
So what is the status of this patch?

It looks like you received some feedback from Emre, Tom, Ronan, and
Alvaro but it also looks like you responded to most or all of that.
Are you still blocked waiting for feedback? Anything specific you need
help with?

Or is the patch ready for commit now? In which case it would be good
to rebase it since it's currently failing to apply. Well it would be
good to rebase regardless but it would be especially important if we
want to get it committed :)




Re: pg_stat_statements and "IN" conditions

2023-03-14 Thread Gregory Stark (as CFM)
So I was seeing that this patch needs a rebase according to cfbot.

However it looks like the review feedback you're looking for is more
of design questions. What jumbling is best to include in the feature
set and which is best to add in later patches. It sounds like you've
gotten conflicting feedback from initial reviews.

It does sound like the patch is pretty mature and you're actively
responding to feedback so if you got more authoritative feedback it
might even be committable now. It's already been two years of being
rolled forward so it would be a shame to keep rolling it forward.

Or is there some fatal problem that you're trying to work around and
still haven't found the magic combination that convinces any
committers this is something we want? In which case perhaps we set
this patch returned? I don't get that impression myself though.




Re: [EXTERNAL] Re: Support load balancing in libpq

2023-03-14 Thread Gregory Stark (as CFM)
The pgindent run in b6dfee28f is causing this patch to need a rebase
for the cfbot to apply it.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-03-14 Thread Greg Stark
On Tue, 14 Mar 2023 at 13:59, Tom Lane  wrote:
>
> "Gregory Stark (as CFM)"  writes:
> > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and
> > fe-connect.c. Every hunk is failing which perhaps means the code
> > you're patching has been moved or refactored?
>
> The cfbot is giving up after
> v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails,
> but that's been superseded (at least in part) by b6dfee28f.

Ah, same with Jelte Fennema's patch for load balancing in libpq.

-- 
greg




Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-03-14 Thread Gregory Stark (as CFM)
It does look like a rebase for meson.build would be helpful. I'm not
marking it waiting on author just for meson.build conflicts but it
would be perhaps more likely to be picked up by a committer if it's
showing green in cfbot.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-03-14 Thread Tom Lane
"Gregory Stark (as CFM)"  writes:
> It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and
> fe-connect.c. Every hunk is failing which perhaps means the code
> you're patching has been moved or refactored?

The cfbot is giving up after
v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails,
but that's been superseded (at least in part) by b6dfee28f.

regards, tom lane




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-03-14 Thread Gregory Stark (as CFM)
It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and
fe-connect.c. Every hunk is failing which perhaps means the code
you're patching has been moved or refactored?




DROP DATABASE is interruptible

2023-03-14 Thread Andres Freund
Hi,


Unfortunately DROP DATABASE does not hold interrupt over its crucial steps. If
you e.g. set a breakpoint on DropDatabaseBuffers() and then do a signal
SIGINT, we'll process that interrupt before the transaction commits.

A later connect to that database ends with:
2023-03-14 10:22:24.443 PDT [3439153][client backend][3/2:0][[unknown]] PANIC:  
could not open critical system index 2662


It's not entirely obvious how to fix this. We can't just hold interrupts for
the whole transaction - for one, we hang if we do so, because it prevents
ourselves from absorbing our own barrier:
/* Close all smgr fds in all backends. */

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));


ISTM that at the very least dropdb() needs to internally commit *before*
dropping buffers - after that point the database is corrupt.

Greetings,

Andres Freund




Re: [PATCH] Add pretty-printed XML output option

2023-03-14 Thread Tom Lane
Jim Jones  writes:
> [ v22-0001-Add-pretty-printed-XML-output-option.patch ]

I poked at this for awhile and ran into a problem that I'm not sure
how to solve: it misbehaves for input with embedded DOCTYPE.

regression=# SELECT xmlserialize(DOCUMENT '' as text indent);
 xmlserialize 
--
 +
  +
 
(1 row)

regression=# SELECT xmlserialize(CONTENT '' as text indent);
 xmlserialize 
--
 
(1 row)

The bad result for CONTENT is because xml_parse() decides to
parse_as_document, but xmlserialize_indent has no idea that happened
and tries to use the content_nodes list anyway.  I don't especially
care for the laissez faire "maybe we'll set *content_nodes and maybe
we won't" API you adopted for xml_parse, which seems to be contributing
to the mess.  We could pass back more info so that xmlserialize_indent
knows what really happened.  However, that won't fix the bogus output
for the DOCUMENT case.  Are we perhaps passing incorrect flags to
xmlSaveToBuffer?

regards, tom lane




Re: ICU locale validation / canonicalization

2023-03-14 Thread Jeff Davis
On Tue, 2023-03-14 at 08:08 +0100, Peter Eisentraut wrote:
> Another issue that came to mind:  Right now, you can, say, develop
> SQL 
> schemas on a newer ICU version, say, your laptop, and then deploy
> them 
> on a server running an older ICU version.  If we have a cutoff beyond
> which we convert ICU locale IDs to language tags, then this won't
> work 
> anymore for certain combinations.  And RHEL/CentOS 7 is still pretty 
> popular.

If we just uloc_canonicalize() in icu_set_collation_attributes() then
versions 50-53 can support language tags. Patch attached.

One loose end is that we really should support language tags like "und"
in those older versions (54 and earlier). Your commit d72900bded
avoided the problem, but perhaps we should fix it by looking for "und"
and replacing it with "root" while opening, or something.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From ba1f0794d3be3fe7a9a365f4b6312bef9c215728 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 14 Mar 2023 09:58:29 -0700
Subject: [PATCH v1] Support language tags in older ICU versions (53 and
 earlier).

By calling uloc_canonicalize() before parsing the attributes, the
existing locale attribute parsing logic works on language tags as
well.

Fix a small memory leak, too.
---
 src/backend/commands/collationcmds.c  |  8 +++---
 src/backend/utils/adt/pg_locale.c | 26 ---
 .../regress/expected/collate.icu.utf8.out |  8 ++
 src/test/regress/sql/collate.icu.utf8.sql |  4 +++
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8949684afe..b8f2e7059f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -950,7 +950,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			const char *name;
 			char	   *langtag;
 			char	   *icucomment;
-			const char *iculocstr;
 			Oid			collid;
 
 			if (i == -1)
@@ -959,20 +958,19 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 name = uloc_getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
-			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+			if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
 continue;
 
 			collid = CollationCreate(psprintf("%s-x-icu", langtag),
 	 nspid, GetUserId(),
 	 COLLPROVIDER_ICU, true, -1,
-	 NULL, NULL, iculocstr, NULL,
-	 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+	 NULL, NULL, langtag, NULL,
+	 get_collation_actual_version(COLLPROVIDER_ICU, langtag),
 	 true, true);
 			if (OidIsValid(collid))
 			{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1d3d4d86d3..b9c7fbd511 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2643,9 +2643,28 @@ pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
 {
-	char	   *str = asc_tolower(loc, strlen(loc));
+	UErrorCode	status;
+	int32_t		len;
+	char	   *icu_locale_id;
+	char	   *lower_str;
+	char	   *str;
 
-	str = strchr(str, '@');
+	/* first, make sure the string is an ICU format locale ID */
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, );
+	icu_locale_id = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, );
+	if (U_FAILURE(status))
+		ereport(ERROR,
+(errmsg("canonicalization failed for locale string \"%s\": %s",
+		loc, u_errorName(status;
+
+	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
+
+	pfree(icu_locale_id);
+
+	str = strchr(lower_str, '@');
 	if (!str)
 		return;
 	str++;
@@ -2660,7 +2679,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			char	   *value;
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
-			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
 
@@ -2727,6 +2745,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 loc, u_errorName(status;
 		}
 	}
+
+	pfree(lower_str);
 }
 
 #endif			/* USE_ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9a3e12e42d..057c1d1bf6 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1304,6 +1304,14 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
  t| t
 (1 row)
 
+-- test language tags
+CREATE COLLATION langtag (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE langtag = 'AbCd' COLLATE langtag;
+ ?column? 
+--
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE 

Re: Add LZ4 compression in pg_dump

2023-03-14 Thread Tomas Vondra
On 3/14/23 12:07, gkokola...@pm.me wrote:
> 
> 
> --- Original Message ---
> On Monday, March 13th, 2023 at 10:47 PM, Tomas Vondra 
>  wrote:
> 
> 
> 
>>
>>> Change pg_fatal() to an assertion+comment;
>>
>>
>> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
>> could add such protections against "impossible" stuff to a zillion other
>> places and the confusion likely outweighs the benefits.
>>
> 
> A minor note to add is to not ignore the lessons learned from a7885c9bb.
> 
> For example, as the testing framework stands, one can not test that the
> contents of the custom format are indeed compressed. One can infer it by
> examining the header of the produced dump and searching for the
> compression flag. The code responsible for writing the header and the
> code responsible for actually dealing with data, is not the same. Also,
> the compression library itself will happily read and write uncompressed
> data.
> 
> A pg_fatal, assertion, or similar, is the only guard rail against this
> kind of error. Without it, the tests will continue passing even after
> e.g. a wrong initialization of the API. It was such a case that lead to
> a7885c9bb in the first place. I do think that we wish it to be an
> "impossible" case. Also it will be an untested case with some history
> without such a guard rail.
> 

So is the pg_fatal() a dead code or not? My understanding was it's not
really reachable, and the main purpose is to remind people this is not
possible. Or am I mistaken/confused?

If it's reachable, can we test it? AFAICS we don't, per the coverage
reports.

If it's just a protection against incorrect API initialization, then an
assert is the right solution, I think. With proper comment. But can't we
actually verify that *during* the initialization?

Also, how come WriteDataToArchiveLZ4() doesn't need this protection too?
Or is that due to gzip being the default compression method?

> Of course I will not object to removing it, if you think that is more
> confusing than useful.
> 

Not sure, I have a feeling I don't quite understand in what situation
this actually helps.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: ICU 54 and earlier are too dangerous

2023-03-14 Thread Jeff Davis
On Mon, 2023-03-13 at 18:13 -0700, Andres Freund wrote:
> What non-error code is returned in the above example?

When the collator for locale "asdf" is opened, the status is set to
U_USING_DEFAULT_WARNING.

That seemed very promising at first, but it's the same thing returned
after opening most valid locales, including "en" and "en-US". It seems
to only return U_ZERO_ERROR on an exact hit, like "fr-CA" or "root".

There's also U_USING_FALLBACK_WARNING, which also seemed promising, but
it's returned when opening "fr-FR" or "ja-JP" (falls back to "fr" and
"ja" respectively).

> Can we query the returned collator and see if it matches what we were
> looking
> for?

I tried a few variations of that in my canonicalization / validation
patch, which I called "validation". The closest thing I found is:

   ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, )

We could strip away the attributes and compare to the result of that,
and it mostly works. There are a few complications, like I think we
need to preserve the "collation" attribute for things like
"de@collation=phonebook".

Another thing to consider is that the environment might happen to open
the collation you intend at the time the collation is created, but then
later of course the environment can change, so we'd have to check every
time it's opened. And getting an error when the collation is opened is
not great, so it might need to be a WARNING or something, and it starts
to get less useful.

What would be *really* nice is if there was some kind of way to tell if
there was no real match to a known locale, either during open or via
some other API. I wasn't able to find one, though.

Actually, now that I think about it, we could just search all known
locales using either ucol_getAvailable() or uloc_getAvailable(), and
see if there's a match. Not very clean, but it should catch most
problems. I'll look into whether there's a reasonable way to match or
not.

> 
> I'm a bit confused by the dates.
> https://icu.unicode.org/download/55m1 says
> that version was released 2014-12-17, but the linked issue around
> root locales
> is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I
> guess
> the issue tracker was migrated at some point or such...

The dates are misleading in both git (migrated from SVN circa 2016) and
JIRA (migrated circa 2018, see
https://unicode-org.atlassian.net/browse/ICU-1 ). It seems 55.1 was
released in either 2014 or 2015.


Regards,
Jeff Davis





Re: Add LZ4 compression in pg_dump

2023-03-14 Thread Tomas Vondra



On 3/14/23 16:18, gkokola...@pm.me wrote:
> ...> Would you mind me trying to come with a patch to address your points?
> 

That'd be great, thanks. Please keep it split into smaller patches - two
might work, with one patch for "cosmetic" changes and the other tweaking
the API error-handling stuff.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: Allow logical replication to copy tables in binary format

2023-03-14 Thread Takamichi Osumi (Fujitsu)
On Tuesday, March 14, 2023 8:02 PM Melih Mutlu  wrote:
> Attached v13.
Hi,


Thanks for sharing v13. Few minor review comments.
(1) create_subscription.sgml

+  column types as opposed to text format. Even when this option is 
enabled,
+  only data types having binary send and receive functions will be
+  transferred in binary. Note that the initial synchronization requires

(1-1)

I think it's helpful to add a reference for the description about send and 
receive functions (e.g. to the page of CREATE TYPE).

(1-2)

Also, it would be better to have a cross reference from there to this doc as 
one paragraph probably in "Notes". I suggested this, because send and receive 
functions are described as "optional" there and missing them leads to error in 
the context of binary table synchronization.

(3) copy_table()

+   /*
+* If the publisher version is earlier than v14, it COPY command doesn't
+* support the binary option.
+*/

This sentence doesn't look correct grammatically. We can replace "it COPY 
command" with "subscription" for example. Kindly please fix it.


Best Regards,
Takamichi Osumi





Re: Add LZ4 compression in pg_dump

2023-03-14 Thread gkokolatos






--- Original Message ---
On Monday, March 13th, 2023 at 9:21 PM, Tomas Vondra 
 wrote:


>
>
>
>
> On 3/11/23 11:50, gkokola...@pm.me wrote:
>
> > --- Original Message ---
> > On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin 
> > exclus...@gmail.com wrote:
> >
> > > Hello,
> > > 23.02.2023 23:24, Tomas Vondra wrote:

>
>
> Thanks for the patch.
>
> I did look if there are other places that might have the same issue, and
> I think there are - see attached 0002. For example LZ4File_write is
> declared to return size_t, but then it also does
>
> if (LZ4F_isError(status))
> {
> fs->errcode = status;
>
> return -1;
> }
>
> That won't work :-(

You are right. It is confusing.

>
> And these issues may not be restricted to lz4 code - Gzip_write is
> declared to return size_t, but it does
>
> return gzwrite(gzfp, ptr, size);
>
> and gzwrite returns int. Although, maybe that's correct, because
> gzwrite() is "0 on error" so maybe this is fine ...
>
> However, Gzip_read assigns gzread() to size_t, and that does not seem
> great. It probably will still trigger the following pg_fatal() because
> it'd be very lucky to match the expected 'size', but it's confusing.

Agreed.

>
>
> I wonder whether CompressorState should use int or size_t for the
> read_func/write_func callbacks. I guess no option is perfect, i.e. no
> data type will work for all compression libraries we might use (lz4 uses
> int while lz4f uses size_t, to there's that).
>
> It's a bit weird the "open" functions return int and the read/write
> size_t. Maybe we should stick to int, which is what the old functions
> (cfwrite etc.) did.
>
You are right. These functions are modeled by the open/fread/
fwrite etc, and they have kept the return types of these ones. Their
callers do check the return value of read_func and write_func against
the requested size of bytes to be transferred.

>
> But I think the actual problem here is that the API does not clearly
> define how errors are communicated. I mean, it's nice to return the
> value returned by the library function without "mangling" it by
> conversion to size_t, but what if the libraries communicate errors in
> different way? Some may return "0" while others may return "-1".

Agreed.

>
> I think the right approach is to handle all library errors and not just
> let them through. So Gzip_write() needs to check the return value, and
> either call pg_fatal() or translate it to an error defined by the API.

It makes sense. It will change some of the behaviour of the callers,
mostly on what constitutes an error, and what error message is emitted.
This is a reasonable change though.

>
> For example we might say "returns 0 on error" and then translate all
> library-specific errors to that.

Ok.

> While looking at the code I realized a couple function comments don't
> say what's returned in case of error, etc. So 0004 adds those.
>
> 0003 is a couple minor assorted comments/questions:
>
> - Should we move ZLIB_OUT_SIZE/ZLIB_IN_SIZE to compress_gzip.c?

It would make things clearer.

> - Why are LZ4 buffer sizes different (ZLIB has both 4kB)?

Clearly some comments are needed, if the difference makes sense.

> - I wonder if we actually need LZ4F_HEADER_SIZE_MAX? Is it even possible
> for LZ4F_compressBound to return value this small (especially for 16kB
> input buffer)?
>

I would recommend to keep it. Earlier versions of LZ4F_HEADER_SIZE_MAX
do not have it. Later versions do advise to use it.

Would you mind me trying to come with a patch to address your points?

Cheers,
//Georgios

>
>
> regards
>
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-14 Thread Ilya Gladyshev



> 14 марта 2023 г., в 18:34, Justin Pryzby  написал(а):
> 
> On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote:
>> Justin Pryzby  writes:
>>> On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote:
 I agree that adding such a field to IndexStmt would be a very bad idea.
 However, adding another parameter to DefineIndex doesn't seem like a
 problem.
>> 
>>> It's a problem since this is a bug and it's desirable to backpatch a
>>> fix, right ?
>> 
>> I do not think this is important enough to justify a back-patch.
> 
> That's fine with me, but it comes as a surprise, and it might invalidate
> earlier discussions, which were working under the constraint of
> maintaining a compatible ABI.
> 
>>> Incrementing by 0 sounds terrible, since someone who has intermediate
>>> partitioned tables is likely to always see 0% done.
>> 
>> How so?  The counter will increase after there's some actual work done,
>> ie building an index.  If there's no indexes to build then it hardly
>> matters, because the command will complete in very little time.
> 
> I misunderstood your idea as suggesting to skip progress reporting for
> *every* intermediate partitioned index, and its leaves.  But I guess
> what you meant is to skip progress reporting when ATTACHing an
> intermediate partitioned index. That seems okay, since 1) intermediate
> partitioned tables are probably rare, and 2) ATTACH is fast, so the
> effect is indistinguisable from querying the progress report a few
> moments later.

+1 that counting attached partitioned indexes as 0 is fine. 

> The idea would be for:
> 1) TOTAL to show the number of direct and indirect leaf partitions;
> 2) update progress while building direct or indirect indexes;
> 3) ATTACHing intermediate partitioned tables to increment by 0;
> 4) ATTACHing a direct child should continue to increment by 1,
> since that common case already works as expected and shouldn't be
> changed.
> 
> The only change from the current patch is (3).  (1) still calls
> count_leaf_partitions(), but only once.  I'd prefer that to rearranging
> the progress reporting to set the TOTAL in ProcessUtilitySlow().
> 
> -- 
> Justin

As for reusing TOTAL calculated outside of DefineIndex, as I can see, 
ProcessUtilitySlow is not the only call site for DefineIndex (although, I don’t 
know whether all of them need progress tracking), for instance, there is ALTER 
TABLE that calls DefineIndex to create index for constraints. So I feel like 
rearranging progress reporting will result in unnecessary code duplication in 
those call sites, so passing in an optional parameter seems to be easier here, 
if we are going to optimize it, after all. Especially if back-patching is a 
non-issue.





Re: doc: mentioned CREATE+ATTACH PARTITION as an alternative to CREATE TABLE..PARTITION OF

2023-03-14 Thread Jakub Wartak
Hi, I've tested the attached patch by Justin and it applied almost
cleanly to the master, but there was a tiny typo and make
postgres-A4.pdf didn't want to run:
Note that creating a partition using PARTITION OF
=> (note lack of closing literal) =>
Note that creating a partition using PARTITION OF

Attached is version v0002 that contains this fix. @Justin maybe you
could set the status to Ready for Comitter (
https://commitfest.postgresql.org/42/3790/ ) ?

On Thu, Jan 19, 2023 at 10:58 PM Justin Pryzby  wrote:
>
> On Thu, Jan 19, 2023 at 04:47:59PM -0500, Robert Treat wrote:
> > I think all of that feedback is useful, I guess the immediate question
> > becomes if Justin wants to try to proceed with his patch implementing
> > the change, or if adjusting the documentation for the current
> > implementation is the right move for now.
>
> The docs change is desirable in any case, since it should be
> backpatched, and any patch to change CREATE..PARTITION OF would be for
> v17+ anyway.
>
> --
> Justin
>
>


v0002-Justin-doc_mention_CREATE+ATTACH_PARTITION.patch
Description: Binary data


Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-14 Thread Tom Lane
Justin Pryzby  writes:
> The idea would be for:
> 1) TOTAL to show the number of direct and indirect leaf partitions;
> 2) update progress while building direct or indirect indexes;
> 3) ATTACHing intermediate partitioned tables to increment by 0;
> 4) ATTACHing a direct child should continue to increment by 1,
> since that common case already works as expected and shouldn't be
> changed.

OK.

> The only change from the current patch is (3).  (1) still calls
> count_leaf_partitions(), but only once.  I'd prefer that to rearranging
> the progress reporting to set the TOTAL in ProcessUtilitySlow().

I don't agree with that.  find_all_inheritors is fairly expensive
and it seems completely silly to do it twice just to avoid adding
a parameter to DefineIndex.

regards, tom lane




Re: Ignoring BRIN for HOT updates (was: -udpates seems broken)

2023-03-14 Thread Matthias van de Meent
On Tue, 14 Mar 2023 at 14:49, Tomas Vondra
 wrote:
>
>
>
> On 3/8/23 23:31, Matthias van de Meent wrote:
> > On Wed, 22 Feb 2023 at 14:14, Matthias van de Meent
> >
> > I think that the v4 patch solves all comments up to now; and
> > considering that most of this patch was committed but then reverted
> > due to an issue in v15, and that said issue is fixed in this patch,
> > I'm marking this as ready for committer.
> >
> > Tomas, would you be up for that?
> >
>
> Thanks for the patch. I started looking at it yesterday, and I think
> it's 99% RFC. I think it's correct and I only have some minor comments,
> (see the 0002 patch):
>
>
> 1) There were still a couple minor wording issues in the sgml docs.
>
> 2) bikeshedding: I added a bunch of "()" to various conditions, I think
> it makes it clearer.

Sure

> 3) This seems a bit weird way to write a conditional Assert:
>
> if (onlySummarized)
> Assert(HeapTupleIsHeapOnly(heapTuple));
>
> better to do a composed Assert(!(onlySummarized && !...)) or something?

I don't like this double negation, as it adds significant parsing
complexity to the statement. If I'd have gone with a single Assert()
statement, I'd have used the following:

Assert((!onlySummarized) || HeapTupleIsHeapOnly(heapTuple));

because in the code section above that the HOT + !onlySummarized case
is an early exit.

> 4) A couple comments and minor tweaks.
> 5) Undoing a couple unnecessary changes (whitespace, ...).
> 6) Proper formatting of TU_UpdateIndexes enum.

Allright

> + *
> + * XXX Why do we assign explicit values to the members, instead of just 
> letting
> + * it up to the enum (just like for TM_Result)?

This was from the v15 beta window, to reduce the difference between
bool and TU_UpdateIndexes. With pg16, that can be dropped.

>
> 7) Comment in RelationGetIndexAttrBitmap() is misleading, as it still
> references hotblockingattrs, even though it may update summarizedattrs
> in some cases.

How about

  Since we have covering indexes with non-key columns, we must
  handle them accurately here. Non-key columns must be added into
  the hotblocking or summarizing attrs bitmap, since they are in
  the index, and update shouldn't miss them.

instead for that section?

> If you agree with these changes, I'll get it committed.

Yes, thanks!

Kind regards,

Matthias van de Meent




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-14 Thread Justin Pryzby
On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote:
> >> I agree that adding such a field to IndexStmt would be a very bad idea.
> >> However, adding another parameter to DefineIndex doesn't seem like a
> >> problem.
> 
> > It's a problem since this is a bug and it's desirable to backpatch a
> > fix, right ?
> 
> I do not think this is important enough to justify a back-patch.

That's fine with me, but it comes as a surprise, and it might invalidate
earlier discussions, which were working under the constraint of
maintaining a compatible ABI.

> > Incrementing by 0 sounds terrible, since someone who has intermediate
> > partitioned tables is likely to always see 0% done.
> 
> How so?  The counter will increase after there's some actual work done,
> ie building an index.  If there's no indexes to build then it hardly
> matters, because the command will complete in very little time.

I misunderstood your idea as suggesting to skip progress reporting for
*every* intermediate partitioned index, and its leaves.  But I guess
what you meant is to skip progress reporting when ATTACHing an
intermediate partitioned index. That seems okay, since 1) intermediate
partitioned tables are probably rare, and 2) ATTACH is fast, so the
effect is indistinguisable from querying the progress report a few
moments later.

The idea would be for:
1) TOTAL to show the number of direct and indirect leaf partitions;
2) update progress while building direct or indirect indexes;
3) ATTACHing intermediate partitioned tables to increment by 0;
4) ATTACHing a direct child should continue to increment by 1,
since that common case already works as expected and shouldn't be
changed.

The only change from the current patch is (3).  (1) still calls
count_leaf_partitions(), but only once.  I'd prefer that to rearranging
the progress reporting to set the TOTAL in ProcessUtilitySlow().

-- 
Justin




Re: lz4 --rm on Ubuntu 18.04 (Add LZ4 compression to pg_dump)

2023-03-14 Thread Tomas Vondra
On 3/14/23 11:34, Christoph Berg wrote:
> Re: Tomas Vondra
>> and I don't think there's a good place to inject the 'rm' so I ended up
>> adding a 'cleanup_cmd' right after 'compress_cmd'. But it seems a bit
>> strange / hacky. Maybe there's a better way?
> 
> Does the file need to be removed at all? Could we leave it there and
> have "make clean" remove it?
> 

I don't think that'd work, because of the automatic "discovery" where we
check if a file exists, and if not we try to append .gz and .lz4. So if
you leave the .toc, we'd not find the .lz4, making the test useless ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: meson: Non-feature feature options

2023-03-14 Thread Nazir Bilal Yavuz
Hi,

On Wed, 22 Feb 2023 at 12:14, Peter Eisentraut
 wrote:
>
> On 21.02.23 17:32, Nazir Bilal Yavuz wrote:
> >>> I like the second approach, with a 'uuid' feature option.  As you wrote
> >>> earlier, adding an 'auto' choice to a combo option doesn't work fully 
> >>> like a
> >>> real feature option.
> >> But we can make it behave exactly like one, by checking the auto_features
> >> option.
> > Yes, we can set it like `uuidopt = get_option('auto_features')`.
> > However, if someone wants to set 'auto_features' to 'disabled' but
> > 'uuid' to 'enabled'(to find at least one working uuid library); this
> > won't be possible. We can add 'enabled', 'disabled and 'auto' choices
> > to 'uuid' combo option to make all behaviours possible but adding
> > 'uuid' feature option and 'uuid_library' combo option seems better to
> > me.
>
> I think the uuid side of this is making this way too complicated.  I'm
> content leaving this as a manual option for now.
>
> There is much more value in making the ssl option work automatically.
> So I would welcome a patch that just makes -Dssl=auto work smoothly,
> perhaps using the "trick" that Andres described.

I tried to implement what we did for ssl to uuid as well, do you have
any comments?

Regards,
Nazir Bilal Yavuz
Microsoft
From 0e92050c15ca6e9ebeddaafad229eee53fc9e7b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 14 Mar 2023 16:38:02 +0300
Subject: [PATCH v1] meson: Make auto the default of the uuid option

---
 .cirrus.yml   |  5 +--
 meson.build   | 80 ++-
 meson_options.txt |  4 +-
 src/makefiles/meson.build |  2 +-
 4 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 505c50f3285..1ccb14c6340 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -181,7 +181,7 @@ task:
 su postgres <<-EOF
   meson setup \
 --buildtype=debug \
--Dcassert=true -Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
+-Dcassert=true -Dtcl_version=tcl86 -Ddtrace=auto \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 -Dextra_lib_dirs=/usr/local/lib -Dextra_include_dirs=/usr/local/include/ \
 build
@@ -243,7 +243,6 @@ LINUX_CONFIGURE_FEATURES: _CONFIGURE_FEATURES >-
 
 LINUX_MESON_FEATURES: _MESON_FEATURES >-
   -Dllvm=enabled
-  -Duuid=e2fs
 
 
 task:
@@ -496,7 +495,7 @@ task:
   -Dextra_include_dirs=${brewpath}/include \
   -Dextra_lib_dirs=${brewpath}/lib \
   -Dcassert=true \
-  -Duuid=e2fs -Ddtrace=auto \
+  -Ddtrace=auto \
   -Dsegsize_blocks=6 \
   -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
   build
diff --git a/meson.build b/meson.build
index 2ebdf914c1b..7955ae42d03 100644
--- a/meson.build
+++ b/meson.build
@@ -1282,34 +1282,61 @@ endif
 ###
 
 uuidopt = get_option('uuid')
+uuid_library = 'none'
+uuid = not_found_dep
+if uuidopt == 'auto' and auto_features.disabled()
+  uuidopt = 'none'
+endif
+
 if uuidopt != 'none'
-  uuidname = uuidopt.to_upper()
-  if uuidopt == 'e2fs'
-uuid = dependency('uuid', required: true)
-uuidfunc = 'uuid_generate'
-uuidheader = 'uuid/uuid.h'
-  elif uuidopt == 'bsd'
-# libc should have uuid function
-uuid = declare_dependency()
-uuidfunc = 'uuid_to_string'
-uuidheader = 'uuid.h'
-  elif uuidopt == 'ossp'
-uuid = dependency('ossp-uuid', required: true)
-uuidfunc = 'uuid_export'
-uuidheader = 'uuid.h'
-  else
-error('huh')
-  endif
 
-  if not cc.has_header_symbol(uuidheader, uuidfunc, args: test_c_args, dependencies: uuid)
-error('uuid library @0@ missing required function @1@'.format(uuidopt, uuidfunc))
-  endif
-  cdata.set('HAVE_@0@'.format(uuidheader.underscorify().to_upper()), 1)
+  uuids = {
+'e2fs': {
+  'uuiddep': 'uuid',
+  'uuidfunc': 'uuid_generate',
+  'uuidheader': 'uuid/uuid.h',
+},
+'bsd': {
+  'uuiddep': '',
+  'uuidfunc': 'uuid_to_string',
+  'uuidheader': 'uuid.h',
+},
+'ossp': {
+  'uuiddep': 'ossp-uuid',
+  'uuidfunc': 'uuid_export',
+  'uuidheader': 'uuid.h',
+}
+  }
 
-  cdata.set('HAVE_UUID_@0@'.format(uuidname), 1,
-   description: 'Define to 1 if you have @0@ UUID support.'.format(uuidname))
-else
-  uuid = not_found_dep
+  foreach uuidname, uuid_values : uuids
+if uuidopt != 'auto' and uuidname != uuidopt
+  continue
+endif
+uuid_required = (uuidname == uuidopt)
+uuiddep = uuid_values['uuiddep']
+uuidheader = uuid_values['uuidheader']
+uuidfunc = uuid_values['uuidfunc']
+
+# libc should have uuid function
+uuid = uuiddep != '' ? dependency(uuiddep, required: uuid_required) : \
+declare_dependency()
+
+if uuid.found() and cc.has_header_symbol(uuidheader, uuidfunc, args: test_c_args,
+ dependencies: uuid, required: uuid_required)
+  uuidname_upper = 

Re: Ignoring BRIN for HOT updates (was: -udpates seems broken)

2023-03-14 Thread Tomas Vondra


On 3/8/23 23:31, Matthias van de Meent wrote:
> On Wed, 22 Feb 2023 at 14:14, Matthias van de Meent
>  wrote:
>>
>> On Wed, 22 Feb 2023 at 13:15, Tomas Vondra
>>  wrote:
>>>
>>> On 2/20/23 19:15, Matthias van de Meent wrote:
 Thanks. Based on feedback, attached is v2 of the patch, with as
 significant changes:

 - We don't store the columns we mention in predicates of summarized
 indexes in the hotblocking column anymore, they are stored in the
 summarized columns bitmap instead. This further reduces the chance of
 failiing to apply HOT with summarizing indexes.
>>>
>>> Interesting idea. I need to think about the correctness, but AFAICS it
>>> should work. Do we have any tests covering such cases?
>>
>> There is a test that checks that an update to the predicated column
>> does update the index (on table brin_hot_2). However, the description
>> was out of date, so I've updated that in v4.
>>
 - The heaptuple header bit for summarized update in inserted tuples is
 replaced with passing an out parameter. This simplifies the logic and
 decreases chances of accidentally storing incorrect data.

>>>
>>> OK.
>>>
>>> 0002 proposes a minor RelationGetIndexPredicate() tweak, getting rid of
>>> the repeated if/else branches. Feel free to discard, if you think the v2
>>> approach is better.
>>
>> I agree that this is better, it's included in v4 of the patch, as attached.
> 
> I think that the v4 patch solves all comments up to now; and
> considering that most of this patch was committed but then reverted
> due to an issue in v15, and that said issue is fixed in this patch,
> I'm marking this as ready for committer.
> 
> Tomas, would you be up for that?
> 

Thanks for the patch. I started looking at it yesterday, and I think
it's 99% RFC. I think it's correct and I only have some minor comments,
(see the 0002 patch):


1) There were still a couple minor wording issues in the sgml docs.

2) bikeshedding: I added a bunch of "()" to various conditions, I think
it makes it clearer.

3) This seems a bit weird way to write a conditional Assert:

if (onlySummarized)
Assert(HeapTupleIsHeapOnly(heapTuple));

better to do a composed Assert(!(onlySummarized && !...)) or something?

4) A couple comments and minor tweaks.

5) Undoing a couple unnecessary changes (whitespace, ...).

6) Proper formatting of TU_UpdateIndexes enum.

7) Comment in RelationGetIndexAttrBitmap() is misleading, as it still
references hotblockingattrs, even though it may update summarizedattrs
in some cases.


If you agree with these changes, I'll get it committed.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 4459b7b325843d2dfcfad42d645ae5d3b47d784d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 14 Mar 2023 01:11:38 +0100
Subject: [PATCH v5 1/2] Ignore BRIN indexes when checking for HOT updates

When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.

This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.

A new type TU_UpdateIndexes is invented provide a signal to the executor
to determine which indexes to update - no indexes, all indexes, or only
the summarizing indexes.

One otherwise unused bit in the heap tuple header is (ab)used to signal
that the HOT update would still update at least one summarizing index.
The bit is cleared immediately

Original patch by Josef Simanek, various fixes and improvements by
Tomas Vondra and me.

Authors: Josef Simanek, Tomas Vondra, Matthias van de Meent
Reviewed-by: Tomas Vondra, Alvaro Herrera
---
 doc/src/sgml/indexam.sgml |  13 ++
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/heap/heapam.c  |  48 +++-
 src/backend/access/heap/heapam_handler.c  |  19 ++-
 src/backend/access/nbtree/nbtree.c|   1 +
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/access/table/tableam.c|   2 +-
 src/backend/catalog/index.c   |   9 +-
 src/backend/catalog/indexing.c|  35 --
 src/backend/commands/copyfrom.c   |   5 +-
 src/backend/commands/indexcmds.c  |  10 +-
 src/backend/executor/execIndexing.c   |  37 --
 src/backend/executor/execReplication.c|   9 +-
 src/backend/executor/nodeModifyTable.c|  13 +-
 src/backend/nodes/makefuncs.c |   7 +-
 src/backend/utils/cache/relcache.c|  69 +++
 src/include/access/amapi.h|   2 +
 src/include/access/heapam.h   |  

Re: [PATCH] Add pretty-printed XML output option

2023-03-14 Thread Jim Jones

On 09.03.23 21:21, Tom Lane wrote:

Peter Smith  writes:

The patch v19 LGTM.

Another thing that's mildly irking me is that the current
factorization of this code will result in xml_parse'ing the data
twice, if you have both DOCUMENT and INDENT specified.  We could
consider avoiding that if we merged the indentation functionality
into xmltotext_with_xmloption, but it's probably premature to do so
when we haven't figured out how to get the output right --- we might
end up needing two xml_parse calls anyway with different parameters,
perhaps.


Just a thought: since xmlserialize_indent also calls xml_parse() to 
build the xmlDocPtr, couldn't we simply bypass 
xmltotext_with_xmloption() in case of INDENT is specified?


Something like this:

diff --git a/src/backend/executor/execExprInterp.c 
b/src/backend/executor/execExprInterp.c

index 19351fe..ea808dd 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -3829,6 +3829,7 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
    {
    Datum  *argvalue = op->d.xmlexpr.argvalue;
    bool   *argnull = op->d.xmlexpr.argnull;
+   text   *result;

    /* argument type is known to be xml */
    Assert(list_length(xexpr->args) == 1);
@@ -3837,8 +3838,14 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
    return;
    value = argvalue[0];

-   *op->resvalue = 
PointerGetDatum(xmltotext_with_xmloption(DatumGetXmlP(value),

- xexpr->xmloption));
+   if (xexpr->indent)
+   result = 
xmlserialize_indent(DatumGetXmlP(value),

+ xexpr->xmloption);
+   else
+   result = 
xmltotext_with_xmloption(DatumGetXmlP(value),

+ xexpr->xmloption);
+
+   *op->resvalue = PointerGetDatum(result);
    *op->resnull = false;
    }

    break;






Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-14 Thread Juan José Santamaría Flecha
Please, don't top post.

On Tue, Mar 14, 2023 at 12:30 PM Daniel Watzinger <
daniel.watzin...@gmail.com> wrote:

> I'm sorry I couldn't contribute to the discussion in time. The fix of the
> fstat() Win32 port looks good to me. I agree that there's a need for
> multiple fseek() ports to address the shortcomings of the MSVC
> functionality.
>
> The documentation event states that "on devices incapable of seeking, the
> return value is undefined". A simple wrapper using GetFileType() or the new
> fstat(), to filter non-seekable devices before delegation, will probably
> suffice.
>
>
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fseek-fseeki64?view=msvc-170
>

I have just posted a patch to enforce the detection of unseekable streams
in the fseek() calls [1], please feel free to review it.

[1]
https://www.postgresql.org/message-id/CAC%2BAXB26a4EmxM2suXxPpJaGrqAdxracd7hskLg-zxtPB50h7A%40mail.gmail.com

Regards,

Juan José Santamaría Flecha


Fix fseek() detection of unseekable files on WIN32

2023-03-14 Thread Juan José Santamaría Flecha
Hello all,

As highlighted in [1] fseek() might fail to error even when accessing
unseekable streams.

PFA a patch that checks the file type before the actual fseek(), so only
supported calls are made.

[1]
https://www.postgresql.org/message-id/flat/b1448cd7-871e-20e3-8398-895e2d1d3...@gmail.com

Regards,

Juan José Santamaría Flecha


0001-fix-fseek-detection-of-unseekable-files-for-WIN32.patch
Description: Binary data


Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Amit Kapila
On Tue, Mar 14, 2023 at 5:03 PM vignesh C  wrote:
>
> On Tue, 14 Mar 2023 at 12:37, Peter Smith  wrote:
> >
>
> The same issue exists here too:
> 1)
> -   if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
> +   if (rbtxn_is_subtxn(txn))
> {
> -   toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> -   dclist_push_tail(>catchange_txns, 
> >catchange_node);
> +   ReorderBufferTXN *toptxn = rbtxn_get_toptxn(txn);
>
> 2)
>  -   if (change->txn->toptxn)
> -   topxid = change->txn->toptxn->xid;
> +   if (rbtxn_is_subtxn(change->txn))
> +   topxid = rbtxn_get_toptxn(change->txn)->xid;
>
> If you plan to fix, bothe the above also should be handled.
>

I don't know if it would be any better to change the above two as
compared to what the proposed patch has.

-- 
With Regards,
Amit Kapila.




Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Amit Kapila
On Tue, Mar 14, 2023 at 12:37 PM Peter Smith  wrote:
>
> Thanks for the review!
>
> On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
> ...
>
> > 4) We check if txn->toptxn is not null twice here both in if condition
> > and in the assignment, we could retain the assignment operation as
> > earlier to remove the 2nd check:
> > -   if (txn->toptxn)
> > -   txn = txn->toptxn;
> > +   if (isa_subtxn(txn))
> > +   txn = get_toptxn(txn);
> >
> > We could avoid one check again by:
> > +   if (isa_subtxn(txn))
> > +   txn = txn->toptxn;
> >
>
> Yeah, that is true, but I chose not to keep the original assignment in
> this case mainly because then it is consistent with the other changed
> code ---  e.g. Every other direct member assignment/access of the
> 'toptxn' member now hides behind the macros so I did not want this
> single place to be the odd one out. TBH, I don't think 1 extra check
> is of any significance, but it is not a problem to change like you
> suggested if other people also want it done that way.
>

Can't we directly use rbtxn_get_toptxn() for this case? I think that
way code will look neat. I see that it is not exactly matching the
existing check so you might be worried but I feel the new code will
achieve the same purpose and will be easy to follow.

-- 
With Regards,
Amit Kapila.




Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread vignesh C
On Tue, 14 Mar 2023 at 12:37, Peter Smith  wrote:
>
> Thanks for the review!
>
> On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
> ...
> > Few comments:
> > 1) Can we move the macros along with the other macros present in this
> > file, just above this structure, similar to the macros added for
> > txn_flags:
> > /* Toplevel transaction for this subxact (NULL for top-level). */
> > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> >
> > 2) The macro name can be changed to rbtxn_is_toptxn, rbtxn_is_subtxn
> > and rbtxn_get_toptxn to keep it consistent with others:
> > /* Toplevel transaction for this subxact (NULL for top-level). */
> > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> >
> > 3) We could add separate comments for each of the macros:
> > /* Toplevel transaction for this subxact (NULL for top-level). */
> > +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> > +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> > +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
> >
>
> All the above are fixed as suggested.
>
> > 4) We check if txn->toptxn is not null twice here both in if condition
> > and in the assignment, we could retain the assignment operation as
> > earlier to remove the 2nd check:
> > -   if (txn->toptxn)
> > -   txn = txn->toptxn;
> > +   if (isa_subtxn(txn))
> > +   txn = get_toptxn(txn);
> >
> > We could avoid one check again by:
> > +   if (isa_subtxn(txn))
> > +   txn = txn->toptxn;
> >
>
> Yeah, that is true, but I chose not to keep the original assignment in
> this case mainly because then it is consistent with the other changed
> code ---  e.g. Every other direct member assignment/access of the
> 'toptxn' member now hides behind the macros so I did not want this
> single place to be the odd one out. TBH, I don't think 1 extra check
> is of any significance, but it is not a problem to change like you
> suggested if other people also want it done that way.

The same issue exists here too:
1)
-   if (toptxn != NULL && !rbtxn_has_catalog_changes(toptxn))
+   if (rbtxn_is_subtxn(txn))
{
-   toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
-   dclist_push_tail(>catchange_txns, >catchange_node);
+   ReorderBufferTXN *toptxn = rbtxn_get_toptxn(txn);

2)
 -   if (change->txn->toptxn)
-   topxid = change->txn->toptxn->xid;
+   if (rbtxn_is_subtxn(change->txn))
+   topxid = rbtxn_get_toptxn(change->txn)->xid;

If you plan to fix, bothe the above also should be handled.

3) The comment on top of rbtxn_get_toptxn could be kept similar in
both the below places. I know it is not because of your change, but
since it is a very small change probably we could include it along
with this patch:
@@ -717,10 +717,7 @@ ReorderBufferProcessPartialChange(ReorderBuffer
*rb, ReorderBufferTXN *txn,
return;

/* Get the top transaction. */
-   if (txn->toptxn != NULL)
-   toptxn = txn->toptxn;
-   else
-   toptxn = txn;
+   toptxn = rbtxn_get_toptxn(txn);

/*
 * Indicate a partial change for toast inserts.  The change will be
@@ -812,10 +809,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn,
ReorderBufferTXN *toptxn;

/* get the top transaction */
-   if (txn->toptxn != NULL)
-   toptxn = txn->toptxn;
-   else
-   toptxn = txn;
+   toptxn = rbtxn_get_toptxn(txn);

Regards,
Vignesh




Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-14 Thread Daniel Watzinger
I'm sorry I couldn't contribute to the discussion in time. The fix of the
fstat() Win32 port looks good to me. I agree that there's a need for
multiple fseek() ports to address the shortcomings of the MSVC
functionality.

The documentation event states that "on devices incapable of seeking, the
return value is undefined". A simple wrapper using GetFileType() or the new
fstat(), to filter non-seekable devices before delegation, will probably
suffice.

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fseek-fseeki64?view=msvc-170

Regarding test automation and regression testing, there's a programmatic
way to simulate how the "pipe operator" of cmd.exe and other shells works
using CreateProcess and manual "piping" by means of various WinAPI
functionality. This is actually how the bug was discovered in the first
case. However, existing tests are probably platform-agnostic.

--
Daniel

On Tue, Mar 14, 2023 at 12:41 AM Michael Paquier 
wrote:

> On Mon, Mar 13, 2023 at 05:49:41PM +0100, Juan José Santamaría Flecha
> wrote:
> > WFM, making fseek() behaviour more resilient seems like a good
> improvement
> > overall.
>
> I have not looked in details, but my guess would be to add a
> win32seek.c similar to win32stat.c with a port of fseek() that's more
> resilient to the definitions in POSIX.
>
> > Should we open a new thread to make that part more visible?
>
> Yes, perhaps it makes sense to do so to attract the correct audience,
> There may be a few things we are missing.
>
> When it comes to pg_dump, both fixes are required, still it seems to
> me that adjusting the fstat() port and the fseek() ports are two
> different bugs, as they influence different parts of the code base
> when taken individually (aka this fseek() port for WIN32 would need
> fstat() to properly detect a pipe, as far as I understand).
>
> Meanwhile, I'll go apply and backpatch 0001 to fix the first bug at
> hand with the fstat() port, if there are no objections.
> --
> Michael
>


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-14 Thread John Naylor
I wrote:

> > > Since the block-level measurement is likely overestimating quite a
bit, I propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter.
It might be tiny compared to everything else that vacuum does. That would
allow other users to avoid that overhead.
> 2. When context block usage exceeds the limit (rare), make the additional
effort to get the precise usage -- I'm not sure such a top-down facility
exists, and I'm not feeling well enough today to study this further.

Since then, Masahiko incorporated #1 into v31, and that's what I'm looking
at now. Unfortunately, If I had spent five minutes reminding myself what
the original objections were to this approach, I could have saved us some
effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace()
is slow [1], and fragmentation [2] (leading to underestimation).

In v31, in the local case at least, the underestimation is actually worse
than tracking chunk space, since it ignores chunk header and alignment.
I'm not sure about the DSA case. This doesn't seem great.

It shouldn't be a surprise why a simple increment of raw allocation size is
comparable in speed -- GetMemoryChunkSpace() calls the right function
through a pointer, which is slower. If we were willing to underestimate for
the sake of speed, that takes away the reason for making memory tracking
optional.

Further, if the option is not specified, in v31 there is no way to get the
memory use at all, which seems odd. Surely the caller should be able to ask
the context/area, if it wants to.

I still like my idea at the top of the page -- at least for vacuum and
m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.

I'll put this item and a couple other things together in a separate email
tomorrow.

[1]
https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
[2]
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: Add LZ4 compression in pg_dump

2023-03-14 Thread gkokolatos



--- Original Message ---
On Monday, March 13th, 2023 at 10:47 PM, Tomas Vondra 
 wrote:



> 
> > Change pg_fatal() to an assertion+comment;
> 
> 
> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
> could add such protections against "impossible" stuff to a zillion other
> places and the confusion likely outweighs the benefits.
> 

A minor note to add is to not ignore the lessons learned from a7885c9bb.

For example, as the testing framework stands, one can not test that the
contents of the custom format are indeed compressed. One can infer it by
examining the header of the produced dump and searching for the
compression flag. The code responsible for writing the header and the
code responsible for actually dealing with data, is not the same. Also,
the compression library itself will happily read and write uncompressed
data.

A pg_fatal, assertion, or similar, is the only guard rail against this
kind of error. Without it, the tests will continue passing even after
e.g. a wrong initialization of the API. It was such a case that lead to
a7885c9bb in the first place. I do think that we wish it to be an
"impossible" case. Also it will be an untested case with some history
without such a guard rail.

Of course I will not object to removing it, if you think that is more
confusing than useful.

Cheers,
//Georgios

> 
> regards
> 
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: Allow logical replication to copy tables in binary format

2023-03-14 Thread Melih Mutlu
Hi,

Attached v13.

Peter Smith , 14 Mar 2023 Sal, 03:07 tarihinde şunu
yazdı:

> Here are some review comments for patch v12-0001
>

Thanks for reviewing. I tried to make explanations in docs better
according to your comments.
What do you think?

 Amit Kapila , 14 Mar 2023 Sal, 06:17 tarihinde
şunu yazdı:

> I think it would better to write the tests for this feature in the
> existing test file 014_binary as that would save some time for node
> setup/shutdown and also that would be a more appropriate place for
> these tests.


I removed 032_binary_copy.pl and added those tests into 014_binary.pl.
Also added the case with different column order as Peter suggested.

Best,
-- 
Melih Mutlu
Microsoft


v13-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data


Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-14 Thread Gilles Darold

Le 14/03/2023 à 10:50, stephane tachoires a écrit :

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

V5 is ok (aside for LLVM 14 deprecated warnings, but that's another problem) 
with meson compile and meson test on Ubuntu 20.04.2.
Code fits well and looks standart, --help explain what it does clearly, and 
documentation is ok (but as a Français, I'm not an expert in English).



Thanks Stepane, I've changed commit fest status to "Ready for committers".

--
Gilles Darold





Re: lz4 --rm on Ubuntu 18.04 (Add LZ4 compression to pg_dump)

2023-03-14 Thread Christoph Berg
Re: Tomas Vondra
> and I don't think there's a good place to inject the 'rm' so I ended up
> adding a 'cleanup_cmd' right after 'compress_cmd'. But it seems a bit
> strange / hacky. Maybe there's a better way?

Does the file need to be removed at all? Could we leave it there and
have "make clean" remove it?

Christoph




Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-14 Thread stephane tachoires
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

V5 is ok (aside for LLVM 14 deprecated warnings, but that's another problem) 
with meson compile and meson test on Ubuntu 20.04.2.
Code fits well and looks standart, --help explain what it does clearly, and 
documentation is ok (but as a Français, I'm not an expert in English).

Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-14 Thread Stéphane Tachoires
Hi Gilles,
V5 is ok (aside for LLVM 14 deprecated warnings, but that's another
problem) with meson compile and meson test on Ubuntu 20.04.2.
Code fits well and looks standart, --help explain what it does clearly, and
documentation is ok (but as a Français, I'm not an expert in English).

Stéphane

Le lun. 13 mars 2023 à 16:15, Gilles Darold  a écrit :

> Le 12/03/2023 à 19:05, Stéphane Tachoires a écrit :
>
>
> Hi Gilles,
>
> On Ubuntu 22.04.2 all deb's updated, I have an error on a test
> I'll try to find where and why, but I think you should know first.
>
> 1/1 postgresql:pg_dump / pg_dump/002_pg_dumpERROR
>  24.40s   exit status 1
> ――
> ✀
>  
> ――
> stderr:
> #   Failed test 'only_dump_measurement: should dump CREATE TABLE
> test_compression'
> #   at /media/hddisk/stephane/postgresql/src/postgresql/src/bin/pg_dump/t/
> 002_pg_dump.pl line 4729.
> # Review only_dump_measurement results in
> /media/hddisk/stephane/postgresql/build/testrun/pg_dump/002_pg_dump/data/tmp_test_iJxJ
> # Looks like you failed 1 test of 10264.
>
>
> Hi Stephane,
>
>
> Odd, I do not have this error when running make installcheck, I have the
> same OS version as you.
>
>
> +++ tap check in src/bin/pg_dump +++
> t/001_basic.pl  ok
> t/002_pg_dump.pl .. ok
> t/003_pg_dump_with_server.pl .. ok
> t/010_dump_connstr.pl . ok
> All tests successful.
> Files=4, Tests=9531, 11 wallclock secs ( 0.33 usr  0.04 sys +  3.05
> cusr  1.22 csys =  4.64 CPU)
> Result: PASS
>
> Anyway this test must be fixed and this is done in version v5 of the patch
> attached here.
>
>
> Thanks for the review.
>
> --
> Gilles Darold
>
>

-- 
"Où se posaient les hirondelles avant l'invention du téléphone ?"
  -- Grégoire Lacroix


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Önder Kalacı
>
>
>
> Thanks for the updated patch.
> Few minor comments:
> 1) The extra line break after IsIndexOnlyOnExpression function can be
> removed:
>

removed


>
>
> 2) Generally we don't terminate with "." for single line comments
> +
> + /*
> + * Simple case, we already have a primary key or a replica identity index.
> + */
> + idxoid = GetRelationIdentityOrPK(localrel);
> + if (OidIsValid(idxoid))
> + return idxoid;
>
> Well, there are several "."  for single line comments even in the same
file such as:

/* 0 means it's a dropped attribute.  See comments atop AttrMap. */

I really don't have any preference on this, but I doubt if I change it,
I'll get
another review suggesting to conform to the existing style in the same file.
So, I'm skipping this suggestion for now, unless you have objections.


v51_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread vignesh C
On Tue, 14 Mar 2023 at 14:36, Önder Kalacı  wrote:
>
>
> Amit Kapila , 14 Mar 2023 Sal, 11:59 tarihinde şunu 
> yazdı:
>>
>> On Tue, Mar 14, 2023 at 12:48 PM Önder Kalacı  wrote:
>> >>
>> >> 2.
>> >> +# make sure that the subscriber has the correct data after the update 
>> >> UPDATE
>> >>
>> >> "update UPDATE" seems to be a typo.
>> >>
>> >
>> > thanks, fixed
>> >
>> >>
>> >> 3.
>> >> +# now, drop the index with the expression, and re-create index on column 
>> >> lastname
>> >>
>> >> The comment says "re-create index on column lastname" but it seems we 
>> >> didn't do
>> >> that, should it be modified to something like:
>> >> # now, drop the index with the expression, we will use sequential scan
>> >>
>> >>
>> >
>> > Thanks, fixed
>> >
>> > I'll add the changes to v49 in the next e-mail.
>> >
>>
>> It seems you forgot to address these last two comments in the latest version.
>>
>
> Oops, sorry. I think when I get your test changes, I somehow overridden these 
> changes
> on my local.

Thanks for the updated patch.
Few minor comments:
1) The extra line break after IsIndexOnlyOnExpression function can be removed:
+ }
+
+ return true;
+}
+
+
+/*
+ * Returns true if the attrmap (which belongs to remoterel) contains the
+ * leftmost column of the index.
+ *
+ * Otherwise returns false.
+ */

2) Generally we don't terminate with "." for single line comments
+
+ /*
+ * Simple case, we already have a primary key or a replica identity index.
+ */
+ idxoid = GetRelationIdentityOrPK(localrel);
+ if (OidIsValid(idxoid))
+ return idxoid;

Regards,
Vignesh




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Önder Kalacı
Amit Kapila , 14 Mar 2023 Sal, 11:59 tarihinde
şunu yazdı:

> On Tue, Mar 14, 2023 at 12:48 PM Önder Kalacı 
> wrote:
> >>
> >> 2.
> >> +# make sure that the subscriber has the correct data after the update
> UPDATE
> >>
> >> "update UPDATE" seems to be a typo.
> >>
> >
> > thanks, fixed
> >
> >>
> >> 3.
> >> +# now, drop the index with the expression, and re-create index on
> column lastname
> >>
> >> The comment says "re-create index on column lastname" but it seems we
> didn't do
> >> that, should it be modified to something like:
> >> # now, drop the index with the expression, we will use sequential scan
> >>
> >>
> >
> > Thanks, fixed
> >
> > I'll add the changes to v49 in the next e-mail.
> >
>
> It seems you forgot to address these last two comments in the latest
> version.
>
>
Oops, sorry. I think when I get your test changes, I somehow overridden
these changes
on my local.


v50_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Amit Kapila
On Tue, Mar 14, 2023 at 12:48 PM Önder Kalacı  wrote:
>>
>> 2.
>> +# make sure that the subscriber has the correct data after the update UPDATE
>>
>> "update UPDATE" seems to be a typo.
>>
>
> thanks, fixed
>
>>
>> 3.
>> +# now, drop the index with the expression, and re-create index on column 
>> lastname
>>
>> The comment says "re-create index on column lastname" but it seems we didn't 
>> do
>> that, should it be modified to something like:
>> # now, drop the index with the expression, we will use sequential scan
>>
>>
>
> Thanks, fixed
>
> I'll add the changes to v49 in the next e-mail.
>

It seems you forgot to address these last two comments in the latest version.


-- 
With Regards,
Amit Kapila.




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-14 Thread Bharath Rupireddy
On Tue, Mar 7, 2023 at 11:14 PM Nathan Bossart  wrote:
>
> On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> > On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  
> > wrote:
> >> Is it possible to memcpy more than a page at a time?
> >
> > It would complicate things a lot there; the logic to figure out the
> > last page bytes that may or may not fit in the whole page gets
> > complicated. Also, the logic to verify each page's header gets
> > complicated. We might lose out if we memcpy all the pages at once and
> > start verifying each page's header in another loop.
>
> Doesn't the complicated logic you describe already exist to some extent in
> the patch?  You are copying a page at a time, which involves calculating
> various addresses and byte counts.

Okay here I am with the v10 patch set attached that avoids multiple
memcpy calls which must benefit the callers who want to read more than
1 WAL buffer page (streaming replication WAL sender for instance).

> >> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
> >> given LSN %X/%X, Timeline ID %u",
> >> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >>
> >> I definitely don't think we should put an elog() in this code path.
> >> Perhaps this should be guarded behind WAL_DEBUG.
> >
> > Placing it behind WAL_DEBUG doesn't help users/developers. My
> > intention was to let users know that the WAL read hit the buffers,
> > it'll help them report if any issue occurs and also help developers to
> > debug that issue.
>
> I still think an elog() is mighty expensive for this code path, even when
> it doesn't actually produce any messages.  And when it does, I think it has
> the potential to be incredibly noisy.

Well, my motive was to have a way for the user to know WAL buffer hits
and misses to report any found issues. However, I have a plan later to
add WAL buffer stats (hits/misses). I understand that even if someone
enables DEBUG1, this message can bloat server log files and make
recovery slower, especially on a standby. Hence, I agree to keep these
logs behind the WAL_DEBUG macro like others and did so in the attached
v10 patch set.

Please review the attached v10 patch set further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From aa6454d9abb9a70b728dbba7f40279108486a3e4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 14 Mar 2023 07:30:09 +
Subject: [PATCH v10] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 171 
 src/backend/access/transam/xlogreader.c |  42 +-
 src/include/access/xlog.h   |   6 +
 3 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..d40b9562e1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+	XLogRecPtr startptr,
+	TimeLineID tli,
+	Size count,
+	char *buf)
+{
+	XLogRecPtr	ptr = startptr;
+	Size	nbytes = count;	/* total bytes requested to be read by caller */
+	Size	ntotal = 0;	/* total bytes read */
+	Size	nbatch = 0;	/* bytes to be read in single batch */
+	char	*batchstart = NULL;	/* location to read from for single batch */
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char*data;
+		XLogPageHeader	phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;

Re: [PATCH] Add CANONICAL option to xmlserialize

2023-03-14 Thread Jim Jones
v4 attached fixes an encoding issue at the xml_parse call. It now uses 
GetDatabaseEncoding().


Best, Jim
From 3ff8e7bd9a9e43194d834ba6b125841539d5df1c Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Mon, 6 Mar 2023 14:08:54 +0100
Subject: [PATCH v4] Add CANONICAL format to xmlserialize

This patch introduces the CANONICAL option to xmlserialize, which
serializes xml documents in their canonical form - as described in
the W3C Canonical XML Version 1.1 specification. This option can
be used with the additional parameter WITH [NO] COMMENTS to keep
or remove xml comments from the canonical xml output. This feature
is based on the function xmlC14NDocDumpMemory from the C14N module
of libxml2.

This patch also includes regression tests and documentation.
---
 doc/src/sgml/datatype.sgml|  41 +-
 src/backend/executor/execExprInterp.c |  13 +++-
 src/backend/parser/gram.y |  14 +++-
 src/backend/parser/parse_expr.c   |   1 +
 src/backend/utils/adt/xml.c   |  57 ++
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   9 +++
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/xml.h   |   1 +
 src/test/regress/expected/xml.out | 108 ++
 src/test/regress/expected/xml_1.out   | 104 +
 src/test/regress/expected/xml_2.out   | 108 ++
 src/test/regress/sql/xml.sql  |  61 +++
 13 files changed, 514 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 467b49b199..46ec95dbb8 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4460,7 +4460,7 @@ xml 'bar'
 xml, uses the function
 xmlserialize:xmlserialize
 
-XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type )
+XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ CANONICAL [ WITH [NO] COMMENTS ] ])
 
 type can be
 character, character varying, or
@@ -4470,6 +4470,45 @@ XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS 
 
+   
+The option CANONICAL converts a given
+ XML document to its https://www.w3.org/TR/xml-c14n11/#Terminology;>canonical form
+ based on the https://www.w3.org/TR/xml-c14n11/;>W3C Canonical XML 1.1 Specification.
+ It is basically designed to provide applications the ability to compare xml documents or test if they
+ have been changed. The optional parameter WITH [NO] COMMENTS removes or keeps XML comments
+ from the given document.
+
+
+
+ Example:
+
+
+   

 When a character string value is cast to or from type
 xml without going through XMLPARSE or
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 19351fe34b..ce376a6fcd 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -3827,18 +3827,27 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
 
 		case IS_XMLSERIALIZE:
 			{
+text	   *result;
 Datum	   *argvalue = op->d.xmlexpr.argvalue;
 bool	   *argnull = op->d.xmlexpr.argnull;
+XmlSerializeFormat	format = op->d.xmlexpr.xexpr->format;
 
 /* argument type is known to be xml */
 Assert(list_length(xexpr->args) == 1);
 
 if (argnull[0])
 	return;
+
 value = argvalue[0];
 
-*op->resvalue = PointerGetDatum(xmltotext_with_xmloption(DatumGetXmlP(value),
-		 xexpr->xmloption));
+if (format == XMLCANONICAL || format == XMLCANONICAL_WITH_COMMENTS)
+	result = xmlserialize_canonical(DatumGetXmlP(value),
+	format);
+else
+	result = xmltotext_with_xmloption(DatumGetXmlP(value),
+	  xexpr->xmloption);
+
+*op->resvalue = PointerGetDatum(result);
 *op->resnull = false;
 			}
 			break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a0138382a1..af5f3dfdfd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -619,6 +619,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	xmltable_column_option_el
 %type 	xml_namespace_list
 %type 	xml_namespace_el
+%type  	opt_xml_serialize_format
 
 %type 	func_application func_expr_common_subexpr
 %type 	func_expr func_expr_windowless
@@ -676,7 +677,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT
 	BOOLEAN_P BOTH BREADTH BY
 
-	CACHE CALL CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
+	CACHE CALL CALLED CANONICAL CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
 	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
 	CLUSTER COALESCE COLLATE COLLATION COLUMN COLUMNS COMMENT COMMENTS COMMIT
 	COMMITTED COMPRESSION CONCURRENTLY CONFIGURATION CONFLICT
@@ -15532,13 +15533,14 @@ func_expr_common_subexpr:
 	$$ = makeXmlExpr(IS_XMLROOT, NULL, NIL,
 	 list_make3($3, $5, $6), @1);
 }
-			| 

Re: logical decoding and replication of sequences, take 2

2023-03-14 Thread John Naylor
I tried a couple toy examples with various combinations of use styles.

Three with "automatic" reading from sequences:

create table test(i serial);
create table test(i int GENERATED BY DEFAULT AS IDENTITY);
create table test(i int default nextval('s1'));

...where s1 has some non-default parameters:

CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;

...and then two with explicit use of s1, one inserting the 'nextval' into a
table with no default, and one with no table at all, just selecting from
the sequence.

The last two seem to work similarly to the first three, so it seems like
FOR ALL TABLES adds all sequences as well. Is that expected? The
documentation for CREATE PUBLICATION mentions sequence options, but doesn't
really say how these options should be used.

Here's the script:

# alter system set wal_level='logical';
# restart
# port  is subscriber

echo
echo "PUB:"
psql -c "drop sequence if exists s1;"
psql -c "drop publication if exists pub1;"

echo
echo "SUB:"
psql -p  -c "drop sequence if exists s1;"
psql -p  -c "drop subscription if exists sub1 ;"

echo
echo "PUB:"
psql -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"

echo
echo "SUB:"
psql -p  -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -p  -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost
dbname=john application_name=sub1 port=5432' PUBLICATION pub1;"


echo
echo "PUB:"
psql -c "select nextval('s1');"
psql -c "select nextval('s1');"
psql -c "select * from s1;"

sleep 1

echo
echo "SUB:"
psql -p  -c "select * from s1;"

psql -p  -c "drop subscription sub1 ;"

psql -p  -c "select nextval('s1');"
psql -p  -c "select * from s1;"


...with the last two queries returning

 nextval
-
  67
(1 row)

 last_value | log_cnt | is_called
+-+---
 67 |  32 | t

So, I interpret that the decrement by 32 got logged here.

Also, running

CREATE PUBLICATION pub2 FOR ALL SEQUENCES WITH (publish = 'insert, update,
delete, truncate, sequence');

...reports success, but do non-default values of "publish = ..." have an
effect (or should they), or are these just ignored? It seems like these
cases shouldn't be treated orthogonally.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: ICU 54 and earlier are too dangerous

2023-03-14 Thread Peter Eisentraut

On 14.03.23 01:26, Tom Lane wrote:

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

... that seems like an overreaction.  We know from the buildfarm
that there's still a lot of old ICU out there.  Is it really improving
anybody's life to try to forbid them from using such a version?


If I'm getting the dates right, the 10-year support of RHEL 7 will 
expire in June 2024.  So if we follow past practices, we could drop 
support for RHEL 7 in PG17.  This would allow us to drop support for old 
libicu, and also old openssl, zlib, maybe more.


So if we don't feel like we need to do an emergency change here, there 
is a path to do this in a principled way in the near future.






Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Önder Kalacı
Hi Amit, all


Amit Kapila , 14 Mar 2023 Sal, 09:50 tarihinde
şunu yazdı:

> On Mon, Mar 13, 2023 at 7:46 PM Önder Kalacı 
> wrote:
> >
> > Attaching v47.
> >
>
> I have made the following changes in the attached patch (a) removed
> the function IsIdxSafeToSkipDuplicates() and used the check directly
> in the caller


Should be fine, we can re-introduce this function when I work on the
non-pkey/RI unique index improvement as a follow up to this.


> ; (b) changed a few comments in the patch;


Thanks, looks good.


> (c) the test
> file was inconsistently using ';' while executing statement with
> safe_psql, changed it to remove ';'.
>
>
Alright, thanks.

And as a self-review, when I write regression tests next time, I'll spend a
lot
more time on the style/consistency/comments etc. During this review,
the reviewers had to spend many cycles on that area, which is something
I should have done better.

Attaching v49 with some minor changes Shi Yu noted earlier.

Thanks,
Onder KALACI


v49-0001-Allow-the-use-of-indexes-other-than-PK-and-REPLI.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Önder Kalacı
Hi Shi Yu,


> in RemoteRelContainsLeftMostColumnOnIdx():
>
> +   if (indexInfo->ii_NumIndexAttrs < 1)
> +   return false;
>
> Did you see any cases that the condition is true? I think there is at
> least one
> column in the index. If so, we can use an Assert().
>

Actually, it was mostly to guard against any edge cases. I thought similar
to tables,
we can have zero column indexes, but it turns out it is not possible. Also,
index_create seems to check that, so changing it asset makes sense:

>
>  /*
> * check parameters
> */
> if (indexInfo->ii_NumIndexAttrs < 1)
> elog(ERROR, "must index at least one column");




>
> +   if (attrmap->maplen <= AttrNumberGetAttrOffset(keycol))
> +   return false;
>
> Similarly, I think `attrmap->maplen` is the number of columns and it is
> always
> greater than keycol. If you agree, we can check it with an Assert().


At this point, I'm really hesitant to make any assumptions. Logical
replication
is pretty flexible, and who knows maybe dropped columns will be treated
differently at some point, and this assumption changes?

I really feel more comfortable to keep this as-is. We call this function
very infrequently
anyway.


> Besides, It
> seems we don't need AttrNumberGetAttrOffset().
>
>
Why not? We are accessing the AttrNumberGetAttrOffset(keycol) element
of the array attnums?


> 2.
> +# make sure that the subscriber has the correct data after the update
> UPDATE
>
> "update UPDATE" seems to be a typo.
>
>
thanks, fixed


> 3.
> +# now, drop the index with the expression, and re-create index on column
> lastname
>
> The comment says "re-create index on column lastname" but it seems we
> didn't do
> that, should it be modified to something like:
> # now, drop the index with the expression, we will use sequential scan
>
>
>
Thanks, fixed

I'll add the changes to v49 in the next e-mail.

Thanks,
Onder KALACI


Re: ICU locale validation / canonicalization

2023-03-14 Thread Peter Eisentraut

On 13.03.23 16:31, Jeff Davis wrote:

What we had discussed a while ago in one of these threads is that ICU
before version 54 do not support keyword lists, and we have custom
code
to do that parsing ourselves, but we don't have code to do the same
for
language tags.  Therefore, if I understand this right, if we
automatically convert ICU locale IDs to language tags, as shown
above,
then we break support for such locales in those older ICU versions.


Right. In versions 53 and earlier, and during pg_upgrade, we would just
preserve the locale string as entered.


Another issue that came to mind:  Right now, you can, say, develop SQL 
schemas on a newer ICU version, say, your laptop, and then deploy them 
on a server running an older ICU version.  If we have a cutoff beyond 
which we convert ICU locale IDs to language tags, then this won't work 
anymore for certain combinations.  And RHEL/CentOS 7 is still pretty 
popular.






Re: Add macros for ReorderBufferTXN toptxn

2023-03-14 Thread Peter Smith
Thanks for the review!

On Mon, Mar 13, 2023 at 6:19 PM vignesh C  wrote:
...
> Few comments:
> 1) Can we move the macros along with the other macros present in this
> file, just above this structure, similar to the macros added for
> txn_flags:
> /* Toplevel transaction for this subxact (NULL for top-level). */
> +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
>
> 2) The macro name can be changed to rbtxn_is_toptxn, rbtxn_is_subtxn
> and rbtxn_get_toptxn to keep it consistent with others:
> /* Toplevel transaction for this subxact (NULL for top-level). */
> +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
>
> 3) We could add separate comments for each of the macros:
> /* Toplevel transaction for this subxact (NULL for top-level). */
> +#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> +#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> +#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
>

All the above are fixed as suggested.

> 4) We check if txn->toptxn is not null twice here both in if condition
> and in the assignment, we could retain the assignment operation as
> earlier to remove the 2nd check:
> -   if (txn->toptxn)
> -   txn = txn->toptxn;
> +   if (isa_subtxn(txn))
> +   txn = get_toptxn(txn);
>
> We could avoid one check again by:
> +   if (isa_subtxn(txn))
> +   txn = txn->toptxn;
>

Yeah, that is true, but I chose not to keep the original assignment in
this case mainly because then it is consistent with the other changed
code ---  e.g. Every other direct member assignment/access of the
'toptxn' member now hides behind the macros so I did not want this
single place to be the odd one out. TBH, I don't think 1 extra check
is of any significance, but it is not a problem to change like you
suggested if other people also want it done that way.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.


v2-0001-Add-macros-for-ReorderBufferTXN-toptxn.patch
Description: Binary data


Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-14 Thread Peter Eisentraut

On 13.03.23 14:19, Daniel Gustafsson wrote:

On 2 Mar 2023, at 15:44, Tom Lane  wrote:

Peter Eisentraut  writes:

I think an error message like
 "unexpected null value in system cache %d column %d"
is sufficient.  Since these are "can't happen" errors, we don't need to
spend too much extra effort to make it prettier.


I'd at least like to see it give the catalog's OID.  That's easily
convertible to a name, and it doesn't tend to move around across PG
versions, neither of which are true for syscache IDs.

Also, I'm fairly unconvinced that it's a "can't happen" --- this
would be very likely to fire as a result of catalog corruption,
so it would be good if it's at least minimally interpretable
by a non-expert.  Given that we'll now have just one copy of the
code, ISTM there's a good case for doing the small extra work
to report catalog and column by name.


Rebased v3 on top of recent conflicting ICU changes causing the patch to not
apply anymore.  Also took another look around the tree to see if there were
missed callsites but found none new.


I think the only open question here was the granularity of the error 
message, which I think you had addressed in v2.


+   if (isnull)
+   {
+   elog(ERROR,
+"unexpected NULL value in cached tuple for 
pg_catalog.%s.%s",
+get_rel_name(cacheinfo[cacheId].reloid),
+			 NameStr(TupleDescAttr(SysCache[cacheId]->cc_tupdesc, 
attributeNumber - 1)->attname));

+   }

I prefer to use "null value" for SQL null values, and NULL for the C symbol.

I'm a bit hesitant about hardcoding pg_catalog here.  That happens to be 
true, of course, but isn't actually enforced, I think.  I think that 
could be left off.  It's not like people will be confused about which 
schema "pg_class.relname" is in.


Also, the cached tuple isn't really for the attribute, so maybe split 
that up a bit, like


"unexpected null value in cached tuple for catalog %s column %s"




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread Amit Kapila
On Mon, Mar 13, 2023 at 7:46 PM Önder Kalacı  wrote:
>
> Attaching v47.
>

I have made the following changes in the attached patch (a) removed
the function IsIdxSafeToSkipDuplicates() and used the check directly
in the caller; (b) changed a few comments in the patch; (c) the test
file was inconsistently using ';' while executing statement with
safe_psql, changed it to remove ';'.

-- 
With Regards,
Amit Kapila.


v48-0001-Allow-the-use-of-indexes-other-than-PK-and-REPLI.patch
Description: Binary data


Re: Testing autovacuum wraparound (including failsafe)

2023-03-14 Thread Masahiko Sawada
On Wed, Mar 8, 2023 at 1:52 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas  wrote:
> >
> > On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
> > > Thanks for the patch. While reviewing the patch backlog, we have 
> > > determined that
> > > the latest version of this patch was submitted before meson support was
> > > implemented, so it should have a "meson.build" file added for 
> > > consideration for
> > > inclusion in PostgreSQL 16.
> >
> > I wanted to do some XID wraparound testing again, to test the 64-bit
> > SLRUs patches [1], and revived this.
>
> Thank you for reviving this thread!
>
> >
> > I took a different approach to consuming the XIDs. Instead of setting
> > nextXID directly, bypassing GetNewTransactionId(), this patch introduces
> > a helper function to call GetNewTransactionId() repeatedly. But because
> > that's slow, it does include a shortcut to skip over "uninteresting"
> > XIDs. Whenever nextXid is close to an SLRU page boundary or XID
> > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up
> > nextXid close to the next "interesting" value. That's still a lot slower
> > than just setting nextXid, but exercises the code more realistically.
> >
> > I've written some variant of this helper function many times over the
> > years, for ad hoc testing. I'd love to have it permanently in the git tree.
>
> These functions seem to be better than mine.
>
> > In addition to Masahiko's test for emergency vacuum, this includes two
> > other tests. 002_limits.pl tests the "warn limit" and "stop limit" in
> > GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion
> > transactions in total, exercising XID wraparound in general.
> > Unfortunately these tests are pretty slow; the tests run for about 4
> > minutes on my laptop in total, and use about 20 GB of disk space. So
> > perhaps these need to be put in a special test suite that's not run as
> > part of "check-world". Or perhaps leave out the 003_wraparounds.pl test,
> > that's the slowest of the tests. But I'd love to have these in the git
> > tree in some form.
>
> cbfot reports some failures. The main reason seems that meson.build in
> xid_wraparound directory adds the regression tests but the .sql and
> .out files are missing in the patch. Perhaps the patch wants to add
> only tap tests as Makefile doesn't define REGRESS?
>
> Even after fixing this issue, CI tests (Cirrus CI) are not happy and
> report failures due to a disk full. The size of xid_wraparound test
> directory is 105MB out of 262MB:
>
> % du -sh testrun
> 262Mtestrun
> % du -sh testrun/xid_wraparound/
> 105Mtestrun/xid_wraparound/
> % du -sh testrun/xid_wraparound/*
> 460Ktestrun/xid_wraparound/001_emergency_vacuum
> 93M testrun/xid_wraparound/002_limits
> 12M testrun/xid_wraparound/003_wraparounds
> % ls -lh testrun/xid_wraparound/002_limits/log*
> total 93M
> -rw---. 1 masahiko masahiko 93M Mar  7 17:34 002_limits_wraparound.log
> -rw-rw-r--. 1 masahiko masahiko 20K Mar  7 17:34 regress_log_002_limits
>
> The biggest file is the server logs since an autovacuum worker writes
> autovacuum logs for every table for every second (autovacuum_naptime
> is 1s). Maybe we can set log_autovacuum_min_duration reloption for the
> test tables instead of globally enabling it

I think it could be acceptable since 002 and 003 tests are executed
only when required. And 001 test seems to be able to pass on cfbot but
it takes more than 30 sec. In the attached patch, I made these tests
optional and these are enabled if envar ENABLE_XID_WRAPAROUND_TESTS is
defined (supporting only autoconf).

>
> The 001 test uses the 2PC transaction that holds locks on tables but
> since we can consume xids while the server running, we don't need
> that. Instead I think we can keep a transaction open in the background
> like 002 test does.

Updated in the new patch. Also, I added a check if the failsafe mode
is triggered.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Add-tests-for-XID-wraparound.patch
Description: Binary data


RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-14 Thread shiy.f...@fujitsu.com
On Mon, Mar 13, 2023 10:16 PM Önder Kalacı  wrote:
> 
> Attaching v47.
> 

Thanks for updating the patch. Here are some comments.

1.
in RemoteRelContainsLeftMostColumnOnIdx():

+   if (indexInfo->ii_NumIndexAttrs < 1)
+   return false;

Did you see any cases that the condition is true? I think there is at least one
column in the index. If so, we can use an Assert().

+   if (attrmap->maplen <= AttrNumberGetAttrOffset(keycol))
+   return false;

Similarly, I think `attrmap->maplen` is the number of columns and it is always
greater than keycol. If you agree, we can check it with an Assert(). Besides, It
seems we don't need AttrNumberGetAttrOffset().

2.
+# make sure that the subscriber has the correct data after the update UPDATE

"update UPDATE" seems to be a typo.

3.
+# now, drop the index with the expression, and re-create index on column 
lastname

The comment says "re-create index on column lastname" but it seems we didn't do
that, should it be modified to something like: 
# now, drop the index with the expression, we will use sequential scan

Besides these, the patch LGTM.

Regards,
Shi Yu