Re: Time to drop plpython2?

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 11:14:46PM -0800, Andres Freund wrote:
> I've pinged the owners of the animals failing so far:
> - myna, butterflyfish

These two are managed by a colleague, and I have an access to them.
They should get back to green quickly now, if I did not mess up..
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-16 Thread Julien Rouhaud
On Thu, Feb 17, 2022 at 03:51:26PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 17 Feb 2022 14:58:38 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > (Sorry for the broken mail...)
> > > >
> > > Ok, if -1 is wrong, what should the value of return if
> > > somebody calling this function like:
> > > pg_encoding_max_length(63);
> > 
> > Should result in assertion failure, I think.  If that fails, the
> > caller side is anyhow broken.  On the other hand we haven't had a
> > complain about that, maybe.
> > 
> > > Of course, with patch applied, because with original code
> > > has memory corruption, if built and run without DEBUG.
> > 
> > So we don't assume corruption in production build.  It should be
> > logically guaranteed.
> > 
> > I'll dig into that further.
> 
> The number comes from pg_char_to_encoding, which is the internal
> function of the SQL function PG_char_encoding(). It looks fine but I
> can confirm that for all possible encoding names in pg_encname_tbl[].
> 
> =# select * from (
> select e as name, PG_char_to_encoding(e) enc
> from unnest(array['abc', 'alt', 'big5', 'euccn', 'eucjis2004', 'eucjp',
> 'euckr', 'euctw', 'gb18030', 'gbk', 'iso88591', 'iso885910', 'iso885913',
>   'iso885914', 'iso885915', 'iso885916', 'iso88592', 'iso88593', 
> 'iso88594',
>   'iso88595', 'iso88596', 'iso88597', 'iso88598', 'iso88599', 'johab',
>   'koi8', 'koi8r', 'koi8u', 'latin1', 'latin10', 'latin2', 'latin3',
>   'latin4', 'latin5', 'latin6', 'latin7', 'latin8', 'latin9', 'mskanji',
>   'muleinternal', 'shiftjis', 'shiftjis2004', 'sjis', 'sqlascii', 'tcvn',
>   'tcvn5712', 'uhc', 'unicode', 'utf8', 'vscii', 'win', 'win1250', 
> 'win1251',
>   'win1252', 'win1253', 'win1254', 'win1255', 'win1256', 'win1257',
>   'win1258', 'win866', 'win874', 'win932', 'win936', 'win949', 'win950',
>   'windows1250', 'windows1251', 'windows1252', 'windows1253', 
> 'windows1254',
>   'windows1255', 'windows1256', 'windows1257', 'windows1258', 
> 'windows866',
>   'windows874', 'windows932', 'windows936', 'windows949', 'windows950',
>   'hoge']) as e) as t where enc < 0 or enc > 41;
>  name | enc 
> --+-
>  hoge |  -1
> (1 row)
> 
> So, the function doesn't return 63 for all registered names and wrong
> names.
> 
> So other possibilities I can think of are..
> - Someone had broken pg_encname_tbl[]
> - Cosmic ray hit, or ill memory cell.
> - Coverity worked wrong way.
> 
> Could you show the workload for the Coverity warning here?

The 63 upthread was hypothetical right?  pg_encoding_max_length() shouldn't be
called with user-dependent data (unlike pg_encoding_max_length_sql()), so I
also don't see any value spending cycles in release builds.  The error should
only happen with bogus code, and assert builds are there to avoid that, or
corrupted memory and in that case we can't make any promise.




Re: Make mesage at end-of-recovery less scary.

2022-02-16 Thread Kyotaro Horiguchi
At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma  
wrote in 
> OK. The v13 patch looks good. I have marked it as ready to commit.
> Thank you for working on all my review comments.

Thaks! But the recent xlog.c refactoring crashes into this patch.
And I found a silly bug while rebasing.

xlog.c:12463 / xlogrecovery.c:3168
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
..
{
+   Assert(!StandbyMode);
...
+   xlogreader->EndOfWAL = true;

Yeah, I forgot about promotion there..  So what I should have done is
setting EndOfWAL according to StandbyMode.

+   Assert(!StandbyMode || CheckForStandbyTrigger());
...
+   /* promotion exit is not end-of-WAL */
+   xlogreader->EndOfWAL = !StandbyMode;

The rebased v14 is attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5613ee80a4d2a9786f5ce8421dcbb560b63a13c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v14] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   |  78 
 src/backend/access/transam/xlogrecovery.c |  92 ++-
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++
 6 files changed, 268 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+  "missing contrecord at %X/%X",
+  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	  XLogRecPtr PrevRecPtr, XLogRecord *record,
 	  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+		 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+  "empty page in log segment %s, offset %u",
+  fname,
+  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != 

Re: Time to drop plpython2?

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 22:52:08 -0800, Andres Freund wrote:
> On 2022-02-16 11:58:50 -0800, Andres Freund wrote:
> > Cool, will apply 1) later today.
> 
> Done. Curious how red the BF will turn out to be. Let's hope it's not
> too bad.

I've pinged the owners of the animals failing so far:
- snakefly, massasauga
- jay, trilobite, hippopotamus, avocet
- myna, butterflyfish
- rhinoceros

Greetings,

Andres Freund




Re: make tuplestore helper function

2022-02-16 Thread Michael Paquier
On Tue, Feb 15, 2022 at 11:57:56PM -0600, Justin Pryzby wrote:
> On Tue, Jan 11, 2022 at 10:19:33AM -0500, Melanie Plageman wrote:
>> I'll wait to do that if preferred by committer.
>> Are you imagining that patch 0001 would only add the check for
>> expectedDesc that is missing from pg_config() and text_to_table()?

It took me some time to follow up on this point, but we clearly don't
expect a NULL expectedDesc in the context of these two code paths or
we finish with a NULL pointer dereference in CreateTupleDescCopy()
with a crash, so they had better be added.  It also makes sense to me
to do that first and independently of the rest of the patch: it is a
separate issue.

> Yes, that's what I meant, but I haven't been able to determine if a NULL check
> should be added at all.

+   (errcode(ERRCODE_E_R_I_E_SRF_PROTOCOL_VIOLATED),
+errmsg("expected tuple format not specified as required for "
+   "set-returning function.")))
errmsgs should be in a single line, to ease the grepping of similar
patterns.

There is something else that I think could be cleaned up, while we are
working on this area.  Some of the callers of
MakeFuncResultTuplestore() (prepare.c, portalmem.c, one in genfile.c,
guc.c, misc.c) pass down a NULL tupdesc while building their TupleDesc
using CreateTemplateTupleDesc() and a set of TupleDescInitEntry() for
each attribute.  This data exists already in pg_proc.dat, duplicating
the effort.  Except if I am missing something, wouldn't it be better
to let get_call_result_type() do this work for us, passing a TupleDesc
pointer to the new MakeFuncResultTuplestore() rather than NULL?

Once you do that, there is something else that stands out: all the
callers of MakeFuncResultTuplestore() passing NULL for tupledesc *have
to* set expectedDesc, as far as I can see from the remaining callers.
This brings an interesting effect, couldn't we move the check on
ReturnSetInfo->expectedDesc being not NULL within
MakeFuncResultTuplestore(), checking after it only when tupdesc ==
NULL?

One would say that this changes the error ordering in code paths like
pg_config(), but, actually, in this case, we don't have any need for
the part "Check to make sure we have a reasonable tuple descriptor" as
of pg_proc.dat defining the tuple description this function uses, no?
And we could switch to a tuplestore in this case, as well, reducing by
one the numbers of callers with tupdesc=NULL for the new routine.
This could be a separate cleanup, separate from the rest, done first.

each_worker_jsonb() does not really need its check on expectedDesc
being NULL, does it?  HEAD grabs the tuple descriptor with
get_call_result_type(), the patch uses MakeFuncResultTuplestore(), so
we make nowhere use of the expected TupleDesc saved by the SRF
executor.

Asserting that we are in the correct memory context in when calling
MakeFuncResultTuplestore() sounds rather sensible from here as per the
magics done in the various json functions.  Still, it really feels
like we could do a more centralized consolidation of the whole.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-16 Thread Kyotaro Horiguchi
At Thu, 17 Feb 2022 15:51:26 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> - Cosmic ray hit, or ill memory cell.

63 (0x3f) cannot be less than 42(0x2a) by one-bit flip.  So the
possibility of cosmic ray would be quite low.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
> >> -  while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> >> +  while (!ShutdownRequestPending &&
> >> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != 
> >> NULL)
> >
> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
> > not do their jobs when ShutdownRequestPending is set, particularly without a
> > huge fat comment.
>
> The idea was to avoid delaying shutdown because we're waiting for the
> custodian to finish relatively nonessential tasks.  Another option might be
> to just exit immediately when the custodian receives a shutdown request.

I think we should just not do either of these and let the functions
finish. For the cases where shutdown really needs to be immediate
there's, uhm, immediate mode shutdowns.


> > Why does this not open us up to new xid wraparound issues? Before there was 
> > a
> > hard bound on how long these files could linger around. Now there's not
> > anymore.
>
> Sorry, I'm probably missing something obvious, but I'm not sure how this
> adds transaction ID wraparound risk.  These files are tied to LSNs, and
> AFAIK they won't impact slots' xmins.

They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...

Andres




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-16 Thread Nitin Jadhav
> > The progress reporting mechanism of postgres uses the
> > 'st_progress_param' array of 'PgBackendStatus' structure to hold the
> > information related to the progress. There is a function
> > 'pgstat_progress_update_param()' which takes 'index' and 'val' as
> > arguments and updates the 'val' to corresponding 'index' in the
> > 'st_progress_param' array. This mechanism works fine when all the
> > progress information is of type integer as the data type of
> > 'st_progress_param' is of type integer. If the progress data is of
> > different type than integer, then there is no easy way to do so.
>
> Progress parameters are int64, so all of the new 'checkpoint start
> location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
> time' (store as start time in stat_progress, timestamp fits in 64
> bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
> current stat_progress parameter. Some processing would be required at
> the view, but that's not impossible to overcome.

Thank you for sharing the information.  'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time. This function can be called in the view. Is it
safe/advisable to use int64 type here rather than Timestamptz for this
purpose?  'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Thoughts?

Thanks & Regards,
Nitin Jadhav


On Thu, Feb 17, 2022 at 1:33 AM Matthias van de Meent
 wrote:
>
> On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
>  wrote:
> >
> > > > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > > > before the checkpoint start, instead of just emitting the stats at the 
> > > > end.
> > > >
> > > > Bharat, it would be good to show the buffers synced counter and the 
> > > > total buffers to sync, checkpointer pid, substep it is running, whether 
> > > > it is on target for completion, checkpoint_Reason
> > > > (manual/times/forced). BufferSync has several variables tracking the 
> > > > sync progress locally, and we may need some refactoring here.
> > >
> > > I agree to provide above mentioned information as part of showing the
> > > progress of current checkpoint operation. I am currently looking into
> > > the code to know if any other information can be added.
> >
> > Here is the initial patch to show the progress of checkpoint through
> > pg_stat_progress_checkpoint view. Please find the attachment.
> >
> > The information added to this view are pid - process ID of a
> > CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> > for checkpoint (values can be wal, time or force), phase - indicates
> > the current phase of checkpoint operation, total_buffer_writes - total
> > number of buffers to be written, buffers_processed - number of buffers
> > processed, buffers_written - number of buffers written,
> > total_file_syncs - total number of files to be synced, files_synced -
> > number of files synced.
> >
> > There are many operations happen as part of checkpoint. For each of
> > the operation I am updating the phase field of
> > pg_stat_progress_checkpoint view. The values supported for this field
> > are initializing, checkpointing replication slots, checkpointing
> > snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> > pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> > checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> > buffers, performing sync requests, performing two phase checkpoint,
> > recycling old XLOG files and Finalizing. In case of checkpointing
> > buffers phase, the fields total_buffer_writes, buffers_processed and
> > buffers_written shows the detailed progress of writing buffers. In
> > case of performing sync requests phase, the fields total_file_syncs
> > and files_synced shows the detailed progress of syncing files. In
> > other phases, only the phase field is getting updated and it is
> > difficult to show the progress because we do not get the total number
> > of files count without traversing the directory. It is not worth to
> > calculate that as it affects the performance of the checkpoint. I also
> > gave a thought to just mention the number of files processed, but this
> > wont give a meaningful progress information (It can be treated as
> > statistics). Hence just updating the phase field in those 

Re: Time to drop plpython2?

2022-02-16 Thread Andres Freund
On 2022-02-16 11:58:50 -0800, Andres Freund wrote:
> Cool, will apply 1) later today.

Done. Curious how red the BF will turn out to be. Let's hope it's not
too bad.




Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-16 Thread Kyotaro Horiguchi
At Thu, 17 Feb 2022 14:58:38 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (Sorry for the broken mail...)
> 
> At Wed, 16 Feb 2022 09:29:20 -0300, Ranier Vilela  wrote 
> in 
> > > The patch:
> > >  pg_encoding_max_length(int encoding)
> > >  {
> > > -   Assert(PG_VALID_ENCODING(encoding));
> > > -
> > > -   return pg_wchar_table[encoding].maxmblen;
> > > +   if (PG_VALID_ENCODING(encoding))
> > > +   return pg_wchar_table[encoding].maxmblen;
> > > +   else
> > > +   return -1;
> > >
> > > Returning -1 for invalid encoding is, I think, flat wrong.
> > >
> > Ok, if -1 is wrong, what should the value of return if
> > somebody calling this function like:
> > pg_encoding_max_length(63);
> 
> Should result in assertion failure, I think.  If that fails, the
> caller side is anyhow broken.  On the other hand we haven't had a
> complain about that, maybe.
> 
> > Of course, with patch applied, because with original code
> > has memory corruption, if built and run without DEBUG.
> 
> So we don't assume corruption in production build.  It should be
> logically guaranteed.
> 
> I'll dig into that further.

The number comes from pg_char_to_encoding, which is the internal
function of the SQL function PG_char_encoding(). It looks fine but I
can confirm that for all possible encoding names in pg_encname_tbl[].

=# select * from (
select e as name, PG_char_to_encoding(e) enc
from unnest(array['abc', 'alt', 'big5', 'euccn', 'eucjis2004', 'eucjp',
'euckr', 'euctw', 'gb18030', 'gbk', 'iso88591', 'iso885910', 'iso885913',
'iso885914', 'iso885915', 'iso885916', 'iso88592', 'iso88593', 
'iso88594',
'iso88595', 'iso88596', 'iso88597', 'iso88598', 'iso88599', 'johab',
'koi8', 'koi8r', 'koi8u', 'latin1', 'latin10', 'latin2', 'latin3',
'latin4', 'latin5', 'latin6', 'latin7', 'latin8', 'latin9', 'mskanji',
'muleinternal', 'shiftjis', 'shiftjis2004', 'sjis', 'sqlascii', 'tcvn',
'tcvn5712', 'uhc', 'unicode', 'utf8', 'vscii', 'win', 'win1250', 
'win1251',
'win1252', 'win1253', 'win1254', 'win1255', 'win1256', 'win1257',
'win1258', 'win866', 'win874', 'win932', 'win936', 'win949', 'win950',
'windows1250', 'windows1251', 'windows1252', 'windows1253', 
'windows1254',
'windows1255', 'windows1256', 'windows1257', 'windows1258', 
'windows866',
'windows874', 'windows932', 'windows936', 'windows949', 'windows950',
'hoge']) as e) as t where enc < 0 or enc > 41;
 name | enc 
--+-
 hoge |  -1
(1 row)

So, the function doesn't return 63 for all registered names and wrong
names.

So other possibilities I can think of are..

- Someone had broken pg_encname_tbl[]

- Cosmic ray hit, or ill memory cell.

- Coverity worked wrong way.


Could you show the workload for the Coverity warning here?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Andrey Borodin



> 15 февр. 2022 г., в 23:20, Robert Haas  написал(а):
> 
> Anyway, those are my thoughts. What are yours?

+1 for adding Zstd everywhere. Quite similar patches are already a part of 
"libpq compression" and "zstd for pg_dump" commitfest entries, so pushing this 
will simplify those entries IMV.

Also I like the idea of defaulting FPI compression to lz4 and adding Zstd as 
wal_compression option.
I don't think Zstd is obvious choice for default FPI compression: there are HW 
setups when disks are almost as fast as RAM. In this case lz4 can improve 
performance, while Zstd might make things slower.

Thanks!

Best regards, Andrey Borodin.



Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-16 Thread Kyotaro Horiguchi
(Sorry for the broken mail...)

At Wed, 16 Feb 2022 09:29:20 -0300, Ranier Vilela  wrote 
in 
> > > ]
> > > 633retval = pg_verify_mbstr_len(src_encoding, src_str, len,
> > false);
> > > 634
> > >
> > > Trivial patch attached.
> >
> > Mmm? If the assert doesn't work, there should be inconcsistency
> > between pg_enc and pg_wchar_table. But AFAICS they are consistent.
> >
> The consistency is between pg_encname_tbl and pc_enc, and AFAICS are
> consistent.

..Yeah, right.

> > The patch:
> >  pg_encoding_max_length(int encoding)
> >  {
> > -   Assert(PG_VALID_ENCODING(encoding));
> > -
> > -   return pg_wchar_table[encoding].maxmblen;
> > +   if (PG_VALID_ENCODING(encoding))
> > +   return pg_wchar_table[encoding].maxmblen;
> > +   else
> > +   return -1;
> >
> > Returning -1 for invalid encoding is, I think, flat wrong.
> >
> Ok, if -1 is wrong, what should the value of return if
> somebody calling this function like:
> pg_encoding_max_length(63);

Should result in assertion failure, I think.  If that fails, the
caller side is anyhow broken.  On the other hand we haven't had a
complain about that, maybe.

> Of course, with patch applied, because with original code
> has memory corruption, if built and run without DEBUG.

So we don't assume corruption in production build.  It should be
logically guaranteed.

I'll dig into that further.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)

2022-02-16 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 09:29:20 -0300, Ranier Vilela  wrote 
in 
> > > ]
> > > 633retval = pg_verify_mbstr_len(src_encoding, src_str, len,
> > false);
> > > 634
> > >
> > > Trivial patch attached.
> >
> > Mmm? If the assert doesn't work, there should be inconcsistency
> > between pg_enc and pg_wchar_table. But AFAICS they are consistent.
> >
> The consistency is between pg_encname_tbl and pc_enc, and AFAICS are
> consistent.

..Yeah, right.

> > The patch:
> >  pg_encoding_max_length(int encoding)
> >  {
> > -   Assert(PG_VALID_ENCODING(encoding));
> > -
> > -   return pg_wchar_table[encoding].maxmblen;
> > +   if (PG_VALID_ENCODING(encoding))
> > +   return pg_wchar_table[encoding].maxmblen;
> > +   else
> > +   return -1;
> >
> > Returning -1 for invalid encoding is, I think, flat wrong.
> >
> Ok, if -1 is wrong, what should the value of return if
> somebody calling this function like:
> pg_encoding_max_length(63);

Should result in assertion failure, I think.  If that fails, the
caller side is anyhow broken.  On the other hand we haven't had I'll
dig into that further.

> Of course, with patch applied, because with original code
> has memory corruption, if built and run without DEBUG.

So we don't assume corruption in production build.  It should be
logically guaranteed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-02-16 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov  
wrote in 
> Hello, all.
> 
> I thought about patch simplification, and tested version
> without BufTable and dynahash api change at all.
> 
> It performs suprisingly well. It is just a bit worse
> than v1 since there is more contention around dynahash's
> freelist, but most of improvement remains.
> 
> I'll finish benchmarking and will attach graphs with
> next message. Patch is attached here.

Thanks for the new patch.  The patch as a whole looks fine to me. But
some comments needs to be revised.

(existing comments)
> * To change the association of a valid buffer, we'll need to have
> * exclusive lock on both the old and new mapping partitions.
...
> * Somebody could have pinned or re-dirtied the buffer while we were
> * doing the I/O and making the new hashtable entry.  If so, we can't
> * recycle this buffer; we must undo everything we've done and start
> * over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.


+* Clear out the buffer's tag and flags.  We must do this to ensure that
+* linear scans of the buffer array don't think the buffer is valid. We

The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

+*
+* Since we are single pinner, there should no be PIN_COUNT_WAITER or
+* IO_IN_PROGRESS (flags that were not cleared in previous code).
+*/
+   Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);

It seems like to be a test for potential bugs in other functions.  As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition.  Is it really needed?


+   /*
+* Try to make a hashtable entry for the buffer under its new tag. This
+* could fail because while we were writing someone else allocated 
another

The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.


+* buffer for the same block we want to read in. Note that we have not 
yet
+* removed the hashtable entry for the old tag.

Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.


+* the first place.  First, give up the buffer we were planning 
to use
+* and put it to free lists.
..
+   StrategyFreeBuffer(buf);

This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic.  By
the way it would be better if the sentence starts with "First" has a
separate comment section.


(existing comment)
|* Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here.  And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Per-table storage parameters for TableAM/IndexAM extensions

2022-02-16 Thread Dilip Kumar
On Sat, Feb 12, 2022 at 2:35 AM Robert Haas  wrote:

>
> Imagine that I am using the "foo" tableam with "compression=lots" and
> I want to switch to the "bar" AM which does not support that option.
> If I remove the "compression=lots" option using a separate command,
> the "foo" table AM may rewrite my whole table and decompress
> everything. Then when I convert to the "bar" AM it's going to have to
> be rewritten again. That's painful. I clearly need some way to switch
> AMs without having to rewrite the table twice.
>

I agree with you, if we force users to drop the option as a separate
command then we will have to rewrite the table twice.


> It's also interesting to consider the other direction. If I am
> switching from "bar" to "foo" I would really like to be able to add
> the "compression=lots" option at the same time I make the switch.
> There needs to be some syntax for that.
>
> One way to solve the first of these problem is to silently drop
> unsupported options. Maybe a better way is to have syntax that allows
> you to specify options to be added and removed at the time you switch
> AMs e.g.:
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 10:40:01AM -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 9:33 PM Michael Paquier  wrote:
>> Yes, the patch misses the fact that each ./configure switch is
>> documented.  For consistency, I think that you should also add that in
>> the MSVC scripts in the first version.  It is really straight-forward
>> to do so, and it should be just a matter of grepping for the code
>> paths of lz4, then adjust things for zstd.
> 
> Makes sense. Here's an attempt by me to do that.

Thanks.  This looks pretty much right, except for two things that I
have taken the freedom to fix as of the v3 attached.

%define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
this version fails the sanity check between pg_config.h.in and the
MSVC scripts checking that all flags exist.

@@ -351,6 +351,7 @@ XGETTEXT = @XGETTEXT@
 GZIP   = gzip
 BZIP2  = bzip2
 LZ4= @LZ4@
+ZSTD   = @ZSTD@
A similar refresh is needed in vcregress.pl.

+   $proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
The upstream code is also using this library name, so that should be
fine.
--
Michael
From 54485782e931c9536de2fdf7dbdfd0bb5c0c632d Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Wed, 16 Feb 2022 10:36:36 -0500
Subject: [PATCH v3] Add support for building with ZSTD.

This commit doesn't actually add anything that uses ZSTD; that will be
done separately. It just puts the basic infrastructure into place.

Jeevan Ladhe and Robert Haas
---
 src/include/pg_config.h.in|   9 +
 doc/src/sgml/install-windows.sgml |   9 +
 doc/src/sgml/installation.sgml|   9 +
 configure | 271 ++
 configure.ac  |  33 
 src/Makefile.global.in|   1 +
 src/tools/msvc/Solution.pm|  15 ++
 src/tools/msvc/config_default.pl  |   1 +
 src/tools/msvc/vcregress.pl   |   1 +
 9 files changed, 349 insertions(+)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 28a1f0e9f0..1912cf35de 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -352,6 +352,9 @@
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
+/* Define to 1 if you have the `zstd' library (-lzstd). */
+#undef HAVE_LIBZSTD
+
 /* Define to 1 if you have the `link' function. */
 #undef HAVE_LINK
 
@@ -718,6 +721,9 @@
 /* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
 #undef HAVE_X86_64_POPCNTQ
 
+/* Define to 1 if you have the  header file. */
+#undef HAVE_ZSTD_H
+
 /* Define to 1 if the system has the type `_Bool'. */
 #undef HAVE__BOOL
 
@@ -949,6 +955,9 @@
 /* Define to select Win32-style shared memory. */
 #undef USE_WIN32_SHARED_MEMORY
 
+/* Define to 1 to build with ZSTD support. (--with-zstd) */
+#undef USE_ZSTD
+
 /* Define to 1 if `wcstombs_l' requires . */
 #undef WCSTOMBS_L_IN_XLOCALE
 
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index 30dd0c7f75..d2f63db3f2 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -307,6 +307,15 @@ $ENV{MSBFLAGS}="/m";
  
 
 
+
+ ZSTD
+ 
+  Required for supporting ZSTD compression
+  method. Binaries and source can be downloaded from
+  https://github.com/facebook/zstd/releases;>.
+ 
+
+
 
  OpenSSL
  
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 655095f3b1..c6190f6955 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -989,6 +989,15 @@ build-postgresql:

   
 
+  
+   --with-zstd
+   
+
+ Build with ZSTD compression support.
+
+   
+  
+
   
--with-ssl=LIBRARY

diff --git a/configure b/configure
index 930658..f07f689f1a 100755
--- a/configure
+++ b/configure
@@ -650,6 +650,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
+ZSTD
 LZ4
 UUID_LIBS
 LDAP_LIBS_BE
@@ -700,6 +701,9 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+ZSTD_LIBS
+ZSTD_CFLAGS
+with_zstd
 LZ4_LIBS
 LZ4_CFLAGS
 with_lz4
@@ -869,6 +873,7 @@ with_libxslt
 with_system_tzdata
 with_zlib
 with_lz4
+with_zstd
 with_gnu_ld
 with_ssl
 with_openssl
@@ -898,6 +903,8 @@ XML2_CFLAGS
 XML2_LIBS
 LZ4_CFLAGS
 LZ4_LIBS
+ZSTD_CFLAGS
+ZSTD_LIBS
 LDFLAGS_EX
 LDFLAGS_SL
 PERL
@@ -1577,6 +1584,7 @@ Optional Packages:
   use system time zone data in DIR
   --without-zlib  do not use Zlib
   --with-lz4  build with LZ4 support
+  --with-zstd build with ZSTD support
   --with-gnu-ld   assume the C compiler uses GNU ld [default=no]
   --with-ssl=LIB  use LIB for SSL/TLS support (openssl)
   --with-openssl  obsolete spelling of --with-ssl=openssl
@@ -1606,6 +1614,8 @@ Some influential environment variables:
   XML2_LIBS   linker flags for XML2, overriding pkg-config
   LZ4_CFLAGS  C compiler flags for LZ4, overriding 

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 22:58:19 -0500, Tom Lane wrote:
> I wonder though if an exiting walsender would always take that path.

You're right - the CFI() in PostgresMain(), several replication commands, are
not at all guaranteed to be in a transaction when throwing a FATAL.

I don't quite see where we could be throwing an error with a relevant lwlock
held yet though.


> Maybe it'd be good if we released all LWLocks even if we don't think we're
> in a transaction?

Yea, probably a good idea. I think there's a few hooks could that could
conceivably be in trouble otherwise.


> More generally, it wasn't obvious to me why you thought that
> BaseInit was a good spot to place the on_shmem_exit call.

> It obviously can't be *before* that because of the pgstats
> dependency, but maybe it should be later?

Yea, I agree, it probably could be done later. It's not super obvious when
though. If e.g. somebody wrote a non-database connected bgworker to
synchronize slot state with another node (for failover, there's even patches
for it), it'd be just after BaseInit(). Additionally, it was previously in
ProcKill(), so moving it something registered early-ish seemed sensible.

Greetings,

Andres Freund




RE: [Proposal] Add foreign-server health checks infrastructure

2022-02-16 Thread kuroda.hay...@fujitsu.com
> I understood here as removing following mechanism from core:
> 
> * disable timeout at end of tx.

While reading again and this part might be wrong.
Sorry for inconvenience.
But anyway some codes should be (re)moved from core, right?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-16 Thread Nathan Bossart
Hi Andres,

I appreciate the feedback.

On Wed, Feb 16, 2022 at 05:50:52PM -0800, Andres Freund wrote:
>> +/* Since not using PG_TRY, must reset error stack by hand */
>> +if (sigsetjmp(local_sigjmp_buf, 1) != 0)
>> +{
> 
> I also think it's a bad idea to introduce even more copies of the error
> handling body.  I think we need to unify this. And yes, it's unfair to stick
> you with it, but it's been a while since a new aux process has been added.

+1, I think this is useful refactoring.  I might spin this off to its own
thread.

>> +/*
>> + * These operations are really just a minimal subset of
>> + * AbortTransaction().  We don't have very many resources to 
>> worry
>> + * about.
>> + */
> 
> Given what you're proposing this for, are you actually confident that we don't
> need more than this?

I will give this a closer look.

>> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
>> +bool unlink_all);
> 
> I don't like functions with multiple consecutive booleans, they tend to get
> swapped around. Why not just split unlink_all=true/false into different
> functions?

Will do.

>> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>>
>> First, pgsql_tmp directories will be renamed to stage them for
>> removal.
> 
> What if the target name already exists?

The integer at the end of the target name is incremented until we find a
unique name.

>> Note that temporary relation files cannot be cleaned up via the
>> aforementioned strategy and will not be offloaded to the custodian.
> 
> This should be in the prior commit message, otherwise people will ask the same
> question as I did.

Will do.

>> +/*
>> + * Find a name for the stage directory.  We just increment an integer 
>> at the
>> + * end of the name until we find one that doesn't exist.
>> + */
>> +for (int n = 0; n <= INT_MAX; n++)
>> +{
>> +snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
>> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
> 
> Uninterruptible loops up to INT_MAX do not seem like a good idea.

I modeled this after ChooseRelationName() in indexcmds.c.  Looking again, I
see that it loops forever until a unique name is found.  I suspect this is
unlikely to be a problem in practice.  What strategy would you recommend
for choosing a unique name?  Should we just append a couple of random
characters?

>> +dir = AllocateDir(stage_path);
>> +if (dir == NULL)
>> +{
> 
> Why not just use stat()? That's cheaper, and there's no
> time-to-check-time-to-use issue here, we're the only one writing.

I'm not sure why I didn't use stat().  I will update this.

>> -while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> +while (!ShutdownRequestPending &&
>> +   (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != 
>> NULL)
> 
> Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
> not do their jobs when ShutdownRequestPending is set, particularly without a
> huge fat comment.

The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks.  Another option might be
to just exit immediately when the custodian receives a shutdown request.

>> +/*
>> + * If we just staged some pgsql_tmp directories for removal, wake up the
>> + * custodian process so that it deletes all the files in the staged
>> + * directories as well as the directories themselves.
>> + */
>> +if (stage && ProcGlobal->custodianLatch)
>> +SetLatch(ProcGlobal->custodianLatch);
> 
> Just signalling without letting the custodian know what it's expected to do
> strikes me as a bad idea.

Good point.  I will work on that.

>> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
>> From: Nathan Bossart 
>> Date: Sun, 5 Dec 2021 22:02:40 -0800
>> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to 
>> custodian.
>>
>> This was only done during checkpoints because it was a convenient
>> place to put it.  However, if there are many snapshots to remove,
>> it can significantly extend checkpoint time.  To avoid this, move
>> this work to the newly-introduced custodian process.
>> ---
>>  src/backend/access/transam/xlog.c   |  2 --
>>  src/backend/postmaster/custodian.c  | 11 +++
>>  src/backend/replication/logical/snapbuild.c | 13 +++--
>>  src/include/replication/snapbuild.h |  2 +-
>>  4 files changed, 19 insertions(+), 9 deletions(-)
> 
> Why does this not open us up to new xid wraparound issues? Before there was a
> hard bound on how long these files could linger around. Now there's not
> anymore.

Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound 

RE: [Proposal] Add foreign-server health checks infrastructure

2022-02-16 Thread kuroda.hay...@fujitsu.com
Dear Horiguchi-san,

Thank you for giving your suggestions. I want to confirm your saying.

> FWIW, I'm not sure this feature necessarily requires core support
> dedicated to FDWs.  The core have USER_TIMEOUT feature already and
> FDWs are not necessarily connection based.  It seems better if FDWs
> can implement health check feature without core support and it seems
> possible.  Or at least the core feature should be more generic and
> simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
> something and operating functions on it?

I understood that core is too complicated and FDW side is too stupid, right?

> Mmm. AFAICS the running command will stop with "canceling statement
> due to user request", which is a hoax.  We need a more decent message
> there.

+1 about better messages.

> I understand that the motive of this patch is "to avoid wasted long
> local work when fdw-connection dies".

Yeah your understanding is right.

> In regard to the workload in
> your first mail, it is easily avoided by ending the transaction as soon
> as remote access ends. This feature doesn't work for the case "begin;
> ; ". But the same measure also works in
> that case.  So the only case where this feature is useful is "begin;
> ; ; ; end;".  But in the first
> place how frequently do you expecting remote-connection close happens?
> If that happens so frequently, you might need to recheck the system
> health before implementing this feature.  Since it is correctly
> detected when something really went wrong, I feel that it is a bit too
> complex for the usefulness especially for the core part.

Thanks for analyzing motivation.
Indeed, some cases may be resolved by separating tx and this event rarely 
happens.

> In conclusion, as my humble opinion I would like to propose to reduce
> this feature to:
> 
> - Just periodically check health (in any aspect) of all live
>   connections regardless of the session state.

I understood here as removing following mechanism from core:

* disable timeout at end of tx.
* skip if held off or read commands

> - If an existing connection is found to be dead, just try canceling
>   the query (or sending query cancel).
> One issue with it is how to show the decent message for the query
> cancel, but maybe we can have a global variable that suggests the
> reason for the cancel.

Currently I have no good idea for that but I'll try.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 19:46:26 -0800, Andres Freund wrote:
> But right now I'm not seeing what prevents us from throwing a FATAL error
> while holding an lwlock?

Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to
take care of that.

I checked and walsenders do have ShutdownPostgres() and I don't see anything
else that's missing it either.


I'll run the test in a loop, perhaps I can reproduce...

Greetings,

Andres Freund




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> But right now I'm not seeing what prevents us from throwing a FATAL error
> while holding an lwlock?

If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks.  I wonder
though if an exiting walsender would always take that path.
Maybe it'd be good if we released all LWLocks even if we
don't think we're in a transaction?

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

regards, tom lane




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 22:11:30 -0500, Tom Lane wrote:
> So (a) it broke around 48 hours ago, which is already a useful
> bit of info

Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit
to blame.

We document before_shmem_exit hooks as
/*
 * Call before_shmem_exit callbacks.
 *
 * These should be things that need most of the system to still be up 
and
 * working, such as cleanup of temp relations, which requires catalog
 * access; or things that need to be completed because later cleanup 
steps
 * depend on them, such as releasing lwlocks.
 */
and several before_shmem_exit callbacks use lwlocks.

But right now I'm not seeing what prevents us from throwing a FATAL error
while holding an lwlock?


> , and (b) your animals seem far more susceptible than
> anyone else's.  Why do you suppose that is?

Flaviventris, serinus use the newest snapshot of gcc available in debian, one
with -O0, the other with O3.

desmoxytes, idiacanthus, komodoensis all have JIT forced for every query.

They all run on the same host - looking at stats it doesn't look crazily
overcommitted or such. But it might have more scheduler variance than most?


Greetings,

Andres Freund




Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 7:08 PM Andres Freund  wrote:
> I'm quite worried about the pace and style of the vacuum changes. To me it
> doesn't look like that there was design consensus before 44fa8488 was
> committed, quite the opposite (while 44fa8488 probably isn't the center of
> contention, it's not just off to the side either).

I'm not aware of any disagreement whatsoever that is directly related
to 44fa8488. It's true that there was a lot of discussion (even heated
discussion) between myself and Robert, concerning related work on
freezing. But the work in commit 44fa8488 can clearly be justified as
refactoring work. It makes the code in vacuumlazy.c much cleaner. In
fact, that's how commit 44fa8488 started off -- purely as refactoring
work. All of the stuff with freezing (not committed) was started after
the general shape of 44fa8488 was established. One thing followed from
the other.

> You didn't really give a heads up that you're about to commit either. There's
> a note at the bottom of [1], a very long email, talking about committing in a
> couple of weeks. After which there was substantial discussion of the patchset.

How can you be surprised that I committed 44fa8488? It's essentially
the same patch as the first version, posted November 22 -- almost 3
months ago. And it's certainly not a big patch (though it is
complicated).

How was 44fa8488 substantively different to v1 of the patch, posted
all the way back on November 22? Literally the only thing that changed
is that it became more polished, based in part on your feedback.
That's it!

> It doesn't look to me like there was a lot of review for 44fa8488, despite it
> touching very critical parts of the code. I'm listed as a reviewer, that was a
> few months ago, and rereading my review I don't think it can be read to agree
> with the state of the code back then.

Can you be more specific?

> I also have a hard time making heads or tails out of the commit message of
> 44fa84881ff. It's quite long without being particularly descriptive. The
> commit just changes a lot of things at once, making it hard to precisely
> describe and very hard to review and understand.

The commit message is high level. The important point is that we can
more or less treat all scanned_pages the same, without generally
caring about whether or not a cleanup lock could be acquired (since
the exceptions where that isn't quite true are narrow and rare). That
is the common theme, for everything.

I do recall that you said something about the patch that became commit
44fa8488 being too much all at once. In particular, something about
removing FORCE_CHECK_PAGE() in a separate commit. But I don't think
that that made sense -- which I said at the time. There were subtle
interactions between that and between the other stuff -- which is
actually what led to this 74388a1ac36 fix-up. But as I said in the
commit message to 74388a1ac36, I think that this issue might have been
latent. If it wasn't there all along, then it might have been. Sort of
like the issue reference in 2011 commit b4b6923e.

-- 
Peter Geoghegan




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
I wrote:
> So (a) it broke around 48 hours ago, which is already a useful
> bit of info, and (b) your animals seem far more susceptible than
> anyone else's.  Why do you suppose that is?

Eyeballing the commit log for potential causes in that time frame,
I can't help noticing

2 days ago  Andres Freund   Move replication slot release to 
before_shmem_exit().

regards, tom lane




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
>> There's no disconnection log entry for either, which I suppose means
>> that somebody didn't bother logging disconnection for walsenders ...

> The thing is, we actually *do* log disconnection for walsenders:

Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.

> Starting a node in recovery and having it connect to the primary seems like a
> mighty long time for a process to exit, unless it's stuck behind something.

Fair point.  Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started.  I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is

   sysname| branch |  snapshot   | stage |  
l  
--++-+---+-
 desmoxytes   | HEAD   | 2022-02-15 04:42:05 | recoveryCheck | #   Failed test 
'have walsender pid 1685516
 idiacanthus  | HEAD   | 2022-02-15 07:24:05 | recoveryCheck | #   Failed test 
'have walsender pid 2758549
 serinus  | HEAD   | 2022-02-15 11:00:08 | recoveryCheck | #   Failed test 
'have walsender pid 3682154
 desmoxytes   | HEAD   | 2022-02-15 11:04:05 | recoveryCheck | #   Failed test 
'have walsender pid 3775359
 flaviventris | HEAD   | 2022-02-15 18:03:48 | recoveryCheck | #   Failed test 
'have walsender pid 1517077
 idiacanthus  | HEAD   | 2022-02-15 22:48:05 | recoveryCheck | #   Failed test 
'have walsender pid 2494972
 desmoxytes   | HEAD   | 2022-02-15 23:48:04 | recoveryCheck | #   Failed test 
'have walsender pid 3055399
 desmoxytes   | HEAD   | 2022-02-16 10:48:05 | recoveryCheck | #   Failed test 
'have walsender pid 1593461
 komodoensis  | HEAD   | 2022-02-16 21:16:04 | recoveryCheck | #   Failed test 
'have walsender pid 3726703
 serinus  | HEAD   | 2022-02-17 01:18:17 | recoveryCheck | #   Failed test 
'have walsender pid 208363

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's.  Why do you suppose that is?

regards, tom lane




Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 17:16:13 -0800, Peter Geoghegan wrote:
> On Sun, Feb 13, 2022 at 1:43 PM Peter Geoghegan  wrote:
> > This fixes the observed problem directly. It also makes the code
> > robust against other similar problems that might arise in the future.
> > The risk that comes from trusting that scanned_pages is a truly random
> > sample (given these conditions) is generally very large, while the
> > risk that comes from disbelieving it (given these same conditions) is
> > vanishingly small.
>
> Pushed.

I'm quite worried about the pace and style of the vacuum changes. To me it
doesn't look like that there was design consensus before 44fa8488 was
committed, quite the opposite (while 44fa8488 probably isn't the center of
contention, it's not just off to the side either).

You didn't really give a heads up that you're about to commit either. There's
a note at the bottom of [1], a very long email, talking about committing in a
couple of weeks. After which there was substantial discussion of the patchset.

It doesn't look to me like there was a lot of review for 44fa8488, despite it
touching very critical parts of the code. I'm listed as a reviewer, that was a
few months ago, and rereading my review I don't think it can be read to agree
with the state of the code back then.

I also have a hard time making heads or tails out of the commit message of
44fa84881ff. It's quite long without being particularly descriptive. The
commit just changes a lot of things at once, making it hard to precisely
describe and very hard to review and understand.


Then you just committed 74388a1ac36, three days after raising it. Without any
sort of heads up. You talked about a "draft patch".


It'd be one thing if this was happening in isolation or that nobody complained
about this before. But see among others [2], [3], [4], [5]


And yes, it'd have been nice for 44fa8488' thread to have attracted more
reviews. But as far as complicated patchsets go, it has actually gotten more
attention than most of a similar vintage. Engaging with you around vacuum is
time and energy intensive, there tend to be lots of very long
emails. Reviewers have their own work as well.


I'm miffed.

- Andres

[1] 
https://www.postgresql.org/message-id/CAH2-Wz%3DiLnf%2B0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20210414193310.4yk7wriswhqhcxvq%40alap3.anarazel.de
[3] 
https://www.postgresql.org/message-id/CA%2BTgmobyUxq2Ms3g5YMPgqJzNOi7BmStcFGwCNd-W7z8nxbjUA%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/ca+tgmoydp4qy2lfw7yci6yfj1m-rch+npctjqd29k6d6ucx...@mail.gmail.com
[5] 
https://www.postgresql.org/message-id/CA+TgmobkCExv7Zo+pArRCsJ16hYVJJB_VuZKwvk1cCP=rs1...@mail.gmail.com




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 6:56 PM Robert Haas  wrote:
> I think that's not really what is happening, at least not in the cases
> that typically are brought to my attention. In those cases, the
> typical pattern is:

> 5. None of the tables in the database have been vacuumed in a long
> time. There are a million XIDs left. How many of the tables in the
> database are going to be truncate when they are vacuumed and burn one
> of the remaining XIDs? Anybody's guess, could be all or none.

I have to admit that this sounds way more plausible than my
speculative scenario. I haven't been involved in any kind of support
case with a customer in a *long* time, though (not by choice, mind
you).

> 6. Sometimes the user decides to run VACUUM FULL instead of plain
> VACUUM because it sounds better.

It's a pity that the name suggests otherwise. If only we'd named it
something that suggests "option of last resort". Oh well.

-- 
Peter Geoghegan




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 4:48 PM Peter Geoghegan  wrote:
> There might well be an element of survivorship bias here. Most VACUUM
> operations won't ever attempt truncation (speaking very generally).
> How many times might (say) the customer that John mentioned have
> accidentally gone over xidStopLimit for just a little while, before
> the situation corrected itself without anybody noticing? A lot of
> applications are very read-heavy, or aren't very well monitored.
>
> Eventually (maybe after several years of this), some laggard
> anti-wraparound vacuum needs to truncate the relation, due to random
> happenstance. Once that happens, the situation is bound to come to a
> head. The user is bound to finally notice that the system has gone
> over xidStopLimit, because there is no longer any way for the problem
> to go away on its own.

I think that's not really what is happening, at least not in the cases
that typically are brought to my attention. In those cases, the
typical pattern is:

1. Everything is fine.

2. Then the user forgets about a prepared transaction or a replication
slot, or leaves a transaction open forever, or has some kind of
corruption that causes VACUUM to fall over and die every time it tries
to run.

3. The user has no idea that VACUUM is no longer advanced
relfrozenxid. Time passes.

4. Eventually the system stops being willing to allocate new XIDs. It
tells the user to go to single user mode. So they do.

5. None of the tables in the database have been vacuumed in a long
time. There are a million XIDs left. How many of the tables in the
database are going to be truncate when they are vacuumed and burn one
of the remaining XIDs? Anybody's guess, could be all or none.

6. Sometimes the user decides to run VACUUM FULL instead of plain
VACUUM because it sounds better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Time to increase hash_mem_multiplier default?

2022-02-16 Thread Peter Geoghegan
On Tue, Feb 15, 2022 at 8:17 AM Justin Pryzby  wrote:
> The only reason not to is that a single-node hash-aggregate plan will now use
> 2x work_mem.  Which won't make sense to someone who doesn't deal with
> complicated plans (and who doesn't know that work_mem is per-node and can be
> used multiplicitively).

Hearing no objections, I pushed a commit to increase the default to 2.0.

Thanks
--
Peter Geoghegan




Re: Small and unaffected typo in pg_logical_slot_get_changes_guts()

2022-02-16 Thread Kasahara Tatsuhito
2022年2月17日(木) 10:56 Michael Paquier :
>
> On Wed, Feb 16, 2022 at 01:25:09PM +0900, Kasahara Tatsuhito wrote:
> > Remove all references to tuplestore_donestoring() except for the header.
>
> Looks fine, thanks.  This has no impact on Melanie's patch posted on
> [1], so applied after tweaking the comment in tuplestore.h.
>
> [1]: 
> https://www.postgresql.org/message-id/20220106005748.GT14051%40telsasoft.com
Thanks !

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: killing perl2host

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 14:42:30 -0800, Andres Freund wrote:
> On February 16, 2022 1:10:35 PM PST, Andrew Dunstan  
> wrote:
> >So something like this in Utils.pm:
> >
> >
> >die "Msys targeted perl is unsupported for running TAP tests" if
> >$Config{osname}eq 'msys';
> 
> I don't think we should reject msys general - it's fine as long as the target 
> is msys, no? Msys includes postgres fwiw...

I think this means we should do the msys test in configure, inside

if test "$enable_tap_tests" = yes; then

and verify that $host_os != msys.

Greetings,

Andres Freund




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
> There's no disconnection log entry for either, which I suppose means
> that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

2022-02-16 17:53:55.780 PST [2459806][walsender][:0] LOG:  disconnection: 
session time: 0:00:01.321 user=andres database= host=[local]

The only way I see for the disconnection not to be logged is an immediate
shutdown.

It'd be one thing if the missing walsender disconnect wasn't logged because we
shut down just after. But unless I misunderstand something, between the
basebackup and the failing test, we actually start standby_3:

# Running: pg_basebackup -D 
/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_primary3_data/backup/my_backup
 -h /tmp/hyrtnyd2RU -p 53774 --checkpoint fast --no-sync
# Backup finished

...

# Running: pg_ctl -w -D 
/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_standby_3_data/pgdata
 -l 
/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/log/019_replslot_limit_standby_3.log
 -o --cluster-name=standby_3 start

and also wait till replication has caught up:

Waiting for replication conn standby_3's replay_lsn to pass 0/70 on primary3
done

and only then we have

not ok 17 - have walsender pid 3682154
# 3682136

#   Failed test 'have walsender pid 3682154
# 3682136'
#   at t/019_replslot_limit.pl line 335.
#   '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'


Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Greetings,

Andres Freund




Re: libpq async duplicate error results

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-16 20:28:02 -0500, Tom Lane wrote:
>> Uh ... then we'd have to cast away the const to do free().

> I was just thinking of
> if ((const PGresult *) res == _result)

Oh, I see.  Sure, can do it like that.

regards, tom lane




Re: Small and unaffected typo in pg_logical_slot_get_changes_guts()

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 01:25:09PM +0900, Kasahara Tatsuhito wrote:
> Remove all references to tuplestore_donestoring() except for the header.

Looks fine, thanks.  This has no impact on Melanie's patch posted on
[1], so applied after tweaking the comment in tuplestore.h.

[1]: 
https://www.postgresql.org/message-id/20220106005748.GT14051%40telsasoft.com
--
Michael


signature.asc
Description: PGP signature


Re: libpq async duplicate error results

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 20:28:02 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-02-16 18:51:37 -0500, Tom Lane wrote:
> >> +  /* Also, do nothing if the argument is OOM_result */
> >> +  if (res == unconstify(PGresult *, _result))
> >> +  return;
>
> > Wouldn't it make more sense to make res const, rather than unconstifying
> > _result?
>
> Uh ... then we'd have to cast away the const to do free().

I was just thinking of

if ((const PGresult *) res == _result)

It's not important, I just find it stylistically nicer (making a pointer const
from an non-const pointer is safe, the other way round not generally).

- Andres




Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 16:50:57 -0800, Nathan Bossart wrote:
> + * The custodian process is new as of Postgres 15.

I think this kind of comment tends to age badly and not be very useful.


> It's main purpose is to
> + * offload tasks that could otherwise delay startup and checkpointing, but
> + * it needn't be restricted to just those things.  Offloaded tasks should
> + * not be synchronous (e.g., checkpointing shouldn't need to wait for the
> + * custodian to complete a task before proceeding).  Also, ensure that any
> + * offloaded tasks are either not required during single-user mode or are
> + * performed separately during single-user mode.
> + *
> + * The custodian is not an essential process and can shutdown quickly when
> + * requested.  The custodian will wake up approximately once every 5
> + * minutes to perform its tasks, but backends can (and should) set its
> + * latch to wake it up sooner.

Hm. This kind policy makes it easy to introduce bugs where the occasional runs
mask forgotten notifications etc.


> + * Normal termination is by SIGTERM, which instructs the bgwriter to
> + * exit(0).

s/bgwriter/.../

> Emergency termination is by SIGQUIT; like any backend, the
> + * custodian will simply abort and exit on SIGQUIT.
> + *
> + * If the custodian exits unexpectedly, the postmaster treats that the same
> + * as a backend crash: shared memory may be corrupted, so remaining
> + * backends should be killed by SIGQUIT and then a recovery cycle started.

This doesn't really seem useful stuff to me.



> + /*
> +  * If an exception is encountered, processing resumes here.
> +  *
> +  * You might wonder why this isn't coded as an infinite loop around a
> +  * PG_TRY construct.  The reason is that this is the bottom of the
> +  * exception stack, and so with PG_TRY there would be no exception 
> handler
> +  * in force at all during the CATCH part.  By leaving the outermost 
> setjmp
> +  * always active, we have at least some chance of recovering from an 
> error
> +  * during error recovery.  (If we get into an infinite loop thereby, it
> +  * will soon be stopped by overflow of elog.c's internal state stack.)
> +  *
> +  * Note that we use sigsetjmp(..., 1), so that the prevailing signal 
> mask
> +  * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
> +  * signals other than SIGQUIT will be blocked until we complete error
> +  * recovery.  It might seem that this policy makes the HOLD_INTERRUPS()
> +  * call redundant, but it is not since InterruptPending might be set
> +  * already.
> +  */

I think it's bad to copy this comment into even more places.


> + /* Since not using PG_TRY, must reset error stack by hand */
> + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> + {

I also think it's a bad idea to introduce even more copies of the error
handling body.  I think we need to unify this. And yes, it's unfair to stick
you with it, but it's been a while since a new aux process has been added.


> + /*
> +  * These operations are really just a minimal subset of
> +  * AbortTransaction().  We don't have very many resources to 
> worry
> +  * about.
> +  */

Given what you're proposing this for, are you actually confident that we don't
need more than this?


> From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001
> From: Nathan Bossart 
> Date: Sun, 5 Dec 2021 19:38:20 -0800
> Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup.
>
> Presently, the server only removes the contents of the temporary
> directories during startup, not the directory itself.  This changes
> that to prepare for future commits that will move temporary file
> cleanup to a separate auxiliary process.

Is this actually safe? Is there a guarantee no process can access a temp table
stored in one of these? Because without WAL guaranteeing consistency, we can't
just access e.g. temp tables written before a crash.


> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
> + bool unlink_all);

I don't like functions with multiple consecutive booleans, they tend to get
swapped around. Why not just split unlink_all=true/false into different
functions?


> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>
> First, pgsql_tmp directories will be renamed to stage them for
> removal.

What if the target name already exists?


> Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves.  This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.
>
> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

This should be in the prior commit message, otherwise people 

Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Michael Paquier
On Tue, Feb 15, 2022 at 06:24:13PM -0800, Andres Freund wrote:
> For backups it's pretty obviously zstd imo. At the lower levels it achieves
> considerably higher compression ratios while still being vastly faster than
> gzip. Without even using the threaded compression support the library has.

Noted.

> For something like wal_compression it'd be a harder question.

FWIW, I have done some measurements for wal_compression with zstd, as
of:
https://www.postgresql.org/message-id/YMmlvyVyAFlxZ+/h...@paquier.xyz

The result is not surprising, a bit more CPU for zstd with more
compression compared to LZ4, both outclassing easily zlib.  I am not
sure which one would be more adapted as default as FPI patterns depend
on the workload, for one, and this is just one corner case.
--
Michael


signature.asc
Description: PGP signature


Re: refactoring basebackup.c

2022-02-16 Thread Jeevan Ladhe
Thanks for the comments Robert. I have addressed your comments in the
attached patch v13-0002-ZSTD-add-server-side-compression-support.patch.
Rest of the patches are similar to v12, but just bumped the version number.

> It will be good if we can also fix
> CreateWalTarMethod to support LZ4 and ZSTD.
Ok we will see, either Dipesh or I will take care of it.

Regards,
Jeevan Ladhe


On Thu, 17 Feb 2022 at 02:37, Robert Haas  wrote:

> On Wed, Feb 16, 2022 at 12:46 PM Jeevan Ladhe 
> wrote:
> > So, I went ahead and have now also implemented client side decompression
> > for zstd.
> >
> > Robert separated[1] the ZSTD configure switch from my original patch
> > of server side compression and also added documentation related to
> > the switch. I have included that patch here in the patch series for
> > simplicity.
> >
> > The server side compression patch
> > 0002-ZSTD-add-server-side-compression-support.patch has also taken care
> > of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help
> > as suggested by Álvaro Herrera.
>
> The first hunk of the documentation changes is missing a comma between
> gzip and lz4.
>
> + * At the start of each archive we reset the state to start a new
> + * compression operation. The parameters are sticky and they would
> stick
> + * around as we are resetting with option ZSTD_reset_session_only.
>
> I don't think "would" is what you mean here. If you say something
> would stick around, that means it could be that way it isn't. ("I
> would go to the store and buy some apples, but I know they don't have
> any so there's no point.") I think you mean "will".
>
> -printf(_("  -Z,
> --compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL]\n"
> - " compress tar output with given
> compression method or level\n"));
> +printf(_("  -Z,
> --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL]\n"));
> +printf(_("  -Z, --compress=none\n"));
>
> You deleted a line that you should have preserved here.
>
> Overall there doesn't seem to be much to complain about here on a
> first read-through. It will be good if we can also fix
> CreateWalTarMethod to support LZ4 and ZSTD.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


v13-0004-ZSTD-add-client-side-decompression-support.patch
Description: Binary data


v13-0001-Add-support-for-building-with-ZSTD.patch
Description: Binary data


v13-0002-ZSTD-add-server-side-compression-support.patch
Description: Binary data


v13-0003-ZSTD-add-client-side-compression-support.patch
Description: Binary data


Re: Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 05:36:14PM -0500, Andrew Dunstan wrote:
> More specifically, all the tests in question are now passing on jacana
> and fairywren, and their $Config{osname} is NOT 'msys' (it's 'MSWin32').

Indeed.  0001 is incorrect.

> So we should simply remove any line that ends "if $Config{osname} eq
> 'msys';" since we don't have any such animals any more.

So, that's related to [1], meaning that we don't have any buildfarm
members that run the perl flavor from Msys.  And, as a result, we
don't need to worry about the CRLF conversions in any of the perl
modules?  That would be nice.

[1]: 
https://www.postgresql.org/message-id/0ba775a2-8aa0-0d56-d780-69427cf6f...@dunslane.net
 
--
Michael


signature.asc
Description: PGP signature


Re: libpq async duplicate error results

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-16 18:51:37 -0500, Tom Lane wrote:
>> +/* Also, do nothing if the argument is OOM_result */
>> +if (res == unconstify(PGresult *, _result))
>> +return;

> Wouldn't it make more sense to make res const, rather than unconstifying
> _result?

Uh ... then we'd have to cast away the const to do free().

regards, tom lane




Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> I think the test is telling us that something may be broken. We shouldn't
> silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1], and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331.  The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG:  connection received: 
host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG:  replication connection 
authorized: user=bf application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG:  received replication 
command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT:  SHOW 
data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG:  received replication 
command: CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT:  
CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG:  received replication 
command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG:  received replication 
command: START_REPLICATION SLOT "pg_basebackup_3726690" 0/60 TIMELINE 1
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT:  START_REPLICATION 
SLOT "pg_basebackup_3726690" 0/60 TIMELINE 1

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG:  connection received: 
host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG:  replication connection 
authorized: user=bf application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG:  received replication 
command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG:  received replication 
command: START_REPLICATION SLOT "rep3" 0/70 TIMELINE 1
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT:  START_REPLICATION 
SLOT "rep3" 0/70 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that?  But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis=2022-02-16%2021%3A16%3A04




Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM

2022-02-16 Thread Peter Geoghegan
On Sun, Feb 13, 2022 at 1:43 PM Peter Geoghegan  wrote:
> This fixes the observed problem directly. It also makes the code
> robust against other similar problems that might arise in the future.
> The risk that comes from trusting that scanned_pages is a truly random
> sample (given these conditions) is generally very large, while the
> risk that comes from disbelieving it (given these same conditions) is
> vanishingly small.

Pushed.

-- 
Peter Geoghegan




Re: libpq async duplicate error results

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 18:51:37 -0500, Tom Lane wrote:
> This seems workable, and you'll notice it fixes the duplicated text
> in the test case Andres was worried about.

Cool.

I find it mildly scary that we didn't have any other tests verifying the libpq
side of connection termination. Seems like we we maybe should add a few more?
Some simple cases we can do via isolationtester. But some others would
probably require a C test program to be robust...


> + /* Also, do nothing if the argument is OOM_result */
> + if (res == unconstify(PGresult *, _result))
> + return;

Wouldn't it make more sense to make res const, rather than unconstifying
_result?


Greetings,

Andres Freund




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-16 Thread Nathan Bossart
Here is a rebased patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8fc744a0b22a1fff9f3470bfdc0e08d9dd5da5be Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v2 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 154 +-
 doc/src/sgml/func.sgml|  30 +-
 src/backend/access/transam/xlog.c | 466 ++
 src/backend/access/transam/xlogfuncs.c| 235 ++---
 src/backend/catalog/system_functions.sql  |  14 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |   6 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   3 +-
 src/include/catalog/pg_proc.dat   |  20 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +--
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 14 files changed, 113 insertions(+), 935 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..f241bcab9c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -884,7 +876,7 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0
  rights to run pg_start_backup (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_start_backup('label', false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
@@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false);
  issue an immediate checkpoint using as much I/O as available.
 
 
-
- The third parameter being false tells
- pg_start_backup to initiate a non-exclusive base backup.
-


 
@@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false);
 
  In the same connection as before, issue the command:
 
-SELECT * FROM pg_stop_backup(false, true);
+SELECT * FROM pg_stop_backup(true);
 
  This terminates backup mode. On a primary, it also performs an automatic
  switch to the next WAL segment.  On a standby, it is not possible to
@@ -985,142 +973,6 @@ SELECT * FROM pg_stop_backup(false, true);

   
 
-   
-   
-Making an Exclusive Low-Level Backup
-
-
- 
-  The exclusive backup method is deprecated and should be avoided.
-  Prior to PostgreSQL 9.6, this was the only
-  low-level method available, but it is now recommended that all users
-  upgrade their scripts to use non-exclusive backups.
- 
-
-
-
- The process for an exclusive backup is mostly the same as for a
- non-exclusive one, but it differs in a few key steps. This type of
- backup can only be taken on a primary and does not allow concurrent
- backups.  Moreover, because it creates a backup label file, as
- described below, it can block automatic restart of the primary server
- after a crash.  On the other hand, the erroneous removal of this
- file from a backup or standby is a common mistake, which can result
- in serious data corruption.  If it is necessary to use this method,
- the following steps may be used.
-
-
-  
-   
-
- Ensure that WAL archiving is enabled and working.
-
-   
-   
-
- Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
- EXECUTE on the function) and issue the command:
-
-SELECT pg_start_backup('label');
-
- where label is any string you want to use to uniquely
- identify this backup operation.
- pg_start_backup creates a backup label file,
- called backup_label, in the cluster directory with
- information about your backup, including the start time and label string.
- The function also creates a tablespace map file,
- called tablespace_map, in the cluster directory with
- information about tablespace symbolic links in pg_tblspc/ if
- one or more such link is present.  Both files are 

Re: O(n) tasks cause lengthy startups and checkpoints

2022-02-16 Thread Nathan Bossart
Here is another rebased patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c11a6893d2d509df1389a1c03081b6cc563d5683 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v5 1/8] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 214 
 src/backend/postmaster/postmaster.c |  44 -
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  17 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 11 files changed, 301 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index dbbeac5a82..1b7aae60f5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 0587e45920..7eae34884d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..5f2b647544
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,214 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process is new as of Postgres 15.  It's main purpose is to
+ * offload tasks that could otherwise delay startup and checkpointing, but
+ * it needn't be restricted to just those things.  Offloaded tasks should
+ * not be synchronous (e.g., checkpointing shouldn't need to wait for the
+ * custodian to complete a task before proceeding).  Also, ensure that any
+ * offloaded tasks are either not required during single-user mode or are
+ * performed separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested.  The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ * Normal termination is by SIGTERM, which instructs the bgwriter to
+ * exit(0).  Emergency termination is by SIGQUIT; like any backend, the
+ * custodian will simply abort and exit on SIGQUIT.
+ *
+ * If the custodian exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining
+ * backends should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300)		/* 5 minutes */
+
+/*
+ * Main entry point for custodian process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+CustodianMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext custodian_context;
+
+	/*
+	 * Properly accept or ignore signals that might be 

Re: pgsql: Add TAP test to automate the equivalent of check_guc

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 07:39:26PM -0500, Robert Haas wrote:
> Sorry, I saw that and then forgot about it ... but isn't it 3 days,
> not 4 weeks? In any case, I'm happy to have you take care of it, but I
> can also look at it tomorrow if you prefer.

Ugh.  I looked at the top of the thread and saw January the 17th :)

So that means that I need more caffeine.  If you are planning to look
at it, please feel free.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: libpq async duplicate error results

2022-02-16 Thread Tom Lane
I wrote:
>>> An idea that I don't have time to pursue right now is to track
>>> how much of conn->errorMessage has been read out by PQgetResult,
>>> and only report newly-added text in later PQgetResult calls of
>>> a series, while PQerrorMessage would continue to report the
>>> whole thing.  Buffer resets would occur where they do now.

> I spent a bit more time poking at this issue.  I no longer like the
> idea of reporting only the tail part of conn->errorMessage; I'm
> afraid that would convert "edge cases sometimes return redundant text"
> into "edge cases sometimes miss returning text at all", which is not
> an improvement.

I figured out what seems like an adequately bulletproof solution to
this problem.  The thing that was scaring me before is that there are
code paths in which we'll discard a pending conn->result PGresult
(which could be an error) and make a new one; it wasn't entirely
clear that we could track which result represents what text.
The solution to that has two components:

1. Never advance the how-much-text-has-been-reported counter until
we are just about to return an error PGresult to the application.
This reduces the risk of missing or duplicated text to what seems
an acceptable level.  I found that pqPrepareAsyncResult can be used
to handle this logic -- it's only called at outer levels, and it
already has some state-updating behavior, so we'd already have bugs
if it were called improperly.

2. Don't make error PGresults at all until we reach pqPrepareAsyncResult.
This saves some cycles in the paths where we previously replaced one
error with another.  Importantly, it also means that we can now have
a bulletproof solution to the problem of how to return an error when
we get OOM on any attempt to make a PGresult.  The attached patch
includes a statically-declared PGresult that we can return in that
case.

It's not very practical to do #2 exactly, because in the case of
an error received from the server, we need to build a PGresult that
includes the broken-down error fields.  But all internally-generated
errors can be handled in this style.

This seems workable, and you'll notice it fixes the duplicated text
in the test case Andres was worried about.

Note that this patch is on top of the event-proc changes I posted
in [1].  Without that change, it's not very clear how to cope
with PGEVT_RESULTCREATE failure after we've already updated the
error status.

regards, tom lane

[1] https://www.postgresql.org/message-id/3390587.1645035110%40sss.pgh.pa.us

diff --git a/contrib/test_decoding/expected/slot_creation_error.out b/contrib/test_decoding/expected/slot_creation_error.out
index 321648c339..043bdae0a2 100644
--- a/contrib/test_decoding/expected/slot_creation_error.out
+++ b/contrib/test_decoding/expected/slot_creation_error.out
@@ -98,7 +98,6 @@ t
 
 step s2_init: <... completed>
 FATAL:  terminating connection due to administrator command
-FATAL:  terminating connection due to administrator command
 server closed the connection unexpectedly
 	This probably means the server terminated abnormally
 	before or while processing the request.
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index f8f4111fef..6fceff561b 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -1237,7 +1237,7 @@ PQencryptPasswordConn(PGconn *conn, const char *passwd, const char *user,
 	if (!conn)
 		return NULL;
 
-	resetPQExpBuffer(>errorMessage);
+	pqClearConnErrorState(conn);
 
 	/* If no algorithm was given, ask the server. */
 	if (algorithm == NULL)
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 9c9416e8ff..2a3d68b4d1 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3685,7 +3685,7 @@ keep_going:		/* We will come back to here until there is
  * (and it seems some clients expect it to be empty after a
  * successful connection).
  */
-resetPQExpBuffer(>errorMessage);
+pqClearConnErrorState(conn);
 
 /* We are open for business! */
 conn->status = CONNECTION_OK;
@@ -4231,7 +4231,7 @@ closePGconn(PGconn *conn)
 
 	/*
 	 * Close the connection, reset all transient state, flush I/O buffers.
-	 * Note that this includes clearing conn->errorMessage; we're no longer
+	 * Note that this includes clearing conn's error state; we're no longer
 	 * interested in any failures associated with the old connection, and we
 	 * want a clean slate for any new connection attempt.
 	 */
@@ -4241,7 +4241,7 @@ closePGconn(PGconn *conn)
 	conn->xactStatus = PQTRANS_IDLE;
 	conn->pipelineStatus = PQ_PIPELINE_OFF;
 	pqClearAsyncResult(conn);	/* deallocate result */
-	resetPQExpBuffer(>errorMessage);
+	pqClearConnErrorState(conn);
 	release_conn_addrinfo(conn);
 
 	/* Reset all state obtained from server, too */
@@ -5236,7 +5236,7 @@ ldapServiceLookup(const char *purl, PQconninfoOption *options,
  * Returns 0 on success, nonzero on 

Re: Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Andres Freund
Hi, 

On February 16, 2022 2:04:01 PM PST, Andrew Dunstan  wrote:
>
>On 2/16/22 16:36, Daniel Gustafsson wrote:
>> Seeing msys in TAP tests mentioned in a thread [1] tonight reminded me about
>> two related (well, one of them) patches I had sitting around, so rather than
>> forgetting again here are some small cleanups.
>>
>> 0001 attempts to streamline how we detect Windows in the TAP tests (after 
>> that
>> there is a single msys check left that I'm not sure about, but [1] seems to
>> imply it could go); 0002 removes some unused module includes which either 
>> were
>> used at some point in the past or likely came from copy/paste.
>>
>
>0002 looks OK at first glance.
>
>
>0001 is something we should investigate. It's really going in the wrong
>direction I suspect. We should be looking to narrow the scope of these
>platform-specific bits of processing, not expand them.

Yes. The vast majority of those skips looks frankly irresponsible, indicating 
bugs in the test or postgres. There's definitely valid cases (e.g. testing sysv 
shm), but most don't look valid to me.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: killing perl2host

2022-02-16 Thread Andres Freund
Hi, 

On February 16, 2022 1:10:35 PM PST, Andrew Dunstan  wrote:
>
>On 2/16/22 16:01, Andres Freund wrote:
>> Hi,
>>
>> On 2022-02-16 15:46:28 -0500, Andrew Dunstan wrote:
>>> I suggest that we apply this patch:
>> +1
>>
>>> and if nothing breaks in a few days I will set about a more thorough
>>> removal of perl2host() and adjusting everywhere it's called, and we can
>>> forget that the whole sorry mess ever happened :-)
>> I think we would need an error check against using an msys perl when 
>> targeting
>> native windows, otherwise this'll be too hard to debug.
>
>
>So something like this in Utils.pm:
>
>
>die "Msys targeted perl is unsupported for running TAP tests" if
>$Config{osname}eq 'msys';

I don't think we should reject msys general - it's fine as long as the target 
is msys, no? Msys includes postgres fwiw...


>> And we probably should set that environment variable that might fix in-tree
>> tests? Or reject in-tree builds?
>
>
>I think that's an orthogonal issue, but yes we should probably set it.
>Have you tested to make sure it does what we want? I certainly don't
>want to reject in-tree builds.

I don't think it's quite orthogonal - msys perl uses sane PATH semantics and 
thus doesn't have this problem.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Andrew Dunstan


On 2/16/22 17:04, Andrew Dunstan wrote:
> On 2/16/22 16:36, Daniel Gustafsson wrote:
>> Seeing msys in TAP tests mentioned in a thread [1] tonight reminded me about
>> two related (well, one of them) patches I had sitting around, so rather than
>> forgetting again here are some small cleanups.
>>
>> 0001 attempts to streamline how we detect Windows in the TAP tests (after 
>> that
>> there is a single msys check left that I'm not sure about, but [1] seems to
>> imply it could go); 0002 removes some unused module includes which either 
>> were
>> used at some point in the past or likely came from copy/paste.
>>
> 0002 looks OK at first glance.
>
>
> 0001 is something we should investigate. It's really going in the wrong
> direction I suspect. We should be looking to narrow the scope of these
> platform-specific bits of processing, not expand them.
>
>


More specifically, all the tests in question are now passing on jacana
and fairywren, and their $Config{osname} is NOT 'msys' (it's 'MSWin32').

So we should simply remove any line that ends "if $Config{osname} eq
'msys';" since we don't have any such animals any more.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Daniel Gustafsson
> On 16 Feb 2022, at 23:04, Andrew Dunstan  wrote:
> 
> On 2/16/22 16:36, Daniel Gustafsson wrote:
>> Seeing msys in TAP tests mentioned in a thread [1] tonight reminded me about
>> two related (well, one of them) patches I had sitting around, so rather than
>> forgetting again here are some small cleanups.
>> 
>> 0001 attempts to streamline how we detect Windows in the TAP tests (after 
>> that
>> there is a single msys check left that I'm not sure about, but [1] seems to
>> imply it could go); 0002 removes some unused module includes which either 
>> were
>> used at some point in the past or likely came from copy/paste.
>> 
> 
> 0002 looks OK at first glance.

Thanks!

> 0001 is something we should investigate. It's really going in the wrong
> direction I suspect. We should be looking to narrow the scope of these
> platform-specific bits of processing, not expand them.

No arguments there, I was only looking at making sure that we detect the OS in
the same way everywhere, but if you're working towards eliminating this there
is little use in changing anything.

--
Daniel Gustafsson   https://vmware.com/





Re: Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Andrew Dunstan


On 2/16/22 16:36, Daniel Gustafsson wrote:
> Seeing msys in TAP tests mentioned in a thread [1] tonight reminded me about
> two related (well, one of them) patches I had sitting around, so rather than
> forgetting again here are some small cleanups.
>
> 0001 attempts to streamline how we detect Windows in the TAP tests (after that
> there is a single msys check left that I'm not sure about, but [1] seems to
> imply it could go); 0002 removes some unused module includes which either were
> used at some point in the past or likely came from copy/paste.
>

0002 looks OK at first glance.


0001 is something we should investigate. It's really going in the wrong
direction I suspect. We should be looking to narrow the scope of these
platform-specific bits of processing, not expand them.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 1:04 PM Robert Haas  wrote:
> No, what I'm saying is that people running older versions routinely
> run VACUUM in single-user mode because otherwise it fails due to the
> truncation issue. But once they go into single-user mode they lose
> protection.

Seems logically consistent, but absurd. A Catch-22 situation if ever
there was one.

There might well be an element of survivorship bias here. Most VACUUM
operations won't ever attempt truncation (speaking very generally).
How many times might (say) the customer that John mentioned have
accidentally gone over xidStopLimit for just a little while, before
the situation corrected itself without anybody noticing? A lot of
applications are very read-heavy, or aren't very well monitored.

Eventually (maybe after several years of this), some laggard
anti-wraparound vacuum needs to truncate the relation, due to random
happenstance. Once that happens, the situation is bound to come to a
head. The user is bound to finally notice that the system has gone
over xidStopLimit, because there is no longer any way for the problem
to go away on its own.

-- 
Peter Geoghegan




Small TAP tests cleanup for Windows and unused modules

2022-02-16 Thread Daniel Gustafsson
Seeing msys in TAP tests mentioned in a thread [1] tonight reminded me about
two related (well, one of them) patches I had sitting around, so rather than
forgetting again here are some small cleanups.

0001 attempts to streamline how we detect Windows in the TAP tests (after that
there is a single msys check left that I'm not sure about, but [1] seems to
imply it could go); 0002 removes some unused module includes which either were
used at some point in the past or likely came from copy/paste.

--
Daniel Gustafsson   https://vmware.com/

[1] https://postgr.es/m/0ba775a2-8aa0-0d56-d780-69427cf6f...@dunslane.net



0002-Remove-unused-modules-from-TAP-tests.patch
Description: Binary data


0001-Make-Windows-detection-in-TAP-tests-consistent.patch
Description: Binary data


Re: adding 'zstd' as a compression algorithm (initdb/lz4)

2022-02-16 Thread Justin Pryzby
On Tue, Feb 15, 2022 at 11:54:10AM -0800, Andres Freund wrote:
> > Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> > every way? LZ4 is pretty much its successor. And so it seems totally
> > fine to assume that users will always want to use the clearly better
> > option, and that that option will be generally available going
> > forward. TOAST compression is applied selectively already, based on
> > various obscure implementation details, some of which are quite
> > arbitrary.
> 
> Yea, we should really default to lz4 in initdb when available.

This patch intends to implement that.  I have no particular interest in this,
but if anyone wants, I will add it to the next CF (or the one after that).

commit 2a3c5950e625ccfaebc49bbf71b8db16dc143cd2
Author: Justin Pryzby 
Date:   Tue Feb 15 19:14:33 2022 -0600

initdb: default_toast_compression=lz4 if available

TODO: consider same for wal_compression

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc63172efde..f9cd2ef7229 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8537,7 +8537,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 The supported compression methods are pglz and
 (if PostgreSQL was compiled with
 --with-lz4) lz4.
-The default is pglz.
+The default is lz4 if available at the time 
+PostgreSQL was compiled, otherwise
+pglz.

   
  
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f505413a7f9..6b6f6efaba1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4726,7 +4726,11 @@ static struct config_enum ConfigureNamesEnum[] =
NULL
},
_toast_compression,
+#ifdef USE_LZ4
+   TOAST_LZ4_COMPRESSION,
+#else
TOAST_PGLZ_COMPRESSION,
+#endif
default_toast_compression_options,
NULL, NULL, NULL
},
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d78e8e67b8d..bb7c57e00fa 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1185,6 +1185,12 @@ setup_config(void)
  
"#update_process_title = off");
 #endif
 
+#ifdef USE_LZ4
+   conflines = replace_token(conflines,
+ 
"#default_toast_compression = 'pglz'",
+ 
"#default_toast_compression = 'lz4'");
+#endif
+
/*
 * Change password_encryption setting to md5 if md5 was chosen as an
 * authentication method, unless scram-sha-256 was also chosen.




Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-02-16 Thread Matthias van de Meent
Hi,

I noticed that effectively all indexes use the special region of a
page to store some index-specific data on the page. In all cases I've
noticed, this is a constant-sized struct, located at what is
effectively a fixed offset from the end of the page; indicated on the
page by pd_special; and accessed through PageGetSpecialPointer() (that
is about equal to `page + page->pd_special` modulo typing and
assertions).

Seeing that these indexes effectively always use a constant-sized
struct as the only data in the special region; why does this code
depend on the pd_special pointer to retrieve their PageOpaque pointer?
Wouldn't a constant offset off of (page), based on the size of the
opaque structure and BLCKSZ, be enough?

Sure, in assertion builds this also validates that pd_special indeed
points in the bounds of the page, but PageGetSpecialPointer() does not
validate that the struct itself is in the bounds of the page, so
there's little guarantee that this is actually safe.

Additionally, this introduces a layer of indirection that is not
necessarily useful: for example the_bt_step_left code has no use for
the page's header information other than that it contains the
pd_special pointer (which is effectively always BLCKSZ -
MAXALIGN(sizeof(opaque))). This introduces more data and ordering
dependencies in the CPU, where this should be as simple as a constant
offset over the base page pointer.


Assuming there's no significant reason to _not_ to change this code to
a constant offset off the page pointer and only relying on pd_special
in asserts when retrieving the IndexPageOpaques, I propose the
attached patches:

0001 adds a new macro PageGetSpecialOpaque(page, opaquedatatyp); which
replaces PageGetSpecialPointer for constant sized special area
structures, and sets up the SPGist, GIST and Gin index methods to use
that instead of PageGetSpecialPointer.
0002 replaces manual PageGetSpecialPointer calls & casts in btree code
with a new macro BTPageGetOpaque, which utilizes PageGetSpecialOpaque.
0003 does the same as 0002, but for hash indexes.

A first good reason to do this is preventing further damage when a
page is corrupted: if I can somehow overwrite pd_special,
non-assert-enabled builds would start reading and writing at arbitrary
offsets from the page pointer, quite possibly in subsequent buffers
(or worse, on the stack, in case of stack-allocated blocks).
A second reason would be less indirection to get to the opaque
pointer. This should improve performance a bit in those cases where we
(initially) only want to access the [Index]PageOpaque struct.
Lastly, 0002 and 0003 remove some repetitive tasks of dealing with the
[nbtree, hash] opaques by providing a typed accessor macro similar to
what is used in the GIN and (SP-)GIST index methods; improving the
legibility of the code and decreasing the churn.

Kind regards,

Matthias van de Meent.


v1-0001-Add-known-size-pre-aligned-special-area-pointer-m.patch
Description: Binary data


v1-0003-Update-hash-code-to-use-PageGetSpecialOpaque-repl.patch
Description: Binary data


v1-0002-Update-nbtree-code-to-use-PageGetSpecialOpaque-re.patch
Description: Binary data


Re: killing perl2host

2022-02-16 Thread Andrew Dunstan


On 2/16/22 16:01, Andres Freund wrote:
> Hi,
>
> On 2022-02-16 15:46:28 -0500, Andrew Dunstan wrote:
>> I suggest that we apply this patch:
> +1
>
>> and if nothing breaks in a few days I will set about a more thorough
>> removal of perl2host() and adjusting everywhere it's called, and we can
>> forget that the whole sorry mess ever happened :-)
> I think we would need an error check against using an msys perl when targeting
> native windows, otherwise this'll be too hard to debug.


So something like this in Utils.pm:


die "Msys targeted perl is unsupported for running TAP tests" if
$Config{osname}eq 'msys';


>
> And we probably should set that environment variable that might fix in-tree
> tests? Or reject in-tree builds?


I think that's an orthogonal issue, but yes we should probably set it.
Have you tested to make sure it does what we want? I certainly don't
want to reject in-tree builds.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: refactoring basebackup.c

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 12:46 PM Jeevan Ladhe  wrote:
> So, I went ahead and have now also implemented client side decompression
> for zstd.
>
> Robert separated[1] the ZSTD configure switch from my original patch
> of server side compression and also added documentation related to
> the switch. I have included that patch here in the patch series for
> simplicity.
>
> The server side compression patch
> 0002-ZSTD-add-server-side-compression-support.patch has also taken care
> of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help
> as suggested by Álvaro Herrera.

The first hunk of the documentation changes is missing a comma between
gzip and lz4.

+ * At the start of each archive we reset the state to start a new
+ * compression operation. The parameters are sticky and they would stick
+ * around as we are resetting with option ZSTD_reset_session_only.

I don't think "would" is what you mean here. If you say something
would stick around, that means it could be that way it isn't. ("I
would go to the store and buy some apples, but I know they don't have
any so there's no point.") I think you mean "will".

-printf(_("  -Z,
--compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL]\n"
- " compress tar output with given
compression method or level\n"));
+printf(_("  -Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL]\n"));
+printf(_("  -Z, --compress=none\n"));

You deleted a line that you should have preserved here.

Overall there doesn't seem to be much to complain about here on a
first read-through. It will be good if we can also fix
CreateWalTarMethod to support LZ4 and ZSTD.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 3:21 PM Peter Geoghegan  wrote:
> On Wed, Feb 16, 2022 at 12:11 PM Robert Haas  wrote:
> > No, I think it's PostgreSQL 13, because before the vacuum failsafe
> > thing you could end up truncating enough tables during vacuum
> > operations to actually wrap around.
>
> Why wouldn't the xidStopLimit thing prevent actual incorrect answers
> to queries, even on Postgres 13? Why wouldn't that be enough, even if
> we make the most pessimistic possible assumptions?
>
> To me it looks like it's physically impossible to advance an XID past
> xidStopLimit, unless you're in single user mode. Does your concern
> have something to do with the actual xidStopLimit value in shared
> memory not being sufficiently protective in practice?

No, what I'm saying is that people running older versions routinely
run VACUUM in single-user mode because otherwise it fails due to the
truncation issue. But once they go into single-user mode they lose
protection.

> > And even in 14+, you can still do that, if you use single user mode.
>
> So what you're saying is that there is *some* reason for vacuuming in
> single user mode after all, and so we should keep the advice about
> that in place?   :-)

We could perhaps amend the text slightly, e.g. "This is a great idea
if you like pain."

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: killing perl2host

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 15:46:28 -0500, Andrew Dunstan wrote:
> I suggest that we apply this patch:

+1

> and if nothing breaks in a few days I will set about a more thorough
> removal of perl2host() and adjusting everywhere it's called, and we can
> forget that the whole sorry mess ever happened :-)

I think we would need an error check against using an msys perl when targeting
native windows, otherwise this'll be too hard to debug.

And we probably should set that environment variable that might fix in-tree
tests? Or reject in-tree builds?

Greetings,

Andres Freund




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote:
> Did you check whether this is a problem recently introduced or long-lived?

I've reproduced it back to v9.3.  I'm assuming it's much older than that.

> Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
> that it influences the timing in a way that makes it less likely?

I think it just influences the timing.  I believe the WAL pre-allocation
stuff triggers the issue because it adds a step between
AbsorbSyncRequests() and incrementing the started counter.

> ISTM that the problem is partially caused by having multiple "checkpoint"
> counters that are used in different ways, but then only waiting for one of
> them. I wonder whether a better approach could be to either unify the
> different counters, or to request / wait for the sync counter specifically?
> 
> Historically the sync stuff was something in md.c that the global system
> didn't really know anything about, but now it's a "proper" subsystem, so we
> can change the separation of concerns a bit more.

If the counters were unified, I think we might still need to do an extra
aborb after incrementing it, and we'd need to make sure that all of those
requests were tagged with the previous counter value so that they are
processed in the current checkpoint.  If callers requested with a specific
counter value, they might need to lock the counter (ckpt_lck) when making
requests.  Maybe that is okay.

>> Here's a patch that adds a call to AbsorbSyncRequests() in
>> CheckpointerMain() instead of SyncPreCheckpoint().
> 
> That doesn't strike me as great architecturally. E.g. in theory the same
> problem could exist in single user mode. I think it doesn't today, because
> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
> of feels like an implementation detail?

Yeah, maybe that is a reason to add an absorb somewhere within
CreateCheckPoint() instead, like v1 [0] does.  Then the extra absorb would
be sufficient for single-user mode if the requests were not absorbed
immediately.

[0] 
https://www.postgresql.org/message-id/attachment/130994/v1-0001-call-AbsorbSyncRequests-before-advancing-checkpoi.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




killing perl2host

2022-02-16 Thread Andrew Dunstan
Largely following a recipe from Andres, I have migrated buildfarm
animals fairywren and jacana to a setup that shouldn't need (and in fact
won't be able to use) PostgreSQL::Test:Utils::perl2host(). AFAICT these
two are the only buildfarm animals that run TAP tests under msys.

See discussion at


The lesson to be learned, incidentally, is "Don't use the msys targeted
perl to run TAP tests".

I suggest that we apply this patch:


diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm
b/src/test/perl/PostgreSQL/Test/Utils.pm
index 57fcb24089..31e2b0315e 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -311,7 +311,7 @@ The returned path uses forward slashes but has no
trailing slash.
 sub perl2host
 {
    my ($subject) = @_;
-   return $subject unless $Config{osname} eq 'msys';
+   return $subject;
    if ($is_msys2)
    {
    # get absolute, windows type path


and if nothing breaks in a few days I will set about a more thorough
removal of perl2host() and adjusting everywhere it's called, and we can
forget that the whole sorry mess ever happened :-) I know a number of
people who will be overjoyed.


cheers


andrew

-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: initdb / bootstrap design

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-16 13:24:41 -0500, Tom Lane wrote:
>> I remembered the reason why it's done that way: if we replaced those
>> values during genbki.pl, the contents of postgres.bki would become
>> architecture-dependent, belying its distribution as a "share" file.

> Hm. Architecturally I still would like to move it to be processed server
> side. I'd like to eventually get rid of single user mode (but keep bootstrap,
> at least for longer).
> Seems we could make NAMEDATALEN, FLOAT8PASSBYVAL, ALIGNOF_POINTER,
> FLOAT8PASSBYVAL stuff that bootparse knows about? And remove the need for
> POSTGRES, ENCODING, LC_COLLATE, LC_CTYPE as discussed already?

Yeah, I have no objection to doing it that way.  It should be possible
to do those substitutions on a per-field basis, which'd be cleaner than
what initdb does now ...

regards, tom lane




Re: Time to drop plpython2?

2022-02-16 Thread Andres Freund
On 2022-02-16 15:05:36 -0500, Chapman Flack wrote:
> On 02/16/22 14:58, Andres Freund wrote:
> >> "The minimum required version is Python 3.2 or
> >> later."
> > 
> > I stared for a bit, and I just don't see the redundancy?
> 
> "minimum ... or later" maybe?

Ah. Thanks.




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 12:11 PM Robert Haas  wrote:
> No, I think it's PostgreSQL 13, because before the vacuum failsafe
> thing you could end up truncating enough tables during vacuum
> operations to actually wrap around.

Why wouldn't the xidStopLimit thing prevent actual incorrect answers
to queries, even on Postgres 13? Why wouldn't that be enough, even if
we make the most pessimistic possible assumptions?

To me it looks like it's physically impossible to advance an XID past
xidStopLimit, unless you're in single user mode. Does your concern
have something to do with the actual xidStopLimit value in shared
memory not being sufficiently protective in practice?

> And even in 14+, you can still do that, if you use single user mode.

So what you're saying is that there is *some* reason for vacuuming in
single user mode after all, and so we should keep the advice about
that in place?   :-)

-- 
Peter Geoghegan




Re: check-world has suddenly started spewing stuff on stderr

2022-02-16 Thread Heikki Linnakangas

On 16/02/2022 19:04, Tom Lane wrote:

$ make check-world >/dev/null
2022-02-16 11:57:47.548 EST [3339702] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:47.612 EST [3339713] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:59.379 EST [3345366] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:59.445 EST [3345429] LOG:  database system was not properly 
shut down; automatic recovery in progress

Whoever is responsible for that, please fix it.


That's on me. 'initdb' started to print those after the xlog.c splitting 
patch.


Something went wrong in bootstrapping apparently. I'll investigate tomorrow.

- Heikki




Re: Lowering the ever-growing heap->pd_lower

2022-02-16 Thread Matthias van de Meent
On Wed, 16 Feb 2022 at 20:54, Peter Geoghegan  wrote:
>
> On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
>  wrote:
> > Peter Geoghegan asked for good arguments for the two changes
> > implemented. Below are my arguments detailed, with adversarial loads
> > that show the problematic behaviour of the line pointer array that is
> > fixed with the patch.
>
> Why is it okay that lazy_scan_prune() still calls
> PageGetMaxOffsetNumber() once for the page, before it ever calls
> heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff
> now, if only so that its scan-page-items loop doesn't get confused
> when it goes on to read "former line pointers"? This is certainly
> possible with the CLOBBER_FREED_MEMORY stuff in place (which will
> memset the truncated line pointer space with a 0x7F7F7F7F pattern).

Good catch, it is not. Attached a version that re-establishes maxoff
after each prune operation.

-Matthias


v8-0001-Improve-application-of-line-pointer-array-truncat.patch
Description: Binary data


Re: do only critical work during single-user vacuum?

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 2:18 PM Peter Geoghegan  wrote:
> It seems as if the advice about single user mode persisted for no
> great reason at all. Technically there were some remaining reasons to
> keep it around (like the truncation thing), but overall these
> secondary reasons could have been addressed much sooner if somebody
> had thought about it.

I raised it on the list a couple of years ago, actually. I think I had
a bit of difficulty convincing people that it wasn't something we had
to keep recommending.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 1:28 PM Peter Geoghegan  wrote:
> On Wed, Feb 16, 2022 at 10:18 AM Andres Freund  wrote:
> > > I'm pretty sure that some people believe that wraparound can cause
> > > actual data corruption
> >
> > Well, historically they're not wrong.
>
> True, but the most recent version where that's actually possible is
> PostgreSQL 8.0, which was released in early 2005. That was a very
> different time for the project. I don't think that people believe that
> wraparound can cause data corruption because they remember a time when
> it really could. It seems like general confusion to me (which could
> have been avoided).

No, I think it's PostgreSQL 13, because before the vacuum failsafe
thing you could end up truncating enough tables during vacuum
operations to actually wrap around.

And even in 14+, you can still do that, if you use single user mode.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Time to drop plpython2?

2022-02-16 Thread Chapman Flack
On 02/16/22 14:58, Andres Freund wrote:
>> "The minimum required version is Python 3.2 or
>> later."
> 
> I stared for a bit, and I just don't see the redundancy?

"minimum ... or later" maybe?

Regards,
-Chap




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-16 Thread Matthias van de Meent
On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
 wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > > before the checkpoint start, instead of just emitting the stats at the 
> > > end.
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total 
> > > buffers to sync, checkpointer pid, substep it is running, whether it is 
> > > on target for completion, checkpoint_Reason
> > > (manual/times/forced). BufferSync has several variables tracking the sync 
> > > progress locally, and we may need some refactoring here.
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.
>
> As per the discussion in this thread, there should be some mechanism
> to show the progress of checkpoint during shutdown and end-of-recovery
> cases as we cannot access pg_stat_progress_checkpoint in those cases.
> I am working on this to use log_startup_progress_interval mechanism to
> log the progress in the server logs.
>
> Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:

> xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, 
> checkpoint_progress_end
> +/* In bootstrap mode, we don't actually record anything. */
> +if (IsBootstrapProcessingMode())
> +return;

Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.

> xlog.c @ checkpoint_progress_start
> +pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, 
> InvalidOid);
> +checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> + PROGRESS_CHECKPOINT_PHASE_INIT);
> +if (flags & 

Re: Time to drop plpython2?

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 15:43:19 +0100, Peter Eisentraut wrote:
> On 15.02.22 22:40, Andres Freund wrote:
> > 1: plpython: Reject Python 2 during build configuration.
> 
> There is a bit of redundancy in the new wording in installation.sgml:
> 
> "The minimum required version is Python 3.2 or
> later."

I stared for a bit, and I just don't see the redundancy?


> There is also a small bit in install-windows.sgml that would be worth
> updating:
> 
> 
> $config->{python} = 'c:\python26';
> 

Yea. I saw that but ended up with not updating it, because it already was out
of date ;). But I guess I'll just put 310 there or something.


> Other than that, this looks fine to me, and we should move forward with this
> now.

Cool, will apply 1) later today.


> > 2: plpython: Remove plpythonu, plpython2u extensions.
> > 3: plpython: Remove regression test infrastructure for Python 2.
> 
> The removal of the Python 2 contrib extension variants ended up in patch 3,
> which seems incorrect.

Uh, huh. That was a mistake, not sure what happened...

Greetings,

Andres Freund




Re: Lowering the ever-growing heap->pd_lower

2022-02-16 Thread Peter Geoghegan
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
 wrote:
> Peter Geoghegan asked for good arguments for the two changes
> implemented. Below are my arguments detailed, with adversarial loads
> that show the problematic behaviour of the line pointer array that is
> fixed with the patch.

Why is it okay that lazy_scan_prune() still calls
PageGetMaxOffsetNumber() once for the page, before it ever calls
heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff
now, if only so that its scan-page-items loop doesn't get confused
when it goes on to read "former line pointers"? This is certainly
possible with the CLOBBER_FREED_MEMORY stuff in place (which will
memset the truncated line pointer space with a 0x7F7F7F7F pattern).

-- 
Peter Geoghegan




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-16 Thread Matthias van de Meent
On Tue, 15 Feb 2022 at 13:16, Nitin Jadhav
 wrote:
>
> > Apart from above fields, I am planning to add few more fields to the
> > view in the next patch. That is, process ID of the backend process
> > which triggered a CHECKPOINT command, checkpoint start location, filed
> > to indicate whether it is a checkpoint or restartpoint and elapsed
> > time of the checkpoint operation. Please share your thoughts. I would
> > be happy to add any other information that contributes to showing the
> > progress of checkpoint.
>
> The progress reporting mechanism of postgres uses the
> 'st_progress_param' array of 'PgBackendStatus' structure to hold the
> information related to the progress. There is a function
> 'pgstat_progress_update_param()' which takes 'index' and 'val' as
> arguments and updates the 'val' to corresponding 'index' in the
> 'st_progress_param' array. This mechanism works fine when all the
> progress information is of type integer as the data type of
> 'st_progress_param' is of type integer. If the progress data is of
> different type than integer, then there is no easy way to do so.

Progress parameters are int64, so all of the new 'checkpoint start
location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
time' (store as start time in stat_progress, timestamp fits in 64
bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
current stat_progress parameter. Some processing would be required at
the view, but that's not impossible to overcome.

Kind regards,

Matthias van de Meent




Re: initdb / bootstrap design

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 13:24:41 -0500, Tom Lane wrote:
> I remembered the reason why it's done that way: if we replaced those
> values during genbki.pl, the contents of postgres.bki would become
> architecture-dependent, belying its distribution as a "share" file.
> While we don't absolutely have to continue treating postgres.bki
> as architecture-independent, I'm skeptical that there's enough win
> here to justify a packaging change.

Hm. Architecturally I still would like to move it to be processed server
side. I'd like to eventually get rid of single user mode (but keep bootstrap,
at least for longer).

Seems we could make NAMEDATALEN, FLOAT8PASSBYVAL, ALIGNOF_POINTER,
FLOAT8PASSBYVAL stuff that bootparse knows about? And remove the need for
POSTGRES, ENCODING, LC_COLLATE, LC_CTYPE as discussed already?


> initdb is already plenty fast enough for any plausible production
> usage; it's cases like check-world where we wish it were faster.

It's not just our own usage though. I've seen it be a noticable time in test
suites of applications using postgres. And that's not really addressable with
the template approach, unless we want to move use of the template database
into initdb itself. I've thought about it, but then we'd need to do a lot more
than if it's just for our own tests.


> So I'm thinking what we really ought to pursue is the idea that's
> been kicked around more than once of capturing the post-initdb
> state of a cluster's files and just doing "cp -a" to duplicate that
> later in the test run.

Yea, we should pursue that independently of improving initdb's architecture /
speed. initdb will never be as fast as copying files around.

I kind of got stuck on how to deal with install.pl / vcregress.pl. For make
it's easy enough to create the template during during temp-install. But for
the msvc stuff is less clear when / where to create the template
database. Nearly everyone uses NO_TEMP_INSTALL on windows, because install is
so slow and happens in every test. But right now there's no command to create
the "temp" installation. Probably need something like a 'temp-install' command
for vcregress.pl and then convert the buildfarm to use that.

Greetings,

Andres Freund




Re: Observability in Postgres

2022-02-16 Thread Michael Banck
Hi,

On Wed, Feb 16, 2022 at 12:48:11AM -0500, Greg Stark wrote:
> But on further thought I think what you're asking is whether there are
> standard database metrics and standard names for them. A lot of this
> work has already been done with pg_exporter but it is worth looking at
> other database software and see if there are opportunities to
> standardize metrics naming for across databases.

Can you clarify what exactly you mean with "pg_exporter", I think you
mentioned it upthread as well?

https://github.com/Vonng/pg_exporter (90 GH stars, never heard of it) ?
https://github.com/prometheus-community/postgres_exporter (1.6k GH stars) ?

Something else?


Michael

-- 
Michael Banck
Teamleiter PostgreSQL-Team
Projektleiter
Tel.: +49 2166 9901-171
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 09:37:21 -0800, Nathan Bossart wrote:
> I've also figured out a
> way to reproduce the issue without the pre-allocation patches applied:
> 
> 1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
>increment ckpt_started.
> 2. In session 1, run the following commands:
>   a. CREATE TABLESPACE test LOCATION '/path/to/dir';
>   b. CREATE TABLE test TABLESPACE test AS SELECT 1;
> 3. In session 2, start a checkpoint.
> 4. In session 1, run these commands:
>   a. ALTER TABLE test SET TABLESPACE pg_default;
>   b. DROP TABLESPACE test;  -- fails
>   c. DROP TABLESPACE test;  -- succeeds

Did you check whether this is a problem recently introduced or long-lived?

Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
that it influences the timing in a way that makes it less likely?


ISTM that the problem is partially caused by having multiple "checkpoint"
counters that are used in different ways, but then only waiting for one of
them. I wonder whether a better approach could be to either unify the
different counters, or to request / wait for the sync counter specifically?

Historically the sync stuff was something in md.c that the global system
didn't really know anything about, but now it's a "proper" subsystem, so we
can change the separation of concerns a bit more.


> Here's a patch that adds a call to AbsorbSyncRequests() in
> CheckpointerMain() instead of SyncPreCheckpoint().

That doesn't strike me as great architecturally. E.g. in theory the same
problem could exist in single user mode. I think it doesn't today, because
RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
of feels like an implementation detail?


Greetings,

Andres Freund




Re: Observability in Postgres

2022-02-16 Thread Jacob Champion
On Wed, 2022-02-16 at 02:10 -0500, Greg Stark wrote:
> On Tue, 15 Feb 2022 at 17:37, Magnus Hagander  wrote:
> 
> > But I think you'll run into a different problem much earlier. Pretty
> > much everything out there is going to want to speak http(s). How are
> > you going to terminate that, especially https, on the same port as a
> > PostgreSQL connection? PostgreSQL will have to reply with it's initial
> > negotiating byte before anything else is done, including the TLS
> > negotiation, and that will kill anything http.
> 
> Yeah this is a serious problem. I think there are other even more
> compelling reasons someone else was already looking at this so I'm
> kind of hoping it solves itself :)

Yeah, this seems like a shoe-in with implicit TLS support and ALPN. So
hopefully we can help that piece solve itself. :)

That said, I feel like I should probably advise against forwarding HTTP
through Postgres. With implicit TLS you should be able to run a reverse
proxy out front, which could check the ALPN and redirect traffic to the
bgworker port as needed. (I don't think you have to terminate TLS in
order to do this -- so channel bindings et al should be unaffected --
but I don't have experience with that.)

So Postgres wouldn't have to touch HTTP traffic at all, and the
bgworker extension can upgrade its HTTP stack completely independently.
(And if you don't want to share ports, you don't have to deploy the
proxy at all.)

--Jacob


Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 10:27 AM Peter Geoghegan  wrote:
> True, but the most recent version where that's actually possible is
> PostgreSQL 8.0, which was released in early 2005.

It just occurred to me that the main historic reason for the single
user mode advice was the lack of virtual XIDs. The commit that added
the xidStopLimit behavior (commit 60b2444cc3) came a couple of years
before the introduction of virtual transaction IDs (in commit
295e63983d). AFAICT, the advice about single-user mode was added at a
time where exceeding xidStopLimit caused the system to grind to a halt
completely -- even trivial SELECTs would have failed once Postgres 8.1
crossed the xidStopLimit limit.

It seems as if the advice about single user mode persisted for no
great reason at all. Technically there were some remaining reasons to
keep it around (like the truncation thing), but overall these
secondary reasons could have been addressed much sooner if somebody
had thought about it.

-- 
Peter Geoghegan




Re: Allow root ownership of client certificate key

2022-02-16 Thread David Steele

Hi Tom,

On 1/18/22 14:41, Tom Lane wrote:

David Steele  writes:

[ client-key-perm-002.patch ]


I took a quick look at this and agree with the proposed behavior
change, but also with your self-criticisms:


We may want to do the same on the server side to make the code blocks
look more similar.

Also, on the server side the S_ISREG() check gets its own error and that
might be a good idea on the client side as well. As it is, the error
message on the client is going to be pretty confusing in this case.


Particularly, I think the S_ISREG check should happen before any
ownership/permissions checks; it just seems saner that way.


The two blocks of code now look pretty much identical except for error 
handling and the reference to the other file. Also, the indentation for 
the comment on the server side is less but I kept the comment formatting 
the same to make it easier to copy the comment back and forth.



The only other nitpick I have is that I'd make the cross-references be
to the two file names, ie like "Note that similar checks are performed
in fe-secure-openssl.c ..."  References to the specific functions seem
likely to bit-rot in the face of future code rearrangements.
I suppose filename references could become obsolete too, but it
seems less likely.


Updated these to reference the file instead of the function.

I still don't think we can commit the test for root ownership, but 
testing it manually got a whole lot easier after the refactor in 
c3b34a0f. Before that you had to hack up the source tree, which is a 
pain depending on how it is mounted (I'm testing in a container).


So, to test the new functionality, just add this snippet on line 57 of 
001_ssltests.pl:


chmod 0640, "$cert_tempdir/client.key"
or die "failed to change permissions on $cert_tempdir/client.key: $!";
system_or_bail("sudo chown root $cert_tempdir/client.key");

If you can think of a way to add this to the tests I'm all ears. Perhaps 
we could add these lines commented out and explain what they are for?


Regards,
-Daviddiff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
index 7e9a64d08e..3f00040efe 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -143,6 +143,7 @@ check_ssl_key_file_permissions(const char *ssl_key_file, 
bool isServerStart)
return false;
}
 
+   /* Key file must be a regular file */
if (!S_ISREG(buf.st_mode))
{
ereport(loglevel,
@@ -153,10 +154,20 @@ check_ssl_key_file_permissions(const char *ssl_key_file, 
bool isServerStart)
}
 
/*
-* Refuse to load key files owned by users other than us or root.
-*
-* XXX surely we can check this on Windows somehow, too.
-*/
+   * Refuse to load key files owned by users other than us or root and
+   * require no public access to the key file. If the file is owned by us,
+   * require mode 0600 or less. If owned by root, require 0640 or less to
+   * allow read access through our gid, or a supplementary gid that allows
+   * us to read system-wide certificates.
+   *
+   * Note that similar checks are performed in
+   * src/interfaces/libpq/fe-secure-openssl.c so any changes here may need 
to
+   * be made there as well.
+   *
+   * Ideally we would do similar permissions checks on Windows but it is
+   * not clear how that would work since unix-style permissions may not be
+   * available.
+   */
 #if !defined(WIN32) && !defined(__CYGWIN__)
if (buf.st_uid != geteuid() && buf.st_uid != 0)
{
@@ -166,20 +177,7 @@ check_ssl_key_file_permissions(const char *ssl_key_file, 
bool isServerStart)
ssl_key_file)));
return false;
}
-#endif
 
-   /*
-* Require no public access to key file. If the file is owned by us,
-* require mode 0600 or less. If owned by root, require 0640 or less to
-* allow read access through our gid, or a supplementary gid that allows
-* to read system-wide certificates.
-*
-* XXX temporarily suppress check when on Windows, because there may not
-* be proper support for Unix-y file permissions.  Need to think of a
-* reasonable check to apply on Windows.  (See also the data directory
-* permission check in postmaster.c)
-*/
-#if !defined(WIN32) && !defined(__CYGWIN__)
if ((buf.st_uid == geteuid() && buf.st_mode & (S_IRWXG | S_IRWXO)) ||
(buf.st_uid == 0 && buf.st_mode & (S_IWGRP | S_IXGRP | 
S_IRWXO)))
{
diff --git a/src/interfaces/libpq/fe-secure-openssl.c 
b/src/interfaces/libpq/fe-secure-openssl.c
index f6e563a2e5..59e3224f97 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -1245,11 +1245,45 @@ initialize_SSL(PGconn *conn)
 

Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 10:18 AM Andres Freund  wrote:
> > I'm pretty sure that some people believe that wraparound can cause
> > actual data corruption
>
> Well, historically they're not wrong.

True, but the most recent version where that's actually possible is
PostgreSQL 8.0, which was released in early 2005. That was a very
different time for the project. I don't think that people believe that
wraparound can cause data corruption because they remember a time when
it really could. It seems like general confusion to me (which could
have been avoided).

At a minimum, we ought to be very clear on the fact that Postgres
isn't going to just let your database become corrupt in some more or
less predictable way. The xidStopLimit thing is pretty bad, but it's
still much better than that.

-- 
Peter Geoghegan




Re: initdb / bootstrap design

2022-02-16 Thread Tom Lane
Andres Freund  writes:
> Sure, there's a few tokens that we replace in initdb. As it turns out there's
> only two rows that are actually variable. The username of the initial
> superuser in pg_authid and the pg_database row for template 1, where encoding,
> lc_collate and lc_ctype varies. The rest is all compile time constant
> replacements we could do as part of genbki.pl.

I remembered the reason why it's done that way: if we replaced those
values during genbki.pl, the contents of postgres.bki would become
architecture-dependent, belying its distribution as a "share" file.
While we don't absolutely have to continue treating postgres.bki
as architecture-independent, I'm skeptical that there's enough win
here to justify a packaging change.

initdb is already plenty fast enough for any plausible production
usage; it's cases like check-world where we wish it were faster.
So I'm thinking what we really ought to pursue is the idea that's
been kicked around more than once of capturing the post-initdb
state of a cluster's files and just doing "cp -a" to duplicate that
later in the test run.

regards, tom lane




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 10:14:19 -0800, Peter Geoghegan wrote:
> Absolutely -- couldn't agree more. Do you think it's worth targeting
> 14 here, or just HEAD?

I'd go for HEAD first, but wouldn't protest against 14.


> I'm pretty sure that some people believe that wraparound can cause
> actual data corruption

Well, historically they're not wrong. And we've enough things stored in 32bit
counters that I'd be surprised if we didn't have more wraparound issues. Of
course that's not related to anti-wrap vacuums...

Greetings,

Andres Freund




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 9:56 AM Robert Haas  wrote:
> +1. But I think we might want to try to write documentation around
> this. We should explicitly tell people NOT to use single-user mode,
> because that stupid message has been there for a long time and a lot
> of people have probably internalized it by now. And we should also
> tell them that they SHOULD check for prepared transactions, old
> replication slots, etc.

Absolutely -- couldn't agree more. Do you think it's worth targeting
14 here, or just HEAD?

I'm pretty sure that some people believe that wraparound can cause
actual data corruption, in part because of the way the docs present
the information. The system won't do that, of course (precisely
because of this xidStopLimit behavior). The docs make it all sound
absolutely terrifying, which doesn't seem proportionate to me (at
least not with this stuff in place, maybe not ever).

-- 
Peter Geoghegan




Re: PGEventProcs must not be allowed to break libpq

2022-02-16 Thread Tom Lane
I wrote:
> ... more generally, it seems to me that allowing a failing PGEventProc
> to cause this to happen is just not sane.  It breaks absolutely
> every guarantee you might think we have about how libpq will behave.
> As an example that seems very plausible currently, if an event proc
> doesn't know what a PGRES_PIPELINE_SYNC result is and fails on it,
> will the application see behavior that's even a little bit sane?
> I don't think so --- it will think the error results are server
> failures, and then be very confused when answers arrive anyway.

Attached are two proposed patches addressing this.  The first one
turns RESULTCREATE and RESULTCOPY events into pure observers,
ie failure of an event procedure doesn't affect the overall
processing of a PGresult.  I think this is necessary if we want
to be able to reason at all about how libpq behaves.  Event
procedures do still have the option to report failure out to the
application in some out-of-band way, such as via their passThrough
argument.  But they can't break what libpq itself does.

The second patch turns CONNRESET events into pure observers.  While
I'm slightly less hot about making that change, the existing behavior
seems very poorly thought-out, not to mention untested.  Notably,
the code there changes conn->status to CONNECTION_BAD without
closing the socket, which is unlike any other post-connection failure
path; so I wonder just how well that'd work if it were exercised in
anger.

Comments, objections?

regards, tom lane

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index e0ab7cd555..40c39feb7d 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -6831,6 +6831,7 @@ PGresult *PQcopyResult(const PGresult *src, int flags);
   PG_COPYRES_EVENTS specifies copying the source
   result's events.  (But any instance data associated with the source
   is not copied.)
+  The event procedures receive PGEVT_RESULTCOPY events.
  
 

@@ -7126,7 +7127,7 @@ defaultNoticeProcessor(void *arg, const char *message)
,
,
 and
-   PQsetResultInstanceData functions.  Note that
+functions.  Note that
unlike the pass-through pointer, instance data of a PGconn
is not automatically inherited by PGresults created from
it.  libpq does not know what pass-through
@@ -7154,7 +7155,7 @@ defaultNoticeProcessor(void *arg, const char *message)
is called.  It is the ideal time to initialize any
instanceData an event procedure may need.  Only one
register event will be fired per event handler per connection.  If the
-   event procedure fails, the registration is aborted.
+   event procedure fails (returns zero), the registration is cancelled.
 
 
 typedef struct
@@ -7261,11 +7262,11 @@ typedef struct
conn is the connection used to generate the
result.  This is the ideal place to initialize any
instanceData that needs to be associated with the
-   result.  If the event procedure fails, the result will be cleared and
-   the failure will be propagated.  The event procedure must not try to
-the result object for itself.  When returning a
-   failure code, all cleanup must be performed as no
-   PGEVT_RESULTDESTROY event will be sent.
+   result.  If an event procedure fails (returns zero), that event
+   procedure will be ignored for the remaining lifetime of the result;
+   that is, it will not receive PGEVT_RESULTCOPY
+   or PGEVT_RESULTDESTROY events for this result or
+   results copied from it.
   
  
 
@@ -7295,12 +7296,12 @@ typedef struct
src result is what was copied while the
dest result is the copy destination.  This event
can be used to provide a deep copy of instanceData,
-   since PQcopyResult cannot do that.  If the event
-   procedure fails, the entire copy operation will fail and the
-   dest result will be cleared.   When returning a
-   failure code, all cleanup must be performed as no
-   PGEVT_RESULTDESTROY event will be sent for the
-   destination result.
+   since PQcopyResult cannot do that.  If an event
+   procedure fails (returns zero), that event procedure will be
+   ignored for the remaining lifetime of the new result; that is, it
+   will not receive PGEVT_RESULTCOPY
+   or PGEVT_RESULTDESTROY events for that result or
+   results copied from it.
   
  
 
@@ -7618,7 +7619,7 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
 mydata *res_data = dup_mydata(conn_data);
 
 /* associate app specific data with result (copy it from conn) */
-PQsetResultInstanceData(e->result, myEventProc, res_data);
+PQresultSetInstanceData(e->result, myEventProc, res_data);
 break;
 }
 
@@ -7629,7 +7630,7 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
 

improve --with-lz4 install documentation

2022-02-16 Thread Jeevan Ladhe
Now that lz4 also supports backup compression, decompression, and more
future enhancements that can be done here, I think it is better to make
the --with-lz4 description more generic than adding specific details in
there.

Attached is the patch to remove the specifics related to WAL and TOAST
compression.

Regards,
Jeevan Ladhe


improve-with-lz4-install-documentation.patch
Description: Binary data


Re: do only critical work during single-user vacuum?

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 12:51 PM Peter Geoghegan  wrote:
> Good news!

+1. But I think we might want to try to write documentation around
this. We should explicitly tell people NOT to use single-user mode,
because that stupid message has been there for a long time and a lot
of people have probably internalized it by now. And we should also
tell them that they SHOULD check for prepared transactions, old
replication slots, etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: initdb / bootstrap design

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 11:47:31 +0100, Peter Eisentraut wrote:
> On 16.02.22 03:12, Andres Freund wrote:
> > Sure, there's a few tokens that we replace in initdb. As it turns out 
> > there's
> > only two rows that are actually variable. The username of the initial
> > superuser in pg_authid and the pg_database row for template 1, where 
> > encoding,
> > lc_collate and lc_ctype varies. The rest is all compile time constant
> > replacements we could do as part of genbki.pl.
> > 
> > It seems we could save a good number of context switches by opening
> > postgres.bki just before boot_yyparse() in BootstrapModeMain() and having 
> > the
> > parser read it.  The pg_authid / pg_database rows we could just do via
> > explicit insertions in BootstrapModeMain(), provided by commandline args?
> 
> I think we could do the locale setup by updating the pg_database row of
> template1 after bootstrap, as in the attached patch.

Another solution could be to have bootstrap create template0 instead of
template1. I think for template0 it'd more accurate to have a hardcoded C
collation and ascii encoding (which I don't think we actually have?).


> I suspect we could do the treatment of pg_authid similarly.

Yea.

Greetings,

Andres Freund




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 8:48 AM Masahiko Sawada  wrote:
> FYI, I've tested the situation that I assumed autovacuum can not
> correct the problem; when the system had already crossed xidStopLimit,
> it keeps failing to vacuum on tables that appear in the front of the
> list and have sufficient garbage to trigger the truncation but are not
> older than the failsafe limit. But contrary to my assumption, it did
> correct the problem since autovacuum continues to the next table in
> the list even after an error. This probably means that autovacuum
> eventually succeeds to process all tables that trigger the failsafe
> mode, ensuring advancing datfrozenxid, which is great.

Right; it seems as if the situation is much improved, even when the
failsafe didn't prevent the system from going over xidStopLimit. If
autovacuum alone can bring the system back to a normal state as soon
as possible, without a human needing to do anything special, then
clearly the general risk is much smaller. Even this worst case
scenario where "the failsafe has failed" is not so bad anymore, in
practice. I don't think that it really matters if some concurrent
non-emergency VACUUMs fail when attempting to truncate the table (it's
no worse than ANALYZE failing, for example).

Good news!

-- 
Peter Geoghegan




Re: refactoring basebackup.c

2022-02-16 Thread Jeevan Ladhe
Hi Everyone,

So, I went ahead and have now also implemented client side decompression
for zstd.

Robert separated[1] the ZSTD configure switch from my original patch
of server side compression and also added documentation related to
the switch. I have included that patch here in the patch series for
simplicity.

The server side compression patch
0002-ZSTD-add-server-side-compression-support.patch has also taken care
of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help
as suggested by Álvaro Herrera.

[1]
https://www.postgresql.org/message-id/CA%2BTgmobRisF-9ocqYDcMng6iSijGj1EZX99PgXA%3D3VVbWuahog%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20220215175944.GY31460%40telsasoft.com

Regards,
Jeevan Ladhe

On Wed, 16 Feb 2022 at 21:46, Robert Haas  wrote:

> On Wed, Feb 16, 2022 at 11:11 AM Alvaro Herrera 
> wrote:
> > This is hard to interpret for humans though because of the nested
> > brackets and braces.  It gets considerably easier if you split it in
> > separate variants:
> >
> >-Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
> >-Z, --compress=LEVEL
> >-Z, --compress=none
> >  compress tar output with given compression
> method or level
> >
> >
> > or, if you choose to leave the level-only variant undocumented, then
> >
> >-Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
> >-Z, --compress=none
> >  compress tar output with given compression
> method or level
> >
> > There still are some nested brackets and braces, but the scope is
> > reduced enough that interpreting seems quite a bit simpler.
>
> I could go for that. I'm also just noticing that "none" is not really
> a compression method or level, and the statement that it can only
> compress "tar" output is no longer correct, because server-side
> compression can be used together with -Fp. So maybe we should change
> the sentence afterward to something a bit more generic, like "specify
> whether and how to compress the backup".
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


v12-0003-ZSTD-add-client-side-compression-support.patch
Description: Binary data


v2-0001-Add-support-for-building-with-ZSTD.patch
Description: Binary data


v12-0004-ZSTD-add-client-side-decompression-support.patch
Description: Binary data


v12-0002-ZSTD-add-server-side-compression-support.patch
Description: Binary data


Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 08:44:42AM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
>> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>>> It looks like register_unlink_segment() is called prior to the checkpoint,
>>> but the checkpointer is not calling RememberSyncRequest() until after
>>> SyncPreCheckpoint().  This means that the requests are registered with the
>>> next checkpoint cycle count, so they aren't processed until the next
>>> checkpoint.
>> 
>> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
>> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
>> out of the critical section in CreateCheckPoint().  Patch attached.
> 
> An alternative fix might be to call AbsorbSyncRequests() after increasing
> the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
> just before checkpointing where new requests are registered for the
> checkpoint following the one about to begin.

Here's a patch that adds a call to AbsorbSyncRequests() in
CheckpointerMain() instead of SyncPreCheckpoint().  I've also figured out a
way to reproduce the issue without the pre-allocation patches applied:

1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
   increment ckpt_started.
2. In session 1, run the following commands:
a. CREATE TABLESPACE test LOCATION '/path/to/dir';
b. CREATE TABLE test TABLESPACE test AS SELECT 1;
3. In session 2, start a checkpoint.
4. In session 1, run these commands:
a. ALTER TABLE test SET TABLESPACE pg_default;
b. DROP TABLESPACE test;  -- fails
c. DROP TABLESPACE test;  -- succeeds

With the attached patch applied, the first attempt at dropping the
tablespace no longer fails.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e9707dfde25eaa9c447032c4b5a61e3011141dc9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Feb 2022 09:26:08 -0800
Subject: [PATCH v2 1/1] call AbsorbSyncRequests() after indicating checkpoint
 start

---
 src/backend/postmaster/checkpointer.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..e93d34b71f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -401,6 +401,18 @@ CheckpointerMain(void)
 
 			ConditionVariableBroadcast(>start_cv);
 
+			/*
+			 * Pick up any last minute requests.  DROP TABLESPACE schedules a
+			 * checkpoint to clean up any lingering files that are scheduled for
+			 * deletion.  If we don't absorb those requests now, they might not
+			 * be absorbed until after incrementing the checkpoint cycle
+			 * counter, so the files won't be deleted until the following
+			 * checkpoint.  By absorbing requests after indicating the
+			 * checkpoint has started, operations like DROP TABLESPACE can be
+			 * sure that the next checkpoint will clean up any such files.
+			 */
+			AbsorbSyncRequests();
+
 			/*
 			 * The end-of-recovery checkpoint is a real checkpoint that's
 			 * performed while we're still in recovery.
-- 
2.25.1



Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote:
> On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada  
> > wrote in
> > > Or it's possible that the process took a time to clean up the
> > > temporary replication slot?
> >
> > Checkpointer may take ReplicationSlotControlLock. Dead lock between
> > ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
> > happened?

A deadlock requires some form of incorrected lock (or lock like) nesting. Do
you have an idea what that could be?


> That's possible. Whatever the exact cause of this failure, I think we
> can stabilize this test by adding a condition of application_name to
> the query.

I think the test is telling us that something may be broken. We shouldn't
silence that without at least some understanding what it is.

It'd be good try to reproduce this locally...

- Andres




check-world has suddenly started spewing stuff on stderr

2022-02-16 Thread Tom Lane
$ make check-world >/dev/null
2022-02-16 11:57:47.548 EST [3339702] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:47.612 EST [3339713] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:59.379 EST [3345366] LOG:  database system was not properly 
shut down; automatic recovery in progress
2022-02-16 11:57:59.445 EST [3345429] LOG:  database system was not properly 
shut down; automatic recovery in progress

Whoever is responsible for that, please fix it.

regards, tom lane




Re: Implementing Incremental View Maintenance

2022-02-16 Thread huyajun
Hi, Nagata-san
I am very interested in IMMV and read your patch but have some comments in 
v25-0007-Add-Incremental-View-Maintenance-support.patch and want to discuss 
with you.

+   /* For IMMV, we need to rewrite matview query */
+   query = rewriteQueryForIMMV(query, into->colNames);
+   query_immv = copyObject(query);

/* Create triggers on incremental maintainable materialized view */
+   Assert(query_immv != NULL);
+   CreateIvmTriggersOnBaseTables(query_immv, 
matviewOid, true);
1. Do we need copy query?Is it okay  that CreateIvmTriggersOnBaseTables 
directly use (Query *) into->viewQuery instead of query_immv like 
CreateIndexOnIMMV?  It seems only planner may change query, but it shouldn't 
affect us finding the correct base table in CreateIvmTriggersOnBaseTables .

+void
+CreateIndexOnIMMV(Query *query, Relation matviewRel)
+{
+   Query *qry = (Query *) copyObject(query);
2. Also, is it okay to not copy query in CreateIndexOnIMMV? It seems we only 
read query in CreateIndexOnIMMV.

Regards,





Re: do only critical work during single-user vacuum?

2022-02-16 Thread Masahiko Sawada
On Wed, Feb 16, 2022 at 2:29 AM Peter Geoghegan  wrote:
>
> On Mon, Feb 14, 2022 at 10:04 PM John Naylor
>  wrote:
> > Well, the point of inventing this new vacuum mode was because I
> > thought that upon reaching xidStopLimit, we couldn't issue commands,
> > period, under the postmaster. If it was easier to get a test instance
> > to xidStopLimit, I certainly would have discovered this sooner.
>
> I did notice from my own testing of the failsafe (by artificially
> inducing wraparound failure using an XID burning C function) that
> autovacuum seemed to totally correct the problem, even when the system
> had already crossed xidStopLimit - it came back on its own. I wasn't
> completely sure of how robust this effect was, though.

FYI, I've tested the situation that I assumed autovacuum can not
correct the problem; when the system had already crossed xidStopLimit,
it keeps failing to vacuum on tables that appear in the front of the
list and have sufficient garbage to trigger the truncation but are not
older than the failsafe limit. But contrary to my assumption, it did
correct the problem since autovacuum continues to the next table in
the list even after an error. This probably means that autovacuum
eventually succeeds to process all tables that trigger the failsafe
mode, ensuring advancing datfrozenxid, which is great.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>> It looks like register_unlink_segment() is called prior to the checkpoint,
>> but the checkpointer is not calling RememberSyncRequest() until after
>> SyncPreCheckpoint().  This means that the requests are registered with the
>> next checkpoint cycle count, so they aren't processed until the next
>> checkpoint.
> 
> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
> out of the critical section in CreateCheckPoint().  Patch attached.

An alternative fix might be to call AbsorbSyncRequests() after increasing
the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
just before checkpointing where new requests are registered for the
checkpoint following the one about to begin.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: refactoring basebackup.c

2022-02-16 Thread Robert Haas
On Wed, Feb 16, 2022 at 11:11 AM Alvaro Herrera  wrote:
> This is hard to interpret for humans though because of the nested
> brackets and braces.  It gets considerably easier if you split it in
> separate variants:
>
>-Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
>-Z, --compress=LEVEL
>-Z, --compress=none
>  compress tar output with given compression method or 
> level
>
>
> or, if you choose to leave the level-only variant undocumented, then
>
>-Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
>-Z, --compress=none
>  compress tar output with given compression method or 
> level
>
> There still are some nested brackets and braces, but the scope is
> reduced enough that interpreting seems quite a bit simpler.

I could go for that. I'm also just noticing that "none" is not really
a compression method or level, and the statement that it can only
compress "tar" output is no longer correct, because server-side
compression can be used together with -Fp. So maybe we should change
the sentence afterward to something a bit more generic, like "specify
whether and how to compress the backup".

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-16 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 12:43 AM John Naylor
 wrote:
> I'll put some effort in finding any way that it might not be robust.
> After that, changing the message and docs is trivial.

Great, thanks John.

-- 
Peter Geoghegan




Re: refactoring basebackup.c

2022-02-16 Thread Alvaro Herrera
On 2022-Feb-14, Robert Haas wrote:

> A more consistent way of writing the supported syntax would be like this:
> 
>   -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|LEVEL|none}
> 
> I would be somewhat inclined to leave the level-only variant
> undocumented and instead write it like this:
> 
>   -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|none}

This is hard to interpret for humans though because of the nested
brackets and braces.  It gets considerably easier if you split it in
separate variants:

   -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
   -Z, --compress=LEVEL
   -Z, --compress=none
 compress tar output with given compression method or 
level


or, if you choose to leave the level-only variant undocumented, then

   -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
   -Z, --compress=none
 compress tar output with given compression method or 
level

There still are some nested brackets and braces, but the scope is
reduced enough that interpreting seems quite a bit simpler.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




  1   2   >