Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Julien Rouhaud
On Wed, Feb 09, 2022 at 02:30:08PM -0500, Robert Haas wrote:
> On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian  wrote:
> > Well, I think the worst case is that the checkpoint happens exactly
> > between two checkpoints, so you are checkpointing twice as often, but if
> > it happens just before or after a checkpoint, I assume the effect would
> > be minimal.
> 
> I agree for the most part. I think that if checkpoints happen every 8
> minutes normally and the extra checkpoint happens 2 minutes after the
> previous checkpoint, the impact may be almost as bad as if it had
> happened right in the middle. If it happens 5 seconds after the
> previous checkpoint, it should be low impact.

But the extra checkpoints will be immediate, while on a properly configured
system it should be spread checkpoint.  That will add some more overhead.

> > So, it seems we are weighing having a checkpoint happen in the middle of
> > a checkpoint interval vs writing more WAL.  If the WAL traffic, without
> > CREATE DATABASE, is high, and the template database is small, writing
> > more WAL and skipping the checkpoint will be win, but if the WAL traffic
> > is small and the template database is big, the extra WAL will be a loss.
> > Is this accurate?
> 
> I think that's basically correct. I would expect that the worry about
> big template database is mostly about template databases that are
> REALLY big. I think if your template database is 10GB you probably
> shouldn't be worried about this feature. 10GB of extra WAL isn't
> nothing, but if you've got reasonably capable hardware, it's not
> overloaded, and max_wal_size is big enough, it's probably not going to
> have a huge impact. Also, most of the impact will probably be on the
> CREATE DATABASE command itself, and other things running on the system
> at the same time will be impacted to a lesser degree. I think it's
> even possible that you will be happier with this feature than without,
> because you may like the idea that CREATE DATABASE itself is slow more
> than you like the idea of it making everything else on the system
> slow. On the other hand, if your template database is 1TB, the extra
> WAL is probably going to be a fairly big problem.
> 
> Basically I think for most people this should be neutral or a win. For
> people with really large template databases, it's a loss. Hence the
> discussion about having a way for people who prefer the current
> behavior to keep it.

Those extra WALs will also impact backups and replication.  You could have
fancy hardware, a read-mostly workload and the need to replicate over a slow
WAN, and in that case the 10GB could be much more problematic.




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-09 Thread Nitin Jadhav
> > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > before the checkpoint start, instead of just emitting the stats at the end.
> >
> > Bharat, it would be good to show the buffers synced counter and the total 
> > buffers to sync, checkpointer pid, substep it is running, whether it is on 
> > target for completion, checkpoint_Reason
> > (manual/times/forced). BufferSync has several variables tracking the sync 
> > progress locally, and we may need some refactoring here.
>
> I agree to provide above mentioned information as part of showing the
> progress of current checkpoint operation. I am currently looking into
> the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.


On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
 wrote:
>
> On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
>  wrote:
> >
> > > I think the right choice to solve the *general* problem is the
> > > mentioned pg_stat_progress_checkpoints.
> > >
> > > We may want to *additionally* have the ability to log the progress
> > > specifically for the special cases when we're not able to use that
> > > view. And in those case, we can perhaps just use the existing
> > > log_startup_progress_interval parameter for this as well -- at least
> > > for the startup checkpoint.
> >
> > +1
> >
> > > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > > before the checkpoint start, instead of just emitting the stats at the 
> > > end.
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total 
> > > buffers to sync, checkpointer pid, substep it is running, whether it is 
> > > on target for completion, checkpoint_Reason
> > > (manual/times/forced). BufferSync has several variables tracking the sync 
> > > progress locally, and we may need some refactoring here.
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> As suggested in the other thread by Julien, I'm changing the subject
> of this thread to reflect the discussion.
>
> Regards,
> Bharath Rupireddy.



Re: Replacing TAP test planning with done_testing()

2022-02-09 Thread Kyotaro Horiguchi
At Thu, 10 Feb 2022 09:58:27 +0900, Michael Paquier  wrote 
in 
> On Wed, Feb 09, 2022 at 02:49:47PM -0500, Tom Lane wrote:
> > =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> >> Daniel Gustafsson  writes:
> >>> The attached patch removes all Test::More planning and instead ensures 
> >>> that all
> >>> tests conclude with a done_testing() call.  While there, I also removed a 
> >>> few
> >>> exit(0) calls from individual tests making them more consistent.
> > 
> >> LGTM, +1.
> > 
> > LGTM too.

+1. I was anoyed by the definitions especially when adding Non-windows
only tests.

> Not tested, but +1.  Could it be possible to backpatch that even if
> this could be qualified as only cosmetic?  Each time a test is
> backpatched we need to tweak the number of tests planned, and that may
> change slightly depending on the branch dealt with.

+1. I think that makes backpatching easier.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make mesage at end-of-recovery less scary.

2022-02-09 Thread Kyotaro Horiguchi
At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma  
wrote in 
> On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
>  wrote:
> > This means archive-recovery is requested but not started yet. That is,
> > we've just finished crash recovery.  The existing comment cited
> > together is mentioning that.
> >
> > At the end of PITR (or archive recovery), the other code works.
> >
> 
> This is quite understandable, the point here is that the message that
> we are emitting says, we have just finished reading the wal files in
> the pg_wal directory during crash recovery and are now entering
> archive recovery when we are actually doing point-in-time recovery
> which seems a bit misleading.

Here is the messages.

> 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> recovery to WAL location (LSN) "0/5227790"
> 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> properly shut down; automatic recovery in progress
> 2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
> 2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
> 0/3D0 on timeline 1 in pg_wal during crash recovery, entering
> archive recovery

In the first place the last DEBUG1 is not on my part, but one of the
messages added by this patch says the same thing.  Is your point that
archive recovery is different thing from PITR?  In regard to the
difference, I think PITR is a form of archive recovery.

That being said, after some thoughts on this, I changed my mind that
we don't need to say what operation was being performed at the
end-of-WAL.  So in the attached the end-of-WAL message is not
accompanied by the kind of recovery.

> LOG:  reached end of WAL at 0/300 on timeline 1

I removed the archive-source part along with the operation mode.
Because it make the message untranslatable.  It is now very simple but
seems enough.

While working on this, I noticed that we need to set EndOfWAL when
WaitForWALToBecomeAvailable returned with failure.  That means the
file does not exist at all so it is a kind of end-of-WAL.  In that
sense the following existing comment in ReadRecord is a bit wrong.

>* We only end up here without a message when XLogPageRead()
>* failed - in that case we already logged something. In
>* StandbyMode that only happens if we have been triggered, so we
>* shouldn't loop anymore in that case.

Actually there's a case we get there without a message and without
logged something when a segment file is not found unless we're in
standby mode.

> > Well. I guess that the "automatic recovery" is ambiguous.  Does it
> > make sense if the second line were like the follows instead?
> >
> > + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not 
> > properly shut down; crash recovery in progress
> >
> 
> Well, according to me the current message looks fine.

Good to hear. (In the previos version I modified the message by accident..)

> $chkptfile is declared twice in the same scope. We can probably remove
> the first one.

Ugh.. Fixed.  (I wonder why Perl doesn't complain on this..)


In this version 12 I made the following changes.

- Rewrote (halfly reverted) a comment in ReadRecord

- Simplified the "reached end of WAL" message by removing recovery
  mode and WAL source in ReadRecord.

- XLogPageRead sets EndOfWAL flag in the ENOENT case.

- Removed redundant declaration of the same variable in TAP script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e553164dbca709389d92b05cf8ae7a8b427e83a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v12] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c |  92 +-
 src/backend/access/transam/xlogreader.c   |  78 
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 108 +-
 6 files changed, 268 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..bf1d40e7cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, );
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 

Re: Move replication slot structures/enums/macros to a new header file for better usability by external tools/modules

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-07 10:52:08 +0530, Bharath Rupireddy wrote:
> While working on pg_replslotdata tool [1], it was observed that some
> of the replication slot structures/enums/macros such as
> ReplicationSlotPersistentData, ReplicationSlotPersistency,
> ReplicationSlotOnDisk, ReplicationSlotOnDisk etc. are currently in
> replication/slot.h and replication/slot.c. This makes the replication
> slot on disk data structures unusable by the external tools/modules.

FWIW, I still don't see a point in pg_replslotdata. And I don't think these
datastructures should ever be accessed outside the server environment.

Greetings,

Andres Freund




Use return value of XLogRecGetBlockTag instead of explicit block ref checks and also use XLogRecHasBlockRef/Image macros instead of explicit checks

2022-02-09 Thread Bharath Rupireddy
Hi,

I think we can use the return value of XLogRecGetBlockTag instead of
an explicit check XLogRecHasBlockRef as the XLogRecGetBlockTag would
anyways return false in such a case. Also, in some places the macros
XLogRecHasBlockRef and XLogRecHasBlockImage aren't being used instead
the checks are being performed directly.

Although the above change doesn't add any value or improve any
performance but makes use of the existing code better and removes some
unnecessary/duplicate code. Attaching a small patch.

Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-Use-return-value-of-XLogRecGetBlockTag-instead-of.patch
Description: Binary data


Re: Move replication slot structures/enums/macros to a new header file for better usability by external tools/modules

2022-02-09 Thread Bharath Rupireddy
On Wed, Feb 9, 2022 at 2:16 AM Peter Smith  wrote:
>
> On Mon, Feb 7, 2022 at 4:22 PM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > While working on pg_replslotdata tool [1], it was observed that some
> > of the replication slot structures/enums/macros such as
> > ReplicationSlotPersistentData, ReplicationSlotPersistency,
> > ReplicationSlotOnDisk, ReplicationSlotOnDisk etc. are currently in
> > replication/slot.h and replication/slot.c. This makes the replication
> > slot on disk data structures unusable by the external tools/modules.
> > How about moving these structures to a new header file called
> > slot_common.h as attached in the patch here?
> >
> > Thoughts?
> >
> > PS: I'm planning to have the tool separately, as it was rejected to be in 
> > core.
> >
> > [1] 
> > https://www.postgresql.org/message-id/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com
>
> Recently I was also looking to add some new enums but I found it
> was difficult to find any good place to put them where they could be
> shared by the replication code and the pg_recvlogical tool.
>
> So +1 to your suggestion to have a common header, but I wonder can it
> have a more generic name (e.g. repl_common.h? ...) since the
> stuff I wanted to put there was not really "slot" related.

Thanks. repl_common.h sounds cool and generic IMO too, so I changed
it. Another important note here is to let this file have only
replication data structures and functions that are meant/supposed to
be usable across entire postgres modules - core, tools, contrib
modules, the internal data structures can be added elsewhere.

Attaching v2 patch.

I will add a CF entry a day or two later.

Regards,
Bharath Rupireddy.
From 087de72a5aa9cb8c53729d4c15ea0f481fbc97a2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 10 Feb 2022 04:18:53 +
Subject: [PATCH v2] Add new header file for replication slot common data
 structures

This makes the external tools/modules to use the replication
slot data structures.
---
 src/backend/replication/slot.c|  39 
 src/include/replication/repl_common.h | 138 ++
 src/include/replication/slot.h|  82 +--
 3 files changed, 139 insertions(+), 120 deletions(-)
 create mode 100644 src/include/replication/repl_common.h

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e5e0cf8768..42dc297a03 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -50,45 +50,6 @@
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 
-/*
- * Replication slot on-disk data structure.
- */
-typedef struct ReplicationSlotOnDisk
-{
-	/* first part of this struct needs to be version independent */
-
-	/* data not covered by checksum */
-	uint32		magic;
-	pg_crc32c	checksum;
-
-	/* data covered by checksum */
-	uint32		version;
-	uint32		length;
-
-	/*
-	 * The actual data in the slot that follows can differ based on the above
-	 * 'version'.
-	 */
-
-	ReplicationSlotPersistentData slotdata;
-} ReplicationSlotOnDisk;
-
-/* size of version independent data */
-#define ReplicationSlotOnDiskConstantSize \
-	offsetof(ReplicationSlotOnDisk, slotdata)
-/* size of the part of the slot not covered by the checksum */
-#define ReplicationSlotOnDiskNotChecksummedSize  \
-	offsetof(ReplicationSlotOnDisk, version)
-/* size of the part covered by the checksum */
-#define ReplicationSlotOnDiskChecksummedSize \
-	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskNotChecksummedSize
-/* size of the slot data that is version dependent */
-#define ReplicationSlotOnDiskV2Size \
-	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskConstantSize
-
-#define SLOT_MAGIC		0x1051CA1	/* format identifier */
-#define SLOT_VERSION	2		/* version for new files */
-
 /* Control array for replication slot management */
 ReplicationSlotCtlData *ReplicationSlotCtl = NULL;
 
diff --git a/src/include/replication/repl_common.h b/src/include/replication/repl_common.h
new file mode 100644
index 00..b59a70811a
--- /dev/null
+++ b/src/include/replication/repl_common.h
@@ -0,0 +1,138 @@
+/*-
+ * repl_common.h
+ *
+ * This header file is meant to hold replication data structures and
+ * functions usable across the entire Postgres modules i.e. core, tools,
+ * and contrib modules.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *src/include/replication/repl_common.h
+ *
+ *-
+ */
+#ifndef REPL_COMMON_H
+#define REPL_COMMON_H
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crash-safe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.  Slots marked TEMPORARY will be dropped at the end of a session
+ * or on error.
+ *
+ 

Re: [PATCH] Add min() and max() aggregate functions for xid8

2022-02-09 Thread Fujii Masao




On 2022/02/09 13:04, Kyotaro Horiguchi wrote:

At Wed, 09 Feb 2022 12:04:51 +0900 (JST), Kyotaro Horiguchi 
 wrote in

At Wed, 9 Feb 2022 11:01:57 +0900, Fujii Masao  
wrote in

Agreed. So barring any objection, I will commit that patch.


Sorry for being late, but I don't like the function names.

+xid8_larger(PG_FUNCTION_ARGS)
+xid8_smaller(PG_FUNCTION_ARGS)

Aren't they named like xid8gt and xid8lt conventionally?  At least the
name lacks a mention to equality.


Ah, sorry. the function returns larger/smaller one from the
parameters. Sorry for the noise.


Thanks for the review! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Database-level collation version tracking

2022-02-09 Thread Julien Rouhaud
On Wed, Feb 09, 2022 at 05:12:41PM +0100, Peter Eisentraut wrote:
> On 08.02.22 13:55, Julien Rouhaud wrote:
> > Apart from that I still think that we should check the collation version of 
> > the
> > source database when creating a new database.  It won't cost much but will 
> > give
> > the DBA a chance to recreate the indexes before risking invalid index usage.
> 
> A question on this:  In essence, this would be putting code into createdb()
> similar to the code in postinit.c:CheckMyDatabase().  But what should we
> make it do and say exactly?
> 
> After thinking about this for a bit, I suggest: If the actual collation
> version of the newly created database does not match the recorded collation
> version of the template database, we should error out and make the user fix
> the template database (by reindexing there etc.).

I'm not sure what you really mean by "actual collation version of the newly
created database".  Is it really a check after having done all the work or just
checking a discrepancy when computing the to-be-created database version from
the source database, ie.  something like

   if (dbcollversion == NULL)
+   {
   dbcollversion = src_collversion;
+  if src_collversion != get_collation_actual_version(the source db)
+  // raise error or warning

> The alternative is to warn, as it does now in postinit.c.  But then the
> phrasing of the message becomes complicated: Should we make the user fix the
> new database or the template database or both?  And if they don't fix the
> template database, they will have the same problem again.  So making it a
> hard failure seems better to me.

Agreed, I'm in favor of a hard error.  Maybe a message like:

errmsg(cannot create database %s)
errdetail(the template database %s was created using collation version %s, but
the operating system provides version %s)

Also, that check shouldn't be done when using the COLLATION_VERSION option of
create database, since it's there for pg_upgrade usage?




Re: catalog access with reset GUCs during parallel worker startup

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 7:33 PM Andres Freund  wrote:
> > Can we arrange to absorb the leader's values before starting the
> > worker's transaction, instead of inside it?
>
> I assume Robert did it this way for a reason? It'd not be surprising if there
> were a bunch of extensions assuming its safe to do catalog accesses if
> IsUnderPostmaster or such :(

I remember trying that, and if I remember correctly, it broke with
core GUCs, without any need for extensions in the picture. I don't
remember the details too well, unfortunately, but I think it had to do
with the behavior of some of the check and/or assign hooks.

It's probably time to try it again, because (a) maybe things are
different now, or (b) maybe I did it wrong, and in any event (c) I
can't really remember why it didn't work and we probably want to know
that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Logging in LockBufferForCleanup()

2022-02-09 Thread Michael Paquier
On Wed, Feb 09, 2022 at 06:22:05PM -0800, Andres Freund wrote:
> Previously the code looked somewhat safe to use in critical section like
> blocks (although whether it'd be good idea to use in one is a different
> question), but not after. Even if not used in a critical section, adding new
> failure conditions to low-level code that's holding LWLocks etc. doesn't seem
> like a good idea.

This is an interesting point.  Would the addition of one or more
critical sections in this area impact its performance in any way?

> It also just increases the overhead of LockBuffer(). Adding palloc(), copying
> of process title, GetCurrentTimestamp() to a low level routine like this isn't
> free - even if it's mostly in the contended paths.

Good point.
--
Michael


signature.asc
Description: PGP signature


Re: Possible uninitialized use of the variables (src/backend/access/transam/twophase.c)

2022-02-09 Thread Michael Paquier
On Thu, Feb 10, 2022 at 10:18:15AM +0900, Kyotaro Horiguchi wrote:
> Who repoerted that to you?

Let's bet on Coverity.

> StartPrepare and EndPrepare are virtually a single function that
> accepts some additional operations in the middle.  StartPrepare leaves
> the "records" incomplete then EndPrepare completes it.  It is not
> designed so that the fields are accessed from others before the
> completion.  There seems to be no critical reasons for EndPrepare to
> do the pointed operations, but looking that another replorigin-related
> operation is needed in the same function, it is sensible that the
> author intended to consolidate all replorigin related changes in
> EndPrepare.

Well, that's the intention I recall from reading this code a couple of
weeks ago.  In the worst case, this could be considered as a bad
practice as having clean data helps at least debugging if you look at
this data between the calls of StartPrepare() and EndPrepare(), and we
do things this way for all the other fields, like total_len.

The proposed change is incomplete anyway once you consider this
argument.  First, there is no need to set up those fields in
EndPrepare() anymore if there is no origin data, no?  It would be good
to comment that these are filled in EndPrepare(), I guess, once you do
the initialization in StartPrepare().  I agree that those changes are
purely cosmetic.
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 608c5149e5..874c8ed125 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1074,6 +1074,9 @@ StartPrepare(GlobalTransaction gxact)
 	hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(,
 		  );
 	hdr.gidlen = strlen(gxact->gid) + 1;	/* Include '\0' */
+	/* EndPrepare will fill the origin data, if necessary */
+	hdr.origin_lsn = InvalidXLogRecPtr;
+	hdr.origin_timestamp = 0;
 
 	save_state_data(, sizeof(TwoPhaseFileHeader));
 	save_state_data(gxact->gid, hdr.gidlen);
@@ -1133,11 +1136,6 @@ EndPrepare(GlobalTransaction gxact)
 		hdr->origin_lsn = replorigin_session_origin_lsn;
 		hdr->origin_timestamp = replorigin_session_origin_timestamp;
 	}
-	else
-	{
-		hdr->origin_lsn = InvalidXLogRecPtr;
-		hdr->origin_timestamp = 0;
-	}
 
 	/*
 	 * If the data size exceeds MaxAllocSize, we won't be able to read it in


signature.asc
Description: PGP signature


Re: [RFC] building postgres with meson - perl embedding

2022-02-09 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-08 18:42:33 -0500, Tom Lane wrote:
>> I'd be a little inclined to replace it with some rule about stripping '-bE:'
>> switches out of the ldopts result.

> Similar. That's a lot easier to understand than than -bE ending up stripped by
> what we're doing. Should I do so, or do you want to?

I could look at it later, but if you want to do it, feel free.

regards, tom lane




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-02-09 Thread Bharath Rupireddy
On Sun, Feb 6, 2022 at 9:15 PM Robert Haas  wrote:
>
> On Mon, Jan 31, 2022 at 4:40 PM Greg Stark  wrote:
> > So I looked at this patch and I have the same basic question as Bruce.
> > Do we really want to expose every binary tool associated with Postgres
> > as an extension? Obviously this is tempting for cloud provider users
> > which is not an unreasonable argument. But it does have consequences.
> >
> > 1) Some things like pg_waldump are running code that is not normally
> > under user control. This could have security issues or reliability
> > issues.
>
> For what it's worth, I am generally in favor of having something like
> this in PostgreSQL. I think it's wrong of us to continue assuming that
> everyone has command-line access. Even when that's true, it's not
> necessarily convenient. If you choose to use a relational database,
> you may be the sort of person who likes SQL. And if you are, you may
> want to have the database tell you what's going on via SQL rather than
> command-line tools or operating system utilities. Imagine if we didn't
> have pg_stat_activity and you had to get that information by running a
> separate binary. Would anyone like that? Why is this case any
> different?
>
> A few years ago we exposed data from pg_control via SQL and similar
> concerns were raised - but it turns out to be pretty useful. I don't
> know why this shouldn't be equally useful. Sure, there's some
> duplication in functionality, but it's not a huge maintenance burden
> for the project, and people (including me) like having it available. I
> think the same things will be true here.
>
> If decoding WAL causes security problems, that's something we better
> fix, because WAL is constantly decoded on standbys and via logical
> decoding on systems all over the place. I agree that we can't let
> users supply their own hand-crafted WAL records to be decoded without
> causing more trouble than we can handle, but if it's not safe to
> decode the WAL the system generated than we are in a lot of trouble
> already.
>
> I hasten to say that I'm not endorsing every detail or indeed any
> detail of the proposed patch, and some of the concerns you mention
> later sound well-founded to me. But I disagree with the idea that we
> shouldn't have both a command-line utility that roots through files on
> disk and an SQL interface that works with a running system.

Thanks Robert for your comments.

> + appendStringInfo(_blk_ref, "blkref #%u: rel %u/%u/%u fork %s blk %u",
>
> But that's kind of out of place for an SQL interface. It makes it hard
> to write queries since things like the relid, block number etc are in
> the string. If I wanted to use these functions I would expect to be
> doing something like "select all the decoded records pertaining to
> block n".

Thanks Greg for your review of the patches. Since there can be
multiple blkref for WAL record type HEAP2 (for multi inserts
basically) [1], I couldn't find a better way to break it and represent
it as a non-text column. IMO this is simpler and users can easily find
out answers to "how many WAL records my relation generated between
lsn1 and lsn2 or how many WAL records of type Heap exist and so on?",
see [2]. I've also added a test case to just show this in 0002 patch.

Here's the v4 patch set that has the following changes along with
Greg's review comments addressed:

1) Added documentation as 0003 patch.
2) Removed CHECKPOINT commands from tests as it is unnecessary.
3) Added input validation code and tests.
4) A few more comments have been added.
5) Currently, only superusers can create the extension, but users with
the pg_monitor role can use the functions.
6) Test cases are basic yet they cover all the functions, error cases
with input validations, I don't think we need to add many more test
cases as suggested upthread, but I'm open to add a few more if I miss
any use-case.

Please review the v4 patch set further and let me know your thoughts.

[1]
rmgr: Heap2   len (rec/tot): 64/  8256, tx:  0, lsn:
0/014A9070, prev 0/014A8FF8, desc: VISIBLE cutoff xid 709 flags 0x01,
blkref #0: rel 1663/12757/16384 fork vm blk 0 FPW, blkref #1: rel
1663/12757/16384 blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn:
0/014AB0C8, prev 0/014A9070, desc: VISIBLE cutoff xid 709 flags 0x01,
blkref #0: rel 1663/12757/16384 fork vm blk 0, blkref #1: rel
1663/12757/16384 blk 1
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn:
0/014AB108, prev 0/014AB0C8, desc: VISIBLE cutoff xid 709 flags 0x01,
blkref #0: rel 1663/12757/16384 fork vm blk 0, blkref #1: rel
1663/12757/16384 blk 2
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn:
0/014AB148, prev 0/014AB108, desc: VISIBLE cutoff xid 709 flags 0x01,
blkref #0: rel 1663/12757/16384 fork vm blk 0, blkref #1: rel
1663/12757/16384 blk 3
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn:
0/014AB188, prev 0/014AB148, desc: VISIBLE cutoff xid 709 flags 

Logging in LockBufferForCleanup()

2022-02-09 Thread Andres Freund
Hi,

I just noticed somewhat new code in LockBufferForCleanup(), added in

commit 39b03690b529935a3c33024ee68f08e2d347cf4f
Author: Fujii Masao 
Date:   2021-01-13 22:59:17 +0900
 
Log long wait time on recovery conflict when it's resolved.

commit 64638ccba3a659d8b8a3a4bc5b47307685a64a8a
Author: Fujii Masao 
Date:   2020-03-30 17:35:03 +0900
 
Report waiting via PS while recovery is waiting for buffer pin in hot 
standby.


After those commit LockBufferForCleanup() contains code doing memory
allocations, elogs. That doesn't strike me as a good idea:

Previously the code looked somewhat safe to use in critical section like
blocks (although whether it'd be good idea to use in one is a different
question), but not after. Even if not used in a critical section, adding new
failure conditions to low-level code that's holding LWLocks etc. doesn't seem
like a good idea.

Secondly, the way it's done seems like a significant laying violation. Before
the HS related code in LockBufferForCleanup() was encapsulated in a few calls
to routines dedicated to dealing with that. Now it's all over
LockBufferForCleanup().

It also just increases the overhead of LockBuffer(). Adding palloc(), copying
of process title, GetCurrentTimestamp() to a low level routine like this isn't
free - even if it's mostly in the contended paths.

Greetings,

Andres Freund




Re: Plug minor memleak in pg_dump

2022-02-09 Thread Michael Paquier
On Wed, Feb 09, 2022 at 02:48:35PM -0300, Ranier Vilela wrote:
> IMO I think that still have troubles here.
> 
> ReadStr can return NULL, so the fix can crash.

-   sscanf(tmp, "%u", >catalogId.tableoid);
-   free(tmp);
+   if (tmp)
+   {
+   sscanf(tmp, "%u", >catalogId.tableoid);
+   free(tmp);
+   }
+   else
+   te->catalogId.tableoid = InvalidOid;

This patch makes things worse, doesn't it?  Doesn't this localized
change mean that we expose ourselves more into *ignoring* TOC entries
if we mess up with this code in the future?  That sounds particularly
sensible if you have a couple of bytes corrupted in a dump.
--
Michael


signature.asc
Description: PGP signature


Re: [BUG]Update Toast data failure in logical replication

2022-02-09 Thread Amit Kapila
On Wed, Feb 9, 2022 at 11:08 AM tanghy.f...@fujitsu.com
 wrote:
>
> On Tue, Feb 8, 2022 3:18 AM Andres Freund  wrote:
> >
> > On 2022-02-07 08:44:00 +0530, Amit Kapila wrote:
> > > Right, and it is getting changed. We are just printing the first 200
> > > characters (by using SQL [1]) from the decoded tuple so what is shown
> > > in the results is the initial 200 bytes.
> >
> > Ah, I knew I must have been missing something.
> >
> >
> > > The complete decoded data after the patch is as follows:
> >
> > Hm. I think we should change the way the strings are shortened - otherwise 
> > we
> > don't really verify much in that test. Perhaps we could just replace the 
> > long
> > repetitive strings with something shorter in the output?
> >
> > E.g. using something like regexp_replace(data,
> > '(1234567890|9876543210){200}', '\1{200}','g')
> > inside the substr().
> >
> > Wonder if we should deduplicate the number of different toasted strings in 
> > the
> > file to something that'd allow us to have a single "redact_toast" function 
> > or
> > such. There's too many different ones to have a reasonbly simple redaction
> > function right now. But that's perhaps better done separately.
> >
>
> I tried to make the output shorter using your suggestion like the following 
> SQL,
> please see the attached patch, which is based on v8 patch[1].
>
> SELECT substr(regexp_replace(data, '(1234567890|9876543210){200}', 
> '\1{200}','g'), 1, 200) FROM pg_logical_slot_get_changes('regression_slot', 
> NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
>
> Note that some strings are still longer than 200 characters even though they 
> have
> been shorter, so they can't be shown entirely.
>
> e.g.
> table public.toasted_key: UPDATE: old-key: 
> toasted_key[text]:'1234567890{200}' new-tuple: id[integer]:1 
> toasted_key[text]:unchanged-toast-datum 
> toasted_col1[text]:unchanged-toast-datum toasted_col2[te
>
> The entire string is:
> table public.toasted_key: UPDATE: old-key: 
> toasted_key[text]:'1234567890{200}' new-tuple: id[integer]:1 
> toasted_key[text]:unchanged-toast-datum 
> toasted_col1[text]:unchanged-toast-datum toasted_col2[text]:'9876543210{200}'
>
> Maybe it's better to change the substr length to 250 to show the entire 
> string, or we
> can do it as separate HEAD only improvement where we can deduplicate some of 
> the
> other long strings as well. Thoughts?
>

I think it is better to do this as a separate HEAD-only improvement as
it can affect other tests results. We can also try to deduplicate some
of the other long strings used in toast.sql file along with it.

-- 
With Regards,
Amit Kapila.




Re: warn if GUC set to an invalid shared library

2022-02-09 Thread Justin Pryzby
On Fri, Jan 28, 2022 at 09:42:17AM -0500, Robert Haas wrote:
> -1 from me on using "guc" in any user-facing error message. And even
> guc -> setting isn't a big improvement. If we're going to structure
> the reporting this way there, we should try to use a meaningful phrase
> there, probably beginning with the word "while"; see "git grep
> errcontext.*while" for interesting precedents.

On Wed, Feb 02, 2022 at 06:06:01AM +, Maciek Sakrejda wrote:
> I do agree that GUC is awkward here, and I like the "while..." wording 
> suggested both for consistency with other messages and how it could work here:
> CONTEXT: while loading "shared_preload_libraries"

FYI, it has said "while..." and hasn't said "guc" since the 2nd revision of the
patch.

> That said, that series of messages seems a bit suspect to me, because
> the WARNING seems to be stating the same problem as the subsequent
> FATAL and CONTEXT lines. Ideally we'd tighten that somehow.

I avoided the warning by checking IsUnderPostmaster, though I'm not sure if
that's the right condition..

On Wed, Feb 02, 2022 at 06:06:01AM +, Maciek Sakrejda wrote:
> I think this is great, but it would be really helpful to also indicate that 
> at this point the server will fail to come back up after a restart.

> I don't really know a good wording here, but maybe a hint like "The server or 
> session will not be able to start if it has been configured to use libraries 
> that cannot be loaded."?

postgres=# ALTER SYSTEM SET shared_preload_libraries =a,b;
WARNING:  could not access file "$libdir/plugins/a"
HINT:  The server will fail to start with the existing configuration.  If it is 
is shut down, it will be necessary to manually edit the postgresql.auto.conf 
file to allow it to start.
WARNING:  could not access file "$libdir/plugins/b"
HINT:  The server will fail to start with the existing configuration.  If it is 
is shut down, it will be necessary to manually edit the postgresql.auto.conf 
file to allow it to start.
ALTER SYSTEM
postgres=# ALTER SYSTEM SET session_preload_libraries =c,d;
WARNING:  could not access file "$libdir/plugins/c"
HINT:  New sessions will fail with the existing configuration.
WARNING:  could not access file "$libdir/plugins/d"
HINT:  New sessions will fail with the existing configuration.
ALTER SYSTEM

$ ./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data 
-clogging_collector=on
2022-02-09 19:53:48.034 CST postmaster[30979] FATAL:  could not access file 
"a": No such file or directory
2022-02-09 19:53:48.034 CST postmaster[30979] CONTEXT:  while loading shared 
libraries for setting "shared_preload_libraries"
from 
/home/pryzbyj/src/postgres/src/test/regress/tmp_check/data/postgresql.auto.conf:3
2022-02-09 19:53:48.034 CST postmaster[30979] LOG:  database system is shut down

Maybe it's enough to show the GucSource rather than file:line...

-- 
Justin
>From 69560e81bfbb43a67269f54b564fe8ac5ae1b5c6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 13 Dec 2021 08:42:38 -0600
Subject: [PATCH v4 1/3] warn when setting GUC to a nonextant library

---
 src/backend/utils/misc/guc.c  | 103 +-
 .../unsafe_tests/expected/rolenames.out   |   1 +
 .../worker_spi/expected/worker_spi.out|   2 +
 .../modules/worker_spi/sql/worker_spi.sql |   2 +
 src/test/regress/expected/rules.out   |   9 ++
 src/test/regress/sql/rules.sql|   1 +
 6 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f505413a7f9..c44b8ebbfd6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -170,6 +170,12 @@ static bool call_enum_check_hook(struct config_enum *conf, int *newval,
  void **extra, GucSource source, int elevel);
 
 static bool check_log_destination(char **newval, void **extra, GucSource source);
+static bool check_local_preload_libraries(char **newval, void **extra,
+		GucSource source);
+static bool check_session_preload_libraries(char **newval, void **extra,
+		GucSource source);
+static bool check_shared_preload_libraries(char **newval, void **extra,
+		GucSource source);
 static void assign_log_destination(const char *newval, void *extra);
 
 static bool check_wal_consistency_checking(char **newval, void **extra,
@@ -4198,7 +4204,7 @@ static struct config_string ConfigureNamesString[] =
 		},
 		_preload_libraries_string,
 		"",
-		NULL, NULL, NULL
+		check_session_preload_libraries, NULL, NULL
 	},
 
 	{
@@ -4209,7 +4215,7 @@ static struct config_string ConfigureNamesString[] =
 		},
 		_preload_libraries_string,
 		"",
-		NULL, NULL, NULL
+		check_shared_preload_libraries, NULL, NULL
 	},
 
 	{
@@ -4220,7 +4226,7 @@ static struct config_string ConfigureNamesString[] =
 		},
 		_preload_libraries_string,
 		"",
-		NULL, NULL, NULL
+		check_local_preload_libraries, NULL, NULL
 	},
 
 	{
@@ -12149,6 +12155,97 @@ 

Re: [RFC] building postgres with meson - perl embedding

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-08 18:42:33 -0500, Tom Lane wrote:
> I'd be a little inclined to replace it with some rule about stripping '-bE:'
> switches out of the ldopts result.

Similar. That's a lot easier to understand than than -bE ending up stripped by
what we're doing. Should I do so, or do you want to?

Greetings,

Andres Freund




Re: make MaxBackends available in _PG_init

2022-02-09 Thread Michael Paquier
On Wed, Feb 09, 2022 at 09:53:38AM -0800, Nathan Bossart wrote:
> On Wed, Feb 09, 2022 at 12:31:16PM -0500, Robert Haas wrote:
>> On Wed, Feb 9, 2022 at 12:18 AM Michael Paquier  wrote:
>>> Okay, I'd rather apply the same rule everywhere for consistency, then,
>>> like in the attached.  That's minimal, still.
>> 
>> That's fine with me. In the interest of full disclosure, I did kind of
>> notice this when reviewing the patch, though perhaps not every
>> instance, and just decided that it didn't seem important enough to
>> worry about. I'm totally OK with you thinking otherwise, though,
>> especially since you also volunteered to do the work thus generated.
>> :-)

I guessed so :p

> This is fine with me as well.  I only left these out because the extra
> variable felt unnecessary to me for these functions.

Okay, done, then.

> While you are at it, would you mind fixing the misspelling of
> OldestVisibleMXactId in multixact.c (as per the attached)?

Indeed.  Fixed as well.
--
Michael


signature.asc
Description: PGP signature


Re: GetRelationPublicationActions. - Remove unreachable code

2022-02-09 Thread Peter Smith
On Thu, Feb 10, 2022 at 11:34 AM Tom Lane  wrote:
>
> Peter Smith  writes:
> > There appears to be some unreachable code in the relcache function
> > GetRelationPublicationActions.
> > When the 'relation->rd_pubactions' is not NULL then the function
> > unconditionally returns early (near the top).
> > Therefore, the following code (near the bottom) seems to have no
> > purpose because IIUC the 'rd_pubactions' can never be not NULL here.
>
> I'm not sure it's as unreachable as all that.  What you're not
> accounting for is the possibility of recursive cache loading,
> ie something down inside the catalog fetches we have to do here
> could already have made the field valid.
>
> Admittedly, that's probably not very likely given expected usage
> patterns (to wit, that system catalogs shouldn't really have
> publication actions).  But there are other places in relcache.c where
> a coding pattern similar to this is demonstrably necessary to avoid
> memory leakage.  I'd rather leave the code as-is, maybe add a comment
> along the lines of "rd_pubactions is probably still NULL, but just in
> case something already made it valid, avoid leaking memory".

OK. Thanks for your explanation and advice.

PSA another patch to just a comment as suggested.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-GetRelationPublicationActions-comment-code.patch
Description: Binary data


Re: Possible uninitialized use of the variables (src/backend/access/transam/twophase.c)

2022-02-09 Thread Kyotaro Horiguchi
At Wed, 9 Feb 2022 08:15:45 -0300, Ranier Vilela  wrote in 
> Hi,
> 
> Commit
> https://github.com/postgres/postgres/commit/1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf,
> modified
> data struct TwoPhaseFileHeader and added two new fields:
> 
> XLogRecPtr origin_lsn; /* lsn of this record at origin node */
> TimestampTz origin_timestamp; /* time of prepare at origin node */
> 
> I think thay forgot initialize these fields in the function StartPrepare
> because,
> when calling function save_state_data(, sizeof(TwoPhaseFileHeader));
> 
> I have one report about possible uninitialized usage of the variables.

Who repoerted that to you?

StartPrepare and EndPrepare are virtually a single function that
accepts some additional operations in the middle.  StartPrepare leaves
the "records" incomplete then EndPrepare completes it.  It is not
designed so that the fields are accessed from others before the
completion.  There seems to be no critical reasons for EndPrepare to
do the pointed operations, but looking that another replorigin-related
operation is needed in the same function, it is sensible that the
author intended to consolidate all replorigin related changes in
EndPrepare.

So, my humble opinion here is, that change is not needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Justin Pryzby

2022-02-09 Thread Michael Paquier
Hi Alexander,

On Wed, Feb 09, 2022 at 03:18:12PM +0300, Alexander Pyhalov wrote:
> I've looked at patches, introducing CREATE INDEX CONCURRENTLY for
> partitioned tables - 
> https://www.postgresql.org/message-id/flat/20210226182019.GU20769%40telsasoft.com#da169a0a518bf8121604437d9ab053b3
> .
> The thread didn't have any activity for a year.

If you plan to review a patch set, could you reply directly on its
thread without changing the subject please?  This way of replying
breaks the flow of a thread, making it harder to track.

Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Replacing TAP test planning with done_testing()

2022-02-09 Thread Michael Paquier
On Wed, Feb 09, 2022 at 02:49:47PM -0500, Tom Lane wrote:
> =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
>> Daniel Gustafsson  writes:
>>> The attached patch removes all Test::More planning and instead ensures that 
>>> all
>>> tests conclude with a done_testing() call.  While there, I also removed a 
>>> few
>>> exit(0) calls from individual tests making them more consistent.
> 
>> LGTM, +1.
> 
> LGTM too.

Not tested, but +1.  Could it be possible to backpatch that even if
this could be qualified as only cosmetic?  Each time a test is
backpatched we need to tweak the number of tests planned, and that may
change slightly depending on the branch dealt with.
--
Michael


signature.asc
Description: PGP signature


Re: row filtering for logical replication

2022-02-09 Thread Peter Smith
On Tue, Jan 25, 2022 at 2:18 PM houzj.f...@fujitsu.com
 wrote:
>
...

> > 4. src/backend/utils/cache/relcache.c - RelationBuildPublicationDesc
> >
> > - if (relation->rd_pubactions)
> > + if (relation->rd_pubdesc)
> >   {
> > - pfree(relation->rd_pubactions);
> > - relation->rd_pubactions = NULL;
> > + pfree(relation->rd_pubdesc);
> > + relation->rd_pubdesc = NULL;
> >   }
> >
> > What is the purpose of this code? Can't it all just be removed?
> > e.g. Can't you Assert that relation->rd_pubdesc is NULL at this point?
> >
> > (if it was not-null the function would have returned immediately from the 
> > top)
>
> I think it might be better to change this as a separate patch.

OK. I have made a separate thread [1[ for discussing this one.

--
[1] 
https://www.postgresql.org/message-id/flat/1524753.1644453267%40sss.pgh.pa.us#1c40bbc4126daaf75b927a021526654a

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: GetRelationPublicationActions. - Remove unreachable code

2022-02-09 Thread Tom Lane
Peter Smith  writes:
> There appears to be some unreachable code in the relcache function
> GetRelationPublicationActions.
> When the 'relation->rd_pubactions' is not NULL then the function
> unconditionally returns early (near the top).
> Therefore, the following code (near the bottom) seems to have no
> purpose because IIUC the 'rd_pubactions' can never be not NULL here.

I'm not sure it's as unreachable as all that.  What you're not
accounting for is the possibility of recursive cache loading,
ie something down inside the catalog fetches we have to do here
could already have made the field valid.

Admittedly, that's probably not very likely given expected usage
patterns (to wit, that system catalogs shouldn't really have
publication actions).  But there are other places in relcache.c where
a coding pattern similar to this is demonstrably necessary to avoid
memory leakage.  I'd rather leave the code as-is, maybe add a comment
along the lines of "rd_pubactions is probably still NULL, but just in
case something already made it valid, avoid leaking memory".

regards, tom lane




Re: catalog access with reset GUCs during parallel worker startup

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-09 18:56:41 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Do we really need to reset GUCs to their boot value? Particularly for
> > PGC_SIGHUP variables I don't think we can do that if we want to do
> > RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
> > but it seems entirely possible to just get rid of the reset phase?
> 
> In an EXEC_BACKEND build, they're going to start out with those
> values anyway.  If you're not happy about the consequences,
> "skipping the reset" is not the way to improve matters.

Postmaster's GUC state will already be loaded via read_nondefault_variables(),
much earlier in startup. Well before bgworkers get control and before
transaction environment is up.

Setting a watchpoint on enableFsync, in a parallel worker where postmaster
runs with enableFsync=off, shows the following:

Old value = true
New value = false
set_config_option (name=0x7f42bd9ec110 "fsync", value=0x7f42bd9ec000 "false", 
context=PGC_POSTMASTER, source=PGC_S_ARGV, action=GUC_ACTION_SET, 
changeVal=true, elevel=21, is_reload=true) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
7760set_extra_field(>gen, 
>gen.extra,
(rr) bt
#0  set_config_option (name=0x7f42bd9ec110 "fsync", value=0x7f42bd9ec000 
"false", context=PGC_POSTMASTER, source=PGC_S_ARGV, action=GUC_ACTION_SET, 
changeVal=true, elevel=21, is_reload=true) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
#1  0x7f42bcd5f178 in read_nondefault_variables () at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:10635
#2  0x7f42bca9ccd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5086
#3  0x7f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/main/main.c:194

Old value = false
New value = true
InitializeOneGUCOption (gconf=0x7f42bd0a32c0 ) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:5900
5900conf->gen.extra = conf->reset_extra = 
extra;
(rr) bt
#0  InitializeOneGUCOption (gconf=0x7f42bd0a32c0 ) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:5900
#1  0x7f42bcd5ff95 in RestoreGUCState (gucstate=0x7f42bc4f1d00) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:11135
#2  0x7f42bc720873 in ParallelWorkerMain (main_arg=545066472) at 
/home/andres/src/postgresql/src/backend/access/transam/parallel.c:1408
#3  0x7f42bca884e5 in StartBackgroundWorker () at 
/home/andres/src/postgresql/src/backend/postmaster/bgworker.c:858
#4  0x7f42bca9cfd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5230
#5  0x7f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/main/main.c:194


Old value = true
New value = false
set_config_option (name=0x7f42bc4f1e04 "fsync", value=0x7f42bc4f1e0a "false", 
context=PGC_POSTMASTER, source=PGC_S_ARGV, action=GUC_ACTION_SET, 
changeVal=true, elevel=21, is_reload=true) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
7760set_extra_field(>gen, 
>gen.extra,
(rr) bt
#0  set_config_option (name=0x7f42bc4f1e04 "fsync", value=0x7f42bc4f1e0a 
"false", context=PGC_POSTMASTER, source=PGC_S_ARGV, action=GUC_ACTION_SET, 
changeVal=true, elevel=21, is_reload=true) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
#1  0x7f42bcd600fc in RestoreGUCState (gucstate=0x7f42bc4f1d00) at 
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:11172
#2  0x7f42bc720873 in ParallelWorkerMain (main_arg=545066472) at 
/home/andres/src/postgresql/src/backend/access/transam/parallel.c:1408
#3  0x7f42bca884e5 in StartBackgroundWorker () at 
/home/andres/src/postgresql/src/backend/postmaster/bgworker.c:858
#4  0x7f42bca9cfd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5230
#5  0x7f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at 
/home/andres/src/postgresql/src/backend/main/main.c:194




> Can we arrange to absorb the leader's values before starting the
> worker's transaction, instead of inside it?

I assume Robert did it this way for a reason? It'd not be surprising if there
were a bunch of extensions assuming its safe to do catalog accesses if
IsUnderPostmaster or such :(

Greetings,

Andres Freund




GetRelationPublicationActions. - Remove unreachable code

2022-02-09 Thread Peter Smith
Hi hackers.

There appears to be some unreachable code in the relcache function
GetRelationPublicationActions.

When the 'relation->rd_pubactions' is not NULL then the function
unconditionally returns early (near the top).

Therefore, the following code (near the bottom) seems to have no
purpose because IIUC the 'rd_pubactions' can never be not NULL here.

if (relation->rd_pubactions)
{
pfree(relation->rd_pubactions);
relation->rd_pubactions = NULL;
}

The latest Code Coverage report [1] also shows that this code is never executed.

56013556 : if (relation->rd_pubactions)
5602 : {
5603   0 : pfree(relation->rd_pubactions);
5604   0 : relation->rd_pubactions = NULL;
5605 : }

~~

PSA a patch to remove this unreachable code.

--
[1] https://coverage.postgresql.org/src/backend/utils/cache/relcache.c.gcov.html

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Remove-unreachable-code-in-GetRelationPublication.patch
Description: Binary data


Re: catalog access with reset GUCs during parallel worker startup

2022-02-09 Thread Tom Lane
Andres Freund  writes:
> Do we really need to reset GUCs to their boot value? Particularly for
> PGC_SIGHUP variables I don't think we can do that if we want to do
> RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
> but it seems entirely possible to just get rid of the reset phase?

In an EXEC_BACKEND build, they're going to start out with those
values anyway.  If you're not happy about the consequences,
"skipping the reset" is not the way to improve matters.

Can we arrange to absorb the leader's values before starting the
worker's transaction, instead of inside it?

regards, tom lane




Re: New developer papercut - Makefile references INSTALL

2022-02-09 Thread Tom Lane
Andres Freund  writes:
> On 2022-02-09 22:32:59 +0100, Magnus Hagander wrote:
>> post-commit hooks don't run on the git server, they run locally on
>> your machine. There is a "post receive" hook that runs on the git
>> server, but we definitely don't want that one to fabricate new commits
>> I think.

> Why not? We probably wouldn't want to do synchronously as part of the receive
> hook, but if we have a policy that INSTALL is not to be updated by humans, but
> updated automatically whenever its sources are modified, I'd be OK with
> auto-committing that.

What happens when the INSTALL build fails (which is quite possible,
I believe, even if a plain html build works)?

I don't really want any post-commit or post-receive hooks doing
anything interesting to the tree.  I think the odds for trouble
are significantly greater than any value we'd get out of it.

I'm in favor of unifying README and README.git along the lines
we discussed above.  I think that going further than that
will be a lot of trouble for very little gain; in fact no gain,
because I do not buy any of the arguments that have been
made about why changing the INSTALL setup would be beneficial.
If we adjust the README contents to be less confusing about that,
we're done.

regards, tom lane




Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Ranier Vilela
Em qua., 9 de fev. de 2022 às 20:10, Dong Wook Lee 
escreveu:

> Yes, now I understand it.
> Thank you for letting me know about that.
>
You are welcome.

Best regards,
Ranier Vilela


Re: New developer papercut - Makefile references INSTALL

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-09 22:32:59 +0100, Magnus Hagander wrote:
> On Fri, Jan 21, 2022 at 11:53 PM Andres Freund  wrote:
> > On 2022-01-21 17:25:08 -0500, Tom Lane wrote:
> > > Perhaps this could be finessed by making updating of INSTALL
> > > the responsibility of some post-commit hook on the git server.
> > > Not sure that we want to go there, though.  In any case, that
> > > approach would negate your point about seeing the results.
> >
> > It would. I guess it'd still be better than the situation today, but...
> 
> post-commit hooks don't run on the git server, they run locally on
> your machine. There is a "post receive" hook that runs on the git
> server, but we definitely don't want that one to fabricate new commits
> I think.

Why not? We probably wouldn't want to do synchronously as part of the receive
hook, but if we have a policy that INSTALL is not to be updated by humans, but
updated automatically whenever its sources are modified, I'd be OK with
auto-committing that.


But before we go there, it might be worth checking if the generated INSTALL
actually changes meaningfully across "doc toolchain" versions. If not, a
simpler receive hook verifying that INSTALL was updated when the relevant sgml
files changed probably would be sufficient.

Greetings,

Andres Freund




Re: [PATCH] nodeindexscan with reorder memory leak

2022-02-09 Thread Tom Lane
Alexander Korotkov  writes:
> I've rechecked that the now patched code branch is covered by
> regression tests.  I think the memory leak issue is independent of the
> computational errors we've observed.
> So, I'm going to push and backpatch this if no objections.

+1.  We should work on the roundoff-error issue as well, but
as you say, it's an independent problem.

regards, tom lane




Re: decoupling table and index vacuum

2022-02-09 Thread Peter Geoghegan
On Wed, Feb 9, 2022 at 1:41 PM Robert Haas  wrote:
> I'm not sure that we can. I mean, there's still only going to be ~3
> autovacuum workers, and there could be arbitrarily many tables. Even
> if the vacuum load is within the bounds of what the system can
> sustain, individual tables can't be assured of being visited
> frequently (or so it seems to me) and it could be that there are
> actually not enough resources to vacuum and have to try to cope as
> best we can. Less unnecessary vacuuming of large indexes can help, of
> course, but I'm not sure it fundamentally changes the calculus.

You seem to be vastly underestimating the value in being able to
spread out and reschedule the work, and manage costs more generally.
If you can multiplex autovacuum workers across tables, by splitting up
work across a table's index over time, then it might not matter at all
that you only have 3 workers. If you can spread out the work over
time, then you make things much cheaper (fewer FPIs by aligning to
checkpoint boundaries). And, because you have a schedule that can be
dynamically updated, you get to update your global view of the world
(not just one table) before you've fully committed to it -- if you
provisionally say that you think that a certain index won't need to be
vacuumed for a long time, that isn't the last word anymore.

Costs are paid by the whole system, but benefits only go to individual
tables and indexes. Being able to manage costs over time with a sense
of the benefits, and a sense of high level priorities will be *huge*
for us. Managing debt at the level of the entire system (not just one
table or index) is also really important. (Though maybe we should just
focus on the v1, just because that's what is needed right now.)

> > We will need something like that. I think that LP_DEAD items (or
> > would-be LP_DEAD items -- tuples with storage that would get pruned
> > into LP_DEAD items if we were to prune) in the table are much more
> > interesting than dead heap-only tuples, and also more interesting that
> > dead index tuples. Especially the distribution of such LP_DEAD items
> > in the table, and their concentration. That does seem much more likely
> > to be robust as a quantitative driver of index vacuuming.
>
> Hmm... why would the answer have to do with dead items in the heap?

We're eventually going to have to make the LP_DEAD items LP_UNUSED
anyway here. So we might as well get started on that, with the index
that we *also* think is the one that might need it the most, for its
own reasons. We're making a decision on the basis of multiple factors,
knowing that in the worst case (when the index really didn't need
anything at all) we will have at least had the benefit of doing some
actually-useful work sooner rather than later. We should probably
consider multiple reasons to do any unit of work.

> I was thinking along the lines of trying to figure out either a more
> reliable count of dead tuples in the index, subtracting out whatever
> we save by kill_prior_tuple and bottom-up vacuuming; or else maybe a
> count of the subset of dead tuples that are likely not to get
> opportunistically pruned in one way or another, if there's some way to
> guess that.

I don't know how to build something like that, since that works by
understanding what's working, not by noticing that some existing
strategy plainly isn't working. The only positive information that I have
confidence in is the extreme case where you have zero index growth.
Which is certainly possible, but perhaps not that interesting with a
real workload.

There are emergent behaviors with bottom-up deletion. Purely useful
behaviors, as far as I know, but still very hard to precisely nail
down. For example, Victor Yegorov came up with an adversarial
benchmark [1] that showed that the technique dealt with index bloat
from queue-like inserts and deletes that recycled the same distinct
key values over time, since they happened to be mixed with non-hot
updates. It dealt very well with that, even though *I had no clue*
that it would work *at all*, and might have even incorrectly predicted
the opposite if Victor had asked about it in advance.

> I realize I'm
> hand-waving, but if the property is a property of the heap rather than
> the index, how will different indexes get different treatment?

Maybe by making the primary key growth an indicator of what is
reasonable for the other indexes (or other B-Tree indexes) -- it has a
natural tendency to be the least bloated possible index. If you have
something like a GiST index, or if you have a B-Tree index that
constantly gets non-HOT updates that logically modify an indexed
column, then it should become reasonably obvious. Maybe there'd be
some kind of feedback behavior to lock in "bloat prone index" for a
time.

If we can bring costs into it too (e.g., spreading out the burden of
index vacuuming over time), then it becomes acceptable to incorrectly
determine which index needed special attention. We will 

catalog access with reset GUCs during parallel worker startup

2022-02-09 Thread Andres Freund
Hi,

I noticed that during ParallelWorkerMain()->RestoreGUCState() we perform
catalog accesses with GUCs being set to wrong values, specifically values that
aren't what's in postgresql.conf, and that haven't been set in the leader.

The reason for this is that

1) RestoreGUCState() is called in a transaction, which signals to several GUC
hooks that they can do catalog access

2) RestoreGUCState() resets all non-skipped GUC values to their "boot_val"
first and then in another pass sets all GUCs that were non-default in the
leader.


This e.g. can lead to parallel worker startup using the wrong fsync or
wal_sync_method value when catalog accesses trigger WAL writes. Luckily
PGC_POSTMASTER GUCs are skipped, otherwise this'd be kinda bad. But it still
doesn't seem good.

Do we really need to reset GUCs to their boot value? Particularly for
PGC_SIGHUP variables I don't think we can do that if we want to do
RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
but it seems entirely possible to just get rid of the reset phase?

Greetings,

Andres Freund




Re: [PATCH] nodeindexscan with reorder memory leak

2022-02-09 Thread Alexander Korotkov
On Thu, Feb 10, 2022 at 1:19 AM Aliaksandr Kalenik  wrote:
> On Mon, Feb 7, 2022 at 11:20 PM Alexander Korotkov  
> wrote:
> > Regarding the memory leak, could you add a corresponding regression
> > test to the patch (probably similar to Tom's query upthread)?
>
> Yes, added similar to Tom's query but with polygons instead of circles.

I've rechecked that the now patched code branch is covered by
regression tests.  I think the memory leak issue is independent of the
computational errors we've observed.

So, I'm going to push and backpatch this if no objections.

--
Regards,
Alexander Korotkov




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-09 16:42:30 -0600, Justin Pryzby wrote:
> On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:
> > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points 
> > to
> > a filename ending in " (deleted)", b) doing fstat(fd) and checking if 
> > st_nlink
> > == 0.
> 
> You could also stat() the file in proc/self/fd/N and compare st_ino.  It
> "looks" like a symlink (in which case that wouldn't work) but it's actually a
> Very Special File.  You can even recover deleted, still-opened files that 
> way..

Yea, the readlink() thing above relies on it being a /proc/self/fd/$fd being a
"Very Special File".

In most places we'd not have convenient access to a inode / filename to
compare it to. I don't think there's any additional information we could gain
anyway, compared to looking at st_nlink == 0 and then doing a readlink() to
get the filename?


> PS. I didn't know pg_upgrade knew about Reply-To ;)

Ugh, formatting fail...

Greetings,

Andres Freund




Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Dong Wook Lee
Yes, now I understand it.
Thank you for letting me know about that.

Thanks.

2022년 2월 10일 (목) 00:39, Ranier Vilela 님이 작성:

>
> Em qua., 9 de fev. de 2022 às 11:34, Dong Wook Lee 
> escreveu:
>
>> Hi,
>>
>> I've applied your patch. I think it's reasonable.
>> but  IMHO It looks more complicated to read because of many conditions
>> in if statement.
>> so what about just moving up if-statement?
>>
>
> No.
> Your version is worse than HEAD, please don't.
>
> 1. Do not declare variables after statement.
> Although Postgres accepts C99, this is not acceptable, for a number of
> problems, already discussed.
>
> 2. Still slow unnecessarily, because still test TupleDescAttr(tupleDesc,
> i)->attlen,
> which is much more onerous than test isnull.
>
> 3. We don't write Baby C code.
> C has short break expressions, use-it!
>
> My version is the right thing, here.
>
> regards,
> Ranier Vilela
>


Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-09 Thread Justin Pryzby
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:
> On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
> a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
> == 0.

You could also stat() the file in proc/self/fd/N and compare st_ino.  It
"looks" like a symlink (in which case that wouldn't work) but it's actually a
Very Special File.  You can even recover deleted, still-opened files that way..

PS. I didn't know pg_upgrade knew about Reply-To ;)




Re: [PATCH] nodeindexscan with reorder memory leak

2022-02-09 Thread Aliaksandr Kalenik
On Mon, Feb 7, 2022 at 11:20 PM Alexander Korotkov  wrote:
> Regarding the memory leak, could you add a corresponding regression
> test to the patch (probably similar to Tom's query upthread)?

Yes, added similar to Tom's query but with polygons instead of circles.


0002-nodeindexscan_with_reorder_memory_leak.patch
Description: Binary data


wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-09 Thread Andres Freund
Hi,

I was working on rebasing the AIO branch. Tests started to fail after, but it
turns out that the problem exists independent of AIO.

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas 
Date:   2022-01-24 14:23:15 -0500

pg_upgrade: Preserve database OIDs.

Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve
relfilenodes and tablespace OIDs. For similar reasons, also arrange
to preserve database OIDs.

after this commit the database contents for databases that exist before
pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same
oid. Consequently the RelFileNode of files in the "pre-existing" database and
the re-created database will be the same.

That's a problem: There is no reliable invalidation mechanism forcing all
processes to clear SMgrRelationHash when a database is dropped /
created. Which means that backends can end up using file descriptors for the
old, dropped, database, when accessing contents in the new database.  This is
most likely to happen to bgwriter, but could be any backend (backends in other
databases can end up opening files in another database to write out dirty
victim buffers during buffer replacement).

That's of course dangerous, because we can end up reading the wrong data (by
reading from the old file), or loosing writes (by writing to the old file).


Now, this isn't a problem that's entirely new - but previously it requried
relfilenodes to be recycled due to wraparound or such. The timeframes for that
are long enough that it's very likely that *something* triggers smgr
invalidation processing. E.g. has

if (FirstCallSinceLastCheckpoint())
{
/*
 * After any checkpoint, close all smgr files.  This is 
so we
 * won't hang onto smgr references to deleted files 
indefinitely.
 */
smgrcloseall();
}

in its mainloop. Previously for the problem to occur there would have to be
"oid wraparound relfilenode recycling" within a single checkpoint window -
quite unlikely. But now a bgwriter cycle just needs to span a drop/create
database, easier to hit.


I think the most realistic way to address this is the mechanism is to use the
mechanism from
https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.


I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?


I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

b) might also work on a few other operating systems, but I have not verified
that. a) has the advantage of providing a filename, which makes it a lot
easier to understand the problem.  It'd probably make sense to use b) on all
the operating systems it works on, and then a) on linux for a more useful
error message.

The checks for this would need to be in FileRead/FileWrite/... and direct
calls to pread etc. I think that's an acceptable cost.   I have a patch for
this, but it's currently based on the AIO tree. if others agree it'd be useful
to have, I'll adapt it to work on master.

Comments?

Greetings,

Andres Freund




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-02-09 Thread Joshua Brindle
On Wed, Feb 9, 2022 at 3:58 PM Robert Haas  wrote:
>
> On Wed, Feb 9, 2022 at 1:13 PM Nathan Bossart  
> wrote:
> > I do wonder if users find the differences between predefined roles and role
> > attributes confusing.  INHERIT doesn't govern role attributes, but it will
> > govern predefined roles when this patch is applied.  Maybe the role
> > attribute system should eventually be deprecated in favor of using
> > predefined roles for everything.  Or perhaps the predefined roles should be
> > converted to role attributes.
>
> I couldn't agree more. Apparently it's even confusing to developers,
> because otherwise (1) we wouldn't have the problem the patch proposes
> to fix in the first place and (2) I would have immediately been
> convinced of the value of the patch once it showed up. Since those
> things didn't happen, this is apparently confusing to (1) whoever
> wrote the code that this patch fixes and (2) me.
>

My original patch removed is_member_of to address #1 above, but that
was rejected[1]. There is now a warning in the header beside it to
hopefully dissuade improper usage going forward.

1. https://www.postgresql.org/message-id/254275.1635357633%40sss.pgh.pa.us




Re: decoupling table and index vacuum

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 2:27 PM Peter Geoghegan  wrote:
> We should probably dispense with the idea that we'll be making these
> decisions about what to do with an index like this (bloated in a way
> that bottom-up index deletion just won't help with) in an environment
> that is similar to how the current "skip index scan when # heap pages
> with one or more LP_DEAD items < 2% of rel_pages" thing. That
> mechanism has to be very conservative because we just don't know when
> the next opportunity to vacuum indexes will be -- we almost have to
> assume that the decision will be static, and made exactly once, so we
> better be defensive. But why should that continue to be true with the
> conveyor belt stuff in place, and with periodic mini-vacuums that
> coordinate over time? I don't think it has to be like that. We can
> make it much more dynamic.

I'm not sure that we can. I mean, there's still only going to be ~3
autovacuum workers, and there could be arbitrarily many tables. Even
if the vacuum load is within the bounds of what the system can
sustain, individual tables can't be assured of being visited
frequently (or so it seems to me) and it could be that there are
actually not enough resources to vacuum and have to try to cope as
best we can. Less unnecessary vacuuming of large indexes can help, of
course, but I'm not sure it fundamentally changes the calculus.

> We will need something like that. I think that LP_DEAD items (or
> would-be LP_DEAD items -- tuples with storage that would get pruned
> into LP_DEAD items if we were to prune) in the table are much more
> interesting than dead heap-only tuples, and also more interesting that
> dead index tuples. Especially the distribution of such LP_DEAD items
> in the table, and their concentration. That does seem much more likely
> to be robust as a quantitative driver of index vacuuming.

Hmm... why would the answer have to do with dead items in the heap? I
was thinking along the lines of trying to figure out either a more
reliable count of dead tuples in the index, subtracting out whatever
we save by kill_prior_tuple and bottom-up vacuuming; or else maybe a
count of the subset of dead tuples that are likely not to get
opportunistically pruned in one way or another, if there's some way to
guess that. Or maybe something where when we see an index page filling
up we try to figure out (or guess) that it's close to really needing a
split - i.e. that it's not full of tuples that we could just junk to
make space - and notice how often that's happening. I realize I'm
hand-waving, but if the property is a property of the heap rather than
the index, how will different indexes get different treatment?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-02-09 Thread Joe Conway

On 2/9/22 13:13, Nathan Bossart wrote:

On Tue, Feb 08, 2022 at 10:54:50PM -0500, Robert Haas wrote:

On Tue, Feb 8, 2022 at 7:38 PM Joe Conway  wrote:

If we were to start all over again with this feature my vote would be to
do things differently than we have done. I would not have called them
predefined roles, and I would have used attributes of roles (e.g. make
rolsuper into a bitmap rather than a boolean) rather than role
membership to implement them. But I didn't find time to participate in
the original discussion or review/write the code, so I have little room
to complain.


Yep, fair. I kind of like the predefined role concept myself. I find
it sort of elegant, mostly because I think it scales better than a
bitmask, which can run out of bits surprisingly rapidly. But opinions
can vary, of course.


I do wonder if users find the differences between predefined roles and role
attributes confusing.  INHERIT doesn't govern role attributes, but it will
govern predefined roles when this patch is applied.  Maybe the role
attribute system should eventually be deprecated in favor of using
predefined roles for everything.  Or perhaps the predefined roles should be
converted to role attributes.


Yep, I was suggesting that the latter would have been preferable to me 
while Robert seemed to prefer the former. Honestly I could be happy with 
either of those solutions, but as I alluded to that is probably a 
discussion for the next development cycle since I don't see us doing 
that big a change in this one.


Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: New developer papercut - Makefile references INSTALL

2022-02-09 Thread Magnus Hagander
On Fri, Jan 21, 2022 at 11:53 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-01-21 17:25:08 -0500, Tom Lane wrote:
> > Perhaps this could be finessed by making updating of INSTALL
> > the responsibility of some post-commit hook on the git server.
> > Not sure that we want to go there, though.  In any case, that
> > approach would negate your point about seeing the results.
>
> It would. I guess it'd still be better than the situation today, but...

post-commit hooks don't run on the git server, they run locally on
your machine. There is a "post receive" hook that runs on the git
server, but we definitely don't want that one to fabricate new commits
I think.

And it certainly cannot *modify* the commit that came in in flight, as
that would change the hash, and basically break the whole integrity of
the commit chain.

We could certainly have a cronjob somewhere that ran to check that
they were in sync and would auto-generate a patch if they weren't, for
a committer to review, but I'm not sure how much that would help?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 1:13 PM Nathan Bossart  wrote:
> I do wonder if users find the differences between predefined roles and role
> attributes confusing.  INHERIT doesn't govern role attributes, but it will
> govern predefined roles when this patch is applied.  Maybe the role
> attribute system should eventually be deprecated in favor of using
> predefined roles for everything.  Or perhaps the predefined roles should be
> converted to role attributes.

I couldn't agree more. Apparently it's even confusing to developers,
because otherwise (1) we wouldn't have the problem the patch proposes
to fix in the first place and (2) I would have immediately been
convinced of the value of the patch once it showed up. Since those
things didn't happen, this is apparently confusing to (1) whoever
wrote the code that this patch fixes and (2) me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: is the base backup protocol used by out-of-core tools?

2022-02-09 Thread Magnus Hagander
On Wed, Feb 9, 2022 at 2:02 PM Robert Haas  wrote:
>
> On Wed, Feb 9, 2022 at 3:14 AM Kyotaro Horiguchi
>  wrote:
> > Thanks for pining.  AFAICS, as you see, pg_rman doesn't talk
> > basebackup protocol (nor even pg_basebackup command) as it supports
> > inremental backup.  So there's no issue about the removal of old
> > syntax on our side.
>
> Cool. Since the verdict here seems pretty unanimous and we've heard
> from a good number of backup tool authors, I'll give this another day
> or so and then commit, unless objections emerge before then.

Late to the game, but here's another +1.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Replacing TAP test planning with done_testing()

2022-02-09 Thread Tom Lane
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> Daniel Gustafsson  writes:
>> The attached patch removes all Test::More planning and instead ensures that 
>> all
>> tests conclude with a done_testing() call.  While there, I also removed a few
>> exit(0) calls from individual tests making them more consistent.

> LGTM, +1.

LGTM too.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian  wrote:
> Well, I think the worst case is that the checkpoint happens exactly
> between two checkpoints, so you are checkpointing twice as often, but if
> it happens just before or after a checkpoint, I assume the effect would
> be minimal.

I agree for the most part. I think that if checkpoints happen every 8
minutes normally and the extra checkpoint happens 2 minutes after the
previous checkpoint, the impact may be almost as bad as if it had
happened right in the middle. If it happens 5 seconds after the
previous checkpoint, it should be low impact.

> So, it seems we are weighing having a checkpoint happen in the middle of
> a checkpoint interval vs writing more WAL.  If the WAL traffic, without
> CREATE DATABASE, is high, and the template database is small, writing
> more WAL and skipping the checkpoint will be win, but if the WAL traffic
> is small and the template database is big, the extra WAL will be a loss.
> Is this accurate?

I think that's basically correct. I would expect that the worry about
big template database is mostly about template databases that are
REALLY big. I think if your template database is 10GB you probably
shouldn't be worried about this feature. 10GB of extra WAL isn't
nothing, but if you've got reasonably capable hardware, it's not
overloaded, and max_wal_size is big enough, it's probably not going to
have a huge impact. Also, most of the impact will probably be on the
CREATE DATABASE command itself, and other things running on the system
at the same time will be impacted to a lesser degree. I think it's
even possible that you will be happier with this feature than without,
because you may like the idea that CREATE DATABASE itself is slow more
than you like the idea of it making everything else on the system
slow. On the other hand, if your template database is 1TB, the extra
WAL is probably going to be a fairly big problem.

Basically I think for most people this should be neutral or a win. For
people with really large template databases, it's a loss. Hence the
discussion about having a way for people who prefer the current
behavior to keep it.

> Agreed.  We would want to have a different heap/index key on the standby
> so we can rotate the heap/index key.

I don't like that design, and I don't think that's what we should do,
but I understand that you feel differently. IMHO, this thread is not
the place to hash that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: decoupling table and index vacuum

2022-02-09 Thread Peter Geoghegan
On Wed, Feb 9, 2022 at 6:13 AM Robert Haas  wrote:
> Just to be clear, when I say that the dead index tuples don't matter
> here, I mean from the point of view of the index. From the point of
> view of the table, the presence of dead index tuples (or even the
> potential presence of dead tuples) pointing to dead line pointers is
> an issue that can drive heap bloat. But from the point of view of the
> index, because we don't ever merge sibling index pages, and because we
> have kill_prior_tuple, there's not much value in freeing up space in
> index pages unless it either prevents a split or lets us free the
> whole page. So I agree with Peter that index growth is what really
> matters.

One small caveat that I'd add is this: heap fragmentation likely makes
it harder to avoid page splits in indexes, to a degree. It is arguably
one cause of the page splits that do happen in a table like
pgbench_tellers, with standard pgbench (and lots of throughput, lots
of clients). The tellers (and also branches) primary key tends to
double in size in my recent tests (still way better than a 20x
increase in size, which is what happened in Postgres 11 and maybe even
13). I think that it might be possible to perfectly preserve the
original index size (even with ANALYZE running now and again) by
setting heap fill factor very low, maybe 50 or less.

This is a minor quibble, though. It still makes sense to think of heap
fragmentation as a problem for the heap itself, and not for indexes at
all, since the effect I describe is relatively insignificant, and just
about impossible to model. The problem really is that the heap pages
are failing to hold their original logical rows in place -- the index
size issue is more of a symptom than a problem unto itself.

> However, I have a concern that Peter's idea to use the previous index
> growth to drive future index vacuuming distinction is retrospective
> rather than prospective. If the index is growing more than it should
> based on the data volume, then evidently we didn't do enough vacuuming
> at some point in the past.

That's a very valid concern. As you know, the great advantage about
retrospectively considering what's not going well (and reducing
everything to some highly informative measure like growth in index
size) is that it's reliable -- you don't have to understand precisely
how things got that way, which is just too complicated to get right.
And as you point out, the great disadvantage is that it has already
happened -- which might already be too late.

More on that later...

> It's reasonable to step up our efforts in
> the present to make sure that the problem doesn't continue, but in
> some sense it's already too late. What we would really like is a
> measure that answers the question: is the index going to bloat in the
> relatively near future if we don't vacuum it now? I think that the
> dead tuple count is trying, however imperfectly, to figure that out.
> All other things being equal, the more dead tuples there are in the
> index, the more bloat we're going to have later if we don't clean them
> out now.

One of the key intuitions behind bottom-up index deletion is to treat
the state of an index page as a dynamic thing, not a static thing. The
information that we take from the page that drives our decisions is
very reliable on average, over time, in the aggregate. At the same
time, the information is very noisy, and could be wrong in important
ways at just about any time. The fundamental idea was to take
advantage of the first property, without ever getting killed by the
second property.

To me this seems conceptually similar to how one manages risk when
playing a game of chance while applying the Kelly criterion. The
criterion provides probabilistic certainty on the best strategy to use
in a situation where we have known favorable odds, an infinite series
of bets, and a personal bankroll. I recently came across a great blog
post about it, which gets the idea across well:

https://explore.paulbutler.org/bet/

If you scroll down to the bottom of the page, there are some general
conclusions, some of which are pretty counterintuitive. Especially
"Maximizing expected value can lead to a bad long-run strategy".
Growth over time is much more important than anything else, since you
can play as many individual games as you like -- provided you never go
bankrupt. I think that *a lot* of problems can be usefully analyzed
that way, or at least benefit from a general awareness of certain
paradoxical aspects of probability. Bankruptcy must be recognized as
qualitatively different to any non-zero bankroll. A little bit like
how splitting a leaf page unnecessarily is truly special.

Once we simply avoid ruin, we can get the benefit of playing a game
with favorable odds -- growth over time is what matters, not
necessarily our current bankroll. It sounds like a fairly small
difference, but that's deceptive -- it's a huge difference.

> The problem is not with that core idea, which 

Re: Avoiding smgrimmedsync() during nbtree index builds

2022-02-09 Thread Melanie Plageman
Hi,
v5 attached and all email feedback addressed below

On Thu, Jan 13, 2022 at 12:18 PM Robert Haas  wrote:
>
> On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
>  wrote:
> > unbuffered_write() and unbuffered_extend() might be able to be used even
> > if unbuffered_prep() and unbuffered_finish() are not used -- for example
> > hash indexes do something I don't entirely understand in which they call
> > smgrextend() directly when allocating buckets but then initialize the
> > new bucket pages using the bufmgr machinery.
>
> My first thought was that someone might do this to make sure that we
> don't run out of disk space after initializing some but not all of the
> buckets. Someone might have some reason for wanting to avoid that
> corner case. However, in _hash_init() that explanation doesn't make
> any sense, because an abort would destroy the entire relation. And in
> _hash_alloc_buckets() the variable "zerobuf" points to a buffer that
> is not, in fact, all zeroes. So my guess is this is just old, crufty
> code - either whatever reasons somebody had for doing it that way are
> no longer valid, or there wasn't any good reason even at the time.

I notice in the comment before _hash_alloc_buckets() is called, it says

/*
* We treat allocation of buckets as a separate WAL-logged action.
* Even if we fail after this operation, won't leak bucket pages;
* rather, the next split will consume this space. In any case, even
* without failure we don't use all the space in one split operation.
*/

Does this mean that it is okay that these pages are written outside of
shared buffers and, though skipFsync is passed as false, a checkpoint
starting and finishing between writing the WAL and
register_dirty_segment() followed by a crash could result in lost data?

On Thu, Jan 13, 2022 at 10:52 AM Justin Pryzby  wrote:
> I think the ifndef should be outside the includes:

Thanks, fixed!

On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby  wrote:
> Separate from this issue, I wonder if it'd be useful to write a DEBUG log
> showing when btree uses shared_buffers vs fsync.  And a regression test which
> first SETs client_min_messages=debug to capture the debug log to demonstrate
> when/that new code path is being hit.  I'm not sure if that would be good to
> merge, but it may be useful for now.

I will definitely think about doing this.

On Mon, Jan 17, 2022 at 12:22 PM Justin Pryzby  wrote:
>
> On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:
> > On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
> > > This is failing on windows CI when I use initdb --data-checksums, as 
> > > attached.
> > >
> > > https://cirrus-ci.com/task/5612464120266752
> > > https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
> > >
> > > +++ c:/cirrus/src/test/regress/results/bitmapops.out2022-01-13 
> > > 00:47:46.704621200 +
> > > ..
> > > +ERROR:  could not read block 0 in file "base/16384/30310": read only 0 
> > > of 8192 bytes
> >
> > The failure isn't consistent, so I double checked my report.  I have some 
> > more
> > details:
> >
> > The problem occurs maybe only ~25% of the time.
> >
> > The issue is in the 0001 patch.
> >
> > data-checksums isn't necessary to hit the issue.
> >
> > errlocation says: LOCATION:  mdread, md.c:686 (the only place the error
> > exists)
> >
> > With Andres' windows crash patch, I obtained a backtrace - attached.
> > https://cirrus-ci.com/task/5978171861368832
> > https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt
> >
> > Maybe its a race condition or synchronization problem that nowhere else 
> > tends
> > to hit.
>
> I meant to say that I had not seen this issue anywhere but windows.
>
> But now, by chance, I still had the 0001 patch in my tree, and hit the same
> issue on linux:
>
> https://cirrus-ci.com/task/4550618281934848
> +++ 
> /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out
>  2022-01-17 16:06:35.759108172 +
>  EXPLAIN (COSTS OFF)
>  SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids 
> ORDER BY noabort_increasing LIMIT 5;
> +ERROR:  could not read block 0 in file "base/16387/t3_36794": read only 0 of 
> 8192 bytes

Yes, I think this is due to the problem Andres mentioned with my saving
the SMgrRelation and then trying to use it after a relcache flush. The
new patch version addresses this by always re-executing
RelationGetSmgr() as recommended in the comments.

On Sun, Jan 23, 2022 at 4:55 PM Andres Freund  wrote:
> On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> > On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> >  wrote:
> > Thus, the backend must ensure that
> > either the Redo pointer has not moved or that the data is fsync'd before
> > freeing the page.
>
> "freeing"?

Yes, I agree this wording was confusing/incorrect. I meant before it
moves on 

Re: How to get started with contribution

2022-02-09 Thread Daniel Gustafsson
> On 9 Feb 2022, at 18:13, 603_Annada Dash  
> wrote:

> Respected Sir/Ma'am,
> I am Annada Dash, a computer science undergraduate, currently about to finish 
> the first semester of University (MKSSS Cummins College of Engineering for 
> Women, Pune). I am new to open source contributions but I am well aware of 
> python, C and R. I would love to contribute to your organization, but could 
> you please tell me how to get started?
> Hoping to hear from you soon.

See this recent thread from this list with some information on this:

https://www.postgresql.org/message-id/flat/CALyR1r4m%2Bhd17ubDQXzBFghXEL_0UQJtEaABb2D-hcrSNj7Y_w%40mail.gmail.com

--
Daniel Gustafsson   https://vmware.com/





Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Bruce Momjian
On Wed, Feb  9, 2022 at 11:00:06AM -0500, Robert Haas wrote:
> Try running pgbench with the --progress option and enough concurrent
> jobs to keep a moderately large system busy and watching what happens
> to the tps each time a checkpoint occurs. It's extremely dramatic, or
> at least it was the last time I ran such tests. I think that
> performance will sometimes drop by a factor of five or more when the
> checkpoint hits, and take multiple minutes to recover.
> 
> I think your statement that doing an extra checkpoint "just means the
> next checkpoint will do less work" is kind of misleading. That's
> certainly true in some situations. But when the same pages are being
> dirtied over and over again, an extra checkpoint often means that the
> system will do MUCH MORE work, because every checkpoint triggers a new
> set of full-page writes over the actively-updated portion of the
> database.
> 
> I think that very few people run systems with heavy write workloads
> with checkpoint_timeout=5m, precisely because of this issue. Almost
> every system I see has had that raised to at least 10m and sometimes
> 30m or more. It can make a massive difference.

Well, I think the worst case is that the checkpoint happens exactly
between two checkpoints, so you are checkpointing twice as often, but if
it happens just before or after a checkpoint, I assume the effect would
be minimal.

So, it seems we are weighing having a checkpoint happen in the middle of
a checkpoint interval vs writing more WAL.  If the WAL traffic, without
CREATE DATABASE, is high, and the template database is small, writing
more WAL and skipping the checkpoint will be win, but if the WAL traffic
is small and the template database is big, the extra WAL will be a loss.
Is this accurate?

> I can't predict whether PostgreSQL will get TDE in the future, and if
> it does, I can't predict what form it will take. Therefore any strong
> statement about whether this will benefit TDE or not seems to me to be
> pretty questionable - we don't know that it will be useful, and we

Agreed.  We would want to have a different heap/index key on the standby
so we can rotate the heap/index key.

> don't know that it won't. But, like Dilip, I think the way we're
> WAL-logging CREATE DATABASE right now is a hack, and I *know* it can

Yes, it is a hack, but it seems to be a clever one that we might have
chosen if it had not been part of the original system.

> cause massive performance drops on busy systems.

See above.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Identify missing publications from publisher while create/alter subscription.

2022-02-09 Thread Euler Taveira
On Wed, Feb 9, 2022, at 12:06 PM, Ashutosh Sharma wrote:
> Just wondering if we should also be detecting the incorrect conninfo
> set with ALTER SUBSCRIPTION command as well. See below:
> 
> -- try creating a subscription with incorrect conninfo. the command fails.
> postgres=# create subscription sub1 connection 'host=localhost
> port=5490 dbname=postgres' publication pub1;
> ERROR:  could not connect to the publisher: connection to server at
> "localhost" (::1), port 5490 failed: Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> connection to server at "localhost" (127.0.0.1), port 5490 failed:
> Connection refused
> Is the server running on that host and accepting TCP/IP connections?
That's because by default 'connect' parameter is true.

The important routine for all SUBSCRIPTION commands that handle connection
string is to validate the connection string e.g. check if all parameters are
correct. See walrcv_check_conninfo that calls PQconninfoParse.

The connection string is syntactically correct. Hence, no error. It could be
the case that the service is temporarily down. It is a useful and common
scenario that I wouldn't want to be forbid.

> -- reset the connninfo in the subscription to some wrong value. the
> command succeeds.
> postgres=# alter subscription sub1 connection 'host=localhost
> port=5490 dbname=postgres';
> ALTER SUBSCRIPTION
> postgres=#
> 
> postgres=# drop subscription sub1;
> ERROR:  could not connect to publisher when attempting to drop
> replication slot "sub1": connection to server at "localhost" (::1),
> port 5490 failed: Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> connection to server at "localhost" (127.0.0.1), port 5490 failed:
> Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> HINT:  Use ALTER SUBSCRIPTION ... SET (slot_name = NONE) to
> disassociate the subscription from the slot.
Again, dropping a subscription that is associated with a replication slot
requires a connection to remove the replication slot. If the publisher is gone
(and so the replication slot), follow the HINT advice.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Unnecessary call to resetPQExpBuffer in getIndexes

2022-02-09 Thread Nathan Bossart
On Wed, Feb 09, 2022 at 10:50:07AM +0800, Julien Rouhaud wrote:
> I just noticed that e2c52beecd (adding PeterE in Cc) added a 
> resetPQExpBuffer()
> which seems unnecessary since the variable is untouched since the initial
> createPQExpBuffer().
> 
> Simple patch attached.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-02-09 Thread Nathan Bossart
On Tue, Feb 08, 2022 at 10:54:50PM -0500, Robert Haas wrote:
> On Tue, Feb 8, 2022 at 7:38 PM Joe Conway  wrote:
>> If we were to start all over again with this feature my vote would be to
>> do things differently than we have done. I would not have called them
>> predefined roles, and I would have used attributes of roles (e.g. make
>> rolsuper into a bitmap rather than a boolean) rather than role
>> membership to implement them. But I didn't find time to participate in
>> the original discussion or review/write the code, so I have little room
>> to complain.
> 
> Yep, fair. I kind of like the predefined role concept myself. I find
> it sort of elegant, mostly because I think it scales better than a
> bitmask, which can run out of bits surprisingly rapidly. But opinions
> can vary, of course.

I do wonder if users find the differences between predefined roles and role
attributes confusing.  INHERIT doesn't govern role attributes, but it will
govern predefined roles when this patch is applied.  Maybe the role
attribute system should eventually be deprecated in favor of using
predefined roles for everything.  Or perhaps the predefined roles should be
converted to role attributes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Typo in archive modules docs

2022-02-09 Thread Nathan Bossart
On Wed, Feb 09, 2022 at 03:04:34PM +0100, Daniel Gustafsson wrote:
> Spotted a typo when reading the new archive modules documentation, will apply
> the attached shortly to fix.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




How to get started with contribution

2022-02-09 Thread 603_Annada Dash
Respected Sir/Ma'am,
I am Annada Dash, a computer science undergraduate, currently about to
finish the first semester of University (MKSSS Cummins College of
Engineering for Women, Pune). I am new to open source contributions but I
am well aware of python, C and R. I would love to contribute to your
organization, but could you please tell me how to get started?
Hoping to hear from you soon.
Regards
Annada


[image: Mailtrack]

Sender
notified by
Mailtrack

02/09/22,
10:43:41 PM


Re: make MaxBackends available in _PG_init

2022-02-09 Thread Nathan Bossart
On Wed, Feb 09, 2022 at 12:31:16PM -0500, Robert Haas wrote:
> On Wed, Feb 9, 2022 at 12:18 AM Michael Paquier  wrote:
>> Okay, I'd rather apply the same rule everywhere for consistency, then,
>> like in the attached.  That's minimal, still.
> 
> That's fine with me. In the interest of full disclosure, I did kind of
> notice this when reviewing the patch, though perhaps not every
> instance, and just decided that it didn't seem important enough to
> worry about. I'm totally OK with you thinking otherwise, though,
> especially since you also volunteered to do the work thus generated.
> :-)

This is fine with me as well.  I only left these out because the extra
variable felt unnecessary to me for these functions.

While you are at it, would you mind fixing the misspelling of
OldestVisibleMXactId in multixact.c (as per the attached)?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index f00c272521..6a70d49738 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -285,7 +285,7 @@ typedef struct MultiXactStateData
  * Pointers to the state data in shared memory
  *
  * The index of the last element of the OldestMemberMXactId and
- * OldestVisibleMXacId arrays can be obtained with GetMaxOldestSlot().
+ * OldestVisibleMXactId arrays can be obtained with GetMaxOldestSlot().
  */
 static MultiXactStateData *MultiXactState;
 static MultiXactId *OldestMemberMXactId;


Re: Plug minor memleak in pg_dump

2022-02-09 Thread Ranier Vilela
>No, but I was distracted by other things leaving this on the TODO list.
It's

>been pushed now.

Hi,

IMO I think that still have troubles here.

ReadStr can return NULL, so the fix can crash.

regards,

Ranier Vilela


v1_fix_possible_null_dereference_pg_backup_archiver.patch
Description: Binary data


Re: make MaxBackends available in _PG_init

2022-02-09 Thread Nathan Bossart
On Tue, Feb 08, 2022 at 04:12:26PM -0500, Robert Haas wrote:
> After some investigation I've determined that it's no longer Friday
> afternoon.

This matches my analysis.

> I also spent time investigating whether the patch had
> problems that would make me uncomfortable with the idea of committing
> it, and I did not find any. So, I committed it.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: make MaxBackends available in _PG_init

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 12:18 AM Michael Paquier  wrote:
> Okay, I'd rather apply the same rule everywhere for consistency, then,
> like in the attached.  That's minimal, still.

That's fine with me. In the interest of full disclosure, I did kind of
notice this when reviewing the patch, though perhaps not every
instance, and just decided that it didn't seem important enough to
worry about. I'm totally OK with you thinking otherwise, though,
especially since you also volunteered to do the work thus generated.
:-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Add reloption for views to enable RLS

2022-02-09 Thread walther

Laurenz Albe:

So even though the view owner "duff" has no permissions
on the schema "viewtest", we can still select from the table.
Permissions on the schema containing the table are not
checked, only permissions on the table itself.

[...]

If not, I don't know if it is the business of this patch to
change the behavior.


Ah, good find. In that case, I suggest to change the docs slightly to 
say that the schema will not be checked.


In one place it's described as "it will cause all access to the 
underlying tables to be checked as ..." which is fine, I think. But in 
another place it's "access to tables, functions and *other objects* 
referenced in the view, ..." which is misleading.



I agree that the name "security_invoker" is suggestive of SECURITY INVOKER
in CREATE FUNCTION, but the behavior is different.
Perhaps the solution is as simple as choosing a different name that does
not prompt this association, for example "permissions_invoker".


Yes, given that there is not much that can be done about the 
functionality anymore, a different name would be better. This would also 
avoid the implicit "if security_invoker=false, the view behaves like 
SECURITY DEFINER" association, which is also clearly wrong. And this 
assumption is actually what made me think the chained views example was 
somehow off.


I am not convinced "permissions_invoker" is much better, though. The 
difference between SECURITY INVOKER and SECURITY DEFINER is invoker vs 
definer... where, I think, we need something else to describe what we 
currently have and what the patch provides.


Maybe we can look at it from the other perspective: Both ways of 
operating keep the CURRENT_USER the same, pretty much like what we 
understand "security invoker" should do. The difference, however, is the 
current default in which the permissions are checked with the view 
*owner*. Let's treat this difference as the thing that can be set: 
security_owner=true|false. Or run_as_owner=true|false.


xxx_owner=true would be the default and xxx_owner=false could be set 
explicitly to get the behavior we are looking for in this patch?




I guess more documentation how this works would be a good idea.
[...] but since it exposes this
in new ways, it might as well clarify how all this works.


+1

Best

Wolfgang




Re: Database-level collation version tracking

2022-02-09 Thread Alvaro Herrera
On 2022-Feb-08, Julien Rouhaud wrote:

> On Tue, Feb 08, 2022 at 12:14:02PM +0100, Peter Eisentraut wrote:

> > I don't think you can run ALTER DATABASE from the regression test scripts,
> > since the database name is not fixed.  You'd have to paste the command
> > together using psql tricks or something.  I guess it could be done, but
> > maybe there is a better solution.  We could put it into the createdb test
> > suite, or write a new TAP test suite somewhere.  There is no good precedent
> > for this.
> 
> I was thinking using a simple DO command, as there are already some other 
> usage
> for that for non deterministic things (like TOAST tables).  If it's too
> problematic I'm fine with not having tests for that for now.

You can do this:

select current_database() as datname \gset
alter database :"datname" owner to foo;

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Robert Haas
On Tue, Feb 8, 2022 at 11:47 PM Dilip Kumar  wrote:
> Now, one bigger question is can we proceed with this patch without
> fixing [2], IMHO, if we are deciding to keep the old method also
> intact then one option could be that for now only change CREATE
> DATABASE to support both old and new way of creating database and for
> time being leave the ALTER DATABASE SET TABLESPACE alone and let it
> work only with the old method?  And another option is that we first
> fix the issue related to the tombstone file and then come back to
> this?
>
> IMHO, the first option could be better in a way that we have already
> made better progress in this patch and this is in better shape than
> the other patch we are trying to make for removing the tombstone
> files.

Yeah, it's getting quite close to the end of this release cycle. I'm
not sure whether we can get anything committed here at all in the time
we have remaining, but I agree with you that this patch seems like a
better prospect than that one.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Andrew Dunstan


On 2/9/22 10:58, Dilip Kumar wrote:
> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan  wrote:
>>
>> On 6/16/21 03:52, Dilip Kumar wrote:
>>> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan  wrote:
 Rather than use size, I'd be inclined to say use this if the source
 database is marked as a template, and use the copydir approach for
 anything that isn't.
>>> Yeah, that is possible, on the other thought wouldn't it be good to
>>> provide control to the user by providing two different commands, e.g.
>>> COPY DATABASE for the existing method (copydir) and CREATE DATABASE
>>> for the new method (fully wal logged)?
>>>
>>
>> This proposal seems to have gotten lost.
> Yeah, I am planning to work on this part so that we can support both methods.
>

OK, many thanks.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Database-level collation version tracking

2022-02-09 Thread Peter Eisentraut

On 08.02.22 13:55, Julien Rouhaud wrote:

Apart from that I still think that we should check the collation version of the
source database when creating a new database.  It won't cost much but will give
the DBA a chance to recreate the indexes before risking invalid index usage.


A question on this:  In essence, this would be putting code into 
createdb() similar to the code in postinit.c:CheckMyDatabase().  But 
what should we make it do and say exactly?


After thinking about this for a bit, I suggest: If the actual collation 
version of the newly created database does not match the recorded 
collation version of the template database, we should error out and make 
the user fix the template database (by reindexing there etc.).


The alternative is to warn, as it does now in postinit.c.  But then the 
phrasing of the message becomes complicated: Should we make the user fix 
the new database or the template database or both?  And if they don't 
fix the template database, they will have the same problem again.  So 
making it a hard failure seems better to me.





Re: [PATCH] Add reloption for views to enable RLS

2022-02-09 Thread Laurenz Albe
On Fri, 2022-02-04 at 22:28 +0100, walt...@technowledgy.de wrote:
> This is a feature I have long been looking for. I tested the patch (v5) 
> and found two cases that I feel need to be either fixed or documented 
> explicitly.

Thanks for testing and weighing in!

> Case 1 - Schema privileges:
> 
> create schema a;
> create table a.t();
> 
> create schema b;
> create view b.v with (security_invoker=true) as table a.t;
> 
> create role alice;
> grant usage on schema b to alice; -- missing schema a
> grant select on table a.t, b.v to alice;
> 
> set role alice;
> table a.t; -- ERROR: permission denied for schema a (good)
> table b.v; -- no error (good or bad?)
> 
> User alice does not have USAGE privileges on schema a, but only on table 
> a.t. A SELECT directly on the table fails as expected, but a SELECT on 
> the view succeeds. I assume the schema access is checked when the query 
> is parsed - and at that stage, the user is still the view owner?
> The docs mention explicitly that *all* objects are accessed with invoker 
> privileges, which is not the case.
> 
> Personally I actually like this. It allows to keep a view-based api in a 
> separate schema, while:
> - preserving full RLS capabilities and
> - forcing the user to go through the api, because a direct access to the 
> data schema is not possible.
> 
> However, since this behavior was likely unintended until now, it raises 
> the question whether there are any other privilege checks that are not 
> taking the invoking user into account properly?

This behavior is not new:

 CREATE SCHEMA viewtest;

 CREATE ROLE duff LOGIN;
 CREATE ROLE jock LOGIN;

 CREATE TABLE viewtest.tab (id integer);
 GRANT SELECT ON viewtest.tab TO duff;

 CREATE VIEW v AS SELECT * FROM viewtest.tab;
 ALTER VIEW v OWNER TO duff;
 GRANT SELECT ON v TO jock;

 SET ROLE jock;

 SELECT * FROM v;
  id 
 
 (0 rows)

So even though the view owner "duff" has no permissions
on the schema "viewtest", we can still select from the table.
Permissions on the schema containing the table are not
checked, only permissions on the table itself.

I am not sure how to feel about this.  It is not what I would have
expected, but changing it would be a compatibility break.
Should this be considered a live bug in PostgreSQL?

If not, I don't know if it is the business of this patch to
change the behavior.

> Case 2 - Chained views:
> 
> create schema a;
> create table a.t();
> 
> create role bob;
> grant create on database postgres to bob;
> grant usage on schema a to bob;
> set role bob;
> create schema b;
> create view b.v1 with (security_invoker=true) as table a.t;
> create view b.v2 with (security_invoker=false) as table b.v1;
> 
> reset role;
> create role alice;
> grant usage on schema a, b to alice;
> grant select on table a.t to bob;
> grant select on table b.v2 to alice;
> 
> set role alice;
> table b.v2; -- ERROR: permission denied for table t (bad)
> 
> When alice runs the SELECT on b.v2, the query on b.v1 is made with bob 
> privileges as the view owner of b.v2. This is verified, because alice 
> does not have privileges to access b.v1, but no such error is thrown.
> 
> b.v1 will then access a.t - and my first assumption was, that in this 
> case a.t should be accessed by bob, still as the view owner of b.v2. 
> Clearly, this is not the case as the permission denied error shows.
> 
> This is not actually a problem with this patch, I think, but just 
> highlighting a quirk in the current implementation of views 
> (security_invoker=false) in general: While the query will be run with 
> the view owner, the CURRENT_USER is still the invoker, even "after" the 
> view. In other words, the current implementation is *not* the same as 
> "security definer". It's somewhere between "security definer" and 
> "security invoker" - a strange mix really.

Right.  Even though permissions on "v1" are checked for user "bob",
permissions on the table are checked for the current user, which remains
"alice".

I agree that the name "security_invoker" is suggestive of SECURITY INVOKER
in CREATE FUNCTION, but the behavior is different.
Perhaps the solution is as simple as choosing a different name that does
not prompt this association, for example "permissions_invoker".

> Afaik this mix is not documented explicitly so far. But the 
> security_invoker reloption exposes it in a much less expected way, so I 
> only see two options really:
> a) make the current implementation of security_invoker=false a true 
> "security definer", i.e. change the CURRENT_USER "after" the view for good.
> b) document the "security infiner/devoker" default behavior as a feature.
> 
> I really like a), as this would make a clear cut between security 
> definer and security invoker views - but this would be a big breaking 
> change, which I don't think is acceptable.

I agree that changing the current behavior is not acceptable.

I guess more documentation how this works would be a good idea.
Not sure if this is the job of this 

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar  wrote:
> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan  wrote:
> > On 6/16/21 03:52, Dilip Kumar wrote:
> > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan  
> > > wrote:
> > >> Rather than use size, I'd be inclined to say use this if the source
> > >> database is marked as a template, and use the copydir approach for
> > >> anything that isn't.
> > > Yeah, that is possible, on the other thought wouldn't it be good to
> > > provide control to the user by providing two different commands, e.g.
> > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> > > for the new method (fully wal logged)?
> >
> > This proposal seems to have gotten lost.
>
> Yeah, I am planning to work on this part so that we can support both methods.

But can we pick a different syntax? In my view this should be an
option to CREATE DATABASE rather than a whole new command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian  wrote:
> Honestly, I never understood why the checkpoint during CREATE DATABASE
> was as problem --- we checkpoint by default every five minutes anyway,
> so why is an additional two a problem --- it just means the next
> checkpoint will do less work.  It is hard to see how avoiding
> checkpoints to add WAL writes, fscyncs, and replication traffic could be
> a win.

Try running pgbench with the --progress option and enough concurrent
jobs to keep a moderately large system busy and watching what happens
to the tps each time a checkpoint occurs. It's extremely dramatic, or
at least it was the last time I ran such tests. I think that
performance will sometimes drop by a factor of five or more when the
checkpoint hits, and take multiple minutes to recover.

I think your statement that doing an extra checkpoint "just means the
next checkpoint will do less work" is kind of misleading. That's
certainly true in some situations. But when the same pages are being
dirtied over and over again, an extra checkpoint often means that the
system will do MUCH MORE work, because every checkpoint triggers a new
set of full-page writes over the actively-updated portion of the
database.

I think that very few people run systems with heavy write workloads
with checkpoint_timeout=5m, precisely because of this issue. Almost
every system I see has had that raised to at least 10m and sometimes
30m or more. It can make a massive difference.

> I see the patch justification outlined here:
>
> 
> https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=bm+...@mail.gmail.com
>
> TDE is mentioned as a value for this patch, but I don't see why it is
> needed --- TDE can easily decrypt/encrypt the pages while they are
> copied.

That's true, but depending on what other design decisions we make,
WAL-logging it might be a problem.

Right now, when someone creates a new database, we log a single record
that basically says "go copy the directory'". That's very different
than what we normally do, which is to log changes to individual pages,
or where required, small groups of pages (e.g. a single WAL record is
written for an UPDATE even though it may touch two pages). The fact
that in this case we only log a single WAL record for an operation
that could touch an unbounded amount of data is why this needs special
handling around checkpoints. It also introduces a certain amount of
fragility into the system, because if for some reason the source
directory on the standby doesn't exactly match the source directory on
the primary, the new databases won't match either. Any errors that
creep into the process can be propagated around to other places by a
system like this. However, ordinarily that doesn't happen, which is
why we've been able to use this system successfully for so many years.

The other reason we've been able to use this successfully is that
we're confident that we can perform exactly the same operation on the
standby as we do on the primary knowing only the relevant directory
names. If we say "copy this directory to there" we believe we'll be
able to do that exactly the same way on the standby. Is that still
true with TDE? Well, it depends. If the encryption can be performed
knowing only the key and the identity of the block (database OID,
tablespace OID, relfilenode, fork, block number) then it's true. But
if the encryption needs to, for example, generate a random nonce for
each block, then it's false. If you want the standby to be an exact
copy of the master in a system where new blocks get random nonces,
then you need to replicate the copy block-by-block, not as one
gigantic operation, so that you can log the nonce you picked for each
block. On the other hand, maybe you DON'T want the standby to be an
exact copy of the master. If, for example, you imagine a system where
the master and standby aren't even using the same key, then this is a
lot less relevant.

I can't predict whether PostgreSQL will get TDE in the future, and if
it does, I can't predict what form it will take. Therefore any strong
statement about whether this will benefit TDE or not seems to me to be
pretty questionable - we don't know that it will be useful, and we
don't know that it won't. But, like Dilip, I think the way we're
WAL-logging CREATE DATABASE right now is a hack, and I *know* it can
cause massive performance drops on busy systems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Dilip Kumar
On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan  wrote:
>
>
> On 6/16/21 03:52, Dilip Kumar wrote:
> > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan  wrote:
> >> Rather than use size, I'd be inclined to say use this if the source
> >> database is marked as a template, and use the copydir approach for
> >> anything that isn't.
> > Yeah, that is possible, on the other thought wouldn't it be good to
> > provide control to the user by providing two different commands, e.g.
> > COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> > for the new method (fully wal logged)?
> >
>
>
> This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Dilip Kumar
On Wed, Feb 9, 2022 at 7:49 PM Bruce Momjian  wrote:
>
> Honestly, I never understood why the checkpoint during CREATE DATABASE
> was as problem --- we checkpoint by default every five minutes anyway,
> so why is an additional two a problem --- it just means the next
> checkpoint will do less work.  It is hard to see how avoiding
> checkpoints to add WAL writes, fscyncs, and replication traffic could be
> a win.

But don't you think that the current way of WAL logging the CREATE
DATABASE is a bit hacky?  I mean we are just logically WAL logging the
source and destination directory paths without actually WAL logging
what content we want to copy.  IMHO this is against the basic
principle of WAL and that's the reason we are forcefully checkpointing
to avoid replaying that WAL during crash recovery.  Even after this
some of the code comments say that we have limitations during PITR[1]
and we want to avoid it sometime in the future.

[1]
* In PITR replay, the first of these isn't an issue, and the second
* is only a risk if the CREATE DATABASE and subsequent template
* database change both occur while a base backup is being taken.
* There doesn't seem to be much we can do about that except document
* it as a limitation.
*
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Andrew Dunstan


On 6/16/21 03:52, Dilip Kumar wrote:
> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan  wrote:
>> Rather than use size, I'd be inclined to say use this if the source
>> database is marked as a template, and use the copydir approach for
>> anything that isn't.
> Yeah, that is possible, on the other thought wouldn't it be good to
> provide control to the user by providing two different commands, e.g.
> COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> for the new method (fully wal logged)?
>


This proposal seems to have gotten lost.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Alvaro Herrera
Our style is that we group all declarations in a block to appear at the
top.  We don't mix declarations and statements.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Tomas Vondra

On 2/9/22 06:41, Noah Misch wrote:


The explanation was more boring than that.  v13 and earlier have an additional
InvalidateSystemCaches() call site, which I neglected to update.  Here's the
fix I intend to push.


I tried this patch on 10 and 13, and it seems to fix the issue. So +1.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Identify missing publications from publisher while create/alter subscription.

2022-02-09 Thread Ashutosh Sharma
Just wondering if we should also be detecting the incorrect conninfo
set with ALTER SUBSCRIPTION command as well. See below:

-- try creating a subscription with incorrect conninfo. the command fails.
postgres=# create subscription sub1 connection 'host=localhost
port=5490 dbname=postgres' publication pub1;
ERROR:  could not connect to the publisher: connection to server at
"localhost" (::1), port 5490 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5490 failed:
Connection refused
Is the server running on that host and accepting TCP/IP connections?
postgres=#
postgres=#

-- this time the conninfo is correct and the command succeeded.
postgres=# create subscription sub1 connection 'host=localhost
port=5432 dbname=postgres' publication pub1;
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
postgres=#
postgres=#

-- reset the connninfo in the subscription to some wrong value. the
command succeeds.
postgres=# alter subscription sub1 connection 'host=localhost
port=5490 dbname=postgres';
ALTER SUBSCRIPTION
postgres=#

postgres=# drop subscription sub1;
ERROR:  could not connect to publisher when attempting to drop
replication slot "sub1": connection to server at "localhost" (::1),
port 5490 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5490 failed:
Connection refused
Is the server running on that host and accepting TCP/IP connections?
HINT:  Use ALTER SUBSCRIPTION ... SET (slot_name = NONE) to
disassociate the subscription from the slot.

==

When creating a subscription we do connect to the publisher node hence
the incorrect connection info gets detected. But that's not the case
with alter subscription.

--
With Regards,
Ashutosh Sharma.

On Sat, Nov 13, 2021 at 6:27 PM Bharath Rupireddy
 wrote:
>
> On Sat, Nov 13, 2021 at 12:50 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v13 patch has the fixes for the same.
>
> Thanks for the updated v13 patch. I have no further comments, it looks
> good to me.
>
> Regards,
> Bharath Rupireddy.
>
>




Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Dong Wook Lee
Oops, I realized that I made a mistake in my patch and now I have corrected it.

2022년 2월 9일 (수) 오후 11:33, Dong Wook Lee 님이 작성:
>
> Hi,
>
> I've applied your patch. I think it's reasonable.
> but  IMHO It looks more complicated to read because of many conditions
> in if statement.
> so what about just moving up if-statement?
>
> Thanks.
> Dong Wook Lee
>
> 2022년 2월 9일 (수) 오후 7:56, Ranier Vilela 님이 작성:
> >
> > Hi,
> >
> > I think this change can improve this particular function by avoiding 
> > touching value if not needed.
> > Test if not isnull is cheaper than test TupleDescAttr is -1.
> >
> > best regards,
> >
> > Ranier Vilela


v3_tiny_improvemnt_toast_helper.patch
Description: Binary data


Re: Replacing TAP test planning with done_testing()

2022-02-09 Thread Dagfinn Ilmari Mannsåker
Daniel Gustafsson  writes:

> Whether or not to explicitly plan the number of TAP tests per suite has been
> discussed a number of times on this list, often as a side-note in an unrelated
> thread which adds/modifies a test.  The concensus has so far weighed towards
> not doing manual bookkeeping of test plans but to let Test::More deal with it.
> So far, no concrete patch to make that happen has been presented though.
>
> The attached patch removes all Test::More planning and instead ensures that 
> all
> tests conclude with a done_testing() call.  While there, I also removed a few
> exit(0) calls from individual tests making them more consistent.
>
> Thoughts?

LGTM, +1.

I have grepped the patched code and can verify that there are no
occurrences of `tests => N` (with or without quotes around `tests`) left
in any test scripts, and `make check-world` passes.

I'm especially pleased by the removal of the big chunks of test count
calculating code in the pg_dump tests.

- ilmari




Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Dong Wook Lee
Hi,

I've applied your patch. I think it's reasonable.
but  IMHO It looks more complicated to read because of many conditions
in if statement.
so what about just moving up if-statement?

Thanks.
Dong Wook Lee

2022년 2월 9일 (수) 오후 7:56, Ranier Vilela 님이 작성:
>
> Hi,
>
> I think this change can improve this particular function by avoiding touching 
> value if not needed.
> Test if not isnull is cheaper than test TupleDescAttr is -1.
>
> best regards,
>
> Ranier Vilela


v2_tiny_improvemnt_toast_helper.patch
Description: Binary data


Re: Add tag/category to the commitfest app

2022-02-09 Thread Julien Rouhaud
On Wed, Feb 09, 2022 at 03:14:30PM +0100, Daniel Gustafsson wrote:
> > On 8 Feb 2022, at 03:10, Julien Rouhaud  wrote:
> 
> > I was thinking that we could add a tag system, probably with some limited
> > predefined tags, to address that problem.
> > 
> > Do you think it's worth adding, or do you have a better idea to help newer
> > contributors?
> 
> We already have the categories for patches, not sure that a second level would
> be all that helpful.

Yes, but those categories are quite orthogonal to the problem of finding a patch
suitable for a new contributor or attract attention of other groups of people.

Note that those tags wouldn't be a sub category, just a parallel thing where a
patch could have 0, 1 or n predefined tags, with the search bar allowing easy
access to the combination you want.

> I still think that adding an abstract/description field
> as discussed in 3c53e8aa-b53f-43c2-b3c6-fa71e18a4...@yesql.se would be helpful
> but I haven't had time to work on that.

Agreed, but it doesn't seem to help apart from being a good indication that
this is likely not a good choice for a new contributor.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-09 Thread Bruce Momjian
On Tue, Feb  8, 2022 at 12:09:08PM -0500, Robert Haas wrote:
> Sadly, it doesn't appear to me that anyone has done any performance
> testing of this patch, along the lines suggested above or otherwise,
> and I think it's a crucial question for the patch. My reading of this
> thread is that nobody really likes the idea of maintaining two methods
> for performing CREATE DATABASE, but nobody wants to hose people who
> are using it to clone large databases, either. To some extent those
> things are inexorably in conflict. If we postulate that the 10TB
> template database is on a local RAID array with 40 spindles, while
> pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
> then the new system is going to be infinitely worse. But real
> situations aren't likely to be that bad, and it would be useful in my
> opinion to have an idea how bad they actually are.

Honestly, I never understood why the checkpoint during CREATE DATABASE
was as problem --- we checkpoint by default every five minutes anyway,
so why is an additional two a problem --- it just means the next
checkpoint will do less work.  It is hard to see how avoiding
checkpoints to add WAL writes, fscyncs, and replication traffic could be
a win.

I see the patch justification outlined here:


https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=bm+...@mail.gmail.com

TDE is mentioned as a value for this patch, but I don't see why it is
needed --- TDE can easily decrypt/encrypt the pages while they are
copied.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Replacing TAP test planning with done_testing()

2022-02-09 Thread Julien Rouhaud
Hi,

On Wed, Feb 09, 2022 at 03:01:36PM +0100, Daniel Gustafsson wrote:
> Whether or not to explicitly plan the number of TAP tests per suite has been
> discussed a number of times on this list, often as a side-note in an unrelated
> thread which adds/modifies a test.  The concensus has so far weighed towards
> not doing manual bookkeeping of test plans but to let Test::More deal with it.
> So far, no concrete patch to make that happen has been presented though.
> 
> The attached patch removes all Test::More planning and instead ensures that 
> all
> tests conclude with a done_testing() call.  While there, I also removed a few
> exit(0) calls from individual tests making them more consistent.
> 
> Thoughts?

+1, I don't think it adds much value compared to the extra work it requires.
Not having the current total also doesn't seem much a problem, as it was said
on the other thread no tests is long enough to really care, at least on a
development box.

I still remember the last time I wanted to add some TAP tests to pg_dump, and
it wasn't pleasant.  In any case we should at least get rid of this one.

The patch itself looks good to me.




Re: Add tag/category to the commitfest app

2022-02-09 Thread Daniel Gustafsson
> On 8 Feb 2022, at 03:10, Julien Rouhaud  wrote:

> I was thinking that we could add a tag system, probably with some limited
> predefined tags, to address that problem.
> 
> Do you think it's worth adding, or do you have a better idea to help newer
> contributors?

We already have the categories for patches, not sure that a second level would
be all that helpful.  I still think that adding an abstract/description field
as discussed in 3c53e8aa-b53f-43c2-b3c6-fa71e18a4...@yesql.se would be helpful
but I haven't had time to work on that.

--
Daniel Gustafsson   https://vmware.com/





Re: decoupling table and index vacuum

2022-02-09 Thread Robert Haas
On Wed, Feb 9, 2022 at 1:18 AM Dilip Kumar  wrote:
> I agree with the point that we should be focusing more on index size
> growth compared to dead tuples.  But I don't think that we can
> completely ignore the number of dead tuples.  Although we have the
> bottom-up index deletion but whether the index structure will be
> preserved or not will depend upon what keys we are inserting next.  So
> for example if there are 80% dead tuples but so far index size is fine
> then can we avoid vacuum? If we avoid vacuuming then it is very much
> possible that in some cases we will create a huge bloat e.g. if we are
> inserting some keys which can not take advantage of bottom up
> deletion.  So IMHO the decision should be a combination of index size
> bloat and % dead tuples.  Maybe we can add more weight to the size
> bloat and less weight to % dead tuple but we should not completely
> ignore it.

I think that dead index tuples really don't matter if they're going to
get removed anyway before a page split happens. In particular, if
we're going to do a bottom-up index deletion pass before splitting the
page, then who cares if there are going to be dead tuples around until
then? You might think that they'd have the unfortunate effect of
slowing down scans, and they could slow down ONE scan, but if they do,
then I think kill_prior_tuple will hint them dead and they won't
matter any more. Now, if we have a page that is going to split,
because it's going to receive inserts but neither kill_prior_tuple nor
bottom-up index deletion are going to keep us out of trouble, then the
dead tuples matter. And if we have a page where all the tuples are
dead and no further inserts are ever going to happen, those dead
tuples also matter, because getting rid of them would let us recycle
the page.

Just to be clear, when I say that the dead index tuples don't matter
here, I mean from the point of view of the index. From the point of
view of the table, the presence of dead index tuples (or even the
potential presence of dead tuples) pointing to dead line pointers is
an issue that can drive heap bloat. But from the point of view of the
index, because we don't ever merge sibling index pages, and because we
have kill_prior_tuple, there's not much value in freeing up space in
index pages unless it either prevents a split or lets us free the
whole page. So I agree with Peter that index growth is what really
matters.

However, I have a concern that Peter's idea to use the previous index
growth to drive future index vacuuming distinction is retrospective
rather than prospective. If the index is growing more than it should
based on the data volume, then evidently we didn't do enough vacuuming
at some point in the past. It's reasonable to step up our efforts in
the present to make sure that the problem doesn't continue, but in
some sense it's already too late. What we would really like is a
measure that answers the question: is the index going to bloat in the
relatively near future if we don't vacuum it now? I think that the
dead tuple count is trying, however imperfectly, to figure that out.
All other things being equal, the more dead tuples there are in the
index, the more bloat we're going to have later if we don't clean them
out now.

The problem is not with that core idea, which IMHO is actually good,
but that all other things are NOT equal. Peter has shown pretty
convincingly that in some workloads, essentially 100% of dead tuples
are going to get removed without causing a page split and the index
growth will be 0, whereas in other workloads 0% of dead tuples are
going to get removed without causing index growth. If you knew that
you had the second case, then counting dead index tuples to decide
when to vacuum would, in my opinion, be a very sensible thing to do.
It would still not be perfect, because dead tuples in pages that are
going to get split are a lot worse than dead tuples in pages that
aren't going to be split, but it doesn't seem meaningless. However, if
all of the index tuples are going to get removed in a timely fashion
anyway, then it's as useful as a stopped clock: it will be right
whenever it says the index doesn't need to be vacuumed, and wrong when
it says anything else.

In a certain sense, bottom-up index deletion may have exacerbated the
problems in this area. The more ways we add to remove dead tuples from
indexes without vacuum, the less useful dead tuples will become as a
predictor of index growth. Maybe #-of-dead-tuples and
future-index-growth weren't that tightly coupled even before bottom-up
index deletion, but it must be worse now.

I'm not hung up on using the # of dead tuples specifically as the
metric for index vacuuming, and it may be best to pick some other
measure. But I am a little suspicious that if the only measure is past
index growth, we will let some situations go too far before we wake up
and do something about them. My intuition is that it would be a good
idea to come up with something we could 

Typo in archive modules docs

2022-02-09 Thread Daniel Gustafsson
Spotted a typo when reading the new archive modules documentation, will apply
the attached shortly to fix.

--
Daniel Gustafsson   https://vmware.com/



typo-recyling.diff
Description: Binary data


Replacing TAP test planning with done_testing()

2022-02-09 Thread Daniel Gustafsson
Whether or not to explicitly plan the number of TAP tests per suite has been
discussed a number of times on this list, often as a side-note in an unrelated
thread which adds/modifies a test.  The concensus has so far weighed towards
not doing manual bookkeeping of test plans but to let Test::More deal with it.
So far, no concrete patch to make that happen has been presented though.

The attached patch removes all Test::More planning and instead ensures that all
tests conclude with a done_testing() call.  While there, I also removed a few
exit(0) calls from individual tests making them more consistent.

Thoughts?

--
Daniel Gustafsson   https://vmware.com/



0001-Replace-Test-More-plans-with-done_testing.patch
Description: Binary data


Re: [PATCH] Improve function toast_delete_external (src/backend/access/table/toast_helper.c)

2022-02-09 Thread Julien Rouhaud
Hi,

On Wed, Feb 09, 2022 at 07:56:35AM -0300, Ranier Vilela wrote:
> 
> I think this change can improve this particular function by avoiding
> touching value if not needed.
> Test if not isnull is cheaper than test TupleDescAttr is -1.

It looks sensible.




Re: is the base backup protocol used by out-of-core tools?

2022-02-09 Thread Bernd Helmle
Am Mittwoch, dem 09.02.2022 um 08:01 -0500 schrieb Robert Haas:
> On Wed, Feb 9, 2022 at 3:14 AM Kyotaro Horiguchi
>  wrote:
> > Thanks for pining.  AFAICS, as you see, pg_rman doesn't talk
> > basebackup protocol (nor even pg_basebackup command) as it supports
> > inremental backup.  So there's no issue about the removal of old
> > syntax on our side.
> 
> Cool. Since the verdict here seems pretty unanimous and we've heard
> from a good number of backup tool authors, I'll give this another day
> or so and then commit, unless objections emerge before then.
> 

A while ago i worked on a tool which talks the streaming protocol
directly. In the past there were already additions to the protocol
(like MANIFEST), so to support those optional features and multiple
major versions you already have to abstract the code to deal with that.

Having said that, i don't have any objections either to remove the old
protocol support.

Thanks
Bernd





Re: refactoring basebackup.c

2022-02-09 Thread Abhijit Menon-Sen
At 2022-02-02 10:55:53 -0500, robertmh...@gmail.com wrote:
>
> On Tue, Jan 18, 2022 at 1:55 PM Robert Haas  wrote:
> > 0001 adds "server" and "blackhole" as backup targets. It now has some
> > tests. This might be more or less ready to ship, unless somebody else
> > sees a problem, or I find one.
> 
> I played around with this a bit and it seems quite easy to extend this
> further. So please find attached a couple more patches to generalize
> this mechanism.

It took me a while to assimilate these patches, including the backup
targets one, which I hadn't looked at before. Now that I've wrapped my
head around how to put the pieces together, I really like the idea. As
you say, writing non-trivial integrations in C will take some effort,
but it seems worthwhile. It's also nice that one can continue to use
pg_basebackup to trigger the backups and see progress information.

> Granted, coding up a new base backup target is
> something only experienced C hackers are likely to do, but the fact
> that I was able to throw this together so quickly suggests to me that
> I've got the design basically right, and that anyone who does want to
> plug into the new mechanism shouldn't have too much trouble doing so.
> 
> Thoughts?

Yes, it looks simple to follow the example set by basebackup_to_shell to
write a custom target. The complexity will be in whatever we need to do
to store/forward the backup data, rather than in obtaining the data in
the first place, which is exactly as it should be.

Thanks!

-- Abhijit




Re: Database-level collation version tracking

2022-02-09 Thread Julien Rouhaud
On Wed, Feb 09, 2022 at 12:48:35PM +0100, Peter Eisentraut wrote:
> On 08.02.22 13:55, Julien Rouhaud wrote:
> > I'm just saying that without such a lock you can easily trigger the "cache
> > lookup" error, and that's something that's supposed to happen with normal
> > usage I think.  So it should be a better message saying that the database 
> > has
> > been concurrently dropped, or actually simply does not exist like it's done 
> > in
> > AlterDatabaseOwner() for the same pattern:
> > 
> > [...]
> > tuple = systable_getnext(scan);
> > if (!HeapTupleIsValid(tuple))
> > ereport(ERROR,
> > (errcode(ERRCODE_UNDEFINED_DATABASE),
> >  errmsg("database \"%s\" does not exist", 
> > dbname)));
> > [...]
> 
> In my code, the existence of the database is checked by
> 
> dboid = get_database_oid(stmt->dbname, false);
> 
> This also issues an appropriate user-facing error message if the database
> does not exist.

Yes but if you run a DROP DATABASE concurrently  you will either get a
"database does not exist" or "cache lookup failed" depending on whether the
DROP is processed before or after the get_database_oid().

I agree that it's not worth trying to make it concurrent-drop safe, but I also
thought that throwing plain elog(ERROR) should be avoided when reasonably
doable.  And in that situation we know it can happen in normal situation, so
having a real error message looks like a cost-free improvement.  Now if it's
better to have a cache lookup error even in that situation just for safety or
something ok, it's not like trying to refresh a db collation and having someone
else dropping it at the same time is going to be a common practice anway.

> The flow in AlterDatabaseOwner() is a bit different, it looks up the
> pg_database tuple directly from the name.  I think both are correct.  My
> code has been copied from the analogous code in AlterCollation().

I also think it would be better to have a "collation does not exist" in the
syscache failure message, but same here dropping collation is probably even
less frequent than dropping database, let alone while refreshing the collation
version.




Re: Refactoring SSL tests

2022-02-09 Thread Andrew Dunstan


On 2/9/22 08:11, Daniel Gustafsson wrote:
>> On 8 Feb 2022, at 16:46, Andrew Dunstan  wrote:
>> There a capitalization typo in SSL/Backend/OpenSSL.pm - looks like
>> that's my fault:
>>
>> +  my $backend = SSL::backend::OpenSSL->new();
> Fixed.
>
>> Also, I think we should document that SSL::Server::new() takes an
>> optional flavor parameter, in the absence of which it uses
>> $ENV{with_ssl}, and that 'openssl' is the only currently supported value.
> Good point, done in the attached.



LGTM


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: I would like to participate for postgresql projects

2022-02-09 Thread Ashesh Vashi
On Wed, Feb 9, 2022 at 6:52 PM Daniel Gustafsson  wrote:

> > On 8 Feb 2022, at 06:12, luminate 22  wrote:
> >
> > Hi! my name is Mario Mercado. I'm from California western pacific time.
> I would like to participate in your projects. I'm a software developer from
> California.
>
> Welcome aboard, everyone is welcome.  I would recommend starting with
> cloning
> the repository and getting familiar with it by building and running tests
> etc:
>
> https://www.postgresql.org/docs/devel/git.html
> https://www.postgresql.org/docs/devel/installation.html
>
> Finding an interesting project to hack on is quite subjective, but
> reviewing
> what others have done and looking at what's in flight might bring some
> ideas to
> mind:
>
> https://commitfest.postgresql.org/37/

You may also want to visit the developer FAQ.
https://wiki.postgresql.org/wiki/Developer_FAQ

Have fun!

-- Ashesh Vashi

>
>
> And, most important: have fun!
>
> --
> Daniel Gustafsson   https://vmware.com/
>
>
>
>


Re: SQL/JSON: JSON_TABLE

2022-02-09 Thread Himanshu Upadhyaya
On Wed, Feb 2, 2022 at 12:44 AM Andrew Dunstan  wrote:
>
>
> rebased with some review comments attended to.

I am in process of reviewing these patches, initially, have started
with 0002-JSON_TABLE-v55.patch.
Tested many different scenarios with various JSON messages and these
all are working as expected. Just one question on the below output.

‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS
(a int PATH '$.a' ERROR ON EMPTY)) jt;
 a
---

(1 row)

‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS
(a int PATH '$.a' ERROR ON ERROR)) jt;
 a
---

(1 row)

is not "ERROR ON ERROR" is expected to give error?

There are a few minor comments on the patch:
1)
Few Typo
+  Sibling columns are joined using
+  FULL OUTER JOIN ON FALSE, so that both
parent and child
+  rows are included into the output, with NULL values inserted
+  into both child and parrent columns for all missing values.

Parrent should be parent.

+
+ Gerenates a column and inserts a boolean item into each row of
this column.
+
Gerenates should be Generates.

+
+ Extracts SQL/JSON items from nested levels of the row pattern,
+ gerenates one or more columns as defined by the COLUMNS
+ subclause, and inserts the extracted SQL/JSON items into each
row of these columns.
+ The json_table_column expression in the
+ COLUMNS subclause uses the same syntax as in the
+ parent COLUMNS clause.
+

Gerenates should be Generates.

+/*-
+ *
+ * parse_jsontable.c
+ *   pasring of JSON_TABLE

pasring should be parsing.

2) Albhabatic include.
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+
include files are not in alphabetic order.

3)
+-- JSON_TABLE: nested paths and plans
+-- Should fail (column names anf path names shall be distinct)
+SELECT * FROM JSON_TABLE(
+   jsonb '[]', '$'
+   COLUMNS (
+   a int,
+   b text,
+   a jsonb
+   )
+) jt;
+ERROR:  duplicate JSON_TABLE column name: a
+HINT:  JSON_TABLE path names and column names shall be distinct from
one another

HINT is not much relevant, can't we simply say "JSON_TABLE column
names should be distinct from one another"?

4)
@@ -4837,6 +4844,7 @@ ExecEvalJsonExprSubtrans(JsonFunc func, ExprEvalStep *op,
/* Want to execute expressions inside function's memory context */
MemoryContextSwitchTo(oldcontext);

+

we can remove this empty line.

5)
/*
 * The production for a qualified relation name has to exactly match the
 * production for a qualified func_name, because in a FROM clause we cannot
 * tell which we are parsing until we see what comes after it ('(' for a
 * func_name, something else for a relation). Therefore we allow 'indirection'
 * which may contain subscripts, and reject that case in the C code.
 */

I think the sentence "because in a FROM clause we cannot
 * tell which we are parsing..." should be changed to "because in a
FROM clause we cannot
 * tell what we are parsing "

6)
@@ -696,7 +696,7 @@ transformRangeTableFunc(ParseState *pstate,
RangeTableFunc *rtf)
char  **names;
int colno;

-   /* Currently only XMLTABLE is supported */

can we change(and not remove) the comment to "/* Currently only
XMLTABLE and JSON_TABLE is supported */"

7)
/*
 * TableFunc - node for a table function, such as XMLTABLE.
 *
 * Entries in the ns_names list are either String nodes containing
 * literal namespace names, or NULL pointers to represent DEFAULT.
 */
typedef struct TableFunc

can we change the comment to "...such as XMLTABLE or JSON_TABLE."?

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com




Re: I would like to participate for postgresql projects

2022-02-09 Thread Daniel Gustafsson
> On 8 Feb 2022, at 06:12, luminate 22  wrote:
> 
> Hi! my name is Mario Mercado. I'm from California western pacific time. I 
> would like to participate in your projects. I'm a software developer from 
> California.

Welcome aboard, everyone is welcome.  I would recommend starting with cloning
the repository and getting familiar with it by building and running tests etc:

https://www.postgresql.org/docs/devel/git.html
https://www.postgresql.org/docs/devel/installation.html

Finding an interesting project to hack on is quite subjective, but reviewing
what others have done and looking at what's in flight might bring some ideas to
mind:

https://commitfest.postgresql.org/37/

And, most important: have fun!

--
Daniel Gustafsson   https://vmware.com/





Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Feb 08, 2022 at 05:43:34PM -0800, Andres Freund wrote:
>> It's stamped, not tagged, so we could send out new tarballs. Or we could skip
>> a release number. IIRC we had to do something along those lines before.

> It does not matter now, but the release is stamped and tagged.

Yeah, I see no need to do anything about this on an emergency
basis.

>> What do you mean with detect here?

> Well, we would not be able to see that something is stuck by default,
> but Noah has just answered to my question by mentioning wait_timeout
> in the buildfarm configuration.

The buildfarm's wait_timeout option isn't that helpful here, because
when it triggers, the client just goes belly-up *with no report*.
So even if the CCA animals had it on, you'd not notice unless you
started to wonder why they'd not reported lately.

I think that's a bug that ought to be fixed.  I do agree that
wait_timeout ought to be finite by default, too.

regards, tom lane




Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Noah Misch
On Tue, Feb 08, 2022 at 06:04:03PM -0800, Noah Misch wrote:
> On Tue, Feb 08, 2022 at 04:43:47PM -0800, Andres Freund wrote:
> > On 2022-02-08 22:13:01 +0100, Tomas Vondra wrote:
> > > On 10/24/21 03:40, Noah Misch wrote:
> > > > Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURRENTLY.
> > > > 
> > > > CIC and REINDEX CONCURRENTLY assume backends see their catalog changes
> > > > no later than each backend's next transaction start.  That failed to
> > > > hold when a backend absorbed a relevant invalidation in the middle of
> > > > running RelationBuildDesc() on the CIC index.  Queries that use the
> > > > resulting index can silently fail to find rows.  Fix this for future
> > > > index builds by making RelationBuildDesc() loop until it finishes
> > > > without accepting a relevant invalidation.  It may be necessary to
> > > > reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices.
> > > > Back-patch to 9.6 (all supported versions).
> > > > 
> > > > Noah Misch and Andrey Borodin, reviewed (in earlier versions) by Andres
> > > > Freund.
> > > > 
> > > > Discussion: 
> > > > https://postgr.es/m/20210730022548.ga1940...@gust.leadboat.com
> > > > 
> > > 
> > > Unfortunately, this seems to have broken CLOBBER_CACHE_ALWAYS builds. 
> > > Since
> > > this commit, initdb never completes due to infinite retrying over and over
> > > (on the first RelationBuildDesc call).
> 
> Thanks for the report.  I had added the debug_discard arguments of
> InvalidateSystemCachesExtended() and RelationCacheInvalidate() to make the new
> code survive a CREATE TABLE at debug_discard_caches=5.  Apparently that's not
> enough for initdb.  I'll queue a task to look at it.

The explanation was more boring than that.  v13 and earlier have an additional
InvalidateSystemCaches() call site, which I neglected to update.  Here's the
fix I intend to push.
Author: Noah Misch 
Commit: Noah Misch 

Fix back-patch of "Avoid race in RelationBuildDesc() ..."

The back-patch of commit fdd965d074d46765c295223b119ca437dbcac973 broke
CLOBBER_CACHE_ALWAYS for v9.6 through v13, because it updated the
InvalidateSystemCaches() call pertaining to CLOBBER_CACHE_RECURSIVELY
and not also the one pertaining to CLOBBER_CACHE_ALWAYS.  Back-patch to
v13, v12, v11, and v10.

Reported by Tomas Vondra.  Reviewed by FIXME.

Discussion: 
https://postgr.es/m/df7b4c0b-7d92-f03f-75c4-9e08b269a...@enterprisedb.com

diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 8b0503f..8fe1f0d 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -715,27 +715,27 @@ AcceptInvalidationMessages(void)
 * slows things by at least a factor of 1, so I wouldn't suggest
 * trying to run the entire regression tests that way.  It's useful to 
try
 * a few simple tests, to make sure that cache reload isn't subject to
 * internal cache-flush hazards, but after you've done a few thousand
 * recursive reloads it's unlikely you'll learn more.
 */
 #if defined(CLOBBER_CACHE_ALWAYS)
{
static bool in_recursion = false;
 
if (!in_recursion)
{
in_recursion = true;
-   InvalidateSystemCaches();
+   InvalidateSystemCachesExtended(true);
in_recursion = false;
}
}
 #elif defined(CLOBBER_CACHE_RECURSIVELY)
{
static int  recursion_depth = 0;
 
/* Maximum depth is arbitrary depending on your threshold of 
pain */
if (recursion_depth < 3)
{
recursion_depth++;
InvalidateSystemCachesExtended(true);
recursion_depth--;


Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Michael Paquier
On Tue, Feb 08, 2022 at 05:43:34PM -0800, Andres Freund wrote:
> It's stamped, not tagged, so we could send out new tarballs. Or we could skip
> a release number. IIRC we had to do something along those lines before.

It does not matter now, but the release is stamped and tagged.

> What do you mean with detect here?

Well, we would not be able to see that something is stuck by default,
but Noah has just answered to my question by mentioning wait_timeout
in the buildfarm configuration.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Noah Misch
On Tue, Feb 08, 2022 at 04:43:47PM -0800, Andres Freund wrote:
> On 2022-02-08 22:13:01 +0100, Tomas Vondra wrote:
> > On 10/24/21 03:40, Noah Misch wrote:
> > > Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURRENTLY.
> > > 
> > > CIC and REINDEX CONCURRENTLY assume backends see their catalog changes
> > > no later than each backend's next transaction start.  That failed to
> > > hold when a backend absorbed a relevant invalidation in the middle of
> > > running RelationBuildDesc() on the CIC index.  Queries that use the
> > > resulting index can silently fail to find rows.  Fix this for future
> > > index builds by making RelationBuildDesc() loop until it finishes
> > > without accepting a relevant invalidation.  It may be necessary to
> > > reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices.
> > > Back-patch to 9.6 (all supported versions).
> > > 
> > > Noah Misch and Andrey Borodin, reviewed (in earlier versions) by Andres
> > > Freund.
> > > 
> > > Discussion: https://postgr.es/m/20210730022548.ga1940...@gust.leadboat.com
> > > 
> > 
> > Unfortunately, this seems to have broken CLOBBER_CACHE_ALWAYS builds. Since
> > this commit, initdb never completes due to infinite retrying over and over
> > (on the first RelationBuildDesc call).

Thanks for the report.  I had added the debug_discard arguments of
InvalidateSystemCachesExtended() and RelationCacheInvalidate() to make the new
code survive a CREATE TABLE at debug_discard_caches=5.  Apparently that's not
enough for initdb.  I'll queue a task to look at it.

It's a good reminder to set wait_timeout on buildfarm animals.  (I should take
that advice, too.)

> Ugh. Do we need to do something about WRT the next set of minor releases?

No, given that this code already debuted in the November releases.




Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-09 02:25:09 +0100, Tomas Vondra wrote:
> AFAICS this only affects builds with CLOBBER_CACHE_ALWAYS, and anyone
> running such build in production clearly likes painful things anyway.

Yea, realistically nobody does that.


> But really, for the infinite loop to happen, building a relation descriptor
> has to invalidate a cache. And I haven't found a way to do that without the
> CLOBBER_CACHE_ALWAYS thing.

Phew.


> Also, all the November minor releases include this commit, and there were no
> reports about this (pretty obvious) issue. Buildfarm did not complain either
> (but an animal may be stuck for months and we would not know about it).

Ah, somehow I thought that wasn't yet in the last set of releases. Phew #2.

Greetings,

Andres Freund




Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR

2022-02-09 Thread Andres Freund
Hi,

On 2022-02-09 10:23:06 +0900, Michael Paquier wrote:
> On Tue, Feb 08, 2022 at 04:43:47PM -0800, Andres Freund wrote:
> > Ugh. Do we need to do something about WRT the next set of minor
> > releases?
> 
> The set of minor releases of this week has already been stamped, so
> that's too late :/

It's stamped, not tagged, so we could send out new tarballs. Or we could skip
a release number. IIRC we had to do something along those lines before.


> Ugh++.  The problem is that we would not really detect that
> automatically, isn't it?

What do you mean with detect here?

Greetings,

Andres Freund




  1   2   >